I have already used this dataset to create a Kneighbors model. In this project i will try and test out some new types of models to see if i can get a higher performance model. In my initial test i was also using a binary classification. I want to create a model that can attempt to predict the exact position of an outfield player based on their attributes.

I will start by importing the necessary modules and reading the data in to a data frame. I want to test out some different models to practice using them and to see what results i can get. 

In [34]:
#imports

import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline


from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

#set the random state
random_seed = np.random.RandomState(2)

In [3]:
#Begin by importing the data

df = pd.read_csv('fifa19.csv')





I will be attempting to determine the position of a player from their attributes. I've defined it as someone who plays in the ['RF,ST,LF'] position. I will be dropping the goalkeeper position as i want to focus on outfield players only.


In [5]:
df_new =df[['Position','Crossing','Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling','Curve', 'FKAccuracy', 
      'LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower','Jumping',
      'Stamina', 'Strength', 'LongShots', 'Aggression','Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
      'Marking', 'StandingTackle', 'SlidingTackle']]

In [8]:
# Dropping the values where 'Position =="GK"'

IndexValues = df_new[df['Position']=='GK'].index

df_outfield = df_new.drop(IndexValues)



In [11]:
#checking that no values are GK
df_outfield['Position'].value_counts()


ST     2152
CB     1778
CM     1394
LB     1322
RB     1291
RM     1124
LM     1095
CAM     958
CDM     948
RCB     662
LCB     648
LCM     395
RCM     391
LW      381
RW      370
RDM     248
LDM     243
LS      207
RS      203
RWB      87
LWB      78
CF       74
LAM      21
RAM      21
RF       16
LF       15
Name: Position, dtype: int64

In [15]:
# i also want to make sure there are no null values in my data frame now
df_outfield.isnull().values.any()
# need to find the NaN values and remove them 

True

In [18]:
df_outfield.info()
df_outfield.dropna(how='any',inplace=True,axis=0)
df_outfield.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16182 entries, 0 to 18206
Data columns (total 30 columns):
Position           16122 non-null object
Crossing           16134 non-null float64
Finishing          16134 non-null float64
HeadingAccuracy    16134 non-null float64
ShortPassing       16134 non-null float64
Volleys            16134 non-null float64
Dribbling          16134 non-null float64
Curve              16134 non-null float64
FKAccuracy         16134 non-null float64
LongPassing        16134 non-null float64
BallControl        16134 non-null float64
Acceleration       16134 non-null float64
SprintSpeed        16134 non-null float64
Agility            16134 non-null float64
Reactions          16134 non-null float64
Balance            16134 non-null float64
ShotPower          16134 non-null float64
Jumping            16134 non-null float64
Stamina            16134 non-null float64
Strength           16134 non-null float64
LongShots          16134 non-null float64
Aggression

In [19]:
#NA values fixed
df_outfield.isnull().values.any()


False

# Setting up the features
My data has now been cleaned and i am ready to set up my features and target variable. In this exercise i am going to use binary classification so i will use isin to select the relevant positions i am looking for

In [32]:


# My features will be player stats

X= df_outfield[['Crossing','Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling','Curve', 'FKAccuracy', 
      'LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower','Jumping',
      'Stamina', 'Strength', 'LongShots', 'Aggression','Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
      'Marking', 'StandingTackle', 'SlidingTackle']]
#setting the target positions
forwards=df_outfield['Position'].isin(['ST','LF','RF'])

y=forwards



#splitting the data in to train and test

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=random_seed)

In [33]:
logreg = LogisticRegression()

logreg.fit(X_train,y_train)

y_pred = logreg.predict(X_test)

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(logreg.score(X_test,y_test))



[[4032  130]
 [ 160  515]]
              precision    recall  f1-score   support

       False       0.96      0.97      0.97      4162
        True       0.80      0.76      0.78       675

   micro avg       0.94      0.94      0.94      4837
   macro avg       0.88      0.87      0.87      4837
weighted avg       0.94      0.94      0.94      4837

0.9400454827372339


the logistic regression has worked well to predict the position. I will now perform cross validation to check the accuracy of the results and i will perform hyperparameter tuning on my regression model to find the optimal parameters.

In [35]:
#setting up the param grid 
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

#instantiate a new logreg
logreg_new=LogisticRegression()
#performing a gridsearch to find optimal parameters
logreg_cv= GridSearchCV(logreg_new,param_grid,cv=5)

logreg_cv.fit(X,y)

print('The best tuned score is :{}'.format(logreg_cv.best_score_))
print('the best parameter for this is :{}'.format(logreg_cv.best_params_))





The best tuned score is :0.9408262002232973
the best parameter for this is :{'C': 0.0007196856730011522}


in this case it seems that tuning has not made much of a difference to my accuracy score using a logreg.