Day 24 Lecture 1 Assignment
In this assignment, we will build our first logistic regression model on numeric data. We will use the FIFA soccer ratings dataset loaded below and analyze the model generated for this dataset.

In [18]:
%reload_ext nb_black
import pandas as pd
import numpy as np

import statsmodels.api as sm
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

from mlxtend.plotting import plot_decision_regions
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [2]:
def remove_correlated_features(dataset, threshold):
    col_corr = set()
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (
                corr_matrix.columns[j] not in col_corr
            ):
                colname = corr_matrix.columns[i]
                col_corr.add(colname)
                if colname in dataset.columns:
                    print(f"Deleted {colname} from dataset.")
                    del dataset[colname]

    return dataset

<IPython.core.display.Javascript object>

In [3]:
soccer_data = pd.read_csv(
    "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/fifa_ratings.csv"
)

<IPython.core.display.Javascript object>

In [4]:
soccer_data.head()

Unnamed: 0,ID,Name,Overall,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,...,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle
0,158023,L. Messi,94,84,95,70,90,86,97,93,...,94,48,22,94,94,75,96,33,28,26
1,20801,Cristiano Ronaldo,94,84,94,89,81,87,88,81,...,93,63,29,95,82,85,95,28,31,23
2,190871,Neymar Jr,92,79,87,62,84,84,96,88,...,82,56,36,89,87,81,94,27,24,33
3,192985,K. De Bruyne,91,93,82,55,92,82,86,85,...,91,76,61,87,94,79,88,68,58,51
4,183277,E. Hazard,91,81,84,61,89,80,95,83,...,80,54,41,87,89,86,91,34,27,22


<IPython.core.display.Javascript object>

Our response for our logistic regression model is going to be a binary label, "Elite" or "Not Elite", corresponding to whether or not the player has an overall rating greater than or equal to 75. This corresponds to the top 10% or so of soccer players in the data set. Create the response column.

In [5]:
soccer_data["Is_Elite"] = soccer_data["Overall"] >= 75
soccer_data

Unnamed: 0,ID,Name,Overall,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,...,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,Is_Elite
0,158023,L. Messi,94,84,95,70,90,86,97,93,...,48,22,94,94,75,96,33,28,26,True
1,20801,Cristiano Ronaldo,94,84,94,89,81,87,88,81,...,63,29,95,82,85,95,28,31,23,True
2,190871,Neymar Jr,92,79,87,62,84,84,96,88,...,56,36,89,87,81,94,27,24,33,True
3,192985,K. De Bruyne,91,93,82,55,92,82,86,85,...,76,61,87,94,79,88,68,58,51,True
4,183277,E. Hazard,91,81,84,61,89,80,95,83,...,54,41,87,89,86,91,34,27,22,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16117,238813,J. Lundstram,47,34,38,40,49,25,42,30,...,46,46,39,52,43,45,40,48,47,False
16118,243165,N. Christoffersson,47,23,52,52,43,36,39,32,...,47,16,46,33,43,42,22,15,19,False
16119,241638,B. Worman,47,25,40,46,38,38,45,38,...,32,15,48,43,55,41,32,13,11,False
16120,246268,D. Walker-Rice,47,44,50,39,42,40,51,34,...,33,22,44,47,50,46,20,25,27,False


<IPython.core.display.Javascript object>

Address potential collinearity issues by removing the appropriate features. There is no universally agreed upon technique for doing so, so feel free to use any reasonable method. We have provided the convenience function *remove_correlated_features* at the top as one way of doing so, and we use a threshold of 0.9 for that function to reduce correlation among features.

In [6]:
remove_correlated_features(soccer_data, 0.9)

Deleted StandingTackle from dataset.
Deleted SlidingTackle from dataset.


Unnamed: 0,ID,Name,Overall,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,...,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,Is_Elite
0,158023,L. Messi,94,84,95,70,90,86,97,93,...,59,94,48,22,94,94,75,96,33,True
1,20801,Cristiano Ronaldo,94,84,94,89,81,87,88,81,...,79,93,63,29,95,82,85,95,28,True
2,190871,Neymar Jr,92,79,87,62,84,84,96,88,...,49,82,56,36,89,87,81,94,27,True
3,192985,K. De Bruyne,91,93,82,55,92,82,86,85,...,75,91,76,61,87,94,79,88,68,True
4,183277,E. Hazard,91,81,84,61,89,80,95,83,...,66,80,54,41,87,89,86,91,34,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16117,238813,J. Lundstram,47,34,38,40,49,25,42,30,...,47,38,46,46,39,52,43,45,40,False
16118,243165,N. Christoffersson,47,23,52,52,43,36,39,32,...,67,42,47,16,46,33,43,42,22,False
16119,241638,B. Worman,47,25,40,46,38,38,45,38,...,32,45,32,15,48,43,55,41,32,False
16120,246268,D. Walker-Rice,47,44,50,39,42,40,51,34,...,48,34,33,22,44,47,50,46,20,False


<IPython.core.display.Javascript object>

Split the data into train and test, with 80% training and 20% testing. Be sure to leave out columns that would not make sense in the model, like the player ID column.

In [7]:
X = soccer_data.drop(columns=["ID", "Name", "Overall", "Is_Elite"])
y = soccer_data["Is_Elite"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=27
)

<IPython.core.display.Javascript object>

Fit the logistic regression model using the statsmodels package and print out the coefficient summary. Which variables appear to be the most important, and what effect do they have on the probability of a player being elite?

# sklearn

In [8]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
model.score(X_train, y_train)

0.9582848724509576

<IPython.core.display.Javascript object>

In [9]:
model.score(X_test, y_test)

0.9556589147286821

<IPython.core.display.Javascript object>

In [16]:
coef_df = pd.DataFrame({"feat": X.columns, "coef": model.coef_.flatten()})
coef_df

Unnamed: 0,feat,coef
0,Crossing,-0.01252
1,Finishing,0.024284
2,HeadingAccuracy,0.045465
3,ShortPassing,0.184808
4,Volleys,-0.020302
5,Dribbling,-0.004909
6,Curve,0.002839
7,FKAccuracy,0.00458
8,LongPassing,-0.001029
9,BallControl,0.195123


<IPython.core.display.Javascript object>

# statsmodel

In [19]:
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)

sm_model = sm.Logit(y_train, X_train_const).fit()
print(sm_model.summary())

Optimization terminated successfully.
         Current function value: 0.104045
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:               Is_Elite   No. Observations:                12897
Model:                          Logit   Df Residuals:                    12869
Method:                           MLE   Df Model:                           27
Date:                Thu, 27 Aug 2020   Pseudo R-squ.:                  0.7152
Time:                        12:01:12   Log-Likelihood:                -1341.9
converged:                       True   LL-Null:                       -4711.0
Covariance Type:            nonrobust   LLR p-value:                     0.000
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
const             -58.1098      1.753    -33.142      0.000     -61.546     -54.673
Crossing     

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

We have yet to discuss how to evaluate the model, which will happen next week, but one intuitive way to see if our model predictions are reasonable is to plot a calibration curve. In essence, the probabilities predicted by a good model will match the observed proportions of outcomes (i.e. If we take all of the predictions around 70% made by our model, the corresponding observed outcomes should be Elite about 70% of the time).

First, make predictions on the test set and join them to the corresponding true outcomes. Then, use the *calibration_curve* function in scikit learn to plot a calibration curve. What do you see?

There is some helpful code for creating calibration plots at the link below:
https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py

We see that the lower predicted probabilities tend to be well calibrated - when the model predicts 20% likelihood of eliteness, for example, we tend to see about 20% in reality, which is a good sign. However, the calibration does falter quite a bit for the more confident predictions; weaker calibration at the extremes is fairly common for probabilistic models, although not always to this extent.