# Logistic Regression

#### I employ a logistic regression to predict the winner of the SuperBowl based on regular season statistics.

##### Imports

In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

I first extract the processed PCA and non-PCA season statistics data and load them as dataframes

In [10]:
stats = pd.read_csv("../data/processed/NFL_stats.csv")
pca_stats = pd.read_csv("../data/processed/PCA_NFL_stats.csv")

stats = stats.drop(["Year", "Team"], axis = 1)
pca_stats = pca_stats.drop(["Year", "Team"], axis = 1)

I then split the data into training and testing sets using stratification to ensure proportional representation of Superbowl winners. This is important because, by definition, winners (regarded as the positive class in this dataset) will constitute only 1/32 of the total data.

In [11]:
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 69)

for train_index, test_index in split.split(stats, stats["Superbowl Status"]):
    stats_train = stats.loc[train_index]
    stats_test = stats.loc[test_index]

sb_train = stats_train["Superbowl Status"]
sb_test = stats_test["Superbowl Status"]

stats_train = stats_train.drop(["Superbowl Status"], axis = 1)
stats_test = stats_test.drop(["Superbowl Status"], axis = 1)

stats_test

for train_index, test_index in split.split(pca_stats, pca_stats["Superbowl Status"]):
    pca_stats_train = pca_stats.loc[train_index]
    pca_stats_test = pca_stats.loc[test_index]

pca_sb_train = pca_stats_train["Superbowl Status"]
pca_sb_test = pca_stats_test["Superbowl Status"]

pca_stats_train = pca_stats_train.drop(["Superbowl Status"], axis = 1)
pca_stats_test = pca_stats_test.drop(["Superbowl Status"], axis = 1)

pca_stats_test

Unnamed: 0,pca0
113,-1.145308
106,0.698567
306,-0.787280
583,-0.671622
131,-0.898767
...,...
146,-0.920795
277,1.570849
175,-1.107329
281,0.369291


I then use SMOTE to oversample the minority class (teams that won a SuperBowl).

In [12]:
smote = SMOTE(random_state=69)
stats_train, sb_train = smote.fit_resample(stats_train, sb_train)
pca_stats_train, pca_sb_train = smote.fit_resample(pca_stats_train, pca_sb_train)

I then train logistic regression classifiers for both non-PCA and PCA data, optimizing with GridSearchCV

In [27]:
LogReg = LogisticRegression(max_iter=100)

params = {
    'penalty' : ['l1','l2'], 
    'C'       : np.logspace(-3,3,7),
    'solver'  : ['lbfgs']
}

grid_log = GridSearchCV(LogReg, params, scoring = 'recall')
pca_grid_log = GridSearchCV(LogReg, params, scoring = 'recall')

grid_log = grid_log.fit(stats_train, sb_train)



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [28]:
pca_grid_log = pca_grid_log.fit(pca_stats_train, pca_sb_train)

35 fits failed out of a total of 70.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
35 fits failed with the following error:
Traceback (most recent call last):
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/sklearn/base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py", line 1168, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/anaconda/envs/azureml_py310_sdkv2

Below the Confusion matrix and the classification report for the non-PCA data are shown

In [29]:
preds = grid_log.predict(stats_test)
confusion_matrix(sb_test, preds)

array([[98, 20],
       [ 2,  2]])

In [30]:
print(classification_report(sb_test, preds))

              precision    recall  f1-score   support

           0       0.98      0.83      0.90       118
           1       0.09      0.50      0.15         4

    accuracy                           0.82       122
   macro avg       0.54      0.67      0.53       122
weighted avg       0.95      0.82      0.87       122



In [24]:
pca_preds = pca_grid_log.predict(pca_stats_test)
confusion_matrix(pca_sb_test, pca_preds)

array([[81, 37],
       [ 2,  2]])

In [26]:
print(classification_report(pca_sb_test, pca_preds))

              precision    recall  f1-score   support

           0       0.98      0.69      0.81       118
           1       0.05      0.50      0.09         4

    accuracy                           0.68       122
   macro avg       0.51      0.59      0.45       122
weighted avg       0.95      0.68      0.78       122



As you can see, both logistic regressors perform quite poorly with just 50% recall identifying superbowl winners. At least the non-PCA regressor identifies less false positives.