## Project 2
- John Serino   
- Math 219
- 5 May 2023


In [1]:
#importing our most commonly used modules
import numpy as np
import pandas as pd
import seaborn as sns

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

# Preface and Dataset Manipulation
For this assignment I will be using a dataset created by user "TALBOTT" through Kraggle's database.

This is the same dataset that I used in the previous Project.

The set can be found using this link: https://www.kaggle.com/datasets/ttalbitt/american-football-team-stats-1998-2019?select=AmericanFootball98.csv

"TALBOTT"'s dataset looks at the statistics of NFL football teams over the 21 year range of 1999-2019.
In particular I will be looking to see which statistics can be used to best predict which team will win the Superbowl in any given year.

In [2]:
# importing "Talbott's" dataset
football = pd.read_csv("AmericanFootball98.csv",index_col=0)

As was stated in the previous project, the original dataset does not have a built in target classification, so instead I am appending a new column to the end of the dataset named `Superbowl`.

The new column will assign `0` to teams that did not win the Superbowl, and `1` if the team did win the Superbowl.


In [3]:
# adding an empty column to the csv file
football = football.assign(Superbowl = 0)
# a quick look at the dataset for readers.
football.head(3)

Unnamed: 0_level_0,wins,losses,PF,yards,plays,yards/play,TO,Fumbles Lost,1st downs,completions,...,opp pen 1st downs,opp number drives,opp score percentage,opp turnover percentage,opp avg start,opp avg time per drive,opp avg plays per drive,opp avg yards per drive,opp avg points per drive,Superbowl
team_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
nwe2019,12,4,420,5664,1095,5.2,15,6,338,378,...,39,191,19.4,17.3,Own 24.8,2:20,5.0,22.8,1.0,0
buf2019,10,6,314,5283,1018,5.2,19,7,314,299,...,33,178,23.6,12.4,Own 27.7,2:39,5.6,26.2,1.29,0
nyj2019,7,9,276,4368,956,4.6,25,9,253,323,...,40,189,34.4,10.1,Own 31.4,2:35,5.6,26.7,1.81,0


We will now have fill in the Superbowl column with a `1` to represent each team that one the superbowl, this can be done manually by using the `.at[]` method.

In [4]:
football.at['kan2019','Superbowl'] = 1
football.at['nwe2018','Superbowl'] = 1
football.at['phi2017','Superbowl'] = 1
football.at['nwe2016','Superbowl'] = 1
football.at['den2015','Superbowl'] = 1
football.at['nwe2014','Superbowl'] = 1
football.at['sea2013','Superbowl'] = 1
football.at['rav2012','Superbowl'] = 1
football.at['nyg2011','Superbowl'] = 1
football.at['gnb2010','Superbowl'] = 1
football.at['nor2009','Superbowl'] = 1
football.at['pit2008','Superbowl'] = 1
football.at['nyg2007','Superbowl'] = 1
football.at['clt2006','Superbowl'] = 1
football.at['pit2005','Superbowl'] = 1
football.at['nwe2004','Superbowl'] = 1
football.at['nwe2003','Superbowl'] = 1
football.at['tam2002','Superbowl'] = 1
football.at['nwe2001','Superbowl'] = 1
football.at['rav2000','Superbowl'] = 1
football.at['ram1999','Superbowl'] = 1

# Opening
Our goal in this report is something very similar to the last, we will be using the dataset created by "Talbott" to look and see how to more accurately predict the eventual victor of the superbowl based off of regular season statistics from the same season. We will be doing this in order to answer a derivation of our original question. Instead of asking "is it even possible to predict the winner of the Superbowl?" we will instead be asking: "Which statistics are the most irrelevant predicators of Superbowl victory, and how will eliminating these predicators allow us to increase our predictors efficiency?".

Where we differ from the previous project is that instead of using `Classification` to analyze our data, we will be using `Multilinear regression` with varying regularizations to determine which features to eliminate from the .csv file in order to create a better predictor for the Superbowl Champion. 

# Initial Analysis using `Multilinear Regression` with `LASSO regularization`.

We can use Lasso regularization within Multilinear regression as a way to determine which features we should eliminate in order to increase our ability to predict the winner of the Superbowl.

First, we will train our features frame, and our target.

Our target value will be the `Superbowl` metric, while our features frame will include all quantifiable metrics, excluding metrics that are strings.

In [5]:
# Creating feature frame out of all quantitative values, and dropping 'Superbowl' column
X = football.drop(['Superbowl','avg start','avg time per drive',
                   'opp avg start','opp avg time per drive'],axis = 1)
# Creating target for regression
y = football['Superbowl']


We can train an initial predictor on the test set to identify the test set accuracy, and the whole set accuracy before we manipulate the data.

In [6]:
# Importing modules from sci-kit for working with predicting data
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Creating a test set with 25% of the data
X_train, X_test, y_train, y_test = train_test_split(X, y,
  test_size=0.25,
  shuffle=True,
  random_state=3
)


# Training knn classifier to compute test accuracy
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)    # fit only to train set
acc = knn.score(X_test, y_test)    # score only on test set
print(f"test accuracy is {acc:.2%}")


# Training new variables in order to compute predictor accuracy
n,d = X.shape
# Get vector of predictions for the training set:
yhat = knn.predict(X)    
acc = sum(yhat == y) / n    # fraction of correct predictions
print(f"accuracy is {acc:.2%}")

test accuracy is 97.02%
accuracy is 96.86%


We can identify that prior to any regression or regularization that our test accuracy is 97.02%, while our accuracy over the whole set is 96.86%

# Lasso Regularization
We will now use Lasso Regularization to determine which features have the least correlation with Superbowl victory, before eliminating those features.

In [7]:
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression

lass = Lasso(alpha=0.05)
lass.fit(X_train, y_train)

pd.DataFrame( {
    "feature": X.columns,
    "LASSO": lass.coef_
    } )

Unnamed: 0,feature,LASSO
0,wins,0.0
1,losses,-0.0
2,PF,0.000345
3,yards,-0.0
4,plays,0.000268
5,yards/play,0.0
6,TO,0.0
7,Fumbles Lost,0.0
8,1st downs,0.000204
9,completions,-2.4e-05


We will now use the function below to remove any feature frame that has a regression coefficient of zero 

In [8]:
# Get the locations (indices) of the very small weights:
zeroed = np.nonzero( np.abs(lass.coef_) < 1e-9 )
# Names of the corresponding columns:
dropped = X.columns[zeroed].values

X_train_reduced = X_train.drop(dropped, axis=1)
X_test_reduced = X_test.drop(dropped, axis=1)

X_train_reduced.head()

Unnamed: 0_level_0,PF,plays,1st downs,completions,pass attempts,pass yards,rush yards,penalties,pen yards,opp PF,opp yards,opp plays,opp 1st downs,opp rush att,opp rush yards,opp rush 1st downs,opp pen yards
team_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
min2003,416,1055,336,333,520,3951,2343,127,1029,353,5356,955,316,387,1879,94,720
oti1999,392,1011,294,304,527,3485,1811,114,1069,324,5245,994,300,383,1550,81,1010
phi2019,385,1104,354,391,613,3833,1939,100,836,354,5307,967,289,353,1442,76,959
htx2019,378,1017,346,355,534,3783,2009,111,892,385,6213,1020,346,403,1937,98,859
cle2008,232,921,233,238,488,2380,1605,100,669,350,5704,1004,315,541,2431,125,770


After eliminating the columns in the feature frame with a regression coefficient of 0, we can see the above features that are remaining in our predictor. 

We can now compare the original model's score with the reduced linear model's score.

In [9]:
lm = LinearRegression()
lm.fit(X_train, y_train)

print(f"original linear model score: {lm.score(X_test, y_test):.4f}")

lm.fit(X_train_reduced, y_train)
Rsq = lm.score(X_test_reduced, y_test)


print(f"reduced linear model score: {Rsq:.4f}")

original linear model score: -0.0172
reduced linear model score: -0.0300


We now see that the reduced linear model score is -.0172, while the original linear model score is -.0300.

Ordinarily, the value increasing in distance from 0 would be a good thing, however a negative score in each case signifies that the model was very poor in its linear regression, so our reduced linear model is even worse than our original linear model.

This can likely be attributed to the fact that the Superbowl column is a pretty poor target frame as only 1/32 teams will receive a value in that column.

# Multilinear regression with `ridge` regularization
We now use ridge regularization to try to improve our `CoD` score

In [10]:
from sklearn.linear_model import LinearRegression
lm.fit(X_train, y_train)
print(f"linear model CoD score: {lm.score(X_test, y_test):.4f}")

linear model CoD score: -0.0172


In [11]:
from sklearn.linear_model import Ridge

rr = Ridge(alpha=0.5)
rr.fit(X_train, y_train)
print(f"ridge CoD score: {rr.score(X_test, y_test):.4f}")

ridge CoD score: -0.0156


In [12]:
pd.DataFrame( {
    "feature": X.columns,
    "LASSO": rr.coef_
    } )

Unnamed: 0,feature,LASSO
0,wins,0.028401
1,losses,0.017689
2,PF,0.0004
3,yards,-0.000244
4,plays,0.002894
5,yards/play,0.084176
6,TO,0.001519
7,Fumbles Lost,0.001211
8,1st downs,0.000524
9,completions,-7.6e-05


We can see that the ridge regression improves the score by a little bit, although our score is still in the negative suggesting a poor model.

We can now look at the 2-norm of unregularized coefficients versus the 2-norm of ridge coefficients, and we can see a drastic difference in the values as the 2-norm of ridge coefficients is significantly lower, suggesting that the ridge function did what it was intended to do.

In [13]:
from numpy.linalg import norm
print(f"2-norm of unregularized coefficients: {norm(lm.coef_):.1f}")
print(f"2-norm of ridge coefficients: {norm(rr.coef_):.1f}")

2-norm of unregularized coefficients: 10444720420.7
2-norm of ridge coefficients: 0.4


Now we can use different regularization parameters to see how they effect the ridge regression CoD score.

In [14]:
for alpha in [0.25, 0.5, 1, 2]:
    rr = Ridge(alpha=alpha)    # more regularization
    rr.fit(X_train, y_train)
    print(f"alpha = {alpha:.2f}")
    print(f"2-norm of coefficient vector: {norm(rr.coef_):.1f}")
    print(f"ridge regression CoD score: {rr.score(X_test, y_test):.4f}")
    print()

alpha = 0.25
2-norm of coefficient vector: 0.5
ridge regression CoD score: -0.0164

alpha = 0.50
2-norm of coefficient vector: 0.4
ridge regression CoD score: -0.0156

alpha = 1.00
2-norm of coefficient vector: 0.3
ridge regression CoD score: -0.0159

alpha = 2.00
2-norm of coefficient vector: 0.2
ridge regression CoD score: -0.0177



We see that the best ridge score is at alpha = .5, and we now compute the ridge's CoD score when using the ridge function on our reduced features frame.

In [15]:
rr = Ridge(alpha=0.5)
rr.fit(X_train_reduced, y_train)
print(f"ridge CoD score: {rr.score(X_test_reduced, y_test):.4f}")

ridge CoD score: -0.0300


We can see that the CoD has decreased in value from -.0156 to -.0300 suggesting that the ridge regression formula made our predictor worse, albeit the predictor was very poor in the first place.

# KNN grid search hyperparameter optimization or Non-Linear Regression
We now use the KNN grid search function in order to find our best kNN CoD value.

In [16]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

kf = KFold(n_splits=6, shuffle=True, random_state=3383)
grid = {
    "kneighborsregressor__n_neighbors": range(2, 25),
    "kneighborsregressor__weights": ["uniform", "distance"] 
    }
knn = make_pipeline( StandardScaler(), KNeighborsRegressor() )
optim = GridSearchCV(
    knn, grid, 
    cv=kf, 
    n_jobs=-1
    )
optim.fit(X_train, y_train)

print(f"best kNN CoD: {optim.score(X_test, y_test):.4f}")

optim.fit(X_train_reduced, y_train)

print(f"best kNN CoD: {optim.score(X_test_reduced, y_test):.4f}")

best kNN CoD: 0.0457
best kNN CoD: 0.0114


Our best kNN coefficient of determination is .0457, and our best kNN CoD on the train with reduced features is .0114, suggesting that our model has gotten poorer after hyperparameter optimization.

We now will use the grid search function to determine at which c-value we find the best parameters.

In [17]:
from sklearn.linear_model import LogisticRegression

grid = { "logisticregression__C": 10 ** np.linspace(-1, 4, 40), 
         # ridge and LASSO cases:
         "logisticregression__penalty": ["l2", "l1"]   
        }

learner = make_pipeline(
    StandardScaler(),
    LogisticRegression( solver="liblinear" )
    )

kf = StratifiedKFold(n_splits=6, shuffle=True, random_state=302)

search = GridSearchCV(
    learner, grid, 
    cv=kf,
    n_jobs=-1
    )

search.fit(X_train, y_train)

print("Best parameters:")
print(search.best_params_)
print()
print(f"Best score is {search.best_score_:.2%}")



Best parameters:
{'logisticregression__C': 1.4251026703029979, 'logisticregression__penalty': 'l2'}

Best score is 96.81%


In [18]:
search.fit(X_train_reduced, y_train)

print("Best parameters:")
print(search.best_params_)
print()
print(f"Best score is {search.best_score_:.2%}")

Best parameters:
{'logisticregression__C': 0.1, 'logisticregression__penalty': 'l2'}

Best score is 96.81%


We find that the best value for logistic regression on the initial predictor is C = 1.425, with a 96.81% score, while the best value of C on the predictor without the features removed by the LASSO regression function is 0.1, with a score of 96.81%, identical to the original predictor.

We now look at regularization with the default value versus logistic regression with the set value for C of 1.425 on the original set, and 0.1 on the new predictor.

In [19]:
logreg = LogisticRegression(solver="liblinear")
logreg.fit(X_train, y_train)
acc = logreg.score(X_test, y_test)
print(f"accuracy with default regularization is {acc:.2%}")

logreg = LogisticRegression(solver="liblinear", C=1.425)
logreg.fit(X_train, y_train)
acc = logreg.score(X_test, y_test)
print(f"accuracy with C=1.425 is {acc:.2%}")

accuracy with default regularization is 96.43%
accuracy with C=1.425 is 96.43%


In [20]:
logreg = LogisticRegression(solver="liblinear")
logreg.fit(X_train_reduced, y_train)
acc = logreg.score(X_test_reduced, y_test)
print(f"accuracy with default regularization is {acc:.2%}")

logreg = LogisticRegression(solver="liblinear", C=0.1)
logreg.fit(X_train_reduced, y_train)
acc = logreg.score(X_test_reduced, y_test)
print(f"accuracy with C=0.1 is {acc:.2%}")

accuracy with default regularization is 97.02%
accuracy with C=0.1 is 97.02%


We find that the accuracy percentage values are optimal at the values we previously found.

# Conclusion
Throughout this project we were able to implement Multi-Linear regression with LASSO regularization, Multi-Linear regression with Ridge regularization, and Non-Linear regression with kNN optimization on hyperparameters.

However, this project was far from a success as each attempt to optimize our data lead to worse regression scores, often going from slightly negative to even more negative.

This can likely be attributed to the inefficient target column, only 3.1% of our target columns had non-zero values, and because of this and the often initially negative regression scores, we can tell that the model did not like the data that we presented it with, more than likely causing our issues in trying to optimize regression.

If I was to attempt these findings again, I would assign each team with values based upon which round they advanced to, and would have given them integer values based on what round they made it to, as opposed to only receiving a value for winning the Superbowl. 