# 2023 Superbowl Score Predictor #

This is a simple regression model to predict the Superbowl score. It uses a two-feature dataset from the 2023 regular season. Playoff results are not included.

In [127]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

## Collecting and Normalizing Data ##

For the purposes of this demonstration, we are going to breeze over data collection and use a local CSV file. The CSV file was created by hand, earlier in the semester, using 2023 data from the [NFL](https://nfl.com/stats) website.

The data includes two input features that are on very different scales. The first feature is Opponent Points Against and it is the total number of points scored against our opponent for the entire year. Typical values are in the 300-400 range and higher numbers indicate that our opponent had a poor defense (so we would expect to score more often). The second feature is Opponent Turnovers and these values are in the 20-30 range. These numbers give the total number of times that our opponent's offense fumbled the ball or threw an interception. Higher numbers indicate that our offense should expect more posessions (so we would expect to score more often).

The Points Against and Turnovers features are on different scales, which might cause our model to overemphasize the data with higher numbers. To avoid this, we will use a MinMaxScaler to normalize the data to a floating point number between 0-1, spreading the data out evenly between these values.

In [None]:
df = pd.read_csv('sample_nfl_stats.csv')
df.head()

In [None]:
df = df.rename(columns={'Opponent Points Against':'OppPA', 'Opponent Turnovers':'OppTO'})
df.head()

In [None]:
scaler = MinMaxScaler()
df[['OppPA', 'OppTO']] = scaler.fit_transform(df[['OppPA', 'OppTO']])
df.head()

## Encoding Non-Numeric Features ##

Notice that our model contains data for the Chiefs and the 49ers and these samples are differentiated from each other based on a string. These teams have different offenses and we want to make sure that our model keeps the teams separate in its calculations. One idea would be to separate the data and create two different models: one for the Chiefs and another for the 49ers. However, separating the data means that we lose datapoints. And the fact is, there is a football-is-football aspect to the game results that are common to all teams. So let's build one single model but use the "Teams" feature to account for the difference in rosters and coaching. But how do we use a string in a regression equation?

One of the most common ways to account for categorical information in a machine learning model is using one-hot encoding. This encoding technique creates a binary feature for every possible string value. It assigns a 1 to whichever feature corresponds to the string value and a 0 for the other, non-matching features. So, if our data included all 32 NFL teams, we would transform the data from 1 feature with 32 values to 32 binary features. At some point, all of these features can produce "The Curse of Dimensionality" which slows down processing and leads to overfit. There are techniques to avoid TCD, but we will save those for later.

There are other encoding techniques like ordinal encoding that simple assign a unique value to represent each category, while keeping the data within a single feature. This works well when there is a natural order to the data like (cold, cool, room temp, warm, hot) or (low, medium, high). But ordinal encoding can cause problems if the data has no underlying sequence, such as the case of football team names. 

In [None]:
encoder = OneHotEncoder(sparse_output=False, dtype=np.uint8)
encoded = encoder.fit_transform(df[['Team']])
columns = encoder.get_feature_names_out(['Team'])

print(columns)
print(encoded[:3])
print('...')
print(encoded[-3:])

In [None]:
encoded = pd.DataFrame(encoded, columns=columns)
encoded.head()

In [None]:
df = pd.concat([df, encoded], axis=1)
df.head()

## Create a Model ##

We will use a multilinear regression model to predict the Superbowl scores. Multilinear means that we are measuring a linear relationship between the inputs and the output, but there is more than one independent variable (feature). The `LinearRegression` object works the same whether you use it with a single feature or many (phew!).

In order to test the accuracy of our model, we will hold back a few of the games from the training model. The `train_test_split` function will separate the inputs X and the output y simultaneously, randomly choosing the samples to hold back for training but making sure to choose the same X rows and y rows.

In [None]:
X = df[['Team_49ers', 'Team_Chiefs', 'OppPA', 'OppTO']]
y = df[['Points']]
print(X.shape)
print(y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [136]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred = np.round(y_pred).astype(np.uint8)

In [None]:
y_pred

In [None]:
y_test

## Score the Model ##

There are a variety of metrics to score the quality of a regression model. Three common metrics are *Mean Absolute Error*, *Mean Squared Error*, and $R^2$ Error.
* Mean Absolute Error (**MAE**): Scoring single model or models with same output scales, don't care about extreme outlier predictions. Score directly connected to output values.
* Mean Squared Error (**MSE**): Similar to MAE but want to highlight models with extreme outliers. Range of scores may not correspond to the output range.
* $R^2$ Error (**R2**): Standardized score with -$\infty$ being negative scores being really poor and 1 being perfect score. Score of 0 is equal to just picking the mean y-value for your prediction.

We will use MAE so that our score metric represents points from the game.

If we are unhappy with the score, we can go back and change the features or parameters used in the model. This sort of evaluate-modify feedback loop is important, but it is also one of the easiest ways to introduce leakage or bias in a way that overfits our model. 

In [None]:
mae_score = mean_absolute_error(y_test, y_pred)
print(mae_score)

## Predict the Superbowl Winner ##

Now that we are happy with our model, let's make a prediction with the Superbowl teams facing off against each other. These statistics weren't in our original data and at this point, it's probably easiest to just manually create the appropriate arrays following the same order as our original input features in X.

By the way, we know the result of the Chiefs-49ers Superbowl. Hopefully our model predicts the Chief's winning 38-35.

### 49ers vs Chiefs ###
|Team|Opponent Points Against|Opponent Turnovers|
|----|-----------------------|------------------|
|49ers|294 (Chiefs had good defense|28 (Chief's were turnover prone)|
|Chiefs|298 (49ers also had a good defense)|18 (49ers protected the football)|

In [None]:
# No need for OHE because we created the data manually
X_real = pd.DataFrame({'Team_49ers':[1, 0],
                  'Team_Chiefs':[0, 1],
                  'OppPA':[294, 298],
                  'OppTO':[28, 18]})
X_real

In [None]:
X_real[['OppPA', 'OppTO']] = scaler.transform(X_real[['OppPA', 'OppTO']])
X_real

In [None]:

superbowl_scores = model.predict(X_real)
superbowl_scores = np.round(superbowl_scores).astype(np.uint8)
print(f"2023 SUPERBOWL PREDICITON")
print(f"49ers: {superbowl_scores[0]}")
print(f"Chiefs: {superbowl_scores[1]}")