## **Project Introduction**

The goal of this project is to predict which NFL team will win a match based on their recent historical performance. This is a classification problem, where we aim to forecast the outcome (win or loss) of a game using historical team data and performance metrics.

Predicting NFL game outcomes benefits:
- **Coaches** by refining strategies,
- **Analysts** by providing insights,
- **Betting companies** by setting better odds, and
- **Fans** by boosting engagement.

By analyzing historical performance data, we can uncover patterns that inform decision-making for team preparation, betting strategies, and fan interactions. This improves outcomes for all stakeholders involved, from optimizing team tactics to generating content for fans.


In [1]:
import pandas as pd

In [2]:
# Read in base data from csv
df = pd.read_csv('nfl.csv')
df.head(5)

Unnamed: 0,Rk,Team,Date,Season,Pts,PtsO,Rate,TO,Y/P,DY/P,...,Int.1,Rate.3,Opp_Rush_Att,Opp_Rush_Yds,Y/A.1,TD.3,Rush,Pass,Tot.1,TO.2
0,1,DET,2024-12-05,2024,34,31,109.7,0,5.14,6.62,...,0,110.2,24,99,4.1,3,12,81,93,0
1,2,GNB,2024-12-05,2024,31,34,111.7,0,6.62,5.14,...,1,109.3,34,111,3.3,1,-12,-81,-93,0
2,3,DEN,2024-12-02,2024,41,32,65.7,1,6.56,6.57,...,3,86.5,23,77,3.3,0,29,-181,-152,1
3,4,CLE,2024-12-02,2024,32,41,88.1,-1,6.57,6.56,...,2,65.7,26,106,4.1,2,-29,181,152,-1
4,5,JAX,2024-12-01,2024,20,23,83.0,-1,5.57,5.34,...,0,92.5,25,108,4.3,1,-11,58,47,-1


In [3]:
# Display all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

### **Data Cleaning & Exploration**
Here is the general overview of the steps we took during data cleaning & exploration.

- **Handle missing/invalid values**: Fill in or remove missin/invalid data to maintain dataset integrity.
- **Standardize formats & data types & units**: Ensure consistency in date formats, data types, and measurement units.
- **Remove unnecessary columns**: Drop irrelevant columns to simplify the dataset.
- **Anomoly detection & filter outliers**: Identify and manage extreme values that may distort analysis.
- **Correlation analysis**: Analyze relationships between variables to identify strong correlations that to inform feature selection.


#### **Handle Missing/Invalid Values**
To handle missing values, we count the number of null values in each column.
- We can see that the data is quite complete, with no null values except in the away column.
- Upon further inspection of the away column, it can be seen that the NaN value is used to denote that the current team is at home and the @ value is used to denote that the current team is away.
- Therefore, we do not need to handle any missing values.

Upon further data inspection, we found that some games ended in a tie.
- We will be considering this an invalid type since we are only trying to predict a win or loss
- Therefore, all rows which have a tie result will be removed.

In [4]:
# Handle missing values

null_counts = df.isnull().sum()

if (null_counts > 0).any():
    print(null_counts[null_counts > 0])

Away    739
dtype: int64


In [5]:
# Handle invalid values

df = df[~df['Result'].str[0].isin(['T'])]

#### **Standardize Formats, Data Types, & Units**

Before we complete the other steps in data cleaning, we should perform data standardization to make comparisons easier. This includes:
- Check that each column has a consistent type.
  - Our check found that each column had a consitent type except 'Away', which we handle in the next step.
- Standardizing non-numeric / categorical data.
  - Here we print out any columns without the 'int64' or 'float64' type.
  - Then we go through each column and turn those into a standardized type.
  - As a final check, we print the data type of each column and double check by inspecing the data.
- Confirming consistent units.
  - Since the source of the data provides units, they are expected to be consistent.
  - However, this is further checked by a visual inspection of the data and any outliers/anomolies which may be detected.

In [6]:
# Check that each column has a consistent type
for column in df.columns:
  if df[column].map(type).nunique() > 1:
    print(f"Column '{column}' contains mixed types.")

Column 'Away' contains mixed types.


In [7]:
# Standardizing non-numeric data
non_numeric_columns = df.select_dtypes(exclude=['int64', 'float64'])
print(non_numeric_columns.dtypes)

Team      object
Date      object
ToP       object
Day       object
Away      object
Opp       object
Result    object
ToP.1     object
Time      object
dtype: object


In [8]:
# Standardizing non-numeric data
df['Date'] = pd.to_datetime(df['Date'])

def convert_to_seconds(time_str):
    minutes, seconds = map(int, time_str.split(':'))
    return minutes * 60 + seconds
df['ToP'] = df['ToP'].apply(convert_to_seconds)
df['ToP.1'] = df['ToP.1'].apply(convert_to_seconds)

df['Away'] = df['Away'].isna().astype(int)

def convert_to_minutes(time_str):
    return int(time_str.split(':')[0])
df['Time'] = df['Time'].apply(convert_to_minutes)

df['Result'] = df['Result'].str[0].map({'W': 1, 'L': 0})

In [9]:
# Standardizing non-numeric data
non_numeric_columns = df.select_dtypes(exclude=['int64', 'float64'])
print(non_numeric_columns.dtypes)

Team            object
Date    datetime64[ns]
Day             object
Away             int32
Opp             object
dtype: object


In [10]:
# Standardizing non-numeric data
df.head(5)

Unnamed: 0,Rk,Team,Date,Season,Pts,PtsO,Rate,TO,Y/P,DY/P,ToP,Rate.1,Att,Att.1,Day,G#,Week,Away,Opp,Result,Pts.1,PtsO.1,PtDif,PC,Cmp,Att.2,Inc,Cmp%,Yds,TD,Int,TD%,Int%,Rate.2,Sk,Yds.1,Sk%,Y/A,NY/A,AY/A,ANY/A,Y/C,Rush_Att,Rush_Yds,Rush_Y/A,TD.1,Tot,Ply,Y/P.1,DPly,DY/P.1,TO.1,ToP.1,Time,Cmp.1,Att.3,Cmp%.1,Yds.2,TD.2,Sk.1,Yds.3,Int.1,Rate.3,Opp_Rush_Att,Opp_Rush_Yds,Y/A.1,TD.3,Rush,Pass,Tot.1,TO.2
0,1,DET,2024-12-05,2024,34,31,109.7,0,5.14,6.62,2166,110.2,34,24,Thu,13,14,1,GNB,1,34,31,3,65,32,41,9,78.0,280,3,1,7.3,2.4,109.7,1,3,2.38,6.8,6.67,7.2,7.02,8.8,34,111,3.3,1,391,76,5.14,45,6.62,1,2166,3,12,20,60.0,199,1,1,7,0,110.2,24,99,4.1,3,12,81,93,0
1,2,GNB,2024-12-05,2024,31,34,111.7,0,6.62,5.14,1434,109.3,24,34,Thu,13,14,0,DET,0,31,34,-3,65,12,20,8,60.0,199,1,0,5.0,0.0,111.7,1,7,4.76,10.0,9.48,10.95,10.43,16.6,24,99,4.1,3,298,45,6.62,76,5.14,1,1434,3,32,41,78.0,280,3,1,3,1,109.3,34,111,3.3,1,-12,-81,-93,0
2,3,DEN,2024-12-02,2024,41,32,65.7,1,6.56,6.57,1670,86.5,26,23,Mon,13,13,1,CLE,1,41,32,9,73,18,35,17,51.4,294,1,2,2.9,5.7,65.7,0,0,0.0,8.4,8.4,6.4,6.4,16.3,26,106,4.1,2,400,61,6.56,84,6.57,2,1670,3,34,58,58.6,475,4,3,22,3,86.5,23,77,3.3,0,29,-181,-152,1
3,4,CLE,2024-12-02,2024,32,41,88.1,-1,6.57,6.56,1930,65.7,23,26,Mon,12,13,0,DEN,0,32,41,-9,73,34,58,24,58.6,475,4,3,6.9,5.2,88.1,3,22,4.92,8.2,7.79,7.24,6.89,14.0,23,77,3.3,0,552,84,6.57,61,6.56,3,1930,3,18,35,51.4,294,1,0,0,2,65.7,26,106,4.1,2,-29,181,152,-1
4,5,JAX,2024-12-01,2024,20,23,83.0,-1,5.57,5.34,1663,92.5,25,25,Sun,12,13,1,HOU,0,20,23,-3,43,24,42,18,57.1,276,2,1,4.8,2.4,83.0,0,0,0.0,6.6,6.57,6.45,6.45,11.5,25,97,3.9,0,373,67,5.57,61,5.34,1,1663,3,22,34,64.7,218,1,2,24,0,92.5,25,108,4.3,1,-11,58,47,-1


#### **Remove Unnecessary Columns**

To simplify our dataset, we should look to see if there are any unncessary columns. This can include:
- Columns containing irrelavent information
  - Upon inspection of the data, we can see that the 'Rk' column served as an index column. 
  - Therefore, the 'Rk' column can be considered to contain irrelavent information to out model and thus removed.
- Columns with constant data
  - No columns were found to have constant data
- Columns with excessive missing data
  - Since we have handled and discovered no missing values, we do not need to account for this case
- Columns with identical data (we will keep the first column alphabetically for each duplicate)
  - Column 'Att' is a duplicate of column 'Rush_Att'
  - Column 'Att.1' is a duplicate of column 'Opp_Rush_Att'
  - Column 'DY/P' is a duplicate of column 'DY/P.1'
  - Column 'Pts' is a duplicate of column 'Pts.1'
  - Column 'PtsO' is a duplicate of column 'PtsO.1'
  - Column 'Rate' is a duplicate of column 'Rate.2'
  - Column 'Rate.1' is a duplicate of column 'Rate.3'
  - Column 'TO' is a duplicate of column 'TO.2'
  - Column 'ToP' is a duplicate of column 'ToP.1'
  - Column 'Y/P' is a duplicate of column 'Y/P.1'

In [11]:
# Columns containing irrelavent information
df.drop('Rk', axis=1, inplace=True)
df.head(5)

Unnamed: 0,Team,Date,Season,Pts,PtsO,Rate,TO,Y/P,DY/P,ToP,Rate.1,Att,Att.1,Day,G#,Week,Away,Opp,Result,Pts.1,PtsO.1,PtDif,PC,Cmp,Att.2,Inc,Cmp%,Yds,TD,Int,TD%,Int%,Rate.2,Sk,Yds.1,Sk%,Y/A,NY/A,AY/A,ANY/A,Y/C,Rush_Att,Rush_Yds,Rush_Y/A,TD.1,Tot,Ply,Y/P.1,DPly,DY/P.1,TO.1,ToP.1,Time,Cmp.1,Att.3,Cmp%.1,Yds.2,TD.2,Sk.1,Yds.3,Int.1,Rate.3,Opp_Rush_Att,Opp_Rush_Yds,Y/A.1,TD.3,Rush,Pass,Tot.1,TO.2
0,DET,2024-12-05,2024,34,31,109.7,0,5.14,6.62,2166,110.2,34,24,Thu,13,14,1,GNB,1,34,31,3,65,32,41,9,78.0,280,3,1,7.3,2.4,109.7,1,3,2.38,6.8,6.67,7.2,7.02,8.8,34,111,3.3,1,391,76,5.14,45,6.62,1,2166,3,12,20,60.0,199,1,1,7,0,110.2,24,99,4.1,3,12,81,93,0
1,GNB,2024-12-05,2024,31,34,111.7,0,6.62,5.14,1434,109.3,24,34,Thu,13,14,0,DET,0,31,34,-3,65,12,20,8,60.0,199,1,0,5.0,0.0,111.7,1,7,4.76,10.0,9.48,10.95,10.43,16.6,24,99,4.1,3,298,45,6.62,76,5.14,1,1434,3,32,41,78.0,280,3,1,3,1,109.3,34,111,3.3,1,-12,-81,-93,0
2,DEN,2024-12-02,2024,41,32,65.7,1,6.56,6.57,1670,86.5,26,23,Mon,13,13,1,CLE,1,41,32,9,73,18,35,17,51.4,294,1,2,2.9,5.7,65.7,0,0,0.0,8.4,8.4,6.4,6.4,16.3,26,106,4.1,2,400,61,6.56,84,6.57,2,1670,3,34,58,58.6,475,4,3,22,3,86.5,23,77,3.3,0,29,-181,-152,1
3,CLE,2024-12-02,2024,32,41,88.1,-1,6.57,6.56,1930,65.7,23,26,Mon,12,13,0,DEN,0,32,41,-9,73,34,58,24,58.6,475,4,3,6.9,5.2,88.1,3,22,4.92,8.2,7.79,7.24,6.89,14.0,23,77,3.3,0,552,84,6.57,61,6.56,3,1930,3,18,35,51.4,294,1,0,0,2,65.7,26,106,4.1,2,-29,181,152,-1
4,JAX,2024-12-01,2024,20,23,83.0,-1,5.57,5.34,1663,92.5,25,25,Sun,12,13,1,HOU,0,20,23,-3,43,24,42,18,57.1,276,2,1,4.8,2.4,83.0,0,0,0.0,6.6,6.57,6.45,6.45,11.5,25,97,3.9,0,373,67,5.57,61,5.34,1,1663,3,22,34,64.7,218,1,2,24,0,92.5,25,108,4.3,1,-11,58,47,-1


In [12]:
# Columns with constant data
constant_columns = df.columns[df.nunique() == 1]
print('Columns with constant data:', constant_columns)

Columns with constant data: Index([], dtype='object')


In [13]:
# Columns with identical data
duplicate_columns = {}
seen_pairs = set()

for col1 in df.columns:
  for col2 in df.columns:
    if col1 != col2 and df[col1].equals(df[col2]):
      if (col1, col2) not in seen_pairs and (col2, col1) not in seen_pairs:
        duplicate_columns[col1] = col2
        seen_pairs.add((col1, col2))

sorted_duplicate_columns = sorted(duplicate_columns.items())

for col, duplicate in sorted_duplicate_columns:
  print(f"Column '{col}' is a duplicate of column '{duplicate}'")

duplicated_columns_list = list(set(duplicate_columns.keys()).union(duplicate_columns.values()))
print(df[duplicated_columns_list].head(5).T.sort_index(axis=0))


Column 'Att' is a duplicate of column 'Rush_Att'
Column 'Att.1' is a duplicate of column 'Opp_Rush_Att'
Column 'DY/P' is a duplicate of column 'DY/P.1'
Column 'Pts' is a duplicate of column 'Pts.1'
Column 'PtsO' is a duplicate of column 'PtsO.1'
Column 'Rate' is a duplicate of column 'Rate.2'
Column 'Rate.1' is a duplicate of column 'Rate.3'
Column 'TO' is a duplicate of column 'TO.2'
Column 'ToP' is a duplicate of column 'ToP.1'
Column 'Y/P' is a duplicate of column 'Y/P.1'
                    0        1        2        3        4
Att             34.00    24.00    26.00    23.00    25.00
Att.1           24.00    34.00    23.00    26.00    25.00
DY/P             6.62     5.14     6.57     6.56     5.34
DY/P.1           6.62     5.14     6.57     6.56     5.34
Opp_Rush_Att    24.00    34.00    23.00    26.00    25.00
Pts             34.00    31.00    41.00    32.00    20.00
Pts.1           34.00    31.00    41.00    32.00    20.00
PtsO            31.00    34.00    32.00    41.00    23.0

In [14]:
# Columns with identical data
for col1, duplicate in duplicate_columns.items():
 if duplicate in df.columns:
    df.drop(duplicate, axis=1, inplace=True)

df.head(5)

Unnamed: 0,Team,Date,Season,Pts,PtsO,Rate,TO,Y/P,DY/P,ToP,Rate.1,Att,Att.1,Day,G#,Week,Away,Opp,Result,PtDif,PC,Cmp,Att.2,Inc,Cmp%,Yds,TD,Int,TD%,Int%,Sk,Yds.1,Sk%,Y/A,NY/A,AY/A,ANY/A,Y/C,Rush_Yds,Rush_Y/A,TD.1,Tot,Ply,DPly,TO.1,Time,Cmp.1,Att.3,Cmp%.1,Yds.2,TD.2,Sk.1,Yds.3,Int.1,Opp_Rush_Yds,Y/A.1,TD.3,Rush,Pass,Tot.1
0,DET,2024-12-05,2024,34,31,109.7,0,5.14,6.62,2166,110.2,34,24,Thu,13,14,1,GNB,1,3,65,32,41,9,78.0,280,3,1,7.3,2.4,1,3,2.38,6.8,6.67,7.2,7.02,8.8,111,3.3,1,391,76,45,1,3,12,20,60.0,199,1,1,7,0,99,4.1,3,12,81,93
1,GNB,2024-12-05,2024,31,34,111.7,0,6.62,5.14,1434,109.3,24,34,Thu,13,14,0,DET,0,-3,65,12,20,8,60.0,199,1,0,5.0,0.0,1,7,4.76,10.0,9.48,10.95,10.43,16.6,99,4.1,3,298,45,76,1,3,32,41,78.0,280,3,1,3,1,111,3.3,1,-12,-81,-93
2,DEN,2024-12-02,2024,41,32,65.7,1,6.56,6.57,1670,86.5,26,23,Mon,13,13,1,CLE,1,9,73,18,35,17,51.4,294,1,2,2.9,5.7,0,0,0.0,8.4,8.4,6.4,6.4,16.3,106,4.1,2,400,61,84,2,3,34,58,58.6,475,4,3,22,3,77,3.3,0,29,-181,-152
3,CLE,2024-12-02,2024,32,41,88.1,-1,6.57,6.56,1930,65.7,23,26,Mon,12,13,0,DEN,0,-9,73,34,58,24,58.6,475,4,3,6.9,5.2,3,22,4.92,8.2,7.79,7.24,6.89,14.0,77,3.3,0,552,84,61,3,3,18,35,51.4,294,1,0,0,2,106,4.1,2,-29,181,152
4,JAX,2024-12-01,2024,20,23,83.0,-1,5.57,5.34,1663,92.5,25,25,Sun,12,13,1,HOU,0,-3,43,24,42,18,57.1,276,2,1,4.8,2.4,0,0,0.0,6.6,6.57,6.45,6.45,11.5,97,3.9,0,373,67,61,1,3,22,34,64.7,218,1,2,24,0,108,4.3,1,-11,58,47


#### **Anomoly Detection & Filter Outliers**

Anomaly detection is essential to identify and address outliers or errors in the dataset.
This helps to ensure that the analysis remains accurate and the model is not influenced by misleading or irrelevant data points. Note that this is also somewhat part of data exploration.
We followed the steps below:

- Plot distributions for numeric data
	- Before we search for anomolies, is good to visualize how the data is distributed
	- Additionally, we should note that some of the numeric values are inheriently categorical or continous such as:
		- Result
			Season
			Time
			Week
			G#
- Using the plots, we can inspect to see the types of distributions
	- Only only classified Yds.1 and Yds.3 as "exponential"
	- Everything else is considered to have a "normal" distribution, even if severly skewed
- By separetly calculating the Z-scores for normal and exponential features and combining them at the end:
	- We get an anomoly rate of 15.60% for Z-score threshold of 3 and 2.92% for Z-score threshold of 4
	- Printing out the anomolies for Z-score threshold of 4, we see games with still relatively normal stats
	- Therefore, we concluded that due to the chaotic and variable nature sports, especially with football, we cannot confindelty determine which games should be considered anomolies, so no rows are removed







In [15]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

ModuleNotFoundError: No module named 'seaborn'

In [None]:
exclude_columns = ['Result', 'Season', 'Time', 'Week', 'G#']

def get_numeric_columns(df, exclude_columns):
    return sorted([col for col in df.select_dtypes(include=['float64', 'int64']).columns if col not in exclude_columns])

numeric_columns = get_numeric_columns(df, exclude_columns)

In [None]:
def fit_distributions(data):
	norm_params = stats.norm.fit(data) # Normal distribution
	
	exp_params = stats.expon.fit(data) # Exponential distribution
	
	if (data > 0).all():
		lognorm_params = stats.lognorm.fit(data) # Log-Normal distribution (if all > 0)
		return norm_params, exp_params, lognorm_params, True
	else:
		return norm_params, exp_params, None, False

In [None]:
num_cols = 4
num_rows = int(np.ceil(len(numeric_columns) / num_cols))
plt.figure(figsize=(num_cols * 4, num_rows * 4))

for i, col in enumerate(numeric_columns, 1):
	data = df[col]
	norm_params, exp_params, lognorm_params, plot_lognorm = fit_distributions(data)
	plt.subplot(num_rows, num_cols, i)
	sns.histplot(data, kde=True, bins=20, color='blue', stat="density")
	xmin, xmax = plt.xlim()
	x = np.linspace(xmin, xmax, 100)

	p_norm = stats.norm.pdf(x, *norm_params) # Normal distribution
	plt.plot(x, p_norm, 'k-', label=f'Normal fit')

	p_exp = stats.expon.pdf(x, *exp_params) # Exponential distribution
	plt.plot(x, p_exp, 'r-', label=f'Exponential fit')

	if plot_lognorm:
		p_lognorm = stats.lognorm.pdf(x, *lognorm_params) # Log-Normal distribution (if all > 0)
		plt.plot(x, p_lognorm, 'g-', label=f'Log-Normal fit') 

	plt.title(f'Distribution of {col}')
	plt.legend()

plt.tight_layout()
plt.show()


In [None]:
import pandas as pd
import numpy as np
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler


In [None]:
numeric_columns = df.select_dtypes(include=[np.number]).columns
exponential_features = ['Yds.1', 'Yds.3']
normal_features = [col for col in numeric_columns if col not in exponential_features and col not in ['Result', 'Season', 'Time', 'Week', 'G#']]


normal_data = df[normal_features]
exponential_data = df[exponential_features]

normal_data_standardized = normal_data.apply(zscore)
exponential_data_log = np.log1p(exponential_data)
exponential_data_log_standardized = exponential_data_log.apply(zscore)

normal_anomalies_3 = (np.abs(normal_data_standardized) > 3).any(axis=1)
normal_anomalies_4 = (np.abs(normal_data_standardized) > 4).any(axis=1)
exponential_anomalies_3 = (np.abs(exponential_data_log_standardized) > 3).any(axis=1)
exponential_anomalies_4 = (np.abs(exponential_data_log_standardized) > 4).any(axis=1)
final_anomalies_3 = normal_anomalies_3 | exponential_anomalies_3
final_anomalies_4 = normal_anomalies_4 | exponential_anomalies_4

print(f"Anomalies detected (threshold = 3): {final_anomalies_3.mean() * 100:.2f}%")
print(f"Anomalies detected (threshold = 4): {final_anomalies_4.mean() * 100:.2f}%")

In [None]:
plt.hist(normal_data_standardized.values.flatten(), bins=50, alpha=0.7, label="Normal Features")
plt.hist(exponential_data_log_standardized.values.flatten(), bins=50, alpha=0.7, label="Exponential Features")
plt.title("Distribution of Z-scores")
plt.xlabel("Z-score")
plt.ylabel("Frequency")
plt.legend()
plt.show()

In [None]:
temp_df = df.copy()
temp_df['Anomaly'] = final_anomalies_4.astype(int)
anomalous_rows = temp_df[temp_df['Anomaly'] == 1]
anomalous_rows.head(10)


#### **Correlation Analysis**

Correlation analysis helps to guide out feature engineering. It is useful for detecting irrelavent information and reducing dimensionality. To do so, we did the following:
- Correlation of each column with 'Result'
	- Result only had a high correlation with PtDif, which makes sense as that's what determines the winner
	- Inc and Int dont seem to be correlated to Result
- Correlation of each column with every other column (threshold = 0.7)
	- There seems to be several stats which are very highly correlated to each other, mostly due to how they are calculated
	- 'G#' and 'Week': expected due to both being time based, but don't seem to be hihgly correlated to any stat nor the result
	- A lot of the *Y/A are hihgly coorrelated, since it's mostly calcualtions from the total yards
	- 'Rate' also seems to be highly correlated to multiple other stats, also due to how rate is calculated
	- There is also a pattern of overall stat and stat %, which are high correlated and may be redundant
- Correlation of each column with 'Y/P'
	- NY/A, ANY/A, AY/A, Y/A, Y/C, and Yds are all highly coorelated with 'Y/P'
- Correlation of each column with 'DY/P'
	- 'DY/P' does not seem to be highly coorelated to any stat
- Correlation of each column with 'Rate'
	- 'Rate' is decently correlated to stats related to TD, cmps, yards, and ints, which are used to calculate it
- Correlation of each column with 'Yds'
	- 'Yds' is hihgly correlated to stats such as Cmp and tot
- Correlation of each column with 'Yds.1'
	- Yards lost to sacks is strongly correlated to the num of sacks "Sk"

In [None]:
numeric_columns = df.select_dtypes(include=[np.number]).columns
numeric_data = df[numeric_columns]

In [None]:
# Step 1: Compute the correlation matrix
correlation_matrix = numeric_data.corr()

# Step 2: Mask the upper triangle to avoid duplicate correlations (i.e., A vs B and B vs A)
import numpy as np
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Step 3: Extract pairs of features with strong correlations (e.g., |correlation| > 0.7)
threshold = 0.7
strong_correlations = correlation_matrix.where(mask).stack().reset_index()
strong_correlations.columns = ['Feature 1', 'Feature 2', 'Correlation']

# Step 4: Filter strong correlations above the threshold (both positive and negative)
strong_correlations = strong_correlations[strong_correlations['Correlation'].abs() > threshold]

# Step 5: Remove self-correlations (Feature 1 and Feature 2 being the same)
strong_correlations = strong_correlations[strong_correlations['Feature 1'] != strong_correlations['Feature 2']]

# Step 6: Sort the correlations by absolute value
strong_correlations['abs_correlation'] = strong_correlations['Correlation'].abs()
strong_correlations_sorted = strong_correlations.sort_values(by='abs_correlation', ascending=False).reset_index(drop=True)

# Step 7: Display the result
print(strong_correlations_sorted[['Feature 1', 'Feature 2', 'Correlation']])


In [None]:
def correlations_with_feature(feature_name):
	correlations = numeric_data.corrwith(df[feature_name])
	correlations = correlations.reset_index()

	correlations.columns = ['Feature', 'Correlation with ' + feature_name]
	correlations['abs_correlation'] = correlations['Correlation with ' + feature_name].abs()

	correlations_sorted = correlations.sort_values(by='abs_correlation', ascending=False).reset_index(drop=True)
	print(correlations_sorted[['Feature', 'Correlation with ' + feature_name]])

In [None]:
correlations_with_feature('Result')

In [None]:
correlations_with_feature('Week')

In [None]:
correlations_with_feature('G#')

In [None]:
correlations_with_feature('Y/P')

In [None]:
correlations_with_feature('DY/P')

In [None]:
correlations_with_feature('Rate')

In [None]:
correlations_with_feature('Yds.1')

### **Feature Selection & Engineering**

Feature selection and engineering help to identify the most relevant variables (reduces dimentionality) and transform raw data into informative features, improving model accuracy, reducing complexity, and preventing overfitting.

#### **Feature Selection**

For our feature selection, we used both the data from the correlation analysis as well as logic and previous experience with the sport. 

The general method is to train the model on each game with the format of<br>
[General Game Data | Team 1 Stats| Team 2 Stats | Win/Loss]

Since each game is presented twice in the data set (mirror for each team), we can simply get the data of team 1 for each game and then merge

General Game Data:
- Season
- Date
- Team
- Opp
- Away

Win/Loss:
- Result

Team 1 Stats Included:
- Pts: Generally important for a win
- PtDif: How much they won/lost by
- TO: Generally important for a win
- Rate: Summarizes many important stats for a game (NFL rating)
- Y/P: Generally important for a win, strongly correlated to other stats, serves as a summary
- DY/P: Generally important for a win
- ToP: Generally important for a win
- Sk: Generally detrimental to a win
- Yds.1: Sacked yards: Generally detrimental to a win
- Att: Rushing attempts: generally a positive sign of performace
- Rush_Yds: Generally important for a win

Team 1 Stats Excluded:
- Rate.1: A different rating system to the NFL rating, removed for consistency
- NY/A: Strongly correlated to Y/P, redundant
- ANY/A: Strongly correlated to Y/P, redundant
- AY/A: Strongly correlated to Y/P, redundant
- Y/A: Strongly correlated to Y/P, redundant
- Y/C: Strongly correlated to Y/P, redundant
- Cmp: Strongly correlated to Yds, redundant
- Inc: reflected in Y/P
- Int: reflected in DY/P
- Rush_Y/A: reflected in Att and Rush_Yds
- TD: Reflected in Pts
- Yds: Reflected in Y/P
- Att.2: Reflected in Y/P

Team 2 Stats (All Excluded):
- Att.3
- Cmp%.1
- Cmp.1
- Int.1
- Sk.1
- Tot.1
- TD.2
- TD.3
- Y/A.1
- Yds.2
- Yds.3
- Opp_Rush_Yds
- Att.1
- TD.1
- TO.1
- PtsO

Excluded:
- Time: Irrelavent
- Week: Irrelavent
- G#: Irrelavent
- Day: Irrelavent
- Cmp%: Strongly correlated to Cmp, redundant
- Int%: Strongly correlated to Int, redundant
- Sk%: Strongly correlated to Sk, redundant
- TD%: Strongly correlated to TD, redundant
- Ply: Irrelavent, total for both teams
- DPly: Irrelavent, total for both teams
- PC: Irrelavent

Therefore, in the end we have:

General Game Data:
- Season -> Season
- Date -> Date
- Team -> Team1
- Opp -> Team2
- Away -> Home

Win/Loss:
- Result -> Result

Team 1 Stats Included:
- Pts -> Team1Pts
- PtDif -> Team1PtDiff
- TO -> Team1TM
- Rate -> Team1Rating
- Y/P -> Team1Y/P
- DY/P -> Team1DY/P
- ToP -> Team1ToP
- Sk -> Team1Sks
- Yds.1 -> Team1SkYds
- Att -> Team1RushAtt
- Rush_Yds -> Team1RushYds

In [None]:
import pandas as pd

# Assuming 'df' is your original DataFrame

# Step 1: Create a new DataFrame with only the columns you want and rename them
new_df = df[['Season', 'Date', 'Team', 'Opp', 'Away', 'Pts', 'PtDif', 'TO', 'Rate', 'Y/P', 'DY/P', 'ToP', 'Sk', 'Yds.1', 'Att', 'Rush_Yds', 'Result']].copy()

# Step 2: Rename columns
new_df.rename(columns={
    'Season': 'Season',
    'Date': 'Date',
    'Team': 'Team1',
    'Opp': 'Team2',
    'Away': 'Home',
    'Pts': 'Team1Pts',
    'PtDif': 'Team1PtDiff',
    'TO': 'Team1TM',
    'Rate': 'Team1Rating',
    'Y/P': 'Team1Y/P',
    'DY/P': 'Team1DY/P',
    'ToP': 'Team1ToP',
    'Sk': 'Team1Sks',
    'Yds.1': 'Team1SkYds',
    'Att': 'Team1RushAtt',
    'Rush_Yds': 'Team1RushYds',
    'Result': 'Team1Won'
}, inplace=True)

# Step 3: Modify the Home column (reverse 0 and 1 values)
new_df['Home'] = new_df['Home'].apply(lambda x: 1 if x == 0 else 0)

# Step 4: Add new columns for the margins
# Assuming 'Yds.2' is Team2's total passing yards (you should adjust according to your actual column names)
# Assuming 'Yds.3' is Team2's total sack yards (you should adjust according to your actual column names)
# Assuming 'Opp_Rush_Yds' is Team2's rushing yards (you should adjust according to your actual column names)

# Step 5: Preview the DataFrame
new_df.head(5)


Unnamed: 0,Season,Date,Team1,Team2,Home,Team1Pts,Team1PtDiff,Team1TM,Team1Rating,Team1Y/P,Team1DY/P,Team1ToP,Team1Sks,Team1SkYds,Team1RushAtt,Team1RushYds,Team1Won
0,2024,2024-12-05,DET,GNB,0,34,3,0,109.7,5.14,6.62,2166,1,3,34,111,1
1,2024,2024-12-05,GNB,DET,1,31,-3,0,111.7,6.62,5.14,1434,1,7,24,99,0
2,2024,2024-12-02,DEN,CLE,0,41,9,1,65.7,6.56,6.57,1670,0,0,26,106,1
3,2024,2024-12-02,CLE,DEN,1,32,-9,-1,88.1,6.57,6.56,1930,3,22,23,77,0
4,2024,2024-12-01,JAX,HOU,0,20,-3,-1,83.0,5.57,5.34,1663,0,0,25,97,0


### **Feature Engineering**

Currently each row only tells the story of the stats of each team, they do not however, properly tell the difference between the teams besides PtDif, therefore, we will introduce 3 new features:
- Pass: passing margin
	- Yds - Yds.2
- Rush: rushing margin
	- Rush_Yds - Opp_Rush_Yds
- Sack: sack margin
	- Yds.1 - Yds.3

These help tell the 'story' of how close the match really was

In [None]:
new_df['Team1PYM'] = df['Yds'] - df['Yds.2']
new_df['Team1RYM'] = df['Rush_Yds'] - df['Opp_Rush_Yds']
new_df['Team1YM'] = df['Yds.1'] - df['Yds.3']

new_df.head(5)

Unnamed: 0,Season,Date,Team1,Team2,Home,Team1Pts,Team1PtDiff,Team1TM,Team1Rating,Team1Y/P,Team1DY/P,Team1ToP,Team1Sks,Team1SkYds,Team1RushAtt,Team1RushYds,Team1Won,Team1PYM,Team1RYM,Team1YM
0,2024,2024-12-05,DET,GNB,0,34,3,0,109.7,5.14,6.62,2166,1,3,34,111,1,81,12,-4
1,2024,2024-12-05,GNB,DET,1,31,-3,0,111.7,6.62,5.14,1434,1,7,24,99,0,-81,-12,4
2,2024,2024-12-02,DEN,CLE,0,41,9,1,65.7,6.56,6.57,1670,0,0,26,106,1,-181,29,-22
3,2024,2024-12-02,CLE,DEN,1,32,-9,-1,88.1,6.57,6.56,1930,3,22,23,77,0,181,-29,22
4,2024,2024-12-01,JAX,HOU,0,20,-3,-1,83.0,5.57,5.34,1663,0,0,25,97,0,58,-11,-24


In [None]:
import pandas as pd

# Assuming 'new_df' is your original DataFrame

# Step 1: Create a copy of the DataFrame and rename the columns to represent Team2's perspective
df_shifted = new_df.copy()

# Rename columns to reflect Team2's perspective (shifted stats)
df_shifted.rename(columns={
    'Team1': 'Team2',
    'Team2': 'Team1',
    'Team1Pts': 'Team2Pts',
    'Team1PtDiff': 'Team2PtDiff',
    'Team1TM': 'Team2TM',
    'Team1Rating': 'Team2Rating',
    'Team1Y/P': 'Team2Y/P',
    'Team1DY/P': 'Team2DY/P',
    'Team1ToP': 'Team2ToP',
    'Team1Sks': 'Team2Sks',
    'Team1SkYds': 'Team2SkYds',
    'Team1RushAtt': 'Team2RushAtt',
    'Team1RushYds': 'Team2RushYds',
    'Team1PYM': 'Team2PYM',
    'Team1RYM': 'Team2RYM',
    'Team1YM': 'Team2YM',
}, inplace=True)

# Step 2: Merge the original DataFrame with the shifted DataFrame based on Date, Team1, and Team2
merged_df = pd.merge(new_df, df_shifted, on=['Date', 'Team1', 'Team2'], how='left', suffixes=('', '_Team2'))

# Step 3: Clean up the DataFrame by selecting the relevant columns (no duplicate columns)
merged_df = merged_df[['Season', 'Date', 'Team1', 'Team2', 'Home', 
                       'Team1Pts', 'Team2Pts', 'Team1PtDiff', 'Team2PtDiff', 
                       'Team1TM', 'Team2TM', 'Team1Rating', 'Team2Rating', 
                       'Team1Y/P', 'Team2Y/P', 'Team1DY/P', 'Team2DY/P',
                       'Team1ToP', 'Team2ToP', 'Team1Sks', 'Team2Sks', 
                       'Team1SkYds', 'Team2SkYds', 'Team1RushAtt', 
                       'Team2RushAtt', 'Team1RushYds', 'Team2RushYds', 
                       'Team1PYM', 'Team2PYM', 'Team1RYM', 'Team2RYM', 
                       'Team1YM', 'Team2YM', 'Team1Won']]

# Now 'merged_df' will have the data for both teams in the same row, with the stats ported over correctly.

merged_df.head(10)


Unnamed: 0,Season,Date,Team1,Team2,Home,Team1Pts,Team2Pts,Team1PtDiff,Team2PtDiff,Team1TM,Team2TM,Team1Rating,Team2Rating,Team1Y/P,Team2Y/P,Team1DY/P,Team2DY/P,Team1ToP,Team2ToP,Team1Sks,Team2Sks,Team1SkYds,Team2SkYds,Team1RushAtt,Team2RushAtt,Team1RushYds,Team2RushYds,Team1PYM,Team2PYM,Team1RYM,Team2RYM,Team1YM,Team2YM,Team1Won
0,2024,2024-12-05,DET,GNB,0,34,31.0,3,-3.0,0,0.0,109.7,111.7,5.14,6.62,6.62,5.14,2166,1434.0,1,1.0,3,7.0,34,24.0,111,99.0,81,-81.0,12,-12.0,-4,4.0,1
1,2024,2024-12-05,GNB,DET,1,31,34.0,-3,3.0,0,0.0,111.7,109.7,6.62,5.14,5.14,6.62,1434,2166.0,1,1.0,7,3.0,24,34.0,99,111.0,-81,81.0,-12,12.0,4,-4.0,0
2,2024,2024-12-02,DEN,CLE,0,41,32.0,9,-9.0,1,-1.0,65.7,88.1,6.56,6.57,6.57,6.56,1670,1930.0,0,3.0,0,22.0,26,23.0,106,77.0,-181,181.0,29,-29.0,-22,22.0,1
3,2024,2024-12-02,CLE,DEN,1,32,41.0,-9,9.0,-1,1.0,88.1,65.7,6.57,6.56,6.56,6.57,1930,1670.0,3,0.0,22,0.0,23,26.0,77,106.0,181,-181.0,-29,29.0,22,-22.0,0
4,2024,2024-12-01,JAX,HOU,0,20,23.0,-3,3.0,-1,1.0,83.0,95.5,5.57,5.34,5.34,5.57,1663,1937.0,0,2.0,0,24.0,25,25.0,97,108.0,58,-58.0,-11,11.0,-24,24.0,0
5,2024,2024-12-01,CIN,PIT,0,38,44.0,-6,6.0,-2,2.0,112.7,126.4,6.58,7.88,7.88,6.58,1731,1869.0,4,2.0,27,4.0,15,26.0,93,110.0,-128,128.0,-17,17.0,23,-23.0,0
6,2024,2024-12-01,TAM,CAR,1,26,23.0,3,-3.0,-1,1.0,70.7,83.4,5.78,5.4,5.4,5.78,2371,1659.0,4,1.0,31,9.0,39,21.0,236,78.0,-80,80.0,158,-158.0,22,-22.0,1
7,2024,2024-12-01,SFO,BUF,1,10,35.0,-25,25.0,-3,3.0,74.8,138.9,5.09,6.64,6.64,5.09,1593,2007.0,2,0.0,8,0.0,27,38.0,153,220.0,-66,66.0,-67,67.0,8,-8.0,0
8,2024,2024-12-01,LAR,NOR,1,21,14.0,7,-7.0,0,0.0,110.2,85.9,5.85,4.81,4.81,5.85,1643,1957.0,2,0.0,17,0.0,29,31.0,156,143.0,-18,18.0,13,-13.0,17,-17.0,1
9,2024,2024-12-01,ATL,LAC,0,13,17.0,-4,4.0,-3,3.0,40.0,87.2,4.55,4.07,4.07,4.55,2155,1445.0,1,5.0,11,19.0,37,17.0,116,56.0,103,-103.0,60,-60.0,-8,8.0,0


In [None]:
merged_df.to_csv('nfl_cleaned.csv', index=False)

In [None]:
import pandas as pd

df = pd.read_csv('nfl-cleaned.csv')

In [None]:
df.sort_values('Date')

df = df.reset_index()

We must now compute the expanding averages over previous games in order to train
and test our model. $\newline$
This is because we can only use statistics from previous 
games to predict our incoming test game. $\newline$
We use expanding instead of rolling averages because we want the expanding mean
over all previous games, not just a fixed sliding window.

In [None]:
df[df['Team1'] == 'DAL'].head()

Unnamed: 0.1,index,Unnamed: 0,ID,Team1Won,Season,Date,Team1,Team2,Home,Team1Pts,...,Team1RushAtt,Team2RushAtt,Team1RushYds,Team2RushYds,Team1RYM,Team2RYM,Team1PYM,Team2PYM,Team1YM,Team2YM
6,6,6,1459,False,2022,2022-09-11,DAL,TAM,1,3,...,18,33,71,152,-81,81,-22,22,-103,103
47,47,47,1383,True,2022,2022-09-26,DAL,NYG,0,23,...,30,25,176,167,9,-9,46,-46,55,-55
53,53,53,1359,True,2022,2022-10-02,DAL,WAS,1,25,...,29,27,62,142,-80,80,62,-62,-18,18
74,74,74,1341,True,2022,2022-10-09,DAL,LAR,0,22,...,34,15,163,38,125,-125,-209,209,-84,84
88,88,88,1309,False,2022,2022-10-16,DAL,PHI,0,17,...,26,39,134,136,-2,2,49,-49,47,-47


In [None]:
# Compute the expanding averages over previous games. 

numeric_columns = ['Home',
				'Team1Pts',    
				'Team2Pts',    
				'Team1PtDiff',
				'Team2PtDiff', 
				'Team1TM',    
				'Team2TM',     
				'Team1Rating', 
				'Team2Rating', 
				'Team1Sks',    
				'Team2Sks',    
				'Team1SkYds',  
				'Team2SkYds',  
				'Team1RushAtt',
				'Team2RushAtt',
				'Team1RushYds',
				'Team2RushYds',
				'Team1RYM',    
				'Team2RYM',    
				'Team1PYM',    
				'Team2PYM',    
				'Team1YM',     
				'Team2YM']

for column in numeric_columns:
	avg_col_name = column + '_avg'
	df[avg_col_name] = (
		df.groupby('Team1', group_keys=False)[column]
		.apply(lambda group: group.expanding().mean().shift(1))
		.reset_index(drop=True)
	)

In [None]:
df[df['Team1'] == 'DAL'].head()

Unnamed: 0.1,index,Unnamed: 0,ID,Team1Won,Season,Date,Team1,Team2,Home,Team1Pts,...,Team1RushAtt_avg,Team2RushAtt_avg,Team1RushYds_avg,Team2RushYds_avg,Team1RYM_avg,Team2RYM_avg,Team1PYM_avg,Team2PYM_avg,Team1YM_avg,Team2YM_avg
6,6,6,1459,False,2022,2022-09-11,DAL,TAM,1,3,...,,,,,,,,,,
47,47,47,1383,True,2022,2022-09-26,DAL,NYG,0,23,...,18.0,33.0,71.0,152.0,-81.0,81.0,-22.0,22.0,-103.0,103.0
53,53,53,1359,True,2022,2022-10-02,DAL,WAS,1,25,...,24.0,29.0,123.5,159.5,-36.0,36.0,12.0,-12.0,-24.0,24.0
74,74,74,1341,True,2022,2022-10-09,DAL,LAR,0,22,...,25.666667,28.333333,103.0,153.666667,-50.666667,50.666667,28.666667,-28.666667,-22.0,22.0
88,88,88,1309,False,2022,2022-10-16,DAL,PHI,0,17,...,27.75,25.0,118.0,124.75,-6.75,6.75,-30.75,30.75,-37.5,37.5


In [None]:
# TODO: Should we drop the first column or impute with its original values?

for column in numeric_columns:
	avg_col_name = column + '_avg'
	df[avg_col_name] = df[avg_col_name].fillna(df[column])

df[df['Team1'] == 'DAL'].head()

Unnamed: 0.1,index,Unnamed: 0,ID,Team1Won,Season,Date,Team1,Team2,Home,Team1Pts,...,Team1RushAtt_avg,Team2RushAtt_avg,Team1RushYds_avg,Team2RushYds_avg,Team1RYM_avg,Team2RYM_avg,Team1PYM_avg,Team2PYM_avg,Team1YM_avg,Team2YM_avg
6,6,6,1459,False,2022,2022-09-11,DAL,TAM,1,3,...,18.0,33.0,71.0,152.0,-81.0,81.0,-22.0,22.0,-103.0,103.0
47,47,47,1383,True,2022,2022-09-26,DAL,NYG,0,23,...,18.0,33.0,71.0,152.0,-81.0,81.0,-22.0,22.0,-103.0,103.0
53,53,53,1359,True,2022,2022-10-02,DAL,WAS,1,25,...,24.0,29.0,123.5,159.5,-36.0,36.0,12.0,-12.0,-24.0,24.0
74,74,74,1341,True,2022,2022-10-09,DAL,LAR,0,22,...,25.666667,28.333333,103.0,153.666667,-50.666667,50.666667,28.666667,-28.666667,-22.0,22.0
88,88,88,1309,False,2022,2022-10-16,DAL,PHI,0,17,...,27.75,25.0,118.0,124.75,-6.75,6.75,-30.75,30.75,-37.5,37.5


Let's hold out a test set and set it aside for later use. $\newline$
Note that we can simply split the dataframe because it has already been sorted in chronological order, meaning that data leakage for the time series logic will **not** occur.

In [None]:
import numpy as np

train_set, test_set = np.split(df, [int(0.8 * len(df))])

Let's do cross validation and training!

The function below sets up training and testing data for each fold of a time series split. It begins by defining column groups: `train_post_game_cols` holds post-game statistics like points and rushing yards, while `test_post_game_col` contains the same stats but with `_avg` appended, representing averages for testing. The function takes a specific fold (training and testing indices) and a dataframe as input, then uses these indices to split the dataframe into training and testing sets. Features and outcome labels are extracted separately for training (`X_train` and `y_train`) and testing (`X_test` and `y_test`). The function ensures the split data is ready for training and evaluation during the cross-validation process.

In [None]:
# Prepare train and test sets for each fold of TimeSeriesSplit

pre_game_cols = ['Team1', 'Team2', 'Home']
train_post_game_cols = ['Team1Pts', 'Team2Pts', 'Team1RushYds', 'Team2RushYds', 'Team1SkYds', 'Team2SkYds',
                  'Team1Sks', 'Team2Sks', 'Team1RushAtt', 'Team2RushAtt', 'Team1RYM', 'Team2RYM', 
                  'Team1PYM', 'Team2PYM', 'Team1YM', 'Team2YM', 'Team1Rating', 'Team2Rating']

test_post_game_cols = [col + '_avg' for col in train_post_game_cols]

outcome_col = 'Team1Won'

def prep_data_for_fold(fold, df):
  
  # Split data into training and testing based on the fold
  train_indices, test_indices = fold
  train_data = df.iloc[train_indices]
  test_data = df.iloc[test_indices]

  # Extract features that will be trained and tested on
  X_train = train_data[train_post_game_cols]
  X_test = test_data[test_post_game_cols]
  
  # Class labels from fold split
  y_train = train_data[outcome_col]
  y_test = test_data[outcome_col]
  
  return X_train, X_test, y_train, y_test

The function below calculates how well a model performs on the NFL time series dataset using a nested cross-validation setup. It uses a `TimeSeriesSplit` with 5 splits to preserve the order of time-dependent data. For each fold, it splits the data into training and testing sets and prepares the features and labels using `prep_data_for_fold`. Columns in the test set ending with `_avg` are renamed to match their training counterparts. The function scales the data by fitting a `StandardScaler` on the training set and applying it to both training and testing data. Then, it runs a `GridSearchCV` on the training set to find the best hyperparameters for the model. The best-performing model from the grid search is used to make predictions on the test set, and its accuracy is calculated and stored. After completing all folds, the function returns the average accuracy and prints the best model configuration.

In [None]:
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# TODO: PCA???
# TODO: Run SequentialFeatureSelector to only use best features?

# Nested cross-validation loop that computes how well a model does.
# Return an accuracy score averaged over all folds of the TimeSeriesSplit.

best_models = []

def get_model_accuracy(model, params, df):

  tscv = TimeSeriesSplit(n_splits=5)

  accuracies = []

  # Outer loop: find average accuracy over all time series splits

  for train_indices, test_indices in tscv.split(df):

    # Prepare the data for this fold
    X_train, X_test, y_train, y_test = prep_data_for_fold((train_indices, test_indices), df)

    # Rename the <col>_avg columns to just <col> so GridSearchCV doesn't complain
    X_test = X_test.rename(columns=lambda x: x[:-4] if x.endswith('_avg') else x)

    # Scale the data (fit scaler on training data and transform both train and test)
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Inner loop: find the best hyperparameters for this split
    grid_search = GridSearchCV(estimator=model, param_grid=params, cv=tscv)
    grid_search.fit(X_train, y_train)

    # Get the best model from grid search
    best_model = grid_search.best_estimator_

    # Analyze how well model does by comparing its predictions to actual class labels
    y_pred = best_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    
  best_models.append(best_model)
  
  print(best_model)

  # Return the average accuracy across all outer folds
  return sum(accuracies) / len(accuracies)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import warnings

warnings.filterwarnings("ignore")

models_dict = {}

dtc = DecisionTreeClassifier()
dtc_params = {
  'max_depth': [5, 10, 15, 20],
  'max_features': [5, 10, 15],
  'min_samples_leaf': [5, 10, 15, 20]
}

lr = LogisticRegression()
lr_params = {
	'penalty': ['l1', 'l2', 'elasticnet', None],
	'C': [0.1, 1, 10],
	'solver': ['liblinear', 'saga'],
	'max_iter': [100, 200]
}

rfc = RandomForestClassifier()
rfc_params = {
  'n_estimators': [50, 100, 200],
  'max_depth': [None, 10, 20],
  'min_samples_split': [2, 5, 10],
  'min_samples_leaf': [1, 2, 4]
}

gbc = GradientBoostingClassifier()
gbc_params = {
  'n_estimators': [50, 100, 200],
  'learning_rate': [0.01, 0.1, 0.2],
  'max_depth': [3, 5, 7],
  'subsample': [0.8, 1.0]
}

knn = KNeighborsClassifier()
knn_params = {
  'n_neighbors': [3, 5, 10],
  'weights': ['uniform', 'distance'],
  'p': [1, 2]  # 1 = Manhattan distance, 2 = Euclidean distance
}

svc = SVC()
svc_params = {
  'C': [0.1, 1, 10],
  'kernel': ['linear', 'rbf', 'poly'],
  'gamma': ['scale', 'auto'],
  'degree': [2, 3, 4]  # Only for 'poly' kernel
}

models_dict[dtc] = dtc_params
models_dict[lr] = lr_params
# models_dict[rfc] = rfc_params
# models_dict[gbc] = gbc_params
models_dict[knn] = knn_params
models_dict[svc] = svc_params

for model in models_dict.keys():
  score = get_model_accuracy(model, models_dict[model], df)
  print(score)

# TODO: Handle <col>_avg NaN for each team's first game of the season
# LogisticRegression() does not accept missing values encoded as NaN natively.

DecisionTreeClassifier(max_depth=15, max_features=15, min_samples_leaf=5)
0.5626016260162602
LogisticRegression(C=0.1, penalty='l1', solver='saga')
0.6097560975609756
KNeighborsClassifier(n_neighbors=10, weights='distance')
0.5772357723577235
SVC(C=1, degree=2, kernel='linear')
0.6032520325203252


Let's do a final evaluation of the model on the held out test set.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Let's say that LogisticRegression(C=0.1, penalty='l1', solver='saga') is the best model.

lr = LogisticRegression(C=0.1, solver='saga', penalty='l1')

for model in best_models:
	X_train = train_set[train_post_game_cols]
	X_test = test_set[test_post_game_cols].rename(columns=lambda x: x[:-4] if x.endswith('_avg') else x)
	y_train = train_set[outcome_col]
	y_test = test_set[outcome_col]

	scaler = StandardScaler()
	X_train = scaler.fit_transform(X_train)
	X_test = scaler.transform(X_test)

	model.fit(X_train, y_train)
	y_pred = model.predict(X_test)
	accuracy = accuracy_score(y_test, y_pred)

	print("Accuracy of model " + str(model) + ": " + str(accuracy))

Accuracy of model DecisionTreeClassifier(max_depth=15, max_features=15, min_samples_leaf=5): 0.527027027027027
Accuracy of model LogisticRegression(C=0.1, penalty='l1', solver='saga'): 0.581081081081081
Accuracy of model KNeighborsClassifier(n_neighbors=10, weights='distance'): 0.581081081081081
Accuracy of model SVC(C=1, degree=2, kernel='linear'): 0.5675675675675675


In [None]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred)
TP = conf_matrix[0, 0]
FN = conf_matrix[0, 1]
FP = conf_matrix[1, 0]
TN = conf_matrix[1, 1]

In [None]:
def print_confusion_matrix(TP, FN, FP, TN):
    table_data = [[TP,FN],[FP,TN]]
    df = pd.DataFrame(table_data, columns =['Predicted 1','Predicted 0'])
    df = df.rename(index={0: 'Actual 1', 1: 'Actual 0'})
    display(df)

In [None]:
print_confusion_matrix(TP, FN, FP, TN)

Unnamed: 0,Predicted 1,Predicted 0
Actual 1,41,26
Actual 0,38,43
