## **Project Introduction**

The goal of this project is to predict which NFL team will win a match based on their recent historical performance. This is a classification problem, where we aim to forecast the outcome (win or loss) of a game using historical team data and performance metrics.

Predicting NFL game outcomes benefits:
- **Coaches** by refining strategies,
- **Analysts** by providing insights,
- **Betting companies** by setting better odds, and
- **Fans** by boosting engagement.

By analyzing historical performance data, we can uncover patterns that inform decision-making for team preparation, betting strategies, and fan interactions. This improves outcomes for all stakeholders involved, from optimizing team tactics to generating content for fans.


In [603]:
import pandas as pd

In [605]:
# Read in base data from csv
df = pd.read_csv('nfl.csv')
df.head(5)

In [604]:
# Display all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

### **Data Cleaning & Exploration**
Here is the general overview of the steps we took during data cleaning & exploration.

- **Handle missing/invalid values**: Fill in or remove missin/invalid data to maintain dataset integrity.
- **Standardize formats & data types & units**: Ensure consistency in date formats, data types, and measurement units.
- **Remove unnecessary columns**: Drop irrelevant columns to simplify the dataset.
- **Anomoly detection & filter outliers**: Identify and manage extreme values that may distort analysis.
- **Correlation analysis**: Analyze relationships between variables to identify strong correlations that to inform feature selection.


#### **Handle Missing/Invalid Values**
To handle missing values, we count the number of null values in each column.
- We can see that the data is quite complete, with no null values except in the away column.
- Upon further inspection of the away column, it can be seen that the NaN value is used to denote that the current team is at home and the @ value is used to denote that the current team is away.
- Therefore, we do not need to handle any missing values.

Upon further data inspection, we found that some games ended in a tie.
- We will be considering this an invalid type since we are only trying to predict a win or loss
- Therefore, all rows which have a tie result will be removed.

In [606]:
# Handle missing values

null_counts = df.isnull().sum()

if (null_counts > 0).any():
    print(null_counts[null_counts > 0])

In [607]:
# Handle invalid values

df = df[~df['Result'].str[0].isin(['T'])]

#### **Standardize Formats, Data Types, & Units**

Before we complete the other steps in data cleaning, we should perform data standardization to make comparisons easier. This includes:
- Check that each column has a consistent type.
  - Our check found that each column had a consitent type except 'Away', which we handle in the next step.
- Standardizing non-numeric / categorical data.
  - Here we print out any columns without the 'int64' or 'float64' type.
  - Then we go through each column and turn those into a standardized type.
  - As a final check, we print the data type of each column and double check by inspecing the data.
- Confirming consistent units.
  - Since the source of the data provides units, they are expected to be consistent.
  - However, this is further checked by a visual inspection of the data and any outliers/anomolies which may be detected.

In [608]:
# Check that each column has a consistent type
for column in df.columns:
  if df[column].map(type).nunique() > 1:
    print(f"Column '{column}' contains mixed types.")

In [609]:
# Standardizing non-numeric data
non_numeric_columns = df.select_dtypes(exclude=['int64', 'float64'])
print(non_numeric_columns.dtypes)

In [610]:
# Standardizing non-numeric data
df['Date'] = pd.to_datetime(df['Date'])

def convert_to_seconds(time_str):
    minutes, seconds = map(int, time_str.split(':'))
    return minutes * 60 + seconds
df['ToP'] = df['ToP'].apply(convert_to_seconds)
df['ToP.1'] = df['ToP.1'].apply(convert_to_seconds)

df['Away'] = df['Away'].isna().astype(int)

def convert_to_minutes(time_str):
    return int(time_str.split(':')[0])
df['Time'] = df['Time'].apply(convert_to_minutes)

df['Result'] = df['Result'].str[0].map({'W': 1, 'L': 0})

In [611]:
# Standardizing non-numeric data
non_numeric_columns = df.select_dtypes(exclude=['int64', 'float64'])
print(non_numeric_columns.dtypes)

In [612]:
# Standardizing non-numeric data
df.head(5)

#### **Remove Unnecessary Columns**

To simplify our dataset, we should look to see if there are any unncessary columns. This can include:
- Columns containing irrelavent information
  - Upon inspection of the data, we can see that the 'Rk' column served as an index column. 
  - Therefore, the 'Rk' column can be considered to contain irrelavent information to out model and thus removed.
- Columns with constant data
  - No columns were found to have constant data
- Columns with excessive missing data
  - Since we have handled and discovered no missing values, we do not need to account for this case
- Columns with identical data (we will keep the first column alphabetically for each duplicate)
  - Column 'Att' is a duplicate of column 'Rush_Att'
  - Column 'Att.1' is a duplicate of column 'Opp_Rush_Att'
  - Column 'DY/P' is a duplicate of column 'DY/P.1'
  - Column 'Pts' is a duplicate of column 'Pts.1'
  - Column 'PtsO' is a duplicate of column 'PtsO.1'
  - Column 'Rate' is a duplicate of column 'Rate.2'
  - Column 'Rate.1' is a duplicate of column 'Rate.3'
  - Column 'TO' is a duplicate of column 'TO.2'
  - Column 'ToP' is a duplicate of column 'ToP.1'
  - Column 'Y/P' is a duplicate of column 'Y/P.1'

In [613]:
# Columns containing irrelavent information
df.drop('Rk', axis=1, inplace=True)
df.head(5)

In [614]:
# Columns with constant data
constant_columns = df.columns[df.nunique() == 1]
print('Columns with constant data:', constant_columns)

In [615]:
# Columns with identical data
duplicate_columns = {}
seen_pairs = set()

for col1 in df.columns:
  for col2 in df.columns:
    if col1 != col2 and df[col1].equals(df[col2]):
      if (col1, col2) not in seen_pairs and (col2, col1) not in seen_pairs:
        duplicate_columns[col1] = col2
        seen_pairs.add((col1, col2))

sorted_duplicate_columns = sorted(duplicate_columns.items())

for col, duplicate in sorted_duplicate_columns:
  print(f"Column '{col}' is a duplicate of column '{duplicate}'")

duplicated_columns_list = list(set(duplicate_columns.keys()).union(duplicate_columns.values()))
print(df[duplicated_columns_list].head(5).T.sort_index(axis=0))


In [616]:
# Columns with identical data
for col1, duplicate in duplicate_columns.items():
 if duplicate in df.columns:
    df.drop(duplicate, axis=1, inplace=True)

df.head(5)

#### **Anomoly Detection & Filter Outliers**

Anomaly detection is essential to identify and address outliers or errors in the dataset.
This helps to ensure that the analysis remains accurate and the model is not influenced by misleading or irrelevant data points. Note that this is also somewhat part of data exploration.
We followed the steps below:

- Plot distributions for numeric data
	- Before we search for anomolies, is good to visualize how the data is distributed
	- Additionally, we should note that some of the numeric values are inheriently categorical or continous such as:
		- Result
			Season
			Time
			Week
			G#
- Using the plots, we can inspect to see the types of distributions
	- Only only classified Yds.1 and Yds.3 as "exponential"
	- Everything else is considered to have a "normal" distribution, even if severly skewed
- By separetly calculating the Z-scores for normal and exponential features and combining them at the end:
	- We get an anomoly rate of 15.60% for Z-score threshold of 3 and 2.92% for Z-score threshold of 4
	- Printing out the anomolies for Z-score threshold of 4, we see games with still relatively normal stats
	- Therefore, we concluded that due to the chaotic and variable nature sports, especially with football, we cannot confindelty determine which games should be considered anomolies, so no rows are removed







In [617]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

In [618]:
exclude_columns = ['Result', 'Season', 'Time', 'Week', 'G#']

def get_numeric_columns(df, exclude_columns):
    return sorted([col for col in df.select_dtypes(include=['float64', 'int64']).columns if col not in exclude_columns])

numeric_columns = get_numeric_columns(df, exclude_columns)

In [619]:
def fit_distributions(data):
	norm_params = stats.norm.fit(data) # Normal distribution
	
	exp_params = stats.expon.fit(data) # Exponential distribution
	
	if (data > 0).all():
		lognorm_params = stats.lognorm.fit(data) # Log-Normal distribution (if all > 0)
		return norm_params, exp_params, lognorm_params, True
	else:
		return norm_params, exp_params, None, False

In [620]:
num_cols = 4
num_rows = int(np.ceil(len(numeric_columns) / num_cols))
plt.figure(figsize=(num_cols * 4, num_rows * 4))

for i, col in enumerate(numeric_columns, 1):
	data = df[col]
	norm_params, exp_params, lognorm_params, plot_lognorm = fit_distributions(data)
	plt.subplot(num_rows, num_cols, i)
	sns.histplot(data, kde=True, bins=20, color='blue', stat="density")
	xmin, xmax = plt.xlim()
	x = np.linspace(xmin, xmax, 100)

	p_norm = stats.norm.pdf(x, *norm_params) # Normal distribution
	plt.plot(x, p_norm, 'k-', label=f'Normal fit')

	p_exp = stats.expon.pdf(x, *exp_params) # Exponential distribution
	plt.plot(x, p_exp, 'r-', label=f'Exponential fit')

	if plot_lognorm:
		p_lognorm = stats.lognorm.pdf(x, *lognorm_params) # Log-Normal distribution (if all > 0)
		plt.plot(x, p_lognorm, 'g-', label=f'Log-Normal fit') 

	plt.title(f'Distribution of {col}')
	plt.legend()

plt.tight_layout()
plt.show()


In [621]:
import pandas as pd
import numpy as np
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler


In [622]:
numeric_columns = df.select_dtypes(include=[np.number]).columns
exponential_features = ['Yds.1', 'Yds.3']
normal_features = [col for col in numeric_columns if col not in exponential_features and col not in ['Result', 'Season', 'Time', 'Week', 'G#']]


normal_data = df[normal_features]
exponential_data = df[exponential_features]

normal_data_standardized = normal_data.apply(zscore)
exponential_data_log = np.log1p(exponential_data)
exponential_data_log_standardized = exponential_data_log.apply(zscore)

normal_anomalies_3 = (np.abs(normal_data_standardized) > 3).any(axis=1)
normal_anomalies_4 = (np.abs(normal_data_standardized) > 4).any(axis=1)
exponential_anomalies_3 = (np.abs(exponential_data_log_standardized) > 3).any(axis=1)
exponential_anomalies_4 = (np.abs(exponential_data_log_standardized) > 4).any(axis=1)
final_anomalies_3 = normal_anomalies_3 | exponential_anomalies_3
final_anomalies_4 = normal_anomalies_4 | exponential_anomalies_4

print(f"Anomalies detected (threshold = 3): {final_anomalies_3.mean() * 100:.2f}%")
print(f"Anomalies detected (threshold = 4): {final_anomalies_4.mean() * 100:.2f}%")

In [623]:
plt.hist(normal_data_standardized.values.flatten(), bins=50, alpha=0.7, label="Normal Features")
plt.hist(exponential_data_log_standardized.values.flatten(), bins=50, alpha=0.7, label="Exponential Features")
plt.title("Distribution of Z-scores")
plt.xlabel("Z-score")
plt.ylabel("Frequency")
plt.legend()
plt.show()

In [624]:
temp_df = df.copy()
temp_df['Anomaly'] = final_anomalies_4.astype(int)
anomalous_rows = temp_df[temp_df['Anomaly'] == 1]
anomalous_rows.head(10)


#### **Correlation Analysis**

Correlation analysis helps to guide out feature engineering. It is useful for detecting irrelavent information and reducing dimensionality. To do so, we did the following:
- Correlation of each column with 'Result'
	- Result only had a high correlation with PtDif, which makes sense as that's what determines the winner
	- Inc and Int dont seem to be correlated to Result
- Correlation of each column with every other column (threshold = 0.7)
	- There seems to be several stats which are very highly correlated to each other, mostly due to how they are calculated
	- 'G#' and 'Week': expected due to both being time based, but don't seem to be hihgly correlated to any stat nor the result
	- A lot of the *Y/A are hihgly coorrelated, since it's mostly calcualtions from the total yards
	- 'Rate' also seems to be highly correlated to multiple other stats, also due to how rate is calculated
	- There is also a pattern of overall stat and stat %, which are high correlated and may be redundant
- Correlation of each column with 'Y/P'
	- NY/A, ANY/A, AY/A, Y/A, Y/C, and Yds are all highly coorelated with 'Y/P'
- Correlation of each column with 'DY/P'
	- 'DY/P' does not seem to be highly coorelated to any stat
- Correlation of each column with 'Rate'
	- 'Rate' is decently correlated to stats related to TD, cmps, yards, and ints, which are used to calculate it
- Correlation of each column with 'Yds'
	- 'Yds' is hihgly correlated to stats such as Cmp and tot
- Correlation of each column with 'Yds.1'
	- Yards lost to sacks is strongly correlated to the num of sacks "Sk"

In [625]:
numeric_columns = df.select_dtypes(include=[np.number]).columns
numeric_data = df[numeric_columns]

In [626]:
# Step 1: Compute the correlation matrix
correlation_matrix = numeric_data.corr()

# Step 2: Mask the upper triangle to avoid duplicate correlations (i.e., A vs B and B vs A)
import numpy as np
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Step 3: Extract pairs of features with strong correlations (e.g., |correlation| > 0.7)
threshold = 0.7
strong_correlations = correlation_matrix.where(mask).stack().reset_index()
strong_correlations.columns = ['Feature 1', 'Feature 2', 'Correlation']

# Step 4: Filter strong correlations above the threshold (both positive and negative)
strong_correlations = strong_correlations[strong_correlations['Correlation'].abs() > threshold]

# Step 5: Remove self-correlations (Feature 1 and Feature 2 being the same)
strong_correlations = strong_correlations[strong_correlations['Feature 1'] != strong_correlations['Feature 2']]

# Step 6: Sort the correlations by absolute value
strong_correlations['abs_correlation'] = strong_correlations['Correlation'].abs()
strong_correlations_sorted = strong_correlations.sort_values(by='abs_correlation', ascending=False).reset_index(drop=True)

# Step 7: Display the result
print(strong_correlations_sorted[['Feature 1', 'Feature 2', 'Correlation']])


In [627]:
def correlations_with_feature(feature_name):
	correlations = numeric_data.corrwith(df[feature_name])
	correlations = correlations.reset_index()

	correlations.columns = ['Feature', 'Correlation with ' + feature_name]
	correlations['abs_correlation'] = correlations['Correlation with ' + feature_name].abs()

	correlations_sorted = correlations.sort_values(by='abs_correlation', ascending=False).reset_index(drop=True)
	print(correlations_sorted[['Feature', 'Correlation with ' + feature_name]])

In [628]:
correlations_with_feature('Result')

In [629]:
correlations_with_feature('Week')

In [630]:
correlations_with_feature('G#')

In [631]:
correlations_with_feature('Y/P')

In [632]:
correlations_with_feature('DY/P')

In [633]:
correlations_with_feature('Rate')

In [None]:
correlations_with_feature('Yds.1')

### **Feature Selection & Engineering**

Feature selection and engineering help to identify the most relevant variables (reduces dimentionality) and transform raw data into informative features, improving model accuracy, reducing complexity, and preventing overfitting.

#### **Feature Selection**

For our feature selection, we used both the data from the correlation analysis as well as logic and previous experience with the sport. 

The general method is to train the model on each game with the format of<br>
[General Game Data | Team 1 Stats| Team 2 Stats | Win/Loss]

Since each game is presented twice in the data set (mirror for each team), we can simply get the data of team 1 for each game and then merge

General Game Data:
- Season
- Date
- Team
- Opp
- Away

Win/Loss:
- Result

Team 1 Stats Included:
- Pts: Generally important for a win
- PtDif: How much they won/lost by
- TO: Generally important for a win
- Rate: Summarizes many important stats for a game (NFL rating)
- Y/P: Generally important for a win, strongly correlated to other stats, serves as a summary
- DY/P: Generally important for a win
- ToP: Generally important for a win
- Sk: Generally detrimental to a win
- Yds.1: Sacked yards: Generally detrimental to a win
- Att: Rushing attempts: generally a positive sign of performace
- Rush_Yds: Generally important for a win

Team 1 Stats Excluded:
- Rate.1: A different rating system to the NFL rating, removed for consistency
- NY/A: Strongly correlated to Y/P, redundant
- ANY/A: Strongly correlated to Y/P, redundant
- AY/A: Strongly correlated to Y/P, redundant
- Y/A: Strongly correlated to Y/P, redundant
- Y/C: Strongly correlated to Y/P, redundant
- Cmp: Strongly correlated to Yds, redundant
- Inc: reflected in Y/P
- Int: reflected in DY/P
- Rush_Y/A: reflected in Att and Rush_Yds
- TD: Reflected in Pts
- Yds: Reflected in Y/P
- Att.2: Reflected in Y/P

Team 2 Stats (All Excluded):
- Att.3
- Cmp%.1
- Cmp.1
- Int.1
- Sk.1
- Tot.1
- TD.2
- TD.3
- Y/A.1
- Yds.2
- Yds.3
- Opp_Rush_Yds
- Att.1
- TD.1
- TO.1
- PtsO

Excluded:
- Time: Irrelavent
- Week: Irrelavent
- G#: Irrelavent
- Day: Irrelavent
- Cmp%: Strongly correlated to Cmp, redundant
- Int%: Strongly correlated to Int, redundant
- Sk%: Strongly correlated to Sk, redundant
- TD%: Strongly correlated to TD, redundant
- Ply: Irrelavent, total for both teams
- DPly: Irrelavent, total for both teams
- PC: Irrelavent

Therefore, in the end we have:

General Game Data:
- Season -> Season
- Date -> Date
- Team -> Team1
- Opp -> Team2
- Away -> Home

Win/Loss:
- Result -> Result

Team 1 Stats Included:
- Pts -> Team1Pts
- PtDif -> Team1PtDiff
- TO -> Team1TM
- Rate -> Team1Rating
- Y/P -> Team1Y/P
- DY/P -> Team1DY/P
- ToP -> Team1ToP
- Sk -> Team1Sks
- Yds.1 -> Team1SkYds
- Att -> Team1RushAtt
- Rush_Yds -> Team1RushYds

In [645]:
import pandas as pd

# Assuming you have your original DataFrame 'df'

# Create a new DataFrame with only the columns you want and rename them
new_df = df[['Season', 'Date', 'Team', 'Opp', 'Away', 'Pts', 'PtDif', 'TO', 'Rate', 'Y/P', 'DY/P', 'ToP', 'Sk', 'Yds.1', 'Att', 'Rush_Yds', 'Result']].copy()

# Rename columns
new_df.rename(columns={
    'Season': 'Season',
    'Date': 'Date',
    'Team': 'Team1',
    'Opp': 'Team2',
    'Away': 'Home',
    'Pts': 'Team1Pts',
    'PtDif': 'Team1PtDiff',
    'TO': 'Team1TM',
    'Rate': 'Team1Rating',
    'Y/P': 'Team1Y/P',
    'DY/P': 'Team1DY/P',
    'ToP': 'Team1ToP',
    'Sk': 'Team1Sks',
    'Yds.1': 'Team1SkYds',
    'Att': 'Team1RushAtt',
    'Rush_Yds': 'Team1RushYds',
    'Result': 'Result'
}, inplace=True)

# Now, 'new_df' will have the renamed columns as requested

new_df.head(5)

In [644]:
df.head(5)

Unnamed: 0,Team,Date,Season,Pts,PtsO,Rate,TO,Y/P,DY/P,ToP,Rate.1,Att,Att.1,Day,G#,Week,Away,Opp,Result,PtDif,PC,Cmp,Att.2,Inc,Cmp%,Yds,TD,Int,TD%,Int%,Sk,Yds.1,Sk%,Y/A,NY/A,AY/A,ANY/A,Y/C,Rush_Yds,Rush_Y/A,TD.1,Tot,Ply,DPly,TO.1,Time,Cmp.1,Att.3,Cmp%.1,Yds.2,TD.2,Sk.1,Yds.3,Int.1,Opp_Rush_Yds,Y/A.1,TD.3,Rush,Pass,Tot.1
0,DET,2024-12-05,2024,34,31,109.7,0,5.14,6.62,2166,110.2,34,24,Thu,13,14,1,GNB,1,3,65,32,41,9,78.0,280,3,1,7.3,2.4,1,3,2.38,6.8,6.67,7.2,7.02,8.8,111,3.3,1,391,76,45,1,3,12,20,60.0,199,1,1,7,0,99,4.1,3,12,81,93
1,GNB,2024-12-05,2024,31,34,111.7,0,6.62,5.14,1434,109.3,24,34,Thu,13,14,0,DET,0,-3,65,12,20,8,60.0,199,1,0,5.0,0.0,1,7,4.76,10.0,9.48,10.95,10.43,16.6,99,4.1,3,298,45,76,1,3,32,41,78.0,280,3,1,3,1,111,3.3,1,-12,-81,-93
2,DEN,2024-12-02,2024,41,32,65.7,1,6.56,6.57,1670,86.5,26,23,Mon,13,13,1,CLE,1,9,73,18,35,17,51.4,294,1,2,2.9,5.7,0,0,0.0,8.4,8.4,6.4,6.4,16.3,106,4.1,2,400,61,84,2,3,34,58,58.6,475,4,3,22,3,77,3.3,0,29,-181,-152
3,CLE,2024-12-02,2024,32,41,88.1,-1,6.57,6.56,1930,65.7,23,26,Mon,12,13,0,DEN,0,-9,73,34,58,24,58.6,475,4,3,6.9,5.2,3,22,4.92,8.2,7.79,7.24,6.89,14.0,77,3.3,0,552,84,61,3,3,18,35,51.4,294,1,0,0,2,106,4.1,2,-29,181,152
4,JAX,2024-12-01,2024,20,23,83.0,-1,5.57,5.34,1663,92.5,25,25,Sun,12,13,1,HOU,0,-3,43,24,42,18,57.1,276,2,1,4.8,2.4,0,0,0.0,6.6,6.57,6.45,6.45,11.5,97,3.9,0,373,67,61,1,3,22,34,64.7,218,1,2,24,0,108,4.3,1,-11,58,47


### **Feature Engineering**

Currently each row only tells the story of the stats of each team, they do not however, properly tell the difference between the teams besides PtDif, therefore, we will introduce 3 new features:
- Pass: passing margin
- Rush: rushing margin
- Yardage: yardage margin

These help tell the 'story' of how close the match really was

In [None]:
# Columns with identical data
duplicate_columns = {}
seen_pairs = set()

for col1 in df.columns:
  for col2 in df.columns:
    if col1 != col2 and df[col1].equals(df[col2]):
      if (col1, col2) not in seen_pairs and (col2, col1) not in seen_pairs:
        duplicate_columns[col1] = col2
        seen_pairs.add((col1, col2))

sorted_duplicate_columns = sorted(duplicate_columns.items())

for col, duplicate in sorted_duplicate_columns:
  print(f"Column '{col}' is a duplicate of column '{duplicate}'")

duplicated_columns_list = list(set(duplicate_columns.keys()).union(duplicate_columns.values()))
print(df[duplicated_columns_list].head(5).T.sort_index(axis=0))


Column 'Att' is a duplicate of column 'Rush_Att'
Column 'Att.1' is a duplicate of column 'Opp_Rush_Att'
Column 'DY/P' is a duplicate of column 'DY/P.1'
Column 'Pts' is a duplicate of column 'Pts.1'
Column 'PtsO' is a duplicate of column 'PtsO.1'
Column 'Rate' is a duplicate of column 'Rate.2'
Column 'Rate.1' is a duplicate of column 'Rate.3'
Column 'TO' is a duplicate of column 'TO.2'
Column 'ToP' is a duplicate of column 'ToP.1'
Column 'Y/P' is a duplicate of column 'Y/P.1'
                    0        1        2        3        4
Att             34.00    24.00    26.00    23.00    25.00
Att.1           24.00    34.00    23.00    26.00    25.00
DY/P             6.62     5.14     6.57     6.56     5.34
DY/P.1           6.62     5.14     6.57     6.56     5.34
Opp_Rush_Att    24.00    34.00    23.00    26.00    25.00
Pts             34.00    31.00    41.00    32.00    20.00
Pts.1           34.00    31.00    41.00    32.00    20.00
PtsO            31.00    34.00    32.00    41.00    23.0