## Project Introduction

The goal of this project is to predict which NFL team will win a match based on their recent historical performance. This is a classification problem, where we aim to forecast the outcome (win or loss) of a game using team data and performance metrics.

Predicting NFL game outcomes benefits:
- **Coaches** by refining strategies,
- **Analysts** by providing insights,
- **Betting companies** by setting better odds, and
- **Fans** by boosting engagement.

By analyzing historical performance data, we can uncover patterns that inform decision-making for team preparation, betting strategies, and fan interactions. This improves outcomes for all stakeholders involved, from optimizing team tactics to generating content for fans.


In [54]:
import pandas as pd

In [81]:
# Display all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [80]:
# Read in base data from csv
df = pd.read_csv('nfl.csv')
df.head(5)

Unnamed: 0,Rk,Team,Date,Season,Pts,PtsO,Rate,TO,Y/P,DY/P,ToP,Rate.1,Att,Att.1,Day,G#,Week,Away,Opp,Result,Pts.1,PtsO.1,PtDif,PC,Cmp,Att.2,Inc,Cmp%,Yds,TD,Int,TD%,Int%,Rate.2,Sk,Yds.1,Sk%,Y/A,NY/A,AY/A,ANY/A,Y/C,Rush_Att,Rush_Yds,Rush_Y/A,TD.1,Tot,Ply,Y/P.1,DPly,DY/P.1,TO.1,ToP.1,Time,Cmp.1,Att.3,Cmp%.1,Yds.2,TD.2,Sk.1,Yds.3,Int.1,Rate.3,Opp_Rush_Att,Opp_Rush_Yds,Y/A.1,TD.3,Rush,Pass,Tot.1,TO.2
0,1,DET,2024-12-05,2024,34,31,109.7,0,5.14,6.62,36:06,110.2,34,24,Thu,13,14,,GNB,W 34-31,34,31,3,65,32,41,9,78.0,280,3,1,7.3,2.4,109.7,1,3,2.38,6.8,6.67,7.2,7.02,8.8,34,111,3.3,1,391,76,5.14,45,6.62,1,36:06,3:12,12,20,60.0,199,1,1,7,0,110.2,24,99,4.1,3,12,81,93,0
1,2,GNB,2024-12-05,2024,31,34,111.7,0,6.62,5.14,23:54,109.3,24,34,Thu,13,14,@,DET,L 31-34,31,34,-3,65,12,20,8,60.0,199,1,0,5.0,0.0,111.7,1,7,4.76,10.0,9.48,10.95,10.43,16.6,24,99,4.1,3,298,45,6.62,76,5.14,1,23:54,3:12,32,41,78.0,280,3,1,3,1,109.3,34,111,3.3,1,-12,-81,-93,0
2,3,DEN,2024-12-02,2024,41,32,65.7,1,6.56,6.57,27:50,86.5,26,23,Mon,13,13,,CLE,W 41-32,41,32,9,73,18,35,17,51.4,294,1,2,2.9,5.7,65.7,0,0,0.0,8.4,8.4,6.4,6.4,16.3,26,106,4.1,2,400,61,6.56,84,6.57,2,27:50,3:29,34,58,58.6,475,4,3,22,3,86.5,23,77,3.3,0,29,-181,-152,1
3,4,CLE,2024-12-02,2024,32,41,88.1,-1,6.57,6.56,32:10,65.7,23,26,Mon,12,13,@,DEN,L 32-41,32,41,-9,73,34,58,24,58.6,475,4,3,6.9,5.2,88.1,3,22,4.92,8.2,7.79,7.24,6.89,14.0,23,77,3.3,0,552,84,6.57,61,6.56,3,32:10,3:29,18,35,51.4,294,1,0,0,2,65.7,26,106,4.1,2,-29,181,152,-1
4,5,JAX,2024-12-01,2024,20,23,83.0,-1,5.57,5.34,27:43,92.5,25,25,Sun,12,13,,HOU,L 20-23,20,23,-3,43,24,42,18,57.1,276,2,1,4.8,2.4,83.0,0,0,0.0,6.6,6.57,6.45,6.45,11.5,25,97,3.9,0,373,67,5.57,61,5.34,1,27:43,3:18,22,34,64.7,218,1,2,24,0,92.5,25,108,4.3,1,-11,58,47,-1


### Data Cleaning
Here is the general overview of the steps we took during data cleaning.

- **Handle missing values**: Fill in or remove missing data to maintain dataset integrity.
- **Standardize formats & data types & units**: Ensure consistency in date formats, data types, and measurement units.
- **Remove duplicates**: Eliminate repeated rows to prevent skewed analysis.
- **Remove unnecessary columns**: Drop irrelevant columns to simplify the dataset.
- **Handle categorical data**: Convert categorical variables into a suitable format for analysis.
- **Filter outliers**: Identify and manage extreme values that may distort analysis.


#### Handle Missing Values
In this step, we count the number of null values in each column.
We can see that the data is quite complete, with no null values except in the away column.
Upon further inspection of the away column, it can be seen that the NaN value is used to denote that the current team is at home and the @ value is used to denote that the current team is away.
Therefore, we do not need to handle any missing values.

In [82]:
# Check for null values
null_counts = df.isnull().sum()
print(null_counts)

Rk                0
Team              0
Date              0
Season            0
Pts               0
PtsO              0
Rate              0
TO                0
Y/P               0
DY/P              0
ToP               0
Rate.1            0
Att               0
Att.1             0
Day               0
G#                0
Week              0
Away            739
Opp               0
Result            0
Pts.1             0
PtsO.1            0
PtDif             0
PC                0
Cmp               0
Att.2             0
Inc               0
Cmp%              0
Yds               0
TD                0
Int               0
TD%               0
Int%              0
Rate.2            0
Sk                0
Yds.1             0
Sk%               0
Y/A               0
NY/A              0
AY/A              0
ANY/A             0
Y/C               0
Rush_Att          0
Rush_Yds          0
Rush_Y/A          0
TD.1              0
Tot               0
Ply               0
Y/P.1             0
DPly              0


#### Standardize Formats, Data Types, & Units

Before we complete the other steps in data cleaning, we should perform data standardization to make comparisons easier. This includes:
- Check that each column has a consistent type.
  - Our check found that each column had a consitent type except 'Away', which we handle in the next step.
- Standardizing non-numeric data.
  - Here we print out any columns without the 'int64' or 'float64' type.
  - Then we go through each column and turn those into a standardized type.
  - As a final check, we print the data type of each column and double check by inspecing the data.
- Confirming consistent units.
  - Since the source of the data provides units, they are expected to be consistent.
  - However, this is further checked by a visual inspection of the data and any outliers/anomolies which may be detected.

In [83]:
# Check that each column has a consistent type
for column in df.columns:
  if df[column].map(type).nunique() > 1:
    print(f"Column '{column}' contains mixed types.")

Column 'Away' contains mixed types.


In [84]:
# Standardizing non-numeric data
non_numeric_columns = df.select_dtypes(exclude=['int64', 'float64'])
print(non_numeric_columns.dtypes)

Team      object
Date      object
ToP       object
Day       object
Away      object
Opp       object
Result    object
ToP.1     object
Time      object
dtype: object


In [85]:
# Standardizing non-numeric data
print(non_numeric_columns)

     Team        Date    ToP  Day Away  Opp        Result  ToP.1  Time
0     DET  2024-12-05  36:06  Thu  NaN  GNB       W 34-31  36:06  3:12
1     GNB  2024-12-05  23:54  Thu    @  DET       L 31-34  23:54  3:12
2     DEN  2024-12-02  27:50  Mon  NaN  CLE       W 41-32  27:50  3:29
3     CLE  2024-12-02  32:10  Mon    @  DEN       L 32-41  32:10  3:29
4     JAX  2024-12-01  27:43  Sun  NaN  HOU       L 20-23  27:43  3:18
5     CIN  2024-12-01  28:51  Sun  NaN  PIT       L 38-44  28:51  3:20
6     TAM  2024-12-01  39:31  Sun    @  CAR  W 26-23 (OT)  39:31  3:38
7     SFO  2024-12-01  26:33  Sun    @  BUF       L 10-35  26:33  2:46
8     LAR  2024-12-01  27:23  Sun    @  NOR       W 21-14  27:23  2:53
9     ATL  2024-12-01  35:55  Sun  NaN  LAC       L 13-17  35:55  2:44
10    IND  2024-12-01  25:48  Sun    @  NWE       W 25-24  25:48  2:59
11    NWE  2024-12-01  34:12  Sun  NaN  IND       L 24-25  34:12  2:59
12    NYJ  2024-12-01  28:39  Sun  NaN  SEA       L 21-26  28:39  3:06
13    

In [None]:
# Standardizing non-numeric data

#### Remove Unnecessary Columns

To simplify our dataset, we should look to see if there are any unncessary columns. This can include:
- Columns containing irrelavent information
  - Upon inspection of the data, we can see that the 'Rk' column served as an index column. 
  - Therefore, the 'Rk' column can be considered to contain irrelavent information to out model and thus removed.
- Columns with constant data
  - When we checked for this, the 'Away' column is shown.
  - Since this is a boolean column, we do not remove this column.
- Columns with excessive missing data
  - Since we have handled and discovered no missing values, we do not need to account for this case
- Columns with identical data
  - 

In [58]:
df.drop('Rk', axis=1, inplace=True)
df.head(5)

Unnamed: 0,Team,Date,Season,Pts,PtsO,Rate,TO,Y/P,DY/P,ToP,Rate.1,Att,Att.1,Day,G#,Week,Away,Opp,Result,Pts.1,PtsO.1,PtDif,PC,Cmp,Att.2,Inc,Cmp%,Yds,TD,Int,TD%,Int%,Rate.2,Sk,Yds.1,Sk%,Y/A,NY/A,AY/A,ANY/A,Y/C,Rush_Att,Rush_Yds,Rush_Y/A,TD.1,Tot,Ply,Y/P.1,DPly,DY/P.1,TO.1,ToP.1,Time,Cmp.1,Att.3,Cmp%.1,Yds.2,TD.2,Sk.1,Yds.3,Int.1,Rate.3,Opp_Rush_Att,Opp_Rush_Yds,Y/A.1,TD.3,Rush,Pass,Tot.1,TO.2
0,DET,2024-12-05,2024,34,31,109.7,0,5.14,6.62,36:06,110.2,34,24,Thu,13,14,,GNB,W 34-31,34,31,3,65,32,41,9,78.0,280,3,1,7.3,2.4,109.7,1,3,2.38,6.8,6.67,7.2,7.02,8.8,34,111,3.3,1,391,76,5.14,45,6.62,1,36:06,3:12,12,20,60.0,199,1,1,7,0,110.2,24,99,4.1,3,12,81,93,0
1,GNB,2024-12-05,2024,31,34,111.7,0,6.62,5.14,23:54,109.3,24,34,Thu,13,14,@,DET,L 31-34,31,34,-3,65,12,20,8,60.0,199,1,0,5.0,0.0,111.7,1,7,4.76,10.0,9.48,10.95,10.43,16.6,24,99,4.1,3,298,45,6.62,76,5.14,1,23:54,3:12,32,41,78.0,280,3,1,3,1,109.3,34,111,3.3,1,-12,-81,-93,0
2,DEN,2024-12-02,2024,41,32,65.7,1,6.56,6.57,27:50,86.5,26,23,Mon,13,13,,CLE,W 41-32,41,32,9,73,18,35,17,51.4,294,1,2,2.9,5.7,65.7,0,0,0.0,8.4,8.4,6.4,6.4,16.3,26,106,4.1,2,400,61,6.56,84,6.57,2,27:50,3:29,34,58,58.6,475,4,3,22,3,86.5,23,77,3.3,0,29,-181,-152,1
3,CLE,2024-12-02,2024,32,41,88.1,-1,6.57,6.56,32:10,65.7,23,26,Mon,12,13,@,DEN,L 32-41,32,41,-9,73,34,58,24,58.6,475,4,3,6.9,5.2,88.1,3,22,4.92,8.2,7.79,7.24,6.89,14.0,23,77,3.3,0,552,84,6.57,61,6.56,3,32:10,3:29,18,35,51.4,294,1,0,0,2,65.7,26,106,4.1,2,-29,181,152,-1
4,JAX,2024-12-01,2024,20,23,83.0,-1,5.57,5.34,27:43,92.5,25,25,Sun,12,13,,HOU,L 20-23,20,23,-3,43,24,42,18,57.1,276,2,1,4.8,2.4,83.0,0,0,0.0,6.6,6.57,6.45,6.45,11.5,25,97,3.9,0,373,67,5.57,61,5.34,1,27:43,3:18,22,34,64.7,218,1,2,24,0,92.5,25,108,4.3,1,-11,58,47,-1


In [59]:
constant_columns = df.columns[df.nunique() == 1]
print('Columns with constant data:', constant_columns)

Columns with constant data: Index(['Away'], dtype='object')


In [71]:
# We could use this approach as well, but it does not return which column each is a duplicate of
  # duplicate_columns = df.columns[df.T.duplicated()]
  # print("Columns with identical data:", duplicate_columns)

duplicate_columns = {}

for col1 in df.columns:
    for col2 in df.columns:
        if col1 != col2 and df[col1].equals(df[col2]):
            if col1 not in duplicate_columns:
                duplicate_columns[col1] = col2

for col, duplicate in duplicate_columns.items():
    print(f"Column '{col}' is a duplicate of column '{duplicate}'")
    
df.head(10).T.sort_index(axis=0)

Column 'Pts' is a duplicate of column 'Pts.1'
Column 'PtsO' is a duplicate of column 'PtsO.1'
Column 'Rate' is a duplicate of column 'Rate.2'
Column 'TO' is a duplicate of column 'TO.2'
Column 'Y/P' is a duplicate of column 'Y/P.1'
Column 'DY/P' is a duplicate of column 'DY/P.1'
Column 'ToP' is a duplicate of column 'ToP.1'
Column 'Rate.1' is a duplicate of column 'Rate.3'
Column 'Att' is a duplicate of column 'Rush_Att'
Column 'Att.1' is a duplicate of column 'Opp_Rush_Att'
Column 'Pts.1' is a duplicate of column 'Pts'
Column 'PtsO.1' is a duplicate of column 'PtsO'
Column 'Rate.2' is a duplicate of column 'Rate'
Column 'Rush_Att' is a duplicate of column 'Att'
Column 'Y/P.1' is a duplicate of column 'Y/P'
Column 'DY/P.1' is a duplicate of column 'DY/P'
Column 'ToP.1' is a duplicate of column 'ToP'
Column 'Rate.3' is a duplicate of column 'Rate.1'
Column 'Opp_Rush_Att' is a duplicate of column 'Att.1'
Column 'TO.2' is a duplicate of column 'TO'


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
ANY/A,7.02,10.43,6.4,6.89,6.45,7.07,3.66,4.3,7.92,1.35
AY/A,7.2,10.95,6.4,7.24,6.45,7.82,4.09,4.78,8.58,1.38
Att,34,24,26,23,25,15,39,27,29,37
Att.1,24,34,23,26,25,26,21,38,31,17
Att.2,41,20,35,58,42,38,34,18,24,39
Att.3,20,41,58,35,34,38,46,18,37,24
Away,,@,,@,,,@,@,@,
Cmp,32,12,18,34,24,28,22,11,14,24
Cmp%,78.0,60.0,51.4,58.6,57.1,73.7,64.7,61.1,58.3,61.5
Cmp%.1,60.0,78.0,58.6,51.4,64.7,76.3,56.5,77.8,64.9,70.8


#### Remove Duplicates
Inspecting the dataset shows that there are repeated entries for each game.
More specifically, each game has two entries with:
- 'Team': A and 'Opp': B
- 'Team': B and 'Opp': A

This can be seen in the previous head() in rows 0 and 1.
Upon further general inspection of the csv file, this is determined to be a 
general case.

To remove duplicates, we can use the knowledge that each team does not play more
than one match in any given day. Therefore, for each day, we can see which 
matches are duplicates by comparing the set of the team names. If the sets are
equivalent, the two matches are considered to be duplicates.

Once approach would be to simply remove any one of the two matches. However,
this could potentially lead to a class imbalance of Win/Loss since the matches
are not ordered 
