# NFL Passing Offense Performance - Exploratory Data Analysis

## Project Overview
**Knowledge Domain:** NFL Passing Offense Performance  
**Single Aspect:** Team passing statistics (yards, attempts, touchdowns, interceptions)  

### Research Questions
1. Which NFL teams have the highest average passing yards per game in a season?
2. Is there a relationship between passing touchdowns and total team wins?
3. How does interception rate affect a team’s overall passing efficiency?
4. Do teams with higher pass attempts per game score more points?
5. How does a team’s passing performance change across multiple seasons?

### Client & Value
**Potential Client:** NFL coaching staffs, offensive coordinators, or sports analysts.  
**Value Provided:** This analysis helps identify which passing metrics most contribute to team success. Coaches and analysts can use the results to evaluate offensive strategies, compare teams, and understand how passing efficiency impacts overall performance.

## Data Source
**Pro-Football-Reference — NFL Team Passing Statistics (by Season)** 
Link: [https://www.pro-football-reference.com/years/2023/passing.htm](https://www.pro-football-reference.com/years/2023/passing.htm)

---

## 1. Import Libraries

In [1]:
import pandas as pd

## 2. Load Data
Loading the dataset from the CSV file. Please ensure `nfl_passing_2023.csv` is in the same directory.

In [2]:
# Load the dataset
df = pd.read_csv('nfl_passing_2023.csv')

# Display the first few rows to verify loading
df.head()

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,QBrec,Cmp,Att,...,QBR,Sk,Yds.1,Sk%,NY/A,ANY/A,4QC,GWD,Awards,Player-additional
0,1,Tua Tagovailoa,25,MIA,QB,17,17,11-6-0,388,560,...,61.5,29,171,4.92,7.56,7.48,2,2,PBAP CPoY-5,TagoTu00
1,2,Jared Goff,29,DET,QB,17,17,12-5-0,407,605,...,61.1,30,197,4.72,6.89,6.99,2,3,,GoffJa00
2,3,Dak Prescott,30,DAL,QB,17,17,12-5-0,410,590,...,73.4,39,255,6.2,6.77,7.28,2,3,PBAP-2AP MVP-2AP OPoY-5,PresDa01
3,4,Josh Allen,27,BUF,QB,17,17,11-6-0,385,579,...,70.3,24,152,3.98,6.89,6.51,2,4,AP MVP-5AP OPoY-6,AlleJo02
4,5,Brock Purdy,24,SFO,QB,16,16,12-4-0,308,444,...,73.4,28,153,5.93,8.74,9.01,0,0,PBAP MVP-4AP OPoY-6AP CPoY-6,PurdBr00


## 3. Exploratory Data Analysis (EDA)
We will explore the basic properties of the dataset including its shape, columns, data types, and summary statistics.

### 3.1 Dataset Info
Checking column names, non-null counts, and data types.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117 entries, 0 to 116
Data columns (total 34 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Rk                 117 non-null    int64  
 1   Player             117 non-null    object 
 2   Age                117 non-null    int64  
 3   Team               117 non-null    object 
 4   Pos                117 non-null    object 
 5   G                  117 non-null    int64  
 6   GS                 117 non-null    int64  
 7   QBrec              68 non-null     object 
 8   Cmp                117 non-null    int64  
 9   Att                117 non-null    int64  
 10  Cmp%               106 non-null    float64
 11  Yds                117 non-null    int64  
 12  TD                 117 non-null    int64  
 13  TD%                106 non-null    float64
 14  Int                117 non-null    int64  
 15  Int%               106 non-null    float64
 16  1D                 117 non

### 3.2 Summary Statistics
Descriptive statistics for numerical columns.

In [4]:
df.describe()

Unnamed: 0,Rk,Age,G,GS,Cmp,Att,Cmp%,Yds,TD,TD%,...,Y/G,Rate,QBR,Sk,Yds.1,Sk%,NY/A,ANY/A,4QC,GWD
count,117.0,117.0,117.0,117.0,117.0,117.0,106.0,117.0,117.0,106.0,...,117.0,106.0,106.0,117.0,117.0,117.0,117.0,117.0,117.0,117.0
mean,57.401709,27.65812,10.444444,7.025641,103.162393,160.102564,61.724528,1119.923077,6.555556,6.063208,...,105.801709,83.439623,39.873585,12.307692,82.811966,15.350684,5.253333,5.073419,0.555556,0.735043
std,33.365546,3.959194,5.591626,6.341518,132.096292,201.463327,25.257406,1465.057308,9.449908,15.692615,...,101.596009,26.900195,29.498017,15.216519,104.443181,28.315422,6.766101,9.042257,0.91392,1.132537
min,1.0,21.0,1.0,0.0,0.0,0.0,0.0,-7.0,0.0,0.0,...,-0.4,0.0,0.0,0.0,0.0,0.0,-8.0,-45.0,0.0,0.0
25%,28.0,25.0,5.0,1.0,1.0,1.0,58.45,12.0,0.0,0.0,...,0.8,70.725,8.275,0.0,0.0,0.0,3.9,1.86,0.0,0.0
50%,57.0,27.0,11.0,6.0,28.0,47.0,64.2,291.0,2.0,3.1,...,86.0,85.35,43.45,5.0,28.0,6.2,5.44,5.27,0.0,0.0
75%,86.0,30.0,16.0,13.0,176.0,268.0,69.375,1936.0,10.0,4.6,...,194.4,99.15,60.375,24.0,169.0,10.17,6.33,6.56,1.0,1.0
max,115.0,40.0,17.0,17.0,410.0,612.0,100.0,4624.0,36.0,100.0,...,323.2,146.8,100.0,65.0,477.0,100.0,37.0,37.0,4.0,5.0


## 4. Data Cleaning
The initial EDA highlighted several issues that need to be addressed before analysis:

1.  **Summary Row**: The dataset includes a 'League Average' row (where 'Rk' is NaN), which must be removed.
2.  **Column Names**: Several columns (`Yds.1`, `Player-additional`, and those with symbols like `%` or `/`) need to be renamed for clarity and ease of use in code (e.g., snake_case).
3.  **Missing Values (NaNs)**: Key statistical columns show missing values, mostly for players with few or zero passing attempts. These NaNs will be imputed.
4.  **Column Splitting**: The `QBrec` column contains 'Wins-Losses-Ties' as a string and needs to be split into three separate numerical columns.
5.  **Data Types**: Several count and whole number columns are incorrectly stored as `float64` and should be converted to `int64`.

### 4.1 Remove Summary Row
The row with a missing 'Rk' (Rank) is a 'League Average' summary row and should be dropped.

In [5]:
# Drop the summary row (where 'Rk' is NaN)
df_cleaned = df.dropna(subset=['Rk']).copy()

# Convert Rk to integer
df_cleaned['Rk'] = df_cleaned['Rk'].astype(int)

print(f"Original shape: {df.shape}")
print(f"New shape after dropping summary row: {df_cleaned.shape}")

Original shape: (117, 34)
New shape after dropping summary row: (117, 34)


### 4.2 Standardize Column Names
Renaming columns to be consistent and descriptive.

In [6]:
df_cleaned = df_cleaned.rename(columns={
    'Yds.1': 'Yds_Lost_Sk', # Yards lost to sacks
    'Player-additional': 'Player_ID',
    'Cmp%': 'Cmp_Pct',
    'TD%': 'TD_Pct',
    'Int%': 'Int_Pct',
    'Succ%': 'Succ_Pct',
    'Y/A': 'Yards_Per_Att',
    'AY/A': 'Adj_Yards_Per_Att',
    'Y/C': 'Yards_Per_Cmp',
    'Y/G': 'Yards_Per_Game',
    'Rate': 'Passer_Rating',
    'QBR': 'Total_QBR',
    'Sk%': 'Sk_Pct',
    'NY/A': 'Net_Yards_Per_Att',
    'ANY/A': 'Adj_Net_Yards_Per_Att',
    '4QC': 'Game_Tying_Drives',
    'GWD': 'Game_Winning_Drives'
})

### 4.3 Handle Missing Values and Column Splitting
NaN values in derived statistics are set to 0.0, which is the logical value for players with few or zero attempts. The `QBrec` column is split and the original dropped.

In [7]:
# a. Fill NaN in QBrec with '0-0-0' for players without a record, then split
df_cleaned['QBrec'] = df_cleaned['QBrec'].fillna('0-0-0')
df_cleaned[['Wins', 'Losses', 'Ties']] = df_cleaned['QBrec'].str.split('-', expand=True).astype(int)
df_cleaned = df_cleaned.drop('QBrec', axis=1) # Drop original column

# b. Fill NaN in Awards with 'None'
df_cleaned['Awards'] = df_cleaned['Awards'].fillna('None')

# c. Fill NaNs for derived statistics with 0.0
derived_stats = [
    'Cmp_Pct', 'TD_Pct', 'Int_Pct', 'Succ_Pct', 'Lng',
    'Yards_Per_Att', 'Adj_Yards_Per_Att', 'Yards_Per_Cmp',
    'Passer_Rating', 'Total_QBR'
]
df_cleaned[derived_stats] = df_cleaned[derived_stats].fillna(0.0)

print("Remaining NaNs after cleaning (should be zero):")
print(df_cleaned.isnull().sum().sum())

Remaining NaNs after cleaning (should be zero):
0


### 4.4 Final Data Type Conversion
Converting all count columns to integer type for efficiency and accuracy.

In [8]:
int_cols = [
    'Rk', 'Age', 'G', 'GS', 'Cmp', 'Att', 'Yds', 'TD', 'Int', '1D',
    'Sk', 'Yds_Lost_Sk', 'Game_Tying_Drives', 'Game_Winning_Drives',
    'Wins', 'Losses', 'Ties'
]
df_cleaned[int_cols] = df_cleaned[int_cols].astype(int)

# Display final info and head to show cleaned data
df_cleaned.info()
df_cleaned.head()

# Save the cleaned DataFrame for future use
df_cleaned.to_csv('nfl_passing_2023_cleaned.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117 entries, 0 to 116
Data columns (total 36 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Rk                     117 non-null    int32  
 1   Player                 117 non-null    object 
 2   Age                    117 non-null    int32  
 3   Team                   117 non-null    object 
 4   Pos                    117 non-null    object 
 5   G                      117 non-null    int32  
 6   GS                     117 non-null    int32  
 7   Cmp                    117 non-null    int32  
 8   Att                    117 non-null    int32  
 9   Cmp_Pct                117 non-null    float64
 10  Yds                    117 non-null    int32  
 11  TD                     117 non-null    int32  
 12  TD_Pct                 117 non-null    float64
 13  Int                    117 non-null    int32  
 14  Int_Pct                117 non-null    float64
 15  1D    