# Workshop Title: Analyzing NFL Data with Python

## Introduction
Welcome to the "Analyzing NFL Data with Python" workshop! In this workshop, you will learn how to retrieve, process, and analyze NFL play-by-play data for the 2023 season using Python. We will focus on passing and rushing plays, calculate Expected Points Added (EPA), and visualize the data to gain insights.

## Getting Started
Before we begin, make sure you have the required Python packages installed. You can install them using the following commands:

In [1]:
!pip install nfl_data_py

Collecting nfl_data_py
  Downloading nfl_data_py-0.3.1.tar.gz (16 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting fastparquet>0.5 (from nfl_data_py)
  Obtaining dependency information for fastparquet>0.5 from https://files.pythonhosted.org/packages/09/ea/6bf8718363e6fc8e204db6ff2bff5a18cc78859ad81a625745d68f5e2ed2/fastparquet-2023.10.1-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading fastparquet-2023.10.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (4.1 kB)
Collecting cramjam>=2.3 (from fastparquet>0.5->nfl_data_py)
  Obtaining dependency information for cramjam>=2.3 from https://files.pythonhosted.org/packages/58/8a/7f8b283bb29713fb9c0d548b9d6cbe2f48da05084bdf721b7aa4a399b639/cramjam-2.7.0-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.metadata
  Downloading cramjam-2.7.0-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.metadata (4.0 kB)
Downloading fastparquet-2023.10.1-cp311-cp311-macosx_11_0_arm64.whl (682

Now that you have installed the necessary packages, let's import them:

In [2]:
# Dependencies
import pandas as pd
import nfl_data_py as nfl
import matplotlib.pyplot as plt
from matplotlib import style
from matplotlib.offsetbox import AnnotationBbox, OffsetImage
import os
import urllib.request

## Part 1: Retrieving NFL Play-By-Play Data
We will start by retrieving the NFL play-by-play data for the 2023 season:

In [4]:
# define year variable for 2023 season
year = 2023
# import data for this year
pbp = nfl.import_pbp_data([year])

2023 done.
Downcasting floats.


We can take a look at the size of the dataset and a preview of it's structure:

In [5]:
# checking the size of the data
pbp.shape

(23846, 384)

In [None]:
# preview of data
pbp.head()

There's too many columns here to display, we can list them out by using:

In [None]:
# Remove max_columns limits of pandas
pd.set_option('display.max_columns', None)
pbp.head()

In [None]:
# Or directly print list of columns
print(pbp.columns.tolist())

## Part 2: Processing and Calculating EPA

**Expected Points Added (EPA) calculates how well a team performs relative to expectation on a particular play. This advanced metric is calculated by the expected points gained or lost based on the down, distance, and field position at the start of a play compared to the end of the play.**

For example, if a team starts a drive on the 50-yard line, its expected points to start the drive would be about 2.5. If the team ends the drive with a field goal, thus gaining 3 points, its EPA for that drive would be found by subtracting its expected points from how many points it actually gained, 3 – 2.5 = **0.5 EPA**. However if the team scores a 50 yard touchdown, the EPA of the play would be 7 - 2.5 = **4.5 EPA**.

In this section, we'll filter the data for passing and rushing plays for each team, calculate the average EPA for both, and visualize the results.

First, we filter the data for passing and rushing plays:

In [None]:
# Filter down for passing and rusing plays

# Remove the rows that contains NULL values

Next, we calculate the average EPA for passing plays:

In [None]:
# Select only passing plays

# Group the plays by the offensive team running the play

# Calculate the average passing epa for each team

Cleaning up the table and sorting the values by highest passing epa

In [None]:
# Rename and sort list

Now that we've got the stats for passing epa for each team, let's find the rushing epa for each team:

In [None]:
# Isolate rushing plays

# Group them by team

# Calculate the average

# Rename the column

Now that we have both the rushing and passing epa for each team, we can combine the data and download team logos:

In [None]:
# Combine two lists

In [None]:
# Fetch Team Logos

In [None]:
# Download team logos
logo_paths = []
team_abbr = []
if not os.path.exists("logos"):
    os.makedirs("logos")

for team in range(len(logos)):
    urllib.request.urlretrieve(logos['team_logo_espn'][team], f"logos/{logos['team_abbr'][team]}.tif")
    logo_paths.append(f"logos/{logos['team_abbr'][team]}.tif")
    team_abbr.append(logos['team_abbr'][team])

# Create table for team logo and it's file path

# Combine logo paths to epa data


## Part 3: Visualizing the Data
In this part, we'll create a visualization of the data to compare the EPA for passing and rushing plays. We'll use team logos for the plot.

Let's start by defining the chart

In [67]:
# Define plot size and autolayout


# Set the axes to each epa type


Let's define a function to load the image into the chart:

In [68]:
# Load image into the chart
def getImage(path):
    return OffsetImage(plt.imread(path, format="tif"), zoom=.1)

Putting everything together, we can start adding the points to the chart:

In [None]:
fig, ax = plt.subplots()

# Add points using logo
for x0, y0, path in zip(x, y, paths):
   ab = AnnotationBbox(getImage(path), (x0, y0), frameon=False)
   ax.add_artist(ab)

# Move left y-axis and bottom x-axis to centre, passing through (0,0)

# Eliminate upper and right axes

# Label the chart


## Additional Exploration

The nfl_play_by_play library has more advanced data and metrics such as the Next Gen Stats data, we can access this by:

In [70]:
df = nfl.import_ngs_data(stat_type='passing')

In [None]:
df.columns

In [None]:
# Filter down to week = 0, full season data for the year(s) specified
df = df[df['week'] == 0]
df = df[df['season'] == year]
df = df.reset_index()
df

In [None]:
# Calculate the average time to throw and completion % above expectation
average_ttt = df['avg_time_to_throw'].mean()
average_cpae = df['completion_percentage_above_expectation'].mean()

average_cpae

In [None]:
# Visualize the data

# Define plot size and autolayout
plt.rcParams["figure.figsize"] = [20, 14]
plt.rcParams["figure.autolayout"] = True


# Initialize empty lists for x,y 
x = []
y = []

# Define the x and y variables
for qb in df.index:
    x.append(df['avg_time_to_throw'][qb] - average_ttt)
    y.append(df['completion_percentage_above_expectation'][qb] - average_cpae)
    
# Put x,y into dictionary
xy = pd.DataFrame({'x' : x, 'y' : y})

# Define the plot
fig, ax = plt.subplots()

ax.scatter(xy['x'], xy['y'], s=800, c='blue')

# Move left y-axis and bottom x-axis to centre, passing through (0,0)
ax.spines['left'].set_position('center')
ax.spines['bottom'].set_position('center')

# Eliminate upper and right axes
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')

# Set x and y axis limits
plt.xlim((-0.5,0.5))
plt.ylim((-8,8))


# Annotate with QB name and year
for name in xy.index:
    plt.annotate(f"{df['player_display_name'][name]}",
                 (xy['x'][name] + 0.015,
                  xy['y'][name]),
                  fontsize=15)

# Annotate Quadrants
plt.annotate('Lots time to throw,\nreceivers making great catches', (0.3,6.5), fontsize=18)
plt.annotate('Limited time to throw,\nreceivers making great catches', (-0.45,6.5), fontsize=18)
plt.annotate('Limited time to throw,\nreceivers not making great catches', (-0.45,-6.5), fontsize=18)
plt.annotate('Lots of time to throw,\nreceivers not making great catches', (0.3,-6.5), fontsize=18)
    

# Add a title
plt.title(f'QB Average Time to Throw (s) vs. Completion % Above/Below Expectation, 2023', fontsize=20)

    
# Style the chart
plt.show()

Sources: https://www.youtube.com/watch?v=auyOjPoURRg&ab_channel=MFANS, https://www.youtube.com/watch?v=wWgGgmqijNU&ab_channel=TimBryan
