In [29]:
import seaborn as sns
import pandas as pd 
import re
DATA_ROOT = 'data'

# Big Data Bowl 2022

In this notebook, I will practice basic expoloratory data analysis (EDA) skills and compare them to gold medal submitted notebooks. This repo will serve as a sandbox area to develop on and research EDA concepts

The goal is to generate actionable, practical, and novel insights from player tracking data that corresponds to special teams play. There are several potential topics cited. These include, but are not limited to:

1. Create a new special teams metric. The winning algorithm from the 2020 Big Data Bowl has been adopted by the NFL/NFL Network for on air distribution, and we are hopeful that there could be a new stat for special teams plays that could come from this year’s competition

2. Quantify special teams strategy. Special teams’ coaches are among the most creative and innovative in the league. Compare/contrast how each team game plans. Which strategies yield the best results? What are other strategies that could be adopted?

3. Rank special teams players. Each team employs a variety of players (including longsnappers, kickers, punters, and other utility special teams players). How do they stack up with respect to one another?
The above list is not comprehensive, nor is it meant to be a guide for participants to cover. We are open to all special teams related ideas in this year’s competition.

The winning presentation was on creating the optimal path for special teams punt returners to follow

## Exploratory Data Analysis

### Loading Data and Initial Impressions

First, we read in our data

In [30]:
games = pd.read_csv(f'{DATA_ROOT}/games.csv')
plays = pd.read_csv(f'{DATA_ROOT}/plays.csv')
players = pd.read_csv(f'{DATA_ROOT}/players.csv')
tracking_2018 = pd.read_csv(f'{DATA_ROOT}/tracking2018.csv')
tracking_2019 = pd.read_csv(f'{DATA_ROOT}/tracking2019.csv')
tracking_2020 = pd.read_csv(f'{DATA_ROOT}/tracking2020.csv')
scouting = pd.read_csv(f'{DATA_ROOT}/PFFScoutingData.csv')

Now, let's get a rough idea of what the data each look like. We will also display the MetaData for each dataset straight from https://www.kaggle.com/competitions/nfl-big-data-bowl-2022/data

**Game data**

`gameId`: Game identifier, unique (numeric)

`season`: Season of game

`week`: Week of game

`gameDate`: Game Date (time, mm/dd/yyyy)

`gameTimeEastern`: Start time of game (time, HH:MM:SS, EST)

`homeTeamAbbr`: Home team three-letter code (text)

`visitorTeamAbbr`: Visiting team three-letter code (text)

In [None]:
def camel_to_snake(name):
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

games = games.rename(columns={
    'homeTeamAbbr' : 'home',
    'visitorTeamAbbr' : 'away'
})

games.columns = [camel_to_snake(col) for col in games.columns]

games['game_date'] = pd.to_datetime(games['game_date'])
games['game_time_eastern'] = pd.to_datetime(games['game_time_eastern'])
games['home'] = games['home'].astype(str)
games['away'] = games['away'].astype(str)

display(games.head())
print(games.dtypes)
print(games.isnull().sum())

Unnamed: 0,game_id,season,week,game_date,game_time_eastern,home,away
0,2018090600,2018,1,2018-09-06,2024-07-10 20:20:00,PHI,ATL
1,2018090900,2018,1,2018-09-09,2024-07-10 13:00:00,BAL,BUF
2,2018090901,2018,1,2018-09-09,2024-07-10 13:00:00,CLE,PIT
3,2018090902,2018,1,2018-09-09,2024-07-10 13:00:00,IND,CIN
4,2018090903,2018,1,2018-09-09,2024-07-10 13:00:00,MIA,TEN


game_id                       int64
season                        int64
week                          int64
game_date            datetime64[ns]
game_time_eastern    datetime64[ns]
home                         object
away                         object
dtype: object
game_id              0
season               0
week                 0
game_date            0
game_time_eastern    0
home                 0
away                 0
dtype: int64


So games is just the schedule, we can see that the team location abbreviations are used - we may later want to have these replaced by full team names. The strings and datetimes are treated as objects which can cause problems so we will change those. No null values and thus nothing needs to be done there. Finally we rename some columns for convenience

We can see how special teams performance changes overtime, but as a standalone this data set does not provide much for our intial discovery

Let's follow a similar process for other data sets

#### Play Data

`gameId`: Game identifier, unique (numeric)

`playId`: Play identifier, not unique across games (numeric)

`playDescription`: Description of play (text)

`quarter`: Game quarter (numeric)

`down`: Down (numeric)

`yardsToGo`: Distance needed for a first down (numeric)

`possessionTeam`: Team punting, placekicking or kicking off the ball (text)

`specialTeamsPlayType`: Formation of play: Extra Point, Field Goal, Kickoff or Punt (text)

`specialTeamsResult`: Special Teams outcome of play dependent on play type: Blocked Kick Attempt, Blocked Punt, Downed, Fair Catch, Kick Attempt Good, Kick Attempt No Good, Kickoff Team Recovery, Muffed, Non-Special Teams Result, Out of Bounds, Return or Touchback (text)

`kickerId`: nflId of placekicker, punter or kickoff specialist on play (numeric)

`returnerId`: nflId(s) of returner(s) on play if there was a special teams return. Multiple returners on a play are separated by a ; (text)

`kickBlockerId`: nflId of blocker of kick on play if there was a blocked field goal or blocked punt (numeric)

`yardlineSide`: 3-letter team code corresponding to line-of-scrimmage (text)

`yardlineNumber`: Yard line at line-of-scrimmage (numeric)

`gameClock`: Time on clock of play (MM:SS)

`penaltyCodes`: NFL categorization of the penalties that occurred on the play. A standard penalty code followed by a d means the penalty was on the defense. Multiple penalties on a play are separated by a ; (text)

`penaltyJerseyNumber`: Jersey number and team code of the player committing each penalty. Multiple penalties on a play are separated by a ; (text)

`penaltyYards`: yards gained by possessionTeam by penalty (numeric)

`preSnapHomeScore`: Home score prior to the play (numeric)

`preSnapVisitorScore`: Visiting team score prior to the play (numeric)

`passResult`: Scrimmage outcome of the play if `specialTeamsPlayResult` is "Non-Special Teams Result" (C: Complete pass, I: Incomplete pass, S: Quarterback sack, IN: Intercepted pass, R: Scramble, ' ': Designed Rush, text)

`kickLength`: Kick length in air of kickoff, field goal or punt (numeric)

`kickReturnYardage`: Yards gained by return team if there was a return on a kickoff or punt (numeric)
`playResult`: Net yards gained by the kicking team, including penalty yardage (numeric)

`absoluteYardlineNumber`: Location of ball downfield in tracking data coordinates (numeric)

In [None]:
plays.columns = [camel_to_snake(col) for col in plays.columns]
display(plays)
print(plays.dtypes)
print(plays.isnull().sum())

Unnamed: 0,game_id,play_id,play_description,quarter,down,yards_to_go,possession_team,special_teams_play_type,special_teams_result,kicker_id,...,penalty_codes,penalty_jersey_numbers,penalty_yards,pre_snap_home_score,pre_snap_visitor_score,pass_result,kick_length,kick_return_yardage,play_result,absolute_yardline_number
0,2018090600,37,J.Elliott kicks 65 yards from PHI 35 to end zo...,1,0,0,PHI,Kickoff,Touchback,44966.0,...,,,,0,0,,66.0,,40,45
1,2018090600,366,"(9:20) C.Johnston punts 56 yards to ATL 36, Ce...",1,4,4,PHI,Punt,Return,45603.0,...,UNSd,PHI 18,-15.0,0,0,,56.0,5.0,36,18
2,2018090600,658,"(5:03) M.Bryant 21 yard field goal is GOOD, Ce...",1,4,3,ATL,Field Goal,Kick Attempt Good,27091.0,...,,,,0,0,,21.0,,0,13
3,2018090600,677,M.Bosher kicks 64 yards from ATL 35 to PHI 1. ...,1,0,0,ATL,Kickoff,Return,37267.0,...,,,,0,3,,64.0,30.0,34,75
4,2018090600,872,"(:33) C.Johnston punts 65 yards to end zone, C...",1,4,18,PHI,Punt,Touchback,45603.0,...,,,,0,3,,65.0,,45,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19974,2021010315,3683,J.Myers kicks 65 yards from SEA 35 to end zone...,4,0,0,SEA,Kickoff,Touchback,41175.0,...,,,,16,19,,75.0,,40,75
19975,2021010315,3870,"J.Myers extra point is GOOD, Center-T.Ott, Hol...",4,0,0,SEA,Extra Point,Kick Attempt Good,41175.0,...,,,,16,25,,,,0,25
19976,2021010315,3886,J.Myers kicks 65 yards from SEA 35 to end zone...,4,0,0,SEA,Kickoff,Touchback,41175.0,...,,,,16,26,,75.0,,40,75
19977,2021010315,4166,"T.Vizcaino extra point is GOOD, Center-C.Holba...",4,0,0,SF,Extra Point,Kick Attempt Good,47590.0,...,,,,22,26,,,,0,95


game_id                       int64
play_id                       int64
play_description             object
quarter                       int64
down                          int64
yards_to_go                   int64
possession_team              object
special_teams_play_type      object
special_teams_result         object
kicker_id                   float64
returner_id                  object
kick_blocker_id             float64
yardline_side                object
yardline_number               int64
game_clock                   object
penalty_codes                object
penalty_jersey_numbers       object
penalty_yards               float64
pre_snap_home_score           int64
pre_snap_visitor_score        int64
pass_result                  object
kick_length                 float64
kick_return_yardage         float64
play_result                   int64
absolute_yardline_number      int64
dtype: object
game_id                         0
play_id                         0
play_description  

In [None]:
sns.histplot(data=scouting,x='special_teams_play_type', stat='density')

ValueError: Could not interpret value `special_teams_play_type` for `x`. An entry with this name does not appear in `data`.

#### PFF Scouting data

`gameId`: Game identifier, unique (numeric)

`playId`: Play identifier, not unique across games (numeric)

`snapDetail`: On Punts, whether the snap was on target and if not, provides detail (H: High, L: Low, <: Left, >: Right, OK: Accurate Snap, text)

`operationTime`: Timing from snap to kick on punt plays in seconds: (numeric)

`hangTime`: Hangtime of player's punt or kickoff attempt in seconds. Timing is taken from impact with foot to impact with the ground or a player. (numeric)

`kickType`: Kickoff or Punt Type (text). Depending on whether it is a kickoff or punt, this column can take any of the following values

<ul>
    <li> Possible values for kickoff plays </li>
        <ul>
            <li> D: Deep - your normal deep kick with decent hang time</li>
            <li> F: Flat - different than a Squib in that it will have some hang time and no roll but has a lower trajectory and hang time than a Deep kick off</li>
            <li> K: Free Kick - Kick after a safety</li>
            <li> O: Obvious Onside - score and situation dictates the need to regain possession. Also the hands team is on for the returning team</li>
            <li> P: Pooch kick - high for hangtime but not a lot of distance - usually targeting an upman</li>
            <li> Q: Squib - low-line drive kick that bounces or rolls considerably, with virtually no hang time</li>
            <li> S: Surprise Onside - accounting for score and situation an onsides kick that the returning team doesn’t expect. Hands teams probably aren't on the field</li>
            <li> B: Deep Direct OOB - Kickoff that is aimed deep (regular kickoff) that goes OOB directly (doesn't bounce)</li>
        </ul>
    <li>Possible values for punt plays:</li>
        <ul>
            <li>N: Normal - standard punt style</li>
            <li>R: Rugby style punt</li>
            <li>A: Nose down or Aussie-style punts</li>
        </ul>
</ul>

`kickDirectionIntended`: Intended kick direction from the kicking team's perspective - based on how coverage unit sets up and other factors (L: Left, R: Right, C: Center, text).

`kickDirectionActual`: Actual kick direction from the kicking team's perspective (L: Left, R: Right, C: Center, text).

`returnDirectionIntended`: The return direction the punt return or kick off return unit is set up for from the return team's perspective (L: Left, R: Right, C: Center, text).

`returnDirectionActual`: Actual return direction from the return team's perspective (L: Left, R: Right, C: Center, text).

`missedTacklers`: Jersey number and team code of player(s) charged with a missed tackle on the play. It will be reasonable to assume that he should have brought down the ball carrier and failed to do so. This situation does not have to entail contact, but it most frequently does. Missed tackles on a QB by a pass rusher are also included here. Multiple missed tacklers on a play are separated by a ; (text).

`assistTacklers`: Jersey number and team code of player(s) assisting on the tackle. Multiple assist tacklers on a play are separated by a ; (text).

`tacklers`: Jersey number and team code of player making the tackle (text).

`kickoffReturnFormation`: 3 digit code indicating the number of players in the Front Wall, Mid Wall and Back Wall (text).

`gunners`: Jersey number and team code of player(s) lined up as gunner on punt unit. Multiple gunners on a play are separated by a ; (text).

`puntRushers`: Jersey number and team code of player(s) on the punt return unit with "Punt Rush" role for actively trying to block the punt. Does not include players crossing the line of scrimmage to engage in punt coverage players in a "Hold Up" role. Multiple punt rushers on a play are separated by a ; (text).

`specialTeamsSafeties`: Jersey number and team code for player(s) with "Safety" roles on kickoff coverage and field goal/extra point block units - and those not actively advancing towards the line of scrimmage on the punt return unit. Multiple special teams safeties on a play are separated by a ; (text).

`vises`: Jersey number and team code for player(s) with a "Vise" role on the punt return unit. Multiple vises on a play are separated by a ; (text).

`kickContactType`: Detail on how a punt was fielded, or what happened when it wasn't fielded (text). Possible values below
<ul>
    <li>BB: Bounced Backwards</li>
    <li>BC: Bobbled Catch from Air</li>
    <li>BF: Bounced Forwards</li>
    <li>BOG: Bobbled on Ground</li>
    <li>CC: Clean Catch from Air</li>
    <li>CFFG: Clean Field From Ground</li>
    <li>DEZ: Direct to Endzone</li>
    <li>ICC: Incidental Coverage Team Contact</li>
    <li>KTB: Kick Team Knocked Back</li>
    <li>KTC: Kick Team Catch</li>
    <li>KTF: Kick Team Knocked Forward</li>
    <li>MBC: Muffed by Contact with Non-Designated Returner</li>
    <li>MBDR: Muffed by Designated Returner</li>
    <li>OOB: Directly Out Of Bounds</li>
</ul>

In [None]:
scouting.columns = [camel_to_snake(col) for col in scouting.columns]
display(scouting.head())
print(scouting.dtypes)
print(scouting.isnull().sum())

Unnamed: 0,game_id,play_id,snap_detail,snap_time,operation_time,hang_time,kick_type,kick_direction_intended,kick_direction_actual,return_direction_intended,return_direction_actual,missed_tackler,assist_tackler,tackler,kickoff_return_formation,gunners,punt_rushers,special_teams_safeties,vises,kick_contact_type
0,2018090600,37,,,,3.85,D,R,R,,,,,,8-0-2,,,PHI 23; PHI 27,,
1,2018090600,366,OK,0.84,2.12,4.46,N,C,C,C,R,PHI 57,,PHI 54,,PHI 18; PHI 29,,,ATL 83; ATL 27; ATL 34; ATL 21,CC
2,2018090600,658,,,,,,,,,,,,,,,,PHI 58,,
3,2018090600,677,,,,4.06,D,R,R,C,C,ATL 83,ATL 22,ATL 27,8-0-2,,,ATL 17; ATL 22,,
4,2018090600,872,OK,0.84,2.0,4.35,N,C,L,,,,,,,PHI 18; PHI 29,ATL 85,ATL 37,ATL 83; ATL 34; ATL 21,BF


game_id                        int64
play_id                        int64
snap_detail                   object
snap_time                    float64
operation_time               float64
hang_time                    float64
kick_type                     object
kick_direction_intended       object
kick_direction_actual         object
return_direction_intended     object
return_direction_actual       object
missed_tackler                object
assist_tackler                object
tackler                       object
kickoff_return_formation      object
gunners                       object
punt_rushers                  object
special_teams_safeties        object
vises                         object
kick_contact_type             object
dtype: object
game_id                          0
play_id                          0
snap_detail                  14060
snap_time                    14061
operation_time               14061
hang_time                     6881
kick_type                     6256
k

We have tons of missing data here! Let's see what insights we can start with