<font size = "6"> Fordham Sports Analytics Society Big Data Bowl 2023 - Feature Creation </font>

<font size = "4"> Convert data into one cleaned data frame that can be used in model creation. </font>

- Authors:  Peter Majors, Chris Orlando, Jack Townsend, and Etienne Busnel
- Kaggle:  https://www.kaggle.com/competitions/nfl-big-data-bowl-2023/overview (Resources)
- Our Github:  https://github.com/peterlmajors/FSAS_BigDataBowl_2023 (Up-To-Date Code)

In [20]:
#Import Required Packages

#Data Manipulation
import pandas as pd
import numpy as np
import math

#Notebook Settings
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)

In [21]:
#Import Our Merged DataFrames

#All Player Tracking
ptrack = pd.read_csv("C:/Users/Peter/Python Scripts/Case Competitions/NFL Big Data Bowl 2023/merged_data/ptrack.csv")

#Player Tracking Only On Frames When QB The Target For Pass Rushers, The Pass Rusher Is Known, The Pass Rusher Is In The Immediate Zone, and A Block Occurs
ptrack_qb_poss_block = pd.read_csv("C:/Users/Peter/Python Scripts/Case Competitions/NFL Big Data Bowl 2023/merged_data/ptrack_qb_poss.csv")

#Play-By-Play Data
pbp = pd.read_csv("C:/Users/Peter/Python Scripts/Case Competitions/NFL Big Data Bowl 2023/merged_data/pbp.csv")

#Import PFF Data On QB Pressures
pff_qb_pressure = pd.read_csv("C:/Users/Peter/Python Scripts/Case Competitions/NFL Big Data Bowl 2023/merged_data/pff_qb_pressure.csv")

<font size="5"> Feature 1 & 2: Speed and Acceleration Of Pass Rusher Coming Into A Block </font>

 - With an Immediate Zone 1.5 Yards In Depth and 1.5 Yards Across
 
 - Speed More Likely To Be Used For Tackles, While Guards And Centers Will Almost Definitely Used Acceleration

- Speed and Acceleration On First Frame In The Immediate Zone

- How Explosive Was The Pass Rusher On The Selected Frame?

In [22]:
#Find All Times Rusher In The Box
ptrack_qb_poss_block_1 = ptrack_qb_poss_block[ptrack_qb_poss_block.rusher_in_imm_box == 1][['game_play_nfl_Id', 'frameId', 'rusher_in_imm_box']]

#Since Sorted by Frame, Find The First In Each Group
ff_in_imm_zone = ptrack_qb_poss_block_1.groupby('game_play_nfl_Id').first().reset_index()[['game_play_nfl_Id', 'frameId']]

#Rename Frame Id and Drop Duplicate Rows
ff_in_imm_zone = ff_in_imm_zone.rename(columns = {'frameId': 'ff_in_imm_zone'})

#Merge Onto The Main Data Frame
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(ff_in_imm_zone, on = 'game_play_nfl_Id', how = 'left')

In [23]:
#Speed And Acceleration at First Frame In The Imm Zone
sa_ff_in_imm_zone = ptrack_qb_poss_block[ptrack_qb_poss_block.ff_in_imm_zone == ptrack_qb_poss_block.frameId][['game_play_nfl_Id', 's_rusher', 'a_rusher']] 

#Rename Speed and Acceleration Columns
sa_ff_in_imm_zone = sa_ff_in_imm_zone.rename(columns = {'s_rusher': 's_rusher_ff_imm_box', 'a_rusher': 'a_rusher_ff_imm_box'}).drop_duplicates()

#Merge Speed of The Rusher In The First Frame Imm Zone Onto Main
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(sa_ff_in_imm_zone, on = 'game_play_nfl_Id', how = 'left')

<font size="5"> Feature 3: Distance Traveled From Start To End Of Immediate Zone To QB </font>

- Attempts To Answer The Question of How Much A Pass Blocker Was "Pushed" During Their Handling Of A Rusher

- Only Concerns The First Time (If There Is More Than One) That A Rusher Was In The Pass Blocker's Immediate Zone 

- If The Pass Rusher Never Leaves The Immediate Zone, The End of The Immediate Zone Is The End Of QB Possession

- What Is The Raw Stopping Power of Our Offensive Linemen?

In [24]:
#Find All Frames After The Rusher Has Entered The Immediate Zone And Where They Are Not In The Immediate Zone
ptrack_qb_poss_block_after_ff_in_imm_zone = ptrack_qb_poss_block.loc[(ptrack_qb_poss_block['frameId'] > ptrack_qb_poss_block['ff_in_imm_zone']) & (ptrack_qb_poss_block['rusher_in_imm_box'] == 0)]

#Now Group This By Play And Find The First Row, While Having Sorted By FrameId
ptrack_qb_poss_block_after_ff_in_imm_zone = ptrack_qb_poss_block_after_ff_in_imm_zone.groupby('game_play_nfl_Id')[['game_play_nfl_Id', 'frameId']].first().reset_index(drop = True)

#Rename First Frame Not In Imm Zone For First Time
ptrack_qb_poss_block_after_ff_in_imm_zone = ptrack_qb_poss_block_after_ff_in_imm_zone.rename(columns = {'frameId': 'ff_out_imm_zone'})

#Merge Speed of The Rusher In The First Frame Imm Zone Onto Main
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(ptrack_qb_poss_block_after_ff_in_imm_zone, on = 'game_play_nfl_Id', how = 'left')

#Fill Null Cells With Last Frame of QB Possession
ptrack_qb_poss_block['ff_out_imm_zone'] = ptrack_qb_poss_block['ff_out_imm_zone'].fillna(ptrack_qb_poss_block['frame_last'])

In [25]:
#Find Difference In Distance From Start of End of Immediate Zone

#Create Rusher To QB Distance
ptrack_qb_poss_block['rusher_dist_from_qb'] = np.hypot((ptrack_qb_poss_block.x_rusher - ptrack_qb_poss_block.x_qb), (ptrack_qb_poss_block.y_rusher - ptrack_qb_poss_block.y_qb))

#Distance Beginning & End of Imm Zone
rusher_dist_qb_beg_imm_zone = ptrack_qb_poss_block[ptrack_qb_poss_block.ff_in_imm_zone == ptrack_qb_poss_block.frameId][['game_play_nfl_Id', 'rusher_dist_from_qb']]
rusher_dist_qb_end_imm_zone = ptrack_qb_poss_block[ptrack_qb_poss_block.ff_out_imm_zone == ptrack_qb_poss_block.frameId][['game_play_nfl_Id', 'rusher_dist_from_qb']]

#Rename Distance DFs Beginning & End of Imm Zone
rusher_dist_qb_beg_imm_zone = rusher_dist_qb_beg_imm_zone.rename(columns = {'rusher_dist_from_qb': 'rusher_dist_from_qb_ff_in_imm'}).drop_duplicates()
rusher_dist_qb_end_imm_zone = rusher_dist_qb_end_imm_zone.rename(columns = {'rusher_dist_from_qb': 'rusher_dist_from_qb_ff_out_imm'}).drop_duplicates()
 
#Merge Distance At Beginning & End Of Imm Zone
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(rusher_dist_qb_beg_imm_zone, on = 'game_play_nfl_Id', how = 'left')
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(rusher_dist_qb_end_imm_zone, on = 'game_play_nfl_Id', how = 'left')

#Difference In Distance From Beginning To End of Imm Zone (Positive = Gained Ground)
ptrack_qb_poss_block['rusher_dist_from_qb_diff_beg_end_imm_zone'] = ptrack_qb_poss_block.rusher_dist_from_qb_ff_in_imm - ptrack_qb_poss_block.rusher_dist_from_qb_ff_out_imm 

<font size="5"> Feature 4: Difference Between The Direction Of The Rusher And Orientation Of Pass Blocker First Frame of Immediate Zone </font>

- Attempts To Determine If A Pass Rusher Ran Into The Blocker Straight On Or At And Angle

- Blocks Where Rushers Come In From Creative Angles But Are Handled Well By The Pass Blocker Should Be Factored

- Can Our Lineman Handle Rushers Coming From The Left Or Right?

In [26]:
#Difference Between Rusher Direction and Blocker Orientation At First Frame Beginning Of Immediate Zone
diff_btw_rusher_dir_blocker_o_beg_imm_zone = ptrack_qb_poss_block[ptrack_qb_poss_block.ff_in_imm_zone == ptrack_qb_poss_block.frameId][['game_play_nfl_Id', 'diff_btw_rusher_dir_blocker_o']]

#Rename to Difference Rusher Direction and Blocker Orientation At First Frame In Immediate Zone
diff_btw_rusher_dir_blocker_o_beg_imm_zone = diff_btw_rusher_dir_blocker_o_beg_imm_zone.rename(columns = {'diff_btw_rusher_dir_blocker_o': 'diff_btw_rusher_dir_blocker_o_ff_in_imm'}).drop_duplicates()
 
#Merge Difference Rusher Direction and Blocker Orientation At First Frame In Immediate Zone Onto Main Dataframe
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(diff_btw_rusher_dir_blocker_o_beg_imm_zone, on = 'game_play_nfl_Id', how = 'left')

<font size="5"> Feature 5: Difference In Orientation Of Pass Blocker Between First And Last Frame Of Rusher Being In Immediate Zone </font>

- Attempts to Answer How Much An Offensive Lineman Was "Spun" In By The Pass Rusher

- The Difference Is Measured In Absolute Value, So The Change In Orientation Is Agnostic To Direction "Spun"

- How Well Can A Pass Blocker Hold Shoulder Angle When Engaged?

In [27]:
#Orientation Beginning & End of Imm Zone
o_beg_imm_zone = ptrack_qb_poss_block[ptrack_qb_poss_block.ff_in_imm_zone == ptrack_qb_poss_block.frameId][['game_play_nfl_Id', 'o']]
o_end_imm_zone = ptrack_qb_poss_block[ptrack_qb_poss_block.ff_out_imm_zone == ptrack_qb_poss_block.frameId][['game_play_nfl_Id', 'o']]

#Rename Orientation DFs Beginning & End of Imm Zone
o_beg_imm_zone = o_beg_imm_zone.rename(columns = {'o': 'o_ff_in_imm'}).drop_duplicates()
o_end_imm_zone = o_end_imm_zone.rename(columns = {'o': 'o_ff_out_imm'}).drop_duplicates()
 
#Merge Orientation At Beginning & End Of Imm Zone
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(o_beg_imm_zone, on = 'game_play_nfl_Id', how = 'left')
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(o_end_imm_zone, on = 'game_play_nfl_Id', how = 'left')

#Difference In Orientation From Beginning To End of Imm Zone (Absolute Value)
ptrack_qb_poss_block['blocker_o_diff_beg_end_imm_zone'] = abs(ptrack_qb_poss_block.o_ff_in_imm - ptrack_qb_poss_block.o_ff_out_imm)

<font size="5"> Feature 6: Time The Pass Rusher Was Initially Kept In The Immediate Zone </font>

- For Frames When The Rusher Enters The Immediate Zone Multiple Times In A Play, The Number of Frames They Were In The Immediate Zone The First Time

- If The Rusher Only Enters The Immediate Zone Once In A Play, The Total Number of Seconds

- How Long Can A Pass Rusher Initially Maintain A Pass Rusher In Their Immediate Zone?

In [28]:
#Find The Number of Seconds 
ptrack_qb_poss_block['time_rusher_in_imm_zone'] = ptrack_qb_poss_block.ff_out_imm_zone - ptrack_qb_poss_block.ff_in_imm_zone 

<font size="5"> Feature 7: Time After Initially Leaving The Immediate Zone A Rusher Is In The Immediate Zone For Remainder Of Play </font>

- When Rusher Enters The Immediate Zone Multiple Times In A Play, Total Number of Frames They Were In The Immediate Zone For The Remainder Of The Play

- If The Rusher Only Enters The Immediate Zone Once In A Play, This Feature Is Left Null

- Acts As Insurance In Case A Rusher Comes Out Of The Immediate Zone For A Frame Or Two But Is Still Engaged With The Blocker

- How Well Can Our Pass Blocker Recover On Their Target If They Originally "Failed" In Keeping Them In Front of Them?

In [29]:
#Filter By Frames After The 
ptrack_qb_poss_block_after_ff_out_imm_zone = ptrack_qb_poss_block.loc[ptrack_qb_poss_block['frameId'] > ptrack_qb_poss_block['ff_out_imm_zone']]

#Sum of All Times A Rusher Is In The Immediate Zone Following The Frame In Which They Initially Left
ptrack_qb_poss_block_after_ff_out_imm_zone = ptrack_qb_poss_block_after_ff_out_imm_zone.groupby('game_play_nfl_Id')['rusher_in_imm_box'].count().reset_index() 

#Rename Time In The Immediate Zone After First Time
ptrack_qb_poss_block_after_ff_out_imm_zone = ptrack_qb_poss_block_after_ff_out_imm_zone.rename(columns = {'rusher_in_imm_box': 'time_in_imm_zone_after_out'}).drop_duplicates()

 #Merge Time Each Play In Immediate Zone Post Immediate Zone
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(ptrack_qb_poss_block_after_ff_out_imm_zone, on = 'game_play_nfl_Id', how = 'left')

<font size="5"> Feature 8: Block Type Employed By The Pass Blocker </font>

- Aside From The Context Of Field Position, Block Types Can Have A Noticable Impact On The Distance A Rusher Gets To A Quarterback

- Allows Us To Control For Pass Blockers Who Are Routinely Given Easier or More Difficult Blocking Assignments As It Pertains To Rusher Distance From Quarterback

In [30]:
#No Code Required For "pff_blockType"

<font size="5"> Feature 9: Number of Pass Rushers In The Box </font>

- Data On Double Teams Was Not Provided, Used This Instead

- How Much Total Opposition Was The O-Line, and Subsequently, The Lineman Facing?

In [31]:
#Find The Number of Defenders In The Box
pbp_defendersInBox = pbp[['gameId','playId','defendersInBox']]

#Change Data Type of gameId and playId to int
ptrack_qb_poss_block[['gameId', 'playId']] = ptrack_qb_poss_block[['gameId', 'playId']].apply(pd.to_numeric)

#Merge Pass Rushers
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(pbp_defendersInBox, on = ['gameId','playId'], how = 'left')

<font size="5"> Feature 10: Pressure To Sack Rate - <i> Courtesy of PFF Premium </i> </font>

- A Statistic That Measures The Evasiveness Of A QB While Not Necessarily Being Outside The Pocket

- This Is A Quarterback's Pressure To Sack Rate Over The First 8 Weeks Of The 2021 Season

- Rusher Distance From The QB Is, Among Other Factors, Impacted By A QB's Ability To Evade Sacks

In [32]:
#Find All Game, Play, and NFL ID Combos And Their QBs
qb_on_play = ptrack[ptrack.pff_positionLinedUp == "QB"][['gameId', 'playId', 'displayName']].drop_duplicates()

#Rename So That It Reads "QB_on_play"
qb_on_play = qb_on_play.rename(columns = {'displayName': 'QB_on_play'}).drop_duplicates()

#Change Object Columns To Int
ptrack_qb_poss_block[['gameId', 'playId']] = ptrack_qb_poss_block[['gameId', 'playId']].apply(pd.to_numeric)

#Now, Merge The PFF Data Onto The Main Data Frame
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(qb_on_play, on = ['gameId', 'playId'], how = 'left')

In [33]:
#Find QB Pressure To Sack Rate
qb_p2s = pff_qb_pressure.groupby('player')['pressure_sack_percent'].mean().reset_index()

#Rename The Column To Include That It's For That QB
qb_p2s = qb_p2s.rename(columns = {'pressure_sack_percent': 'qb_pressure_sack_percent'})

#Merge These Rates Onto The Main Data Frame
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(qb_p2s, left_on = 'QB_on_play', right_on = 'player', how = 'left')

<font size="5"> Feature 11: Average Distance That The Rusher Is From The Quarterback End Of Possession On All Plays </font>

- An Understanding Of The Opposition Is Also Critical In Evaluating Offensive Linemen

- Despite The Fact That The Best Pass Rushers Are Often Double Teamed, This Provides Our Model With 

In [34]:
#Match The RusherID And Average Distance
avg_dist_rusher_from_qb_end_poss = ptrack_qb_poss_block[ptrack_qb_poss_block.frameId == ptrack_qb_poss_block.ff_out_imm_zone].groupby("nflId_rusher")['rusher_dist_from_qb'].mean().reset_index()

#Rename The Column
avg_dist_rusher_from_qb_end_poss = avg_dist_rusher_from_qb_end_poss.rename(columns = {'rusher_dist_from_qb': 'rusher_avg_dist_from_qb_end_poss'})

#Merge Onto The Main Data Frame
ptrack_qb_poss_block = ptrack_qb_poss_block.merge(avg_dist_rusher_from_qb_end_poss, on = 'nflId_rusher', how = 'left')

<font size="5"> Filter Columns In Main Data Frame To Prep For Model Building </font>

In [35]:
#Simplify The Data Frame To Only Include Columns Of Interest For The Model
df = ptrack_qb_poss_block[['displayName', "game_play_nfl_Id", 'pff_positionLinedUp',"ff_in_imm_zone", "ff_out_imm_zone", 's_rusher_ff_imm_box', 'a_rusher_ff_imm_box',
                    'rusher_dist_from_qb_diff_beg_end_imm_zone', 'diff_btw_rusher_dir_blocker_o_ff_in_imm', 'blocker_o_diff_beg_end_imm_zone', 'time_rusher_in_imm_zone', 
                    'time_in_imm_zone_after_out', 'pff_blockType', 'defendersInBox', 'qb_pressure_sack_percent', 'rusher_avg_dist_from_qb_end_poss', 'pff_beatenByDefender', 
                    'pff_hitAllowed', 'pff_hurryAllowed', 'pff_sackAllowed', 'nflId_blocker', 'nflId_rusher', 'displayName_rusher', 'pff_positionLinedUp_rusher', 'QB_on_play', 'team']]

<font size="5"> Export The Final Data Frame For The Model </font>

In [36]:
#Export The df For The Model
df.to_csv("C:/Users/Peter/Python Scripts/Case Competitions/NFL Big Data Bowl 2023/merged_data/df_model.csv")