# Box count/BMI Analysis

We engineer features to detect, within a game, the # of players in the "box" a team faces.

We also see if BMI/weight of position groups (DL, LB's) has any impact.

Finally, we analyze game-long pressure rates and their impact on passing

In [1]:
import pandas as pd
import os
from utility_db_25 import get_momentum_cols, create_momentum_index,  get_motion_cols, motion_complexity_score

# Load data

We load train & play data, then flag man/zone discrepancies for df_play

In [2]:
root_dir = os.getcwd()
train_data=pd.read_csv(os.path.join(root_dir, "data/train_data.csv")).sort_values(by=['gameId','possessionTeam','playId'])
df_play = pd.read_csv(os.path.join(root_dir,'data/plays.csv')).sort_values(by=['gameId','possessionTeam','playId'])
df_player_play = pd.read_csv(os.path.join(root_dir,'data/player_play.csv'))
df_players = pd.read_csv(os.path.join(root_dir,'data/players.csv'))

Label man/zone

In [3]:
df_play['pff_manZone'] = df_play['pff_manZone'].fillna('UNK')
df_play['is_man'] = 0
df_play.loc[df_play['pff_manZone'] == 'Man','is_man'] = 1

# Get rolling man/zone ratios

We first start by getting rolling man-defense counts; we also number each play of the game (based on possession team)

In [4]:
df_play = pd.concat([df_play,df_play.groupby(['gameId','possessionTeam']).agg(man_ct_game=('is_man','cumsum'),off_play=('is_man','cumcount'))],axis=1)
df_play['off_play']+=1 # method starts at 0, but we want to start at 1

We now use this data to, for each play, get the ratio of man coverage used to that point:

In [5]:
df_play['man_ratio_pre'] = df_play['man_ct_game']/df_play['off_play']
df_play['man_ratio'] = df_play.groupby(['gameId','possessionTeam']).man_ratio_pre.shift(1)
df_play['man_ratio'] = df_play['man_ratio'].fillna(.2)
df_play[['gameId','playId','possessionTeam','is_man','man_ct_game','off_play','man_ratio_pre','man_ratio']].head(8)

Unnamed: 0,gameId,playId,possessionTeam,is_man,man_ct_game,off_play,man_ratio_pre,man_ratio
13368,2022090800,56,BUF,0,0,1,0.0,0.2
2660,2022090800,80,BUF,0,0,2,0.0,0.0
7042,2022090800,101,BUF,0,0,3,0.0,0.0
2909,2022090800,122,BUF,0,0,4,0.0,0.0
10143,2022090800,167,BUF,0,0,5,0.0,0.0
5958,2022090800,191,BUF,0,0,6,0.0,0.0
16048,2022090800,212,BUF,1,1,7,0.142857,0.0
13120,2022090800,236,BUF,0,1,8,0.125,0.142857


Incorporate man ratio into train data:

In [6]:
train_data = train_data.merge(df_play[['gameId','playId','man_ratio']],how='left')

# Get box-count EWM

On the team-game level, we get the exponential windowed mean for box-count:

In [7]:
train_data['box_ewm_pre'] = train_data.groupby(['gameId','possessionTeam'])['n_defense_box'].transform(lambda x: x.ewm(alpha=.1).mean())
train_data['box_ewm'] = train_data.groupby(['gameId','possessionTeam']).box_ewm_pre.shift(1)
train_data['box_ewm'] = train_data['box_ewm'].fillna(6)
train_data[['gameId','playId','possessionTeam','n_defense_box','box_ewm']].head(6)

Unnamed: 0,gameId,playId,possessionTeam,n_defense_box,box_ewm
0,2022090800,56,BUF,6.0,6.0
1,2022090800,80,BUF,6.0,6.0
2,2022090800,101,BUF,7.0,6.0
3,2022090800,122,BUF,6.0,6.369004
4,2022090800,167,BUF,5.0,6.261704
5,2022090800,191,BUF,6.0,5.953603


Note: while non-EWM mean box count excised now, EWM significantly outperforms (almost 2:1)

In [8]:
train_data[['man_ratio','box_ewm','pass']].corr()

Unnamed: 0,man_ratio,box_ewm,pass
man_ratio,1.0,0.136,-0.009945
box_ewm,0.136,1.0,-0.086355
pass,-0.009945,-0.086355,1.0


Knowing if teams faced man coverage is quite useless; box-count EWM is useful, though, and not too cross-correlated:

In [9]:
box_ewm_corrs = train_data.corr()['box_ewm'].sort_values(ascending=False)
weight_corrs = train_data.corr()['weight_all_sum'].sort_values(ascending=False)
box_ewm_corrs.head(5)

box_ewm           1.000000
box_ewm_pre       0.897657
n_defense_box     0.251735
time_remaining    0.147769
man_ratio         0.136000
Name: box_ewm, dtype: float64

# Process BMI data

First, we convert height to inches, then get BMI:

In [10]:
# calc height, bmi
df_players = pd.concat([df_players,df_players['height'].str.split('-',n=1,expand=True).rename(columns={0:'h_ft',1:'h_in_pre'})],axis=1)
df_players['height_inches'] = df_players['h_ft'].astype(int)*12 + df_players['h_in_pre'].astype(int)
df_players['bmi'] = df_players['weight'] /(df_players['height_inches']**2) # weight/height squared

# incorporate data back into player-play
df_bmi = df_player_play[['gameId','playId','nflId']].merge(df_players[['nflId','bmi','height_inches','weight','position']])
df_bmi.head(1)

Unnamed: 0,gameId,playId,nflId,bmi,height_inches,weight,position
0,2022090800,56,35472,0.054815,77,325,G


### Get BMI by position

For each play, we get the mean BMI, weight, and height, by position group:

In [11]:
ol_df = df_bmi[df_bmi['position'].isin(['C','G','T'])].groupby(['gameId','playId'])[['weight','height_inches','bmi']].mean().reset_index().rename(columns={'weight':'mean_OL_weight','height_inches':'mean_OL_height','bmi':'mean_OL_bmi'})
dl_df = df_bmi[df_bmi['position'].isin(['DT','NT','DE'])].groupby(['gameId','playId'])[['weight','height_inches','bmi']].mean().reset_index().rename(columns={'weight':'mean_DL_weight','height_inches':'mean_DL_height','bmi':'mean_DL_bmi'})
lb_df = df_bmi[df_bmi['position'].isin(['LB','OLB','ILB'])].groupby(['gameId','playId'])[['weight','height_inches','bmi']].mean().reset_index().rename(columns={'weight':'mean_LB_weight','height_inches':'mean_LB_height','bmi':'mean_LB_bmi'})
cb_df = df_bmi[df_bmi['position'].isin(['CB'])].groupby(['gameId','playId'])[['weight','height_inches','bmi']].mean().reset_index().rename(columns={'weight':'mean_CB_weight','height_inches':'mean_CB_height','bmi':'mean_CB_bmi'})
wr_df = df_bmi[df_bmi['position'].isin(['WR'])].groupby(['gameId','playId'])[['weight','height_inches','bmi']].mean().reset_index().rename(columns={'weight':'mean_WR_weight','height_inches':'mean_WR_height','bmi':'mean_WR_bmi'})
te_df = df_bmi[df_bmi['position'].isin(['TE'])].groupby(['gameId','playId'])[['weight','height_inches','bmi']].mean().reset_index().rename(columns={'weight':'mean_TE_weight','height_inches':'mean_TE_height','bmi':'mean_TE_bmi'})

Then, we integrate all these positional BMI's:

In [12]:
df_play = df_play.merge(ol_df,how='left')
df_play = df_play.merge(dl_df,how='left')
df_play = df_play.merge(lb_df,how='left')
df_play = df_play.merge(cb_df,how='left')
df_play = df_play.merge(wr_df,how='left')
df_play = df_play.merge(te_df,how='left')

We next calculate rough "delta" BMI's between oppositional positions (e.g., WR/CB), also adding 'box' data (DL + LB)

In [13]:
df_play['wr_cb_bmi_delta'] = df_play['mean_WR_bmi']-df_play['mean_CB_bmi']
df_play['ol_dl_bmi_delta'] = df_play['mean_OL_bmi']-df_play['mean_DL_bmi']
df_play['ol_box_delta'] = df_play['mean_OL_bmi']-((df_play['mean_DL_bmi'] + df_play['mean_LB_bmi']) /2)
df_play['ol_plus_box_delta'] = ((df_play['mean_OL_bmi']+df_play['mean_TE_bmi'])/2)-((df_play['mean_DL_bmi'] + df_play['mean_LB_bmi']) /2)
df_play['box_weight'] = (df_play['mean_DL_weight'] + df_play['mean_LB_weight']) /2
df_play['box_bmi'] = (df_play['mean_DL_bmi'] + df_play['mean_LB_bmi']) /2

In [14]:
train_data = train_data.merge(df_play[['gameId','playId']+list(df_play.columns[-28:])],how='left')

### Compare new features to final features in model

We want to see if there's too much cross-correlation between our new features and our extant useful ones:

In [15]:
motion_cols=get_motion_cols(train_data.columns)
momentum_cols=get_momentum_cols(train_data.columns)
train_data=create_momentum_index(train_data, momentum_cols)
train_data=motion_complexity_score(train_data, motion_cols)
final_features=['xpass_situational',  'QB_RB1_offset','off_xpass','n_offense_backfield','motion-momentum','neg_Formations', 'mean_pairwise_dist']

Creat a few more composite features, trying to reconcile box count & DL/box weight:

In [16]:
train_data['box_ewm_dl_weight'] = train_data['box_ewm']*train_data['mean_DL_weight']
train_data['box_ewm_dl_bmi'] = train_data['box_ewm']*train_data['mean_DL_bmi']
train_data['box_ewm_weight'] = train_data['box_ewm']*train_data['box_weight']
train_data['box_ewm_bmi'] = train_data['box_ewm']*train_data['box_bmi']

Initial takeaway is that rolling box EWM, paired with mean DL BMI, tells us  most about pass likelihood:

In [17]:
train_data[final_features+['mean_DL_weight','box_ewm','box_ewm_dl_weight','box_ewm_weight',
                           'box_ewm_dl_bmi','box_ewm_bmi','pass']].corr()

Unnamed: 0,xpass_situational,QB_RB1_offset,off_xpass,n_offense_backfield,motion-momentum,neg_Formations,mean_pairwise_dist,mean_DL_weight,box_ewm,box_ewm_dl_weight,box_ewm_weight,box_ewm_dl_bmi,box_ewm_bmi,pass
xpass_situational,1.0,0.092432,0.12328,-0.278399,0.055583,-0.459718,0.38733,-0.211035,-0.10736,-0.200521,-0.147466,-0.193342,-0.157835,0.488006
QB_RB1_offset,0.092432,1.0,-0.040572,-0.210775,0.027452,-0.133448,0.115273,0.00314,-0.005813,-0.00207,0.003596,0.003639,0.004931,0.097237
off_xpass,0.12328,-0.040572,1.0,-0.096844,0.053168,-0.157291,0.122287,-0.062612,-0.123109,-0.13443,-0.127967,-0.150956,-0.141282,0.162012
n_offense_backfield,-0.278399,-0.210775,-0.096844,1.0,0.004917,0.341777,-0.321354,0.088789,0.06975,0.104886,0.084822,0.09976,0.087282,-0.312946
motion-momentum,0.055583,0.027452,0.053168,0.004917,1.0,-0.062888,-0.003989,0.015853,-0.015019,-0.0046,-0.012381,-0.017652,-0.021914,0.237531
neg_Formations,-0.459718,-0.133448,-0.157291,0.341777,-0.062888,1.0,-0.386887,0.130078,0.107366,0.158068,0.127719,0.162643,0.138057,-0.409078
mean_pairwise_dist,0.38733,0.115273,0.122287,-0.321354,-0.003989,-0.386887,1.0,-0.112187,-0.155697,-0.187754,-0.163509,-0.181298,-0.166011,0.309695
mean_DL_weight,-0.211035,0.00314,-0.062612,0.088789,0.015853,0.130078,-0.112187,1.0,0.026178,0.551396,0.395767,0.53092,0.343433,-0.104301
box_ewm,-0.10736,-0.005813,-0.123109,0.06975,-0.015019,0.107366,-0.155697,0.026178,1.0,0.847516,0.910149,0.814301,0.916064,-0.086355
box_ewm_dl_weight,-0.200521,-0.00207,-0.13443,0.104886,-0.0046,0.158068,-0.187754,0.551396,0.847516,1.0,0.971311,0.961509,0.948933,-0.126365


# Pressure Analysis

For each play, we see if any player caused pressure on it:

In [18]:
df_pp_cp = df_player_play.groupby(['gameId','playId']).agg(pressure_play=('causedPressure','any')).reset_index()
df_play = df_play.merge(df_pp_cp,how='left')
df_play.sort_values(by=['gameId','possessionTeam','playId'],inplace=True)

Then, we get the rolling pressure rate for each offensive team in the game:

In [19]:
df_play = pd.concat([df_play,df_play.groupby(['gameId','possessionTeam']).agg(pressure_ct=('pressure_play','cumsum'))],axis=1)
df_play['mean_pr_pre'] = df_play['pressure_ct']/df_play['off_play']
df_play['mean_pressure_ratio'] = df_play.groupby(['gameId','possessionTeam']).mean_pr_pre.shift(1)
df_play['mean_pressure_ratio'] = df_play['mean_pressure_ratio'].fillna(.19)

In [20]:
df_play[['gameId','playId','possessionTeam','pressure_play','pressure_ct','off_play','mean_pr_pre','mean_pressure_ratio']].head(11)

Unnamed: 0,gameId,playId,possessionTeam,pressure_play,pressure_ct,off_play,mean_pr_pre,mean_pressure_ratio
0,2022090800,56,BUF,False,0,1,0.0,0.19
1,2022090800,80,BUF,True,1,2,0.5,0.0
2,2022090800,101,BUF,False,1,3,0.333333,0.5
3,2022090800,122,BUF,True,2,4,0.5,0.333333
4,2022090800,167,BUF,False,2,5,0.4,0.5
5,2022090800,191,BUF,False,2,6,0.333333,0.4
6,2022090800,212,BUF,False,2,7,0.285714,0.333333
7,2022090800,236,BUF,False,2,8,0.25,0.285714
8,2022090800,529,BUF,False,2,9,0.222222,0.25
9,2022090800,550,BUF,True,3,10,0.3,0.222222


### Attempt integration w/OL, box & DL BMI info

By itself, the pressure rate a team faces, somewhat surprisingly, tells us little about pass rates going forward.

In [21]:
df_play[['mean_pressure_ratio','isDropback']].corr()

Unnamed: 0,mean_pressure_ratio,isDropback
mean_pressure_ratio,1.0,0.023543
isDropback,0.023543,1.0


We thus try and integrate it into our previously engineered Box/OL/DL BMI data:

In [22]:
train_data = train_data.merge(df_play[['gameId','playId','mean_pressure_ratio']],how='left')
train_data['box_pressure']  = 5*train_data['box_ewm']+.05*train_data['mean_pressure_ratio']	
train_data['OL_pc'] = train_data['mean_OL_bmi']*train_data['mean_pressure_ratio']
train_data['DL_pc'] = train_data['mean_DL_bmi']*train_data['mean_pressure_ratio']

This still bears little fruit, so we'll try EWM next:

In [23]:
train_data[['box_ewm_dl_bmi','box_ewm','box_pressure','mean_pressure_ratio','OL_pc','DL_pc','pass']].corr()

Unnamed: 0,box_ewm_dl_bmi,box_ewm,box_pressure,mean_pressure_ratio,OL_pc,DL_pc,pass
box_ewm_dl_bmi,1.0,0.814301,0.814267,-0.163002,-0.16395,-0.108692,-0.128618
box_ewm,0.814301,1.0,0.999997,-0.184197,-0.187378,-0.183689,-0.086355
box_pressure,0.814267,0.999997,1.0,-0.181879,-0.185064,-0.181385,-0.086345
mean_pressure_ratio,-0.163002,-0.184197,-0.181879,1.0,0.998946,0.994276,0.019962
OL_pc,-0.16395,-0.187378,-0.185064,0.998946,1.0,0.993638,0.019736
DL_pc,-0.108692,-0.183689,-0.181385,0.994276,0.993638,1.0,0.010115
pass,-0.128618,-0.086355,-0.086345,0.019962,0.019736,0.010115,1.0


# Pressure EWM analysis

Again, like with box count, EWM almost doubles the value of pressure rate to our model

In [24]:
df_play['pressure_ewm_pre'] = df_play.groupby(['gameId','possessionTeam'])['pressure_play'].transform(lambda x: x.ewm(alpha=.1).mean())
df_play['pressure_ewm'] = df_play.groupby(['gameId','possessionTeam']).pressure_ewm_pre.shift(1)
df_play['pressure_ewm'] = df_play['pressure_ewm'].fillna(.19)

In [25]:
train_data = train_data.merge(df_play[['gameId','playId','pressure_ewm']],how='left')

In [26]:
train_data[['pressure_ewm','pass']].corr()

Unnamed: 0,pressure_ewm,pass
pressure_ewm,1.0,0.057065
pass,0.057065,1.0


In [27]:
train_data = train_data.merge(df_play[['gameId','playId','pressure_ewm']],how='left')
train_data['box_pressure_ewm']  = 5*train_data['box_ewm']+.05-train_data['pressure_ewm']
train_data['OL_pc_ewm'] = train_data['mean_OL_bmi']-train_data['pressure_ewm']
train_data['DL_pc_ewm'] = train_data['mean_DL_bmi']-train_data['pressure_ewm']

In [28]:
train_data[['box_ewm_dl_bmi','box_ewm','box_pressure_ewm','pressure_ewm','OL_pc_ewm','DL_pc_ewm','pass']].corr()

Unnamed: 0,box_ewm_dl_bmi,box_ewm,box_pressure_ewm,pressure_ewm,OL_pc_ewm,DL_pc_ewm,pass
box_ewm_dl_bmi,1.0,0.814301,0.814152,-0.189285,0.189146,0.201797,-0.128618
box_ewm,0.814301,1.0,0.998531,-0.209294,0.208836,0.209365,-0.086355
box_pressure_ewm,0.814152,0.998531,1.0,-0.261975,0.261521,0.262022,-0.088389
pressure_ewm,-0.189285,-0.209294,-0.261975,1.0,-0.999956,-0.99977,0.057065
OL_pc_ewm,0.189146,0.208836,0.261521,-0.999956,1.0,0.999736,-0.057073
DL_pc_ewm,0.201797,0.209365,0.262022,-0.99977,0.999736,1.0,-0.059128
pass,-0.128618,-0.086355,-0.088389,0.057065,-0.057073,-0.059128,1.0


# Gather best features, write to CSV

Here, we take everything useful we've gathered, and write it out:

In [29]:
train_data[['gameId','playId','pressure_ewm','box_ewm_dl_bmi']]

Unnamed: 0,gameId,playId,pressure_ewm,box_ewm_dl_bmi
0,2022090800,56,0.190000,0.313053
1,2022090800,80,0.000000,0.313053
2,2022090800,101,0.526316,0.313053
3,2022090800,122,0.332103,0.332306
4,2022090800,167,0.526316,0.326708
...,...,...,...,...
14551,2022103100,3596,0.034831,0.354598
14552,2022103100,3674,0.031343,0.376992
14553,2022103100,3697,0.028206,0.381778
14554,2022103100,3727,0.025383,0.386084


In [31]:
train_data[['gameId','playId','pressure_ewm','box_ewm_dl_bmi']].to_csv(os.path.join(root_dir, "data/box_bmi_pressure.csv"))