[Kaggle Competition](https://www.kaggle.com/c/nfl-big-data-bowl-2021)

Purpose of this notebook is to generate prediction models for the coverage type, based on 2018 week 1 labels provided in the competition.

Reason behind building a coverage label prediction model is to label plays for all other weeks (2-17) and then use that as an additional feature of downstream analytics, either at the play-level or tracking data within plays of a specific coverage type.

# Imports

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation
import seaborn as sns

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import accuracy_score, confusion_matrix

In [3]:
# add local directory to import path
import os
import sys
module_path = os.path.abspath(os.path.join('.'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import nflutil

Loading the data:

In [4]:
week_num = 1

track_df = pd.read_csv(f'csv/week{week_num}.csv')
play_df = pd.read_csv('csv/plays.csv')
game_df = pd.read_csv('csv/games.csv')
player_df = pd.read_csv('csv/players.csv')
coverage_df = pd.read_csv('csv/coverages_week1.csv')

# Generate Features

Features are generated on a per-play basis to predict the overall defensive coverage scheme.

Unless specified as the value at a particular stage of the play (snap, throw, etc.), the features are calculated at each time step between the snap and a "pivot" event that will alter the behavior of the defense (i.e. throw or sack) or a maximum elapsed time after the snap, whichever is first. The maximum time is meant to limit the effect of the play breaking down on the planned movements of the players.

All defenders are evaluated at a threshold time in the play (e.g. 1.5 seconds after snap) if they are still beyond the line of scrimmage to determine which defenders are actually in coverage vs. blitzing. Blitzing player values would skew average results, and the movement of the rusher does not matter beyond the fact that they are not in coverage.

The following features are generated at a per-defender level (inspired by [this paper](https://arxiv.org/abs/1906.11373)), then averaged across all coverage defenders:

* **depth_mean**: average yards downfield of the defender (same as X_mean but normalizes for line of scrimmage)
* **depth_var**: variance in downfield movement (same as X_var, lower value expected for zone)
* **y_mean**: average horizontal location of defender
* **y_var**: variance in horizontal movement (lower value expected for zone)
* **speed_mean**: average speed of the defender, yd/sec (lower value expected for zone)
* **speed_var**: variance in speed of the defender (lower expected for zone)
* **off_mean**: average distance to the closest offensive player at each frame during a play (lower expected for man)
* **off_var**: variance in distance to the closest offensive player at each frame (lower expected for man)
* **def_mean**: average distance to the closest defender at each frame
* **def_var**: variance in distance to the closest defender at each frame
* **rat_mean**: see paper for better description
* **rat_var**: see paper for better description
* **rat_o_los**: ratio of time during a play that a defender is facing the QB/line of scrimmage. Expected to be near 1 for zone (reading the QB's eyes, keeping receivers in front of them vs. turning and running with the receiver) **Current threshold is 180 degrees; may want to make this smaller like 90 degrees to account for man coverage following crossing routes or moving mostly sideways but slightly towards the LOS**

The following play-level features are generated:

* **n_cover**: the number of defenders with tracking data in coverage (beyond the LOS at t_defender_thresh sec)
* **n_cb**: the number of cornerbacks in the defensive formation
* **n_def_excess**: the number of coverage defenders minus eligible receivers (players available for zone or double coverage)

The following features are evaluated based on player locations at the snap (leverage info when defense do not disguise coverage pre-snap):
* **n_deep_snap**: the number of defenders over 10 yards downfield at the snap
* **cb_depth_snap_min**: depth of closest CB to the LOS (press vs. soft)
* **cb_depth_snap_mean**: average depth of CBs
* **cb_depth_snap_max**: depth of farther CB to the LOS (zone expected for far off the ball)

The following features are evaluated at the "freeze" time (determining which players are in coverage, coverage scheme is assumed to be materialized regardless if it was being disguised pre-snap or not):
* **n_deep_frz**: number of defenders over 10 yards downfield at "freeze"
* **def_spac_frz_avg**: spacing to nearest defender at "freeze", averaged over all coverage players
* **def_spac_frz_min**: same source as above, but minimum (larger expected for zone)
* **def_spac_frz_max**: same source as above, but maximum (smaller expected for zone)
* **cb_depth_frz_min**: same as at snap but at the "freeze" time
* **cb_depth_frz_mean**: same as at snap but at the "freeze" time
* **cb_depth_frz_max**: same as at snap but at the "freeze" time

In [22]:
def create_play_features(play_track_df, t_defender_thresh=1.5, t_scheme_develop=3):
    ### THE INPUT TRACKING DATA MUST BE NORMALIZED FOR DIRECTION BEFORE INPUT INTO THIS FUNCTION
    # inputs:
    #     - play_track_df: DataFrame of the raw player tracking data for an individual play.
    #                      MUST ONLY BE FOR A SINGLE PLAY, CANNOT HANDLE MULTIPLE PLAYS.
    #     - t_defender_thresh: time in seconds after the snap to determine which players are in coverage
    #     - t_scheme_develop: time in seconds after the snap to set as a max time threshold
    #                         (i.e. before the play breaks down, after which the movement 
    #                          is not always indicative of the coverage scheme)
    
    # local constants
    DEF_DEEP_THRESH = 10  # yards behind the line of scrimmage considered "deep" coverage

    # work on a copy of the data rather than the actual data (for temporary features)
    play_track_df = play_track_df.copy()
    
    # ------------- FEATURE GENERATION SETUP/INTERMEDIATE CALCULATIONS ----------------------
    
    # make a dictionary of nflId to position for each player in the play
    player_positions = play_track_df.groupby('nflId')[['nflId', 'position']].head(1).dropna()  # football is np.nan
    position_dict = dict(zip(player_positions['nflId'].tolist(), player_positions['position'].tolist()))
    
    # get play information
    x_los = play_track_df.x[(play_track_df.team == 'football') & (play_track_df.frameId == 1)].iloc[0]
    
    # save the distance downfield of all observations relative to the line of scrimmage
    play_track_df['depth'] = play_track_df['x'] - x_los
    
    # get frameId for specific points in the play (exclude handoff: not a material pivot part of the play,
    # also sometimes occurs prior to the snap)
    pivot_events = ['pass_forward', 'qb_sack', 'fumble', 'qb_strip_sack', 'pass_shovel']
    
    frame_max = play_track_df.frameId.max()
    frame_snap = play_track_df[play_track_df.event=='ball_snap']['frameId'].iloc[0]
    if np.any(play_track_df.event.isin(pivot_events)):
        frame_pivot = play_track_df.frameId[play_track_df.event.isin(pivot_events)].iloc[0]
    else:
        frame_pivot = frame_max
    
    # important frameId's in the play:
    frame_start = frame_snap
    frame_cover_freeze = min(frame_max, frame_pivot, int(frame_snap + 10*t_defender_thresh))
    frame_scheme_develop = int(frame_snap + 10*t_scheme_develop)
    frame_end = min(frame_pivot, frame_scheme_develop, frame_max)
    
#     print(f'frame_start: {frame_start}')
#     print(f'frame_cover_freeze: {frame_cover_freeze}')
#     print(f'frame_scheme_develop: {frame_scheme_develop}')
#     print(f'frame_end: {frame_end}')
#     print(f'frame_max: {frame_max}')
    
    
    # filter out data from frames outside of the range (frame_start <= F <= frame_end)
    play_track_df = play_track_df[(play_track_df.frameId >= frame_start) & (play_track_df.frameId <= frame_end)]
    
    
    # ----- SAVE SLICES OF DATAFRAME FOR DEFENDERS AND COVERAGE AND ELIGIBLE RECEIVERS ----
    
    # get defensive player tracks that are in coverage (i.e. not blitzing/rushing the passer)
    def_positions = ['DE', 'DL', 'NT', 'LB', 'MLB', 'ILB', 'OLB', 'DB', 'CB', 'FS', 'SS', 'S']
    cover_players = play_track_df.nflId[(play_track_df.frameId == frame_cover_freeze) &
                          play_track_df.position.isin(def_positions) & 
                          (play_track_df.depth > 0)]
    def_track = play_track_df[play_track_df.nflId.isin(cover_players)].pivot(
        index='frameId', columns='nflId', values=['x', 'depth', 'y', 's', 'a', 'dir', 'o'])
    
    # get offensive player tracks of eligible receivers (minus QB)
    off_positions = ['WR', 'RB', 'TE', 'FB', 'HB']
    off_track = play_track_df[play_track_df.position.isin(off_positions)].pivot(
        index='frameId', columns='nflId', values=['x', 'depth', 'y', 's', 'a', 'dir', 'o'])
    
    
    # ------ PLAY CHARACTERISTICS AT SPECIFIC FRAMES/POINTS IN TIME -----------------------
    
    # find characteristics of scheme at the snap (line of scrimmage naturally divides offense + defense)
    n_deep_snap = np.sum((play_track_df.depth >= DEF_DEEP_THRESH) & (play_track_df.frameId == frame_start))
    n_cb = len(play_track_df[play_track_df.position=='CB'].groupby('nflId').head(1))
    cb_depth_at_snap = play_track_df.loc[(play_track_df.frameId == frame_start) & (play_track_df.position=='CB'), 'depth']
    
    # find characteristics of players in coverage at the "cover freeze time"
    n_deep_freeze = np.sum((play_track_df.nflId.isin(cover_players)) & 
                           (play_track_df.depth >= DEF_DEEP_THRESH) &
                           (play_track_df.frameId == frame_cover_freeze))
    cb_depth_at_freeze = play_track_df.loc[(play_track_df.frameId == frame_cover_freeze) & (play_track_df.position=='CB'), 'depth']
    
    # calculate the number of "excess defenders" (available for free zone or double coverage)
    # - negative either means uncovered, or potential receiver stays back in protection
    n_def_excess = len(cover_players) - len(play_track_df.nflId[play_track_df.position.isin(off_positions)].unique())
    
    
    # ------GENERATE FEATURES FOR EACH COVERAGE PLAYER AT EACH FRAME ---------------------
    
    feature_data = {'depth_mean': [],
                    'depth_var': [],
                    'y_mean': [],
                    'y_var': [],
                    'speed_mean': [],
                    'speed_var': [],
                    'off_mean': [],
                    'off_var': [],
                    'off_dir_mean': [],
                    'off_dir_var': [],
                    'def_mean': [],
                    'def_var': [],
                    'rat_mean': [],
                    'rat_var': [],
                    'rat_o_los': []
                   }
    
    # data that is not dependent on the specific player
    x_off = off_track['x'].to_numpy()  # (n_frame, n_off) array
    y_off = off_track['y'].to_numpy()  # (n_frame, n_off) array
    dir_off = off_track['dir'].to_numpy()  # (n_frame, n_off array)
    
    # preallocate numpy array for minimum distance to nearest defensive player for each frame
    dist_def_min_array = np.empty([len(play_track_df.frameId.unique()), len(cover_players)])
    dist_def_min_array[:] = np.nan
    
    # loop over each defensive player
    for i, player in enumerate(cover_players):
        x_player = def_track['x'][player].to_numpy().reshape(-1, 1)  # (n_frame,1) array
        depth_player = def_track['depth'][player].to_numpy().reshape(-1, 1)  # (n_frame,1) array
        y_player = def_track['y'][player].to_numpy().reshape(-1, 1)  # (n_frame,1) array
        s_player = def_track['s'][player].to_numpy().reshape(-1, 1)  # (n_frame,1) array
        o_player = def_track['o'][player].to_numpy().reshape(-1, 1)  # (n_frame,1) array
        dir_player = def_track['dir'][player].to_numpy().reshape(-1, 1)  # (n_frame,1) array
        
        x_def = def_track['x'].drop(columns=player).to_numpy()  # (n_frame, n_def-1) array
        y_def = def_track['y'].drop(columns=player).to_numpy()  # (n_frame, n_def-1) array
        
        # calculate distance to each player at each time
        dist_off = np.sqrt((x_player - x_off)**2 + (y_player - y_off)**2)  # (n_frame, n_off) array
        dist_off_min = np.nanmin(dist_off, axis=1) # (n_frame,) array
        dist_def = np.sqrt((x_player - x_def)**2 + (y_player - y_def)**2)  # (n_frame, n_off) array
        dist_def_min = np.nanmin(dist_def, axis=1) # (n_frame,) array
        
        # save minimum distances for each player in external array
        dist_def_min_array[:, i] = dist_def_min
        
        # get the direction of the closest offensive player at each time
        dir_off_min = dir_off[dist_off == dist_off_min.reshape(-1,1)]  # (n_frame,) array
        # calculate the difference between the defender direction and offense direction, accounting for 0-360 wrap
        dir_diff_m360 = dir_player - dir_off_min - 360  # min when defender dir = 359, offense dir = 0
        dir_diff = dir_player - dir_off_min             # min when wrap is not an issue (defender = 10, offense = 11)
        dir_diff_p360 = dir_player - dir_off_min + 360  # min when defender dir = 0, offense dir = 359
        temp_dir_compare = np.where(np.abs(dir_diff_m360) < np.abs(dir_diff), dir_diff_m360, dir_diff)
        dir_diff_min = np.where(np.abs(temp_dir_compare) < np.abs(dir_diff_p360), temp_dir_compare, dir_diff_p360)
        
        # calculate ratio of distance to closest offensive player "j" (d_i-j) and distance from same offensive player
        # to defensive player "k" nearest to current player (d_j-k)
        x_j = x_off[dist_off == dist_off_min.reshape(-1,1)]  # (n_frame,) array
        y_j = y_off[dist_off == dist_off_min.reshape(-1,1)]  # (n_frame,) array
        x_k = x_def[dist_def == dist_def_min.reshape(-1,1)]  # (n_frame,) array
        y_k = y_def[dist_def == dist_def_min.reshape(-1,1)]  # (n_frame,) array
        ratio = dist_off_min / np.sqrt((x_j - x_k)**2 + (y_j - y_k)**2)
        
        # calculate the ratio of frames a defender is facing the line of scrimmage
        rat_o_los = np.nanmean((o_player > 180) & (o_player < 360))
        
        # save average distance
        feature_data['depth_mean'].append(np.nanmean(depth_player))
        feature_data['depth_var'].append(np.nanvar(depth_player))
        feature_data['y_mean'].append(np.nanmean(y_player))
        feature_data['y_var'].append(np.nanvar(y_player))
        feature_data['speed_mean'].append(np.nanmean(s_player))
        feature_data['speed_var'].append(np.nanvar(s_player))
        feature_data['off_mean'].append(np.nanmean(dist_off_min))
        feature_data['off_var'].append(np.nanvar(dist_off_min))
        feature_data['off_dir_mean'].append(np.nanmean(dir_diff_min))
        feature_data['off_dir_var'].append(np.nanvar(dir_diff_min))
        feature_data['def_mean'].append(np.nanmean(dist_def_min))
        feature_data['def_var'].append(np.nanvar(dist_def_min))
        feature_data['rat_mean'].append(np.nanmean(ratio))
        feature_data['rat_var'].append(np.nanvar(ratio))
        feature_data['rat_o_los'].append(rat_o_los)
        
    # put results into a dataframe
    dist_df = pd.DataFrame(feature_data, index=cover_players)
    
    # return averages of all coverage players
    out_data = dist_df.mean()
    out_data['n_cover'] = len(cover_players)
    out_data['n_cb'] = n_cb
    out_data['n_deep_snap'] = n_deep_snap
    out_data['n_deep_frz'] = n_deep_freeze
    out_data['n_def_excess'] = n_def_excess
    out_data['def_spac_frz_avg'] = np.nanmean(dist_def_min_array[frame_cover_freeze - frame_start, :])
    out_data['def_spac_frz_min'] = np.nanmin(dist_def_min_array[frame_cover_freeze - frame_start, :])
    
    # add in CB-specific features
    if n_cb > 0:
        out_data['cb_depth_snap_min'] = np.nanmin(cb_depth_at_snap)
        out_data['cb_depth_snap_mean'] = np.nanmean(cb_depth_at_snap)
        out_data['cb_depth_snap_max'] = np.nanmax(cb_depth_at_snap)
        out_data['cb_depth_frz_min'] = np.nanmin(cb_depth_at_freeze)
        out_data['cb_depth_frz_mean'] = np.nanmean(cb_depth_at_freeze)
        out_data['cb_depth_frz_max'] = np.nanmax(cb_depth_at_freeze)
    else:
        out_data['cb_depth_snap_min'] = 0
        out_data['cb_depth_snap_mean'] = 0
        out_data['cb_depth_snap_max'] = 0
        out_data['cb_depth_frz_min'] = 0
        out_data['cb_depth_frz_mean'] = 0
        out_data['cb_depth_frz_max'] = 0
    
    return out_data
    
# # test the function
# game_id = 2018090600
# play_id = 190
# play_track_df = track_df[(track_df.gameId==game_id) & (track_df.playId == play_id)]
# test_df = nflutil.transform_tracking_data(play_track_df)
# out=create_play_features(test_df)
# out

In [6]:
# Transform the raw tracking data so that all offensive plays face the same direction,
# group the tracking data for each play together
test_df_group = nflutil.transform_tracking_data(track_df).groupby(['gameId', 'playId'])

# ------ Create the features for each play ---------------------
feature_df = pd.DataFrame()

col_names = []
values = []

# loop over each play
for (loop_game_id, loop_play_id), loop_track_df in test_df_group:
    
    # error block for easier debugging if a particular play runs into an error
    try:
        features = create_play_features(loop_track_df)
    except Exception as err:
        print(f'error in gameId {loop_game_id}, playId {loop_play_id}')
        raise err
    
    # first loop: save the output dataframe column names (gameId, playId, all feature names)
    if not col_names: # empty
        col_names.extend(['gameId', 'playId'])
        col_names.extend(features.index.tolist())
    
    # save the gameId, playId, and all feature values into a list
    loop_values = [loop_game_id, loop_play_id]
    loop_values.extend(features.values.tolist())
    values.append(loop_values)
    
# convert the features into a dataframe (1 row per play), inner join on plays with labeled coverages
feature_df = pd.DataFrame(values, columns=col_names)
labeled_play_df = pd.merge(feature_df, coverage_df.dropna(), on=['gameId', 'playId'])
labeled_play_df

Unnamed: 0,gameId,playId,depth_mean,depth_var,y_mean,y_var,speed_mean,speed_var,off_mean,off_var,...,n_def_excess,def_spac_frz_avg,def_spac_frz_min,cb_depth_snap_min,cb_depth_snap_mean,cb_depth_snap_max,cb_depth_frz_min,cb_depth_frz_mean,cb_depth_frz_max,coverage
0,2018090600,75,6.875714,1.619591,23.388571,0.415971,1.818681,2.054815,6.534640,1.894467,...,2.0,7.121844,5.312721,6.19,6.835000,7.48,8.31,8.485000,8.66,Cover 3 Zone
1,2018090600,146,6.649796,0.282480,24.884422,0.608430,1.509456,1.219698,6.675726,1.541242,...,2.0,7.121623,2.633040,2.89,5.983333,7.56,2.67,6.923333,9.50,Cover 3 Zone
2,2018090600,168,6.391845,1.327734,27.503036,1.045686,1.901667,1.471704,7.440575,3.672153,...,2.0,7.427226,3.211386,1.67,4.216667,7.45,2.20,6.480000,9.37,Cover 3 Zone
3,2018090600,190,8.516954,3.158885,29.036667,1.217060,2.337011,1.311433,8.083737,4.916043,...,1.0,7.582981,5.183763,1.75,4.716667,7.18,4.97,6.453333,9.24,Cover 3 Zone
4,2018090600,256,2.706952,0.302289,26.748571,0.789577,1.679143,2.084354,3.853772,0.562736,...,0.0,8.832102,5.894277,0.00,0.000000,0.00,0.00,0.000000,0.00,Cover 0 Man
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1023,2018091001,3787,6.136406,1.091074,27.771244,5.738808,2.570922,2.774239,4.573031,3.629912,...,2.0,4.222899,3.788575,2.47,4.896667,6.48,3.00,5.743333,7.45,Cover 3 Zone
1024,2018091001,3904,11.428341,5.720115,29.338894,2.328007,2.858986,2.083819,9.410524,3.283435,...,2.0,9.470289,7.590415,5.45,7.403333,9.46,6.68,9.573333,12.56,Cover 4 Zone
1025,2018091001,3928,10.805048,5.260766,26.655238,2.885616,3.017714,2.772459,8.939152,5.059852,...,2.0,8.178501,5.736009,5.50,7.116667,7.94,7.05,9.636667,10.96,Cover 3 Zone
1026,2018091001,3952,10.190000,3.706458,24.972442,2.375189,2.584470,1.789000,7.638372,3.746163,...,2.0,9.830320,7.565481,5.46,8.620000,10.56,7.10,10.553333,13.05,Cover 4 Zone


Labeled coverage counts:

In [7]:
labeled_play_df.coverage.value_counts()

Cover 3 Zone    352
Cover 1 Man     296
Cover 4 Zone    152
Cover 2 Zone    113
Cover 6 Zone     69
Cover 2 Man      32
Cover 0 Man      13
Prevent Zone      1
Name: coverage, dtype: int64

For labeled coverages, prevent won't be trained. It will be easy to spot with large average depth of the defenders at the snap.

Create extra labels if the classification problem is split into separate "Cover X" and binary "Man/Zone" models

In [9]:
# split coverage into "Cover X" and Zone labels

# Zone
labeled_play_df['zone'] = 0
labeled_play_df.loc[labeled_play_df.coverage.str.contains('Zone'), 'zone'] = 1

# Cover X (Prevent will be listed as np.nan)
labeled_play_df['cover'] = np.nan
labeled_play_df.loc[labeled_play_df.coverage.str.contains('Cover'), 'cover'] = (
    labeled_play_df.loc[labeled_play_df.coverage.str.contains('Cover'), 'coverage'].apply(
    lambda x: int(x.split()[1]))
)

# Classification of All Coverage Types (Multi-class, abbr = "_ac")

## Train-test split

Coverage label is encoded into integer labels

In [14]:
# drop the prevent defense example from training (will handle that case outside of algorithm)
ac_df = labeled_play_df.copy()
ac_df = ac_df[ac_df.coverage != 'Prevent Zone']

# train test split
y_ac_encoder = OrdinalEncoder().fit(ac_df.coverage.to_numpy().reshape(-1,1))
y_ac = y_ac_encoder.transform(ac_df.coverage.to_numpy().reshape(-1,1)).ravel()
X_ac = ac_df.drop(columns=['gameId', 'playId', 'coverage', 'zone', 'cover'])
X_ac_train, X_ac_test, y_ac_train, y_ac_test = train_test_split(X_ac, y_ac, 
                                                                test_size=0.2, stratify=y_ac, random_state=123456)

Confirm no leakage variables in feature set (gameId, playId, labels, etc.)

In [15]:
X_ac.columns

Index(['depth_mean', 'depth_var', 'y_mean', 'y_var', 'speed_mean', 'speed_var',
       'off_mean', 'off_var', 'off_dir_mean', 'off_dir_var', 'def_mean',
       'def_var', 'rat_mean', 'rat_var', 'rat_o_los', 'n_cover', 'n_cb',
       'n_deep_snap', 'n_deep_frz', 'n_def_excess', 'def_spac_frz_avg',
       'def_spac_frz_min', 'cb_depth_snap_min', 'cb_depth_snap_mean',
       'cb_depth_snap_max', 'cb_depth_frz_min', 'cb_depth_frz_mean',
       'cb_depth_frz_max'],
      dtype='object')

In [16]:
y_ac_encoder.categories_

[array(['Cover 0 Man', 'Cover 1 Man', 'Cover 2 Man', 'Cover 2 Zone',
        'Cover 3 Zone', 'Cover 4 Zone', 'Cover 6 Zone'], dtype=object)]

## Train model

StandardScaler isn't necessary for random forest, but included in case the pipeline is modified later to include another algorithm that does need feature scaling

In [17]:
# build pipeline
estimators_ac = [('normalize', StandardScaler()),
              ('clf', RandomForestClassifier(n_estimators=100, random_state=123456))]
pipe_ac = Pipeline(estimators_ac)

# build grid search
param_grid_ac = {
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [None, 2, 4, 8],
    'clf__max_features': ['auto', 'sqrt', 'log2'],
    'clf__min_samples_leaf': [5, 10]  # min > 1 to help control overfitting
}

clf_ac = GridSearchCV(pipe_ac, param_grid=param_grid_ac)
clf_ac.fit(X_ac_train, y_ac_train)
print(f'Train accuracy: {accuracy_score(y_ac_train, clf_ac.predict(X_ac_train)):.3f}')
print(f'CV average accuracy: {clf_ac.best_score_:.3f}')

Train accuracy: 0.853
CV average accuracy: 0.604


In [18]:
clf_ac.best_params_

{'clf__criterion': 'entropy',
 'clf__max_depth': None,
 'clf__max_features': 'auto',
 'clf__min_samples_leaf': 5}

In [19]:
y_ac_pred = clf_ac.predict(X_ac_test)
print(f'Test accuracy: {accuracy_score(y_ac_test, y_ac_pred):.3f}')
print('Confusion Matrix: ')
pd.DataFrame(confusion_matrix(y_ac_test, y_ac_pred),
                   index=('(A) ' + y_ac_encoder.categories_[0]),
                   columns=('(P) ' + y_ac_encoder.categories_[0]))

Test accuracy: 0.549
Confusion Matrix: 


Unnamed: 0,(P) Cover 0 Man,(P) Cover 1 Man,(P) Cover 2 Man,(P) Cover 2 Zone,(P) Cover 3 Zone,(P) Cover 4 Zone,(P) Cover 6 Zone
(A) Cover 0 Man,0,1,0,0,1,1,0
(A) Cover 1 Man,0,37,0,3,19,0,0
(A) Cover 2 Man,0,3,0,3,0,0,0
(A) Cover 2 Zone,0,3,0,10,6,4,0
(A) Cover 3 Zone,0,12,0,2,56,1,0
(A) Cover 4 Zone,0,6,0,4,10,10,0
(A) Cover 6 Zone,0,1,0,2,8,3,0


In [20]:
pd.Series(clf_ac.best_estimator_['clf'].feature_importances_, index=X_ac.columns).sort_values(ascending=False)

n_deep_frz            0.089372
depth_mean            0.074819
cb_depth_snap_min     0.066593
cb_depth_frz_max      0.062645
cb_depth_frz_mean     0.054000
off_mean              0.053256
n_deep_snap           0.049379
speed_var             0.042502
cb_depth_frz_min      0.036005
def_spac_frz_min      0.033978
def_spac_frz_avg      0.032988
cb_depth_snap_max     0.032573
cb_depth_snap_mean    0.032443
def_mean              0.031342
off_dir_var           0.029778
rat_mean              0.029770
speed_mean            0.027574
def_var               0.026961
off_var               0.026206
depth_var             0.024261
rat_var               0.023554
rat_o_los             0.022582
off_dir_mean          0.020501
y_mean                0.020008
y_var                 0.019924
n_cover               0.017021
n_def_excess          0.014789
n_cb                  0.005178
dtype: float64

# Classification of Zone vs. Man (Binary)

## Train-test split

In [24]:
# use whole set of data
zone_df = labeled_play_df.copy()

# train test split
y_z = zone_df.zone
X_z = zone_df.drop(columns=['gameId', 'playId', 'coverage', 'zone', 'cover'])
X_z_train, X_z_test, y_z_train, y_z_test = train_test_split(X_z, y_z, test_size=0.2, stratify=y_z, random_state=123456)

## Train model

StandardScaler isn't necessary for random forest, but included in case the pipeline is modified later to include another algorithm that does need feature scaling

In [25]:
# build pipeline
estimators_z = [('normalize', StandardScaler()),
              ('clf', RandomForestClassifier(n_estimators=100, random_state=123456))]
pipe_z = Pipeline(estimators_z)

# build grid search
param_grid_z = {
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [None, 2, 4, 8],
    'clf__max_features': ['auto', 'sqrt', 'log2'],
    'clf__min_samples_leaf': [5, 10]  # min > 1 to help control overfitting
}

clf_z = GridSearchCV(pipe_z, param_grid=param_grid_z)
clf_z.fit(X_z_train, y_z_train)
print(f'Train accuracy: {accuracy_score(y_z_train, clf_z.predict(X_z_train)):.3f}')
print(f'CV average accuracy: {clf_z.best_score_:.3f}')

Train accuracy: 0.939
CV average accuracy: 0.820


In [26]:
clf_z.best_params_

{'clf__criterion': 'gini',
 'clf__max_depth': None,
 'clf__max_features': 'auto',
 'clf__min_samples_leaf': 5}

In [27]:
y_z_pred = clf_z.predict(X_z_test)
print(f'Test accuracy: {accuracy_score(y_z_test, y_z_pred):.3f}')
print('Confusion Matrix: ')
pd.DataFrame(confusion_matrix(y_z_test, y_z_pred),
                   index=['Actual Man', 'Actual Zone'],
                   columns=['Pred. Man', 'Pred. Zone'])

Test accuracy: 0.825
Confusion Matrix: 


Unnamed: 0,Pred. Man,Pred. Zone
Actual Man,40,28
Actual Zone,8,130


In [28]:
pd.Series(clf_z.best_estimator_['clf'].feature_importances_, index=X_z.columns).sort_values(ascending=False)

cb_depth_snap_min     0.125662
depth_mean            0.108432
off_mean              0.087439
speed_var             0.058136
cb_depth_snap_mean    0.047357
cb_depth_frz_mean     0.044158
cb_depth_frz_max      0.037681
def_spac_frz_min      0.035857
rat_var               0.035204
cb_depth_frz_min      0.034542
cb_depth_snap_max     0.034204
n_deep_frz            0.033848
n_cover               0.033338
y_var                 0.029963
rat_mean              0.028480
off_var               0.024495
def_var               0.023634
def_mean              0.023052
depth_var             0.022786
def_spac_frz_avg      0.019047
y_mean                0.019022
n_def_excess          0.018677
off_dir_var           0.017568
off_dir_mean          0.016782
rat_o_los             0.015651
speed_mean            0.015331
n_deep_snap           0.005777
n_cb                  0.003881
dtype: float64