## Introduction

In the fast-paced realm of NFL football, where every yard gained or lost can be the difference between victory and defeat, evaluating defensive performance becomes paramount. The primary goal of this project is to delve into the intricacies of a team's defensive capabilities, measured through the lens of **"Yards Saved Above Expected"** and the consequential yards allowed due to missed tackles. By analyzing these metrics, these insights aim to provide coaches and teams with a comprehensive understanding of their defensive strengths and weaknesses, ultimately offering valuable insights into the plays that shape the outcome of a game.

Once the ball is snapped and in play, the defense's mission is clear: minimize the yardage gained by the opposing team. A single well-executed tackle or, conversely, a missed tackle, can be a game-changer. The significance of these moments cannot be overstated; a successful tackle can prevent a potential scoring play, while a failed one might allow the offense to seize a crucial advantage. In this context, effective tackle performance emerges as a pivotal indicator of a team's overall defensive prowess.To achieve these insights, we employed a sophisticated approach, utilizing a histogram-based gradient boosting regression tree. This machine learning model is trained on data from weeks 1 to 8, incorporating essential factors from each play. Subsequently, the model is applied to week 9 data, enabling a comprehensive evaluation of team performances during this pivotal period.

In the following sections, I will delve into the methodology, present key findings, and illuminate plays of significant impact. This analysis strives to empower teams and coaches with actionable information, fostering a deeper understanding of their defensive strategies and setting the stage for continuous improvement.

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import LabelEncoder

## Data Preparation

In the pursuit of assessing defensive performance in the NFL, it was necessary to clean up the data and be meticulous about what exactly should be evaluated. The key features in the data that were considered are as follows:
* **Tracking Data**: gameId, playId, nflId, frameId, playDirection, x, y, event
* **Game Data**: gameId, week, homeTeamAbbr, visitorTeamAbbr
* **Play Data**: gameId, playId, ballCarrierId, yardsToGo, defensiveTeam, passResult, playResult, playNullifiedByPenalty, offenseFormation, defendersInTheBox
* **Player Data**: nflId, displayName
* **Tackles Data**: gameId, playId, nflId, tackle, assist, forcedFumble, pff_missedTackle

As mentioned in the introduction, the data was split into two subsets: data from Weeks 1-8 and data from Week 9. Data from Weeks 1-8 was used to train the ML model used. The model was then applied to Week 9 to generate insights.

In [2]:
tracking_data = pd.DataFrame()

for week in range(1, 10):
    file_path = f'../input/nfl-big-data-bowl-2024/tracking_week_{week}.csv'  
    week_data = pd.read_csv(file_path)
    tracking_data = pd.concat([tracking_data, week_data], ignore_index=True)
    raw_tracking_data = pd.concat([tracking_data, week_data], ignore_index=True)
    
tracking_data = tracking_data.drop(columns=["displayName", "frameId", "time", "jerseyNumber", "playDirection", "s", "a", "dis","o","dir" ])
tracking_data = tracking_data[(tracking_data['club'] != 'football')]
tracking_data = tracking_data[(tracking_data['event'] == 'pass_outcome_caught') | (tracking_data['event'] == 'handoff') | (tracking_data['event'] == 'run') | (tracking_data['event'] == 'lateral') | (tracking_data['event'] == 'pass_shovel') | (tracking_data['event'] == 'snap_direct')]

In [3]:
plays_df = pd.read_csv('../input/nfl-big-data-bowl-2024/plays.csv')
games_df = pd.read_csv('../input/nfl-big-data-bowl-2024/games.csv')
tackles_df = pd.read_csv('../input/nfl-big-data-bowl-2024/tackles.csv')
players_df = pd.read_csv('../input/nfl-big-data-bowl-2024/players.csv')

plays_df = plays_df.drop(columns=["quarter", "down", "gameClock", "yardlineSide", "yardlineNumber", "ballCarrierDisplayName", "penaltyYards", "preSnapHomeScore", "passProbability", "preSnapVisitorScore","absoluteYardlineNumber", "preSnapHomeTeamWinProbability", "preSnapVisitorTeamWinProbability", "homeTeamWinProbabilityAdded", "visitorTeamWinProbilityAdded", "expectedPoints", "expectedPointsAdded", "foulName1", "foulName2", "foulNFLId1", "foulNFLId2"])
plays_df = plays_df[(plays_df['playNullifiedByPenalty'] == 'N')]
plays_df["passResult"] = plays_df["passResult"].fillna("RUN")

games_df = games_df.drop(columns=["season", "gameDate", "gameTimeEastern", "homeFinalScore", "visitorFinalScore"])

In [4]:
week_1_to_8_gameIds = games_df[games_df['week'].isin(range(1, 9))]['gameId']
week_9_gameIds = games_df[games_df['week'] == 9]['gameId']

In [5]:
game_tackles_df = pd.merge(plays_df, tackles_df, on=['gameId', 'playId'])
game_tackles_df = game_tackles_df[(game_tackles_df['tackle'] == 1) | (game_tackles_df['assist'] == 1)]

### Custom Data Insight: DistanceFromClosestDefender
One important factor when it comes to tackling is how close a defender is to the ball carrier. If the ball is thrown down the field and there is no defender within 5 yards of the receiver, the receiver will typically gain more yards than they would if the defender was running alongside the receiver, just half a yard away. Since this data was not provided alongside the tracking data, it was important to quantify the proximity of the ball carrier to the nearest defender at the moment the ball is in play and no longer with the quarterback (unless in the case of a QB scramble).

The distance metric serves as a crucial input for our machine learning model, capturing the effectiveness of defensive positioning in relation to typical yards gained during a play. The provided **getDistance** function takes the x and y coordinates of two player positions on the field - one representing the ball carrier and the other a defender, and calculates the Euclidean distance between them using the Pythagorean theorem.

The **getClosestDefender** function is applied to each tackle row in the data. It identifies the home and visitor team in order to not take in account offensive players proximities to the ball carrier. The function then selects the position data for all defensive players on that same play and calculates the distance to the ball carrier using the getDistance function outlined above. The minimum distance among all defenders is determined and returned as the result. This value is then used in the model to help evaluate the effectiveness of the tackles that are occurring. 

In [6]:
def getDistance(player_x, player_y, ball_carrier_x, ball_carrier_y):
    return ((player_x - ball_carrier_x) ** 2 + (player_y - ball_carrier_y) ** 2) ** 0.5

def getClosestDefender(row):
    selected_game = games_df[games_df['gameId'] == row['gameId']].values[0]
    game_tracking_df = tracking_data[tracking_data['gameId'] == row['gameId']]
    
    homeTeam = selected_game[2]
    visitorTeam = selected_game[3]
    
    playId = row['playId']
    ballCarrierId = row['ballCarrierId']
    ball_in_hand_play_df = game_tracking_df[(game_tracking_df['nflId'] == ballCarrierId) & (game_tracking_df['playId'] == playId)]
    if len(ball_in_hand_play_df['club'].values) > 0:
        ballCarrierClub = ball_in_hand_play_df['club'].values[0]
        
        if ballCarrierClub == homeTeam:
            opponent_club = visitorTeam
        else:
            opponent_club = homeTeam

        other_players_position_df = game_tracking_df[(game_tracking_df['nflId'] != ballCarrierId) & (game_tracking_df['playId'] == playId) & (game_tracking_df['club'] == opponent_club)]
    
        if not ball_in_hand_play_df.empty:
            ball_carrier_x = ball_in_hand_play_df['x'].values[0]
            ball_carrier_y = ball_in_hand_play_df['y'].values[0]

            other_players_position_df['distance_to_ball_carrier'] = getDistance(other_players_position_df['x'], other_players_position_df['y'], ball_carrier_x, ball_carrier_y)
            min_distance = other_players_position_df['distance_to_ball_carrier'].min()

            return min_distance
    else:
        return 'NaN'
    
    
game_tackles_df['distanceFromClosestDefender'] = game_tackles_df.apply(getClosestDefender, axis=1)

In [7]:
game_tackles_df = game_tackles_df.dropna(subset=['distanceFromClosestDefender'])
game_tackles_df = game_tackles_df.drop(columns=['possessionTeam', 'passLength', 'playNullifiedByPenalty'])

## Using ML To Calculate 'yardsSavedAboveExpected'

### Variables Used in the Model
* **offenseFormation**: Formation used by the possession team, transformed into a numerical representation using a LabelEncoder
* **passResult**: Dropback outcome of the play, all NA values were treated as run plays, transformed into a numerical representation using a LabelEncoder
* **defendersInTheBox**: Number of defenders in close proximinity to line-of-scrimmage
* **yardsToGo**: Distance needed for a first down
* **distanceFromClosestDefender**: Metric calculated to represent the proximity of the ball carrier to the nearest defender at the moment the ball is in play
* **prePenaltyPlayResult**: Net yards gained by the offense, before penalty yardage

### Histogram-Based Gradient Boosting Regressor
The choice of using a Histogram-Based Gradient Boosting Regressor for this specific use case was due to the following factors:
* **Robustness to Outliers**
    * Gradient Boosting models, in general, are known for their robustness to outliers. The ensemble nature of boosting helps in reducing the impact of individual data points that might have extreme values, making the model more resilient to noisy data.
* **Non-Linear Relationships**
    * A histogram-based approach is well-suited for capturing non-linear relationships within the data. In the context of NFL tackle performance, where the impact of various factors on play outcomes may not follow a linear pattern, a non-linear model can better capiture the complexities of the relationships.
* **Predicting Continuous Outputs**
    * Since the task involves predicting yards gained on tackling plays, which is a continuous variable, a regression model is appropriate. Gradient Boosting Regressors excelt at predicting continuous outputs.
* **Consistent Model Performance**
    * Gradient Boosting models, including the histogram-based variant, often exhibit stable and reliable performance across different types of datasets. This consistency is valuable for generating reliable insights in a sports analytics context.


In [8]:
le = LabelEncoder()

game_tackles_df['offenseFormation_idx'] = le.fit_transform(game_tackles_df['offenseFormation'])
game_tackles_df['passResult_idx'] = le.fit_transform(game_tackles_df['passResult'])

mean_value = game_tackles_df['defendersInTheBox'].mean()
game_tackles_df.loc[game_tackles_df['defendersInTheBox'].isna(), 'defendersInTheBox'] = mean_value

weeks_1_to_8_tackles_df = game_tackles_df[game_tackles_df['gameId'].isin(week_1_to_8_gameIds)]
week_9_tackles_df = game_tackles_df[game_tackles_df['gameId'].isin(week_9_gameIds)]

X_train = weeks_1_to_8_tackles_df[['yardsToGo', 'distanceFromClosestDefender', 'defendersInTheBox', 'offenseFormation_idx', 'passResult_idx']]
y_train = weeks_1_to_8_tackles_df['prePenaltyPlayResult']

X_test = week_9_tackles_df[['yardsToGo', 'distanceFromClosestDefender', 'defendersInTheBox', 'offenseFormation_idx', 'passResult_idx']]
y_test = week_9_tackles_df['prePenaltyPlayResult']

X_train.loc[X_train['defendersInTheBox'].isna(), 'defendersInTheBox'] = X_train['defendersInTheBox'].mean()
X_test.loc[X_test['defendersInTheBox'].isna(), 'defendersInTheBox'] = X_test['defendersInTheBox'].mean()

In [9]:
regr = HistGradientBoostingRegressor()

regr.fit(X_train, y_train)

def getExpectedYards(row, regr):
    columns = ['yardsToGo', 'distanceFromClosestDefender', 'defendersInTheBox', 'offenseFormation_idx', 'passResult_idx']
    data = row[columns].values.reshape(1, -1)
    df = pd.DataFrame(data, columns=columns)
    
    return round(regr.predict(df)[0], 2)

week_9_tackles_df.loc[:, 'expectedYards'] = week_9_tackles_df.apply(getExpectedYards, axis=1, regr=regr)
week_9_tackles_df.loc[:, 'yardsSavedAboveExpected'] = week_9_tackles_df['expectedYards'] - week_9_tackles_df['playResult']
week_9_tackles_df.loc[:, 'yardsSavedAboveExpected'] = week_9_tackles_df['yardsSavedAboveExpected'].apply(lambda x: max(0, x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  week_9_tackles_df.loc[:, 'expectedYards'] = week_9_tackles_df.apply(getExpectedYards, axis=1, regr=regr)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  week_9_tackles_df.loc[:, 'yardsSavedAboveExpected'] = week_9_tackles_df['expectedYards'] - week_9_tackles_df['playResult']


## Determining 'yardsAllowedByMissedTackle'

The function **getYardsAllowed** identifies plays with missed tackles during week 9, extracts relevant tracking data, and calculates the yards allowed due to each missed tackle. This function makes the assumption that the point that when the defender is at the position where they are closest to the football, they are also at the position that the missed tackle takes place.

In [10]:
missed_tackles_df = week_9_tackles_df[week_9_tackles_df['pff_missedTackle'] == 1]
missed_tackles_plays = missed_tackles_df[['gameId', 'playId', 'nflId']].drop_duplicates()
missed_tackle_players = missed_tackles_plays['nflId'].drop_duplicates()

week_9_tracking_df = raw_tracking_data[raw_tracking_data['gameId'].isin(week_9_gameIds)]
week_9_tracking_df = week_9_tracking_df[(week_9_tracking_df['nflId'].isin(missed_tackle_players)) | (week_9_tracking_df['club'] == 'football')]

missed_tracking_df = pd.merge(missed_tackles_plays, week_9_tracking_df, on=['gameId', 'playId'], how='left')

def getYardsAllowed(row):
    selected_game_df = games_df[games_df['gameId'] == row['gameId']]
    selected_game = games_df[games_df['gameId'] == row['gameId']]['gameId'].values[0]
    playId = row['playId']
    playerId = row['nflId']
    homeTeam = selected_game_df['homeTeamAbbr'].values[0]
    awayTeam = selected_game_df['visitorTeamAbbr'].values[0]
    missed_tackle_player_tracking_df = missed_tracking_df[(missed_tracking_df['gameId'] == selected_game) & (missed_tracking_df['playId'] == playId) & (missed_tracking_df['nflId_y'] == playerId)]
    missed_tackle_football_tracking_df = missed_tracking_df[(missed_tracking_df['gameId'] == selected_game) & (missed_tracking_df['playId'] == playId) & (missed_tracking_df['displayName'] == 'football')]
    
    min_distance_frameId = None
    min_distance = np.inf

    for index, row in missed_tackle_player_tracking_df.iterrows():
        distances = np.sqrt((missed_tackle_football_tracking_df['x'] - row['x'])**2 + (missed_tackle_football_tracking_df['y'] - row['y'])**2)
        closest_frameId = missed_tackle_football_tracking_df.loc[distances.idxmin(), 'frameId']
        if distances.min() < min_distance:
            min_distance = distances.min()
            min_distance_frameId = closest_frameId
    
    missed_tackle_x_location = missed_tackle_football_tracking_df[missed_tackle_football_tracking_df['frameId'] == min_distance_frameId]['x'].values[0]
    playDirection = missed_tackle_football_tracking_df[missed_tackle_football_tracking_df['frameId'] == min_distance_frameId]['playDirection'].values[0]
    end_of_play_location = missed_tackle_football_tracking_df['x'].iloc[-6]
    
    if playDirection == 'left':
        return round(missed_tackle_x_location - end_of_play_location, 2)
    else:
        return round(end_of_play_location - missed_tackle_x_location, 2)
    
missed_tackles_df['yardsAllowedByMissedTackle'] = missed_tackles_df.apply(getYardsAllowed, axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  missed_tackles_df['yardsAllowedByMissedTackle'] = missed_tackles_df.apply(getYardsAllowed, axis=1)


In [11]:
columns_to_include = ['gameId', 'playId', 'nflId', 'yardsAllowedByMissedTackle']
missed_tackles_df = missed_tackles_df[columns_to_include]
display(missed_tackles_df)

Unnamed: 0,gameId,playId,nflId,yardsAllowedByMissedTackle
112,2022110610,2629,46846,1.18
1130,2022110601,3266,43503,2.98
1964,2022110605,1030,53450,2.84
2398,2022110603,441,54653,4.4
4737,2022110610,2875,47971,1.76
5183,2022110609,445,43335,0.04
6712,2022110700,588,44851,2.78
7633,2022110600,2782,52594,4.19
10531,2022110608,1535,44848,0.8
13744,2022110604,2888,47790,0.05


In [12]:
processed_tackles_df = week_9_tackles_df.drop(columns=["ballCarrierId", "yardsToGo", "playResult", "offenseFormation", "defendersInTheBox", "distanceFromClosestDefender", "offenseFormation_idx", "passResult_idx", "expectedYards"])
processed_tackles_df = pd.merge(processed_tackles_df, missed_tackles_df, on=['nflId', 'gameId', 'playId'], how='left')
processed_tackles_df.fillna(0, inplace=True)

In [13]:
most_yards_saved = processed_tackles_df.groupby('gameId')['yardsSavedAboveExpected'].idxmax()
most_yards_given_up = processed_tackles_df.groupby('gameId')['yardsAllowedByMissedTackle'].idxmax()

most_yards_saved_df = processed_tackles_df.loc[most_yards_saved, ['gameId', 'playId', 'yardsSavedAboveExpected']]
most_yards_given_up_df = processed_tackles_df.loc[most_yards_given_up, ['gameId', 'playId', 'yardsAllowedByMissedTackle']]

In [14]:
tackle_df = processed_tackles_df[(processed_tackles_df['tackle'] == 1) & (processed_tackles_df['pff_missedTackle'] != 1)]
sum_tackle = tackle_df.groupby('nflId')['yardsSavedAboveExpected'].sum()

assist_df = processed_tackles_df[(processed_tackles_df['assist'] == 1)]
sum_assist = assist_df.groupby('nflId')['yardsSavedAboveExpected'].sum() * 0.5

missed_tackle_df = processed_tackles_df[processed_tackles_df['pff_missedTackle'] == 1]
sum_missed_tackle = missed_tackle_df.groupby('nflId')['yardsAllowedByMissedTackle'].sum()

sum_tackle_reset = sum_tackle.reset_index()
sum_assist_reset = sum_assist.reset_index()
sum_missed_reset = sum_missed_tackle.reset_index()

merged_left = pd.merge(sum_tackle_reset, sum_assist_reset, on='nflId', how='left', suffixes=('_tackle', '_assist'))
merged_right = pd.merge(sum_tackle_reset, sum_assist_reset, on='nflId', how='right', suffixes=('_tackle', '_assist'))

final_sum = pd.concat([merged_left, merged_right], ignore_index=True)
final_sum = pd.merge(final_sum, sum_missed_reset, on='nflId', how='left')
final_sum = final_sum.fillna(0)

final_sum.loc[:, 'yardsSavedAboveExpected_total'] = final_sum['yardsSavedAboveExpected_tackle'] + final_sum['yardsSavedAboveExpected_assist']

In [15]:
sum_columns = ['tackle', 'assist', 'forcedFumble', 'pff_missedTackle']

total_sum = processed_tackles_df.groupby('nflId')[sum_columns].sum().reset_index()
total_sum.columns = ['nflId'] + ['total' + col.title() for col in sum_columns]
total_sum.fillna(0, inplace=True)

total_sum = pd.merge(total_sum, final_sum, on='nflId', how='left')
total_sum = pd.merge(total_sum, players_df[['nflId', 'displayName']], on='nflId', how='left')

In [16]:
unique_team_player_list_df = week_9_tackles_df.groupby('defensiveTeam')['nflId'].agg(list).reset_index()

team_list = []

for index, row in unique_team_player_list_df.iterrows():
    filtered_rows = total_sum[total_sum['nflId'].isin(row['nflId'])]
    filtered_rows = filtered_rows.drop_duplicates()
    
    sum_tackles = filtered_rows['totalTackle'].sum()
    sum_assists = filtered_rows['totalAssist'].sum()
    sum_forced_fumbles = filtered_rows['totalForcedfumble'].sum()
    sum_missed_tackle = filtered_rows['totalPff_Missedtackle'].sum()
    sum_yards_saved = filtered_rows['yardsSavedAboveExpected_tackle'].sum()
    sum_yards_allowed = filtered_rows['yardsAllowedByMissedTackle'].sum()
    
    team_list.append({
        'defensiveTeam': row['defensiveTeam'],
        'teamTotalTackles': sum_tackles,
        'teamTotalAssists': sum_assists,
        'teamTotalForcedFumbles': sum_forced_fumbles,
        'teamTotalMissedTackles': sum_missed_tackle,
        'teamTotalYardsSavedFromTackle': sum_yards_saved,
        'teamTotalYardsAllowed': sum_yards_allowed,
        'averageYardsSavedPerTackle': round(sum_yards_saved / sum_tackles, 2)
    })
    
teams_data_df = pd.DataFrame(team_list)
teams_data_df =teams_data_df.sort_values(by='teamTotalYardsSavedFromTackle', ascending=False)
display(teams_data_df)

Unnamed: 0,defensiveTeam,teamTotalTackles,teamTotalAssists,teamTotalForcedFumbles,teamTotalMissedTackles,teamTotalYardsSavedFromTackle,teamTotalYardsAllowed,averageYardsSavedPerTackle
10,IND,40,9,2,1,136.57,2.84,3.41
1,ATL,35,36,1,0,133.7,0.0,3.82
13,LA,39,28,0,1,127.26,0.04,3.26
25,WAS,37,21,0,0,121.2,0.0,3.28
22,SEA,34,9,1,0,116.58,0.0,3.43
0,ARI,45,23,0,1,111.25,0.8,2.47
24,TEN,47,30,0,2,107.77,2.94,2.29
14,LAC,37,22,1,1,103.93,4.19,2.81
16,MIA,43,23,0,1,101.42,2.98,2.36
23,TB,24,18,0,0,101.0,0.0,4.21


In [17]:
columns = ['defensiveTeam','teamTotalTackles','averageYardsSavedPerTackle','teamTotalYardsSavedFromTackle']
teams_data_df_saved = teams_data_df[columns]
display(teams_data_df_saved.style.background_gradient(axis=0, subset='teamTotalYardsSavedFromTackle'))

Unnamed: 0,defensiveTeam,teamTotalTackles,averageYardsSavedPerTackle,teamTotalYardsSavedFromTackle
10,IND,40,3.41,136.57
1,ATL,35,3.82,133.7
13,LA,39,3.26,127.26
25,WAS,37,3.28,121.2
22,SEA,34,3.43,116.58
0,ARI,45,2.47,111.25
24,TEN,47,2.29,107.77
14,LAC,37,2.81,103.93
16,MIA,43,2.36,101.42
23,TB,24,4.21,101.0


In [18]:
columns = ['defensiveTeam', 'teamTotalMissedTackles', 'teamTotalYardsAllowed']
teams_data_df_allowed = teams_data_df[teams_data_df['teamTotalMissedTackles'] != 0].sort_values(by='teamTotalYardsAllowed', ascending=False)
display(teams_data_df_allowed[columns].style.background_gradient(axis=0, subset='teamTotalYardsAllowed'))

Unnamed: 0,defensiveTeam,teamTotalMissedTackles,teamTotalYardsAllowed
7,DET,1,4.4
14,LAC,1,4.19
16,MIA,1,2.98
24,TEN,2,2.94
10,IND,1,2.84
19,NO,1,2.78
0,ARI,1,0.8
11,JAX,1,0.05
13,LA,1,0.04


In [19]:
# show play with most yards saved
# show play with most yards given up