# **Tackles Over Expected and Broken Tackles Over Expected Generation**

## **1.0: Introduction**
The purpose of this notebook is to run each player's (defense based upon nflId - offense based upon ballCarrierId) observations from the test set through the Logistic Regression Model to get the TP, FP, TN, and FN for each player.

For Defense:
- TP = Tackles that should have been made and was
- *FP = Tackles that should have been made and were missed*
- TN = Tackles that should have been missed and were
- **FN = Tackles that should have been missed and were made**

For Ball Carriers:
- TP = Getting tackled when they should have
- **FP = They should have gotten tackled, but broke the tackle**
- TN = Tackles they broke that they should have broken
- *FN = They should have broken the tackle but got tackled*

Key:
- **Positive**
- *Negative*

### **1.1: Data and Module Load**

In [1]:
import pandas as pd 
import numpy as np
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, make_scorer 



In [3]:
test_features = pd.read_csv('/Users/alexiainman/Documents/Big Data Bowl/Data/Modeling_Data/test_features.csv')
test_target = pd.read_csv('/Users/alexiainman/Documents/Big Data Bowl/Data/Modeling_Data/test_target.csv')
players = pd.read_csv('/Users/alexiainman/Documents/Big Data Bowl/Data/players.csv')

In [4]:
test_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2917 entries, 0 to 2916
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   nflId          2917 non-null   int64  
 1   s_avg_05       2917 non-null   float64
 2   s_avg_1        2917 non-null   float64
 3   bc_s_avg_05    2917 non-null   float64
 4   bc_s_avg_1     2917 non-null   float64
 5   cos_avg_05     2917 non-null   float64
 6   cos_avg_1      2917 non-null   float64
 7   do_avg_05      2917 non-null   float64
 8   do_avg_1       2917 non-null   float64
 9   bc_do_avg_05   2917 non-null   float64
 10  bc_do_avg_1    2917 non-null   float64
 11  cos_05_null    2917 non-null   int64  
 12  cos_1_null     2917 non-null   int64  
 13  ballCarrierId  2917 non-null   int64  
dtypes: float64(10), int64(4)
memory usage: 319.2 KB


In [5]:
test_target.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2917 entries, 0 to 2916
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   tackle_status  2917 non-null   int64
dtypes: int64(1)
memory usage: 22.9 KB


### **1.2: Loading in Models**

#### **1.2.1: Logistic Regression Model**

In [6]:
###Load in Logistic Regression model
# Define the file path of the saved model
model_file_path = '/Users/alexiainman/Documents/Big Data Bowl/Models/LogisticRegression.pkl'

# Load the model from the file
with open(model_file_path, 'rb') as file:
    logr_model = pickle.load(file)

#### **1.2.2: RandomForestClassification Model**

In [7]:
###Load in Random Forest model
# Define the file path of the saved model
model_file_path = '/Users/alexiainman/Documents/Big Data Bowl/Models/RandomForestClassifier.pkl'

# Load the model from the file
with open(model_file_path, 'rb') as file:
    rfc_model = pickle.load(file)

## **2.0: Creating function to run each player's test set observations through a model and save the results**

In [20]:
test = pd.read_csv('/Users/alexiainman/Documents/Big Data Bowl/Data/Processed/test.csv')

In [21]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2917 entries, 0 to 2916
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gameId            2917 non-null   int64  
 1   playId            2917 non-null   int64  
 2   nflId             2917 non-null   int64  
 3   tackle            2917 non-null   int64  
 4   assist            2917 non-null   int64  
 5   forcedFumble      2917 non-null   int64  
 6   pff_missedTackle  2917 non-null   int64  
 7   game_play_Id      2917 non-null   object 
 8   event_type        2917 non-null   float64
 9   s_avg_05          2917 non-null   float64
 10  s_avg_1           2917 non-null   float64
 11  bc_s_avg_05       2917 non-null   float64
 12  bc_s_avg_1        2917 non-null   float64
 13  cos_avg_05        2917 non-null   float64
 14  cos_avg_1         2917 non-null   float64
 15  do_avg_05         2917 non-null   float64
 16  do_avg_1          2917 non-null   float64


In [34]:
def generate_model_results(df, column, model):
    # df is the test dataframe (features and target)
    # column is the 'nflId' or 'ballCarrierId'
    # model is the model to be used to predict
    import pandas as pd
    import numpy as np
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix

    unique_values = df[column].unique()
    players_df = pd.DataFrame({column: unique_values, 'tp': np.nan, 'fp': np.nan, 'tn': np.nan, 'fn': np.nan})

    count = 0

    for player_Id in df[column].unique():
        player_df = df[df[column] == player_Id]
        player_df_features = player_df.drop(columns=['nflId', 'ballCarrierId', 'gameId', 'playId', 'game_play_Id', 'tackle', 'assist', 'forcedFumble', 'pff_missedTackle', 'event_type', 'tackle_status'])
        player_df_target = player_df['tackle_status']
        player_df_pred = model.predict(player_df_features)

        # Generate confusion matrix with labels=[0, 1]
        confusion_mat = confusion_matrix(player_df_target, player_df_pred, labels=[0, 1])

        tp = confusion_mat[1, 1]
        fp = confusion_mat[0, 1]
        tn = confusion_mat[0, 0]
        fn = confusion_mat[1, 0]

        players_df.loc[players_df[column] == player_Id, 'tp'] = tp
        players_df.loc[players_df[column] == player_Id, 'fp'] = fp
        players_df.loc[players_df[column] == player_Id, 'tn'] = tn
        players_df.loc[players_df[column] == player_Id, 'fn'] = fn

    print(f'1x1 Matrix count: {count}')
    return players_df








### **2.1: Running LogisticRegression for Defensive Players**

In [35]:
defensive_players = generate_model_results(test, 'nflId', logr_model)

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype

1x1 Matrix count: 0


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype

In [36]:
defensive_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 687 entries, 0 to 686
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   nflId   687 non-null    int64  
 1   tp      687 non-null    float64
 2   fp      687 non-null    float64
 3   tn      687 non-null    float64
 4   fn      687 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 27.0 KB


In [38]:
defensive_players['num_events'] = defensive_players['tp'] + defensive_players['tn'] + defensive_players['fp'] + defensive_players['fn']


In [40]:
defensive_players['tackles_over_exp'] = defensive_players['fn'] - defensive_players['fp']

In [51]:
max_attempts = defensive_players['num_events'].max()

In [52]:
defensive_players['weighted_toe'] = defensive_players['tackles_over_exp'] * (defensive_players['num_events'] / max_attempts)

In [58]:
defensive_players.drop(columns=['toe_per_event'], inplace=True)

In [85]:
defensive_players = defensive_players.merge(players[['nflId', 'displayName']], how='left', on='nflId')

In [87]:
defensive_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 687 entries, 0 to 686
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   nflId             687 non-null    int64  
 1   tp                687 non-null    float64
 2   fp                687 non-null    float64
 3   tn                687 non-null    float64
 4   fn                687 non-null    float64
 5   num_events        687 non-null    float64
 6   tackles_over_exp  687 non-null    float64
 7   weighted_toe      687 non-null    float64
 8   displayName       687 non-null    object 
dtypes: float64(7), int64(1), object(1)
memory usage: 48.4+ KB


In [88]:
defensive_players.sort_values('weighted_toe', ascending=False).head(50)

Unnamed: 0,nflId,tp,fp,tn,fn,num_events,tackles_over_exp,weighted_toe,displayName
251,46085,9.0,0.0,0.0,3.0,12.0,3.0,1.894737,Tremaine Edmunds
326,46669,9.0,0.0,2.0,2.0,13.0,2.0,1.368421,Jonathan Owens
26,38577,8.0,1.0,0.0,3.0,12.0,2.0,1.263158,Bobby Wagner
461,52473,7.0,0.0,1.0,2.0,10.0,2.0,1.052632,Logan Wilson
309,46269,15.0,0.0,3.0,1.0,19.0,1.0,1.0,Foyesade Oluokun
355,47804,6.0,0.0,1.0,2.0,9.0,2.0,0.947368,Darnell Savage
445,52435,16.0,0.0,1.0,1.0,18.0,1.0,0.947368,Jordyn Brooks
647,54574,6.0,0.0,1.0,2.0,9.0,2.0,0.947368,Coby Bryant
264,46124,4.0,0.0,2.0,2.0,8.0,2.0,0.842105,Donte Jackson
411,47996,10.0,1.0,1.0,2.0,14.0,1.0,0.736842,Donovan Wilson


### **2.2: Running LogisticRegression for Ball Carriers**

In [60]:
ball_carriers = generate_model_results(test, 'ballCarrierId', logr_model)

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype

1x1 Matrix count: 0


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype

In [61]:
ball_carriers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ballCarrierId  380 non-null    int64  
 1   tp             380 non-null    float64
 2   fp             380 non-null    float64
 3   tn             380 non-null    float64
 4   fn             380 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 15.0 KB


In [63]:
ball_carriers['num_events'] = ball_carriers['tp'] + ball_carriers['tn'] + ball_carriers['fp'] + ball_carriers['fn']

In [71]:
ball_carriers['broken_tackles_over_exp'] = ball_carriers['fp'] - ball_carriers['fn']

In [72]:
max_attempts_bc = ball_carriers['num_events'].max()

In [73]:
ball_carriers['weighted_btoe'] = ball_carriers['broken_tackles_over_exp'] * (ball_carriers['num_events'] / max_attempts_bc)

In [76]:
ball_carriers.drop(columns=['tackles_over_exp', 'weighted_toe'], inplace=True)

In [77]:
ball_carriers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ballCarrierId            380 non-null    int64  
 1   tp                       380 non-null    float64
 2   fp                       380 non-null    float64
 3   tn                       380 non-null    float64
 4   fn                       380 non-null    float64
 5   num_events               380 non-null    float64
 6   broken_tackles_over_exp  380 non-null    float64
 7   weighted_btoe            380 non-null    float64
dtypes: float64(7), int64(1)
memory usage: 23.9 KB


In [79]:
ball_carriers.sort_values('weighted_btoe', ascending=False).head(50)


Unnamed: 0,ballCarrierId,tp,fp,tn,fn,num_events,broken_tackles_over_exp,weighted_btoe
27,54572,36.0,9.0,7.0,0.0,52.0,9.0,6.685714
1,46104,32.0,8.0,6.0,0.0,46.0,8.0,5.257143
66,45573,34.0,7.0,8.0,0.0,49.0,7.0,4.9
22,43334,57.0,4.0,9.0,0.0,70.0,4.0,4.0
11,53454,29.0,5.0,12.0,0.0,46.0,5.0,3.285714
16,44995,27.0,7.0,5.0,2.0,41.0,5.0,2.928571
23,52449,26.0,6.0,7.0,1.0,40.0,5.0,2.857143
21,53646,22.0,6.0,5.0,0.0,33.0,6.0,2.828571
84,44816,32.0,5.0,4.0,1.0,42.0,4.0,2.4
102,47807,25.0,4.0,9.0,1.0,39.0,3.0,1.671429


In [81]:
players.head()

Unnamed: 0,nflId,height,weight,birthDate,collegeName,position,displayName
0,25511,6-4,225,1977-08-03,Michigan,QB,Tom Brady
1,29550,6-4,328,1982-01-22,Arkansas,T,Jason Peters
2,29851,6-2,225,1983-12-02,California,QB,Aaron Rodgers
3,30842,6-6,267,1984-05-19,UCLA,TE,Marcedes Lewis
4,33084,6-4,217,1985-05-17,Boston College,QB,Matt Ryan


In [82]:
ball_carriers = ball_carriers.merge(players[['nflId', 'displayName']], how='left', left_on='ballCarrierId', right_on='nflId')
ball_carriers.drop(columns=['nflId'], inplace=True)

In [84]:
ball_carriers.sort_values('weighted_btoe', ascending=False).head(50)

Unnamed: 0,ballCarrierId,tp,fp,tn,fn,num_events,broken_tackles_over_exp,weighted_btoe,displayName
27,54572,36.0,9.0,7.0,0.0,52.0,9.0,6.685714,Dameon Pierce
1,46104,32.0,8.0,6.0,0.0,46.0,8.0,5.257143,Nick Chubb
66,45573,34.0,7.0,8.0,0.0,49.0,7.0,4.9,Austin Ekeler
22,43334,57.0,4.0,9.0,0.0,70.0,4.0,4.0,Derrick Henry
11,53454,29.0,5.0,12.0,0.0,46.0,5.0,3.285714,Travis Etienne
16,44995,27.0,7.0,5.0,2.0,41.0,5.0,2.928571,Aaron Jones
23,52449,26.0,6.0,7.0,1.0,40.0,5.0,2.857143,Jonathan Taylor
21,53646,22.0,6.0,5.0,0.0,33.0,6.0,2.828571,Khalil Herbert
84,44816,32.0,5.0,4.0,1.0,42.0,4.0,2.4,Leonard Fournette
102,47807,25.0,4.0,9.0,1.0,39.0,3.0,1.671429,Josh Jacobs
