<h1>Offensive Linemen Evaluation</h1>
<p>
    Let's begin by importing some data from the 2017 NFL season. This data set includes the coordinates of
    all the player on the field for each frame of each play. We matched the offensive linemen with defensive
    linemen in order to gauge how well they are blocking. The data set is available on
    <a href=https://www.dropbox.com/s/eq1wusdhnifl5zk/netdata.csv?dl=0>dropbox</a>.
</p>
<p>But first, let's import some packages.</p>

In [1]:
%matplotlib inline

In [2]:
import math
import random
import re

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from keras import backend as K


random.seed(1)
np.random.seed(2)
tf.random.set_seed(3)
tf.keras.backend.set_floatx('float32')

Using TensorFlow backend.


A few indicators of the data set: 
<ol>
    <li>
        <table>
            <tr>
                <td>I</td>
                <td>Incomplete</td>
                <td>C</td>
                <td>Complete</td>
            </tr>
            <tr>
                <td>S</td>
                <td>Sack</td>
                <td>R</td>
                <td>Rush</td>
            </tr>
            <tr>
                <td>IN</td>
                <td>Interception</td>
                <td>nan</td>
                <td>Run</td>
            </tr>
        </table>
    </li>
    <li>
        The sack.ind column is 1 if there was a sack in that play and 0 if not.
    </li>
</ol

Notice the nan value represents a run, so we will convert all those to "Run".
Then get only the first 10 plays so we don't crash our machines.

In [3]:
data = pd.read_csv('netdata.csv', converters={'PassResult': lambda x: 'Run' if pd.isna(x) else x})
data = data.drop(data[~data["playId"].isin(data["playId"].unique()[:10])].index)
data.head()

Unnamed: 0,X,gameId,playId,frame.id,OL_C_x,OL_C_y,OL_LG_x,OL_LG_y,OL_LT_x,OL_LT_y,...,Def_9_x,Def_9_y,Def_10_x,Def_10_y,Def_11_x,Def_11_y,PlayResult,sack.ind,PassResult,num_vec
0,1,2017090700,68,1,26.59,23.433333,26.16,24.993333,26.03,26.723333,...,35.38,35.143333,42.49,24.393333,28.47,43.843333,0,0,I,1
1,2,2017090700,68,2,26.59,23.423333,26.16,24.993333,26.03,26.723333,...,35.37,35.143333,42.49,24.403333,28.47,43.843333,0,0,I,2
2,3,2017090700,68,3,26.58,23.423333,26.16,24.993333,26.03,26.723333,...,35.37,35.143333,42.5,24.413333,28.47,43.843333,0,0,I,3
3,4,2017090700,68,4,26.58,23.423333,26.16,24.993333,26.03,26.723333,...,35.37,35.143333,42.52,24.453333,28.47,43.843333,0,0,I,4
4,5,2017090700,68,5,26.57,23.423333,26.16,24.983333,26.03,26.723333,...,35.37,35.153333,42.64,24.673333,28.48,43.843333,0,0,I,5


The X column used to be the index of the file, so let's just drop it. 

In [4]:
data.drop('X', inplace=True, axis=1)
print(f'New dimensions of DataFrame: {data.shape}')

New dimensions of DataFrame: (2490, 61)


Now, let's see what's in this data.

In [5]:
num_plays = data["playId"].unique().size
print(f'Number of unique plays: {num_plays}')
print(f'Average number of frames per play {round(data.shape[0] / num_plays, 2)}')

Number of unique plays: 10
Average number of frames per play 249.0


That tells us a little about how many plays this data comes from. 
Now, let's watch one of these plays to see what all this data actually can tell us.

![Title](play68.gif)

Now that we understand what we are working with a little more, let's split the data into a test and train set.
Annoyingly, there can be more than 1 play with the same playId, so we need to make sure we use just 1 
game to find these plays. 
First let's see how many plays we have from each game.

In [6]:
play_ids_by_game = data.groupby('gameId')["playId"].unique().reset_index()
play_ids_by_game.to_string()

        gameId                                            playId
0   2017090700  [68, 94, 118, 139, 160, 189, 210, 395, 427, 449]
1   2017091004                                             [118]
2   2017091008                                              [68]
3   2017091009                                             [395]
4   2017091700                                             [118]
5   2017091708                                             [160]
6   2017091800                                        [189, 210]
7   2017092100                                             [427]
8   2017092404                                    [94, 189, 210]
9   2017092500                                             [160]
10  2017100100                                         [94, 118]
11  2017100101                                        [160, 395]
12  2017100804                                              [94]
13  2017100807                                             [139]
14  2017100900           

To make things easier, let's look at a game with more than 5 plays.
Then we'll make those 5 plays the test data and everything else will be used as training data.

In [7]:
# Find games with more than 5 plays
game_play = play_ids_by_game.loc[np.array(list(map(len, play_ids_by_game["playId"].values)))>5]

# Get the gameId for the first one of those games
game = game_play.loc[0, "gameId"]

# Get the first 5 playIds from that game
play_ids = game_play[game_play["gameId"] == game].loc[0, "playId"][:5]

# Make a test data frame with those 5 plays
test_data = data.copy().loc[(data["playId"].isin(play_ids)) & (data["gameId"] == game)]

# Make a train data frame with every other play
train_data = data.copy().loc[(~data["playId"].isin(play_ids)) | (data["gameId"] != game)]

# Check to make sure we have 5 distinct plays from the same game
test_data.groupby(["gameId", "playId"])[["gameId", "playId"]].first().reset_index(drop=True)


Unnamed: 0,gameId,playId
0,2017090700,68
1,2017090700,94
2,2017090700,118
3,2017090700,139
4,2017090700,160


Some of the data we have won't help us create a model, so let's make a function below to remove any
columns that we do not want the model to train on. 
Then create a new data frame without those columns for both the test and train set.

In [8]:
def get_data_for_model(df):
    keep_regex = r'(Off|OL_(<?(C|LG|RG|RT|LT)_(x|y))$|Def|frame|Match)'
    drop_cols = [c for c in df.columns if not re.search(keep_regex, c)]
    return df.drop(drop_cols, axis=1)

train_data_for_model = get_data_for_model(train_data)
test_data_for_model = get_data_for_model(test_data)

<p>
Next let's begin creating a Keras model. 
We want this model to predict how many yards a play will gain/loss based on the positions of the players.
Then, we move the offensive linemen a little bit in each direction and see how the prediction changes.
This process can tell us if a linesmen is in the correct position or if they could have been in a better spot
that would result in the offense gaining more yards for that play.
</p>

<p>
That being said, let's begin configuring the model.
We look at the number of columns to determine the input shape for the model.
Then we create a Sequential model with an input layer, 2 hidden layers each with 25 nodes, 
and an output layer that produces one node.
</p>

In [9]:
input_shape = (train_data_for_model.shape[1],)

model = tf.keras.Sequential()
num_input_nodes = 25
num_output_nodes = 1

# Input layer
model.add(tf.keras.layers.Dense(num_input_nodes, input_shape=input_shape, activation=tf.nn.sigmoid))

# Hidden layers
model.add(tf.keras.layers.Dense(num_input_nodes, activation=tf.nn.sigmoid))
model.add(tf.keras.layers.Dense(num_input_nodes, activation=tf.nn.sigmoid))

# Output layer
model.add(tf.keras.layers.Dense(num_output_nodes, activation=tf.keras.activations.linear))

Let's define our own loss function. We are going to use a classic mean squared error loss, however we will
account for the initial randomness of the net in order to initially predict 4 yards for each play.
This makes the model have to train less and even without training, will predict a gain of 4 yards for each play.
Here, `avg_of_play_no_noise` will be whatever we want the model to predict for every play before training.

In [10]:
def mse_loss_with_prior(avg_of_play_no_noise):
    def mse(y_true, y_pred):
        return K.mean(K.square((y_pred - avg_of_play_no_noise) - y_true))
    return mse

# Compile the model 
model.compile(optimizer='adam',
              loss=mse_loss_with_prior([]),
              metrics=['acc'])

Before training, we need to save a copy of the model so we can find how much we need to change the model's prediction
in order to get that initial prediction we want.

In [11]:
initial_model = tf.keras.models.clone_model(model)
initial_weights = model.get_weights()
initial_model.set_weights(initial_weights)

Let's see the predictions before training for the first 10 frames of a play.

In [12]:
initial_predictions = initial_model.predict(train_data_for_model)
print(initial_predictions.flatten()[:10])

[-1.8857856 -1.8855824 -1.8853477 -1.8851091 -1.8848432 -1.8845462
 -1.8840744 -1.8835262 -1.8828763 -1.8823375]


As you can see, these predictions for how many yards a team will gain/loss are not as close to the average number 
of yards gained as 4.
So to make these predictions better, we cancel out the initial randomness of the model when we make our predictions.
Since we'll be making a lot of predicitons, let's define a function that can do this extra work for us. 

In [13]:
def predict(model_for_pred, initial_model_for_pred, prior_for_pred, df, label="Predicted"):
     # Remove unwanted columns
    model_data = get_data_for_model(df)                   
     
    # Get net noise
    initial_pred = initial_model_for_pred.predict(model_data)
    df["NetNoise"] = initial_pred - prior_for_pred
     
    # Use the trained model to predict
    df[label] = model_for_pred.predict(model_data)
     
    # Account for initial net noise
    df[label] = df.apply(lambda x: x[label] - x["NetNoise"], axis=1)
    return df

Now let's see what our predictions look like using our new method and still before training the model.

In [16]:
prior = 4.0
train_data["NetNoise"] = initial_predictions - prior
model.compile(loss=mse_loss_with_prior(train_data["NetNoise"]))
train_data = predict(model, initial_model, prior, train_data)

train_data["Predicted"].head()

363    4.0
364    4.0
365    4.0
366    4.0
367    4.0
Name: Predicted, dtype: float64

Good, the model now predicts every play will always gain 4 yards. 
Now let's actually train the model. 
Since we have ~500,000 frames we can train on, we don't need to run the data through the model too many times,
or else it will begin to memorize the training data. 
To help prevent this over fitting, we use the last 20% of the training data as validation data.


In [15]:
history = model.fit(train_data_for_model, train_data["PlayResult"],
                        validation_split=.2,
                        epochs=10,
                        batch_size=1000)

Train on 1701 samples, validate on 426 samples
Epoch 1/10

TypeError: Input 'y' of 'Sub' Op has type int64 that does not match type float32 of argument 'x'.

Let's plot the loss and validation loss to make sure the model doesn't overfit.

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper right')
plt.show()
plt.close()

Looks good. If the validation loss started to increase that means that the model was memorizing the data instead of 
learning to predict for new plays.
Hopefully after training, the model should have better predictions for the number of yards gained per play.

In [None]:
test_data = predict(model, initial_model, prior, test_data)
test_data.head()

Now, that we have a model that can give a decent prediciton for yards gained, let's evaluate the offensive linemen's 
impact on the play.
To explain this process, we are going to look at 1 play from 1 game, so let's make a function for that.

In [None]:
def get_play_from_game(p_id, g_id, df):
    return df[(df["playId"] == p_id) & (df["gameId"] == g_id)]

Let's get a gameId and a playId from that game.
Just to see if our model is working, we'll plot the expected yards gained vs the predicted for this play.

In [17]:
# Get first gameId
game_id = test_data["gameId"].unique()[0]

# Get first playId from the gameId
play_id = test_data.loc[test_data["gameId"] == game_id, "playId"].unique()[0]

# Create a data frame of all the frames of the play
play_df = get_play_from_game(play_id, game_id, test_data)

fig, ax = plt.subplots()
ax.plot(play_df["frame.id"], play_df[["Predicted", "PlayResult"]])
ax.set_title(f'Predicted Yards for Play {int(play_id)}')
ax.legend(['Predicted', 'Actual'])
ax.set_xlabel('Frame ID')
ax.set_ylabel('Yards Gained)')
plt.show()

NameError: name 'get_play_from_game' is not defined

This is where things get interesting. 
For this example, we are going to look at the center (ie OL_C).
We are going to move his position .01 yards to thr right and see how the predicted number of yards changed.
Same thing by moving him forward .01 yards.
Then we calculate the magnitude of how much these moves changed the prediction.
Finally, his score will be that magnitude, or leverage, times the amount the prediction changed.
This score can be positive or negative depending on if the change resulted in a better or worse prediction.
Negative scores means the player could've been in a better spot, while positive scores mean he was in a good spot.


In [None]:
fig, ax = plt.subplots()

leverages = []
player_id = 'OL_C'
delta = .01

for frame_id in play_df["frame.id"].unique():
    game_play_frame_id = (test_data["playId"] == play_id) \
                         & (test_data["frame.id"] == frame_id) \
                         & (test_data["gameId"] == game_id)
    frame_df = test_data[game_play_frame_id]
    frame_prediction = frame_df.iloc[0]["Predicted"]

    # Move the player on the x-axis and find the new prediction
    move_x_df = frame_df.copy()
    move_x_df[f"{player_id}_x"] = move_x_df[f"{player_id}_x"].apply(lambda x: x + delta)
    move_x_df = predict(model, initial_model, prior, move_x_df, 'Predicted_x')

    # Move the player on the y-axis and find the new prediction
    move_y_df = frame_df.copy()
    move_y_df[f"{player_id}_y"] = move_y_df[f"{player_id}_y"].apply(lambda x: x + delta)
    move_y_df = predict(model, initial_model, prior, move_y_df, 'Predicted_y')
    
    # Calculate the magnitude of change in prediction for both the x and y move
    dx = (frame_prediction - move_x_df.iloc[0]["Predicted_x"]) / delta
    dy = (frame_prediction - move_y_df.iloc[0]["Predicted_y"]) / delta
    leverage = math.sqrt(dx ** 2 + dy ** 2)
    leverages.append(leverage)
    
    test_data.loc[game_play_frame_id, f'{player_id}_leverage'] = leverage
    
    # After the first frame, calculate a player's score for that frame
    if frame_id != 1:
        prev_frame_id = (test_data["playId"] == play_id) \
                         & (test_data["frame.id"] == frame_id - 1) \
                         & (test_data["gameId"] == game_id)
        prev_frame_pred = test_data.loc[prev_frame_id, "Predicted"].iloc[0]
        score = leverage * (frame_prediction - prev_frame_pred)
        test_data.loc[game_play_frame_id, f"{player_id}_score"] = score
        
# This plot will show the amount impact a player had on a play
ax.plot(leverages)
ax.set_title(f'Rating vs FrameId for Play {play_id} and Player {player_id}')
plt.xlabel('Frame ID')
plt.ylabel('Rating')

Let's check out what some of these leverages (ie. how much impact a player had on the play) and scores (ie.
if the player was in the right position or not).


In [None]:
test_data.head()

Let's see how a player's score changes over the play.
Did his score lower, resulting in a decrease in prediciton for yards gained?
Maybe his score increased causing an increase in prediciton.

In [None]:
fig, ax = plt.subplots()

play_df = get_play_from_game(play_id, game_id, test_data)
ax.plot(play_df["frame.id"], play_df[f"{player_id}_score"], c='red')
ax.set_title(f'{player_id} Score for Play {int(play_id)}')
plt.xlabel('Frame ID')
plt.ylabel('Score')

Finally, if we sum the score for a player over a whole play, we can see how well he did over the whole play,
not just at a specific instance in the play.

In [None]:
play_df[f"{player_id}_score"].sum()
