<a href="https://www.kaggle.com/patrickseminatore/quantifying-the-x-factor?scriptVersionId=88527128" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

#  The X-Factor
As scouting departments around the NFL will tell you, predicting a player's performance at the professional level is extremely difficult.  Part of the reason for this is that factors for success are different at every level of the sport.  In college, a player can dominate simply because he is able to physically outclass his competition.  However, once you get to the NFL, almost every player has elite physical tools.  Especially on defense, where the game is based around disrupting the other team's plan, the game becomes much more complex than simply being stronger or faster than your opponent.  Thus, the determining factor in a defender's success in pass defense must be some other X-Factor.  

# What Makes A Successful Defender?
   "Do Your Job."  An iconic saying in the football world, often associated with the great Bill Belichick (although he is certainly not the first to articulate this).  The idea is that each player must make the plays he is responsible for making, no more and no less.  What does this mean in terms of pass coverage? Sure, each defensive scheme might prescribe a zone to defend or a man to guard, but pass defense is not as simple as just being in a specific area or standing near a receiver.  In our estimation, the "job" of each defender is to make sure that as many attempted passes as possible fall incomplete.  Of course, another element of successful pass defense is dissuading the pass from even being thrown in the first place, but we can save that for further investigation.  
   



# True Completion Percentage Allowed
Due to the variety of defensive schemes utilized by NFL teams, it can be difficult to quantify exactly how effective defenders are at making sure passes are not completed.  A statistic that is often used is *Completion Percentage Allowed As Closest Defender*.  However, particularly in zone coverage, two defenders could easily be close enough to the receiver to make a play on the ball, and should be credited as such.  To account for this, we are going to borrow a concept from another NFL Big Data Bowl Submission (*that has since been deleted*): **Radius of Influence**.  The idea is that there is a given area on the field that each defender can realistically influence, based on how far away the defender is from the ball.  This will allow us to calculate a defender's completion percentage allowed, given that they could realistically make a play on the ball.  For our purposes, we can call this value TCPA, or **True Completion Percentage Allowed**.  

# Radius of Influence
As the ball travels closer to the defender and targeted receiver, the time to react decreases, and thus the area that the defender can cover in that time decreases.  Past a certain point, however, that area reaches a limit, and the RoI remains constant.  The original concept stems from [this](http://www.lukebornn.com/papers/fernandez_ssac_2018.pdf) paper from Luke Bornn of Simon Fraser University and the Sacramento Kings and Javier Fernandez of F.C. Barcelona.  Obviously, field control in soccer is different from field control in football, so the scale has been adjusted.  Surely the numbers could be optimized with further observation and analysis, but this model is functional enough for now:

* If the ball is more than 18 yards away, the defender's Radius of Influence is about 10 yards
* Otherwise, the Radius of Influence can be modeled by the following curve, where *x* is the defender's distance from the ball, shown below: $$RoI = 4 + \frac{6}{18^{2}}x^{2}$$ 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def calculate_roi(distance):
    if distance > 18:
        return 10
    else:
        return 4 + (6/(18 ** 2)) * (distance ** 2)

distances = np.arange(0.0, 25.0, 0.5)
rois = [calculate_roi(distance) for distance in distances]
plt.plot(distances, rois)
plt.ylabel('Radius of Influence')
plt.xlabel('Distance from Ball')
plt.show()

Using this formula, we can now calculate the completion percentage allowed by each defender, given that they were in a position to influence the play.  

In [None]:
%%script false --no-raise-error
## Due to the extensive runtime, execution of this code has been disabled, and the resulting data has been uploaded for convenience.  To
## run the code anyway, simply remove the preceding line
def project_zoi_forward(cursor, defender, target_receiver, gameId, playId, frameId, frames_project_forward):
    try:
        proj_frameId = frameId + frames_project_forward
        proj_defender_location = dataStore.get_target_defender_location(cursor, gameId, playId, proj_frameId, defender[0])[0]
        proj_ball_location = dataStore.get_ball_location(cursor, gameId, playId, proj_frameId)[0]
        proj_zoi_rad = calculate_zoi(proj_defender_location, proj_ball_location)
        return proj_zoi_rad, proj_defender_location
    except Exception:
        return -1, -1
    
def calculate_zoi(proj_defender_location, proj_ball_location):
    dist_from_ball = calculate_distance(proj_defender_location, proj_ball_location)
    if dist_from_ball > 18:
        return 10
    else:
        return 4 + (6/(18 ** 2)) * (dist_from_ball ** 2)

def calculate_distance(proj_defender_location, proj_ball_location):
    distance = math.sqrt(abs(((proj_defender_location[0] - proj_ball_location[0]) ** 2) + ((proj_defender_location[1] - proj_ball_location[1]) ** 2)))
    return distance

def is_receiver_in_zoi(rad_influence, proj_defender_location, proj_receiver_location):
    distance = math.sqrt(abs(((proj_defender_location[0] - proj_receiver_location[1]) ** 2) + ((proj_defender_location[1] - proj_receiver_location[2]) ** 2)))
    return distance <= rad_influence
    
frames_project_forward=5    
connection = dataStore.create_data_store()
cursor = connection.cursor()
counter = 0

gameIds = dataStore.get_all_gameIds(cursor)

for gameId_tuple in gameIds:

    gameId = int(gameId_tuple[0])
    playIds = dataStore.get_playIds_from_game(cursor, gameId)

    for playId_tuple in playIds:

        playId = int(playId_tuple[0])

        ## Find target receiver on each play
        raw_targ = dataStore.get_target_receiver_by_play(cursor, gameId, playId)

        if raw_targ:
            target_receiver = raw_targ[0][0]

            ## Find frame where pass thrown
            was_thrown = dataStore.get_frameId_and_time_where_pass_attempted(cursor, gameId, playId)
            if isinstance(was_thrown, tuple):

                frame_id_pass_thrown = int(was_thrown[0])

                defenders = dataStore.get_nflIds_from_play(cursor, gameId, playId, frame_id_pass_thrown)

                for defender in defenders:

                    # Look forward to account for reaction time
                    rad_influence, proj_defender_location = project_zoi_forward(cursor, defender, target_receiver, gameId, playId, frame_id_pass_thrown, frames_project_forward)
                    if rad_influence == -1 and proj_defender_location == -1:
                        pass
                    else:
                        raw_rec_loc = dataStore.get_target_receiver_location(cursor, gameId, playId, frame_id_pass_thrown + frames_project_forward, target_receiver)

                        if isinstance(raw_rec_loc, list):
                            proj_receiver_location = raw_rec_loc[0]

                            # Can the defender influence the play?
                            in_zoi = int(is_receiver_in_zoi(rad_influence, proj_defender_location, proj_receiver_location))

                            # Was the pass completed?
                            query_pass_completed = dataStore.get_is_pass_completed(cursor, gameId, playId)

                            if query_pass_completed:

                                pass_completed = query_pass_completed[0][1]
                                data = [defender[0], playId, gameId, in_zoi, pass_completed]
                                dataStore.record_zoi_completion_by_play(cursor, data)

connection.commit()


In [None]:
import datastore as dataStore
import sqlite3
import os
from operator import itemgetter
import plotly.express as px
import plotly.offline as pyo
from IPython.display import HTML

def grab_best_players(cursor, min_attempts, num_players):
    players = dataStore.get_zoi_players(cursor, min_attempts)
    raw_zoi_arr = create_zoi_array(cursor, players)
    sorted_best_players = sorted(raw_zoi_arr, key=itemgetter(1))[:num_players]
    sorted_with_names = get_player_names(cursor, sorted_best_players)
    names_arr = [name[0] for name in sorted_with_names]
    eff_arr = [name[1] for name in sorted_with_names]
    return names_arr, eff_arr
    
def create_zoi_array(c, players):
    zoi_eff_arr = []
    for player in players:
        sum_comp = 0
        completions = dataStore.get_zoi_plays_by_player(c, player[0])
        for entry in completions:
            sum_comp += entry[0]
        eff_data = [player[0], sum_comp/len(completions)]
        zoi_eff_arr.append(eff_data)
    return zoi_eff_arr    

def get_player_names(c, sorted_players):
    names_efficiency = []
    for player in sorted_players:
        name = dataStore.get_name_by_nfl(c, player[0])[0][0]
        name_eff_data = [name, player[1]]
        names_efficiency.append(name_eff_data)
    return names_efficiency

# Set notebook mode to work in offline
pyo.init_notebook_mode()

connection = sqlite3.connect("../input/datastore-db/datastore.sqlite")
connection.commit()
cursor = connection.cursor()
names_arr, eff_arr = grab_best_players(cursor, 30, 15)
fig = px.bar( x=names_arr, y=eff_arr, labels={'x': 'Defender', 'y': 'Completion Percentage Allowed'})
fig.update_layout(
        yaxis=dict(tickformat=".2%"),
        title={
        'text': "Lowest Completion % Allowed When Targeted Receiver Within Defender RoI, min 30 attempts",
        'y':0.99,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        )
display(HTML(fig.show("notebook")))

# Linear Regression Model
Now that we have a more accurate representation of how often defenders are allowing passes to be completed, we can begin to model this behavior, and hopefully gain some insight into what makes defenders successful.  For a quick overview of how multivariate linear regression works, take a look at [this](http://mezeylab.cb.bscb.cornell.edu/labmembers/documents/supplement%205%20-%20multiple%20regression.pdf) paper from Dr. Martina Bremer. At a high level, all we are attempting to do is construct the formula: 

$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon$$

Where *y* in this case is the predicted TCPA, each *x* is one of our *n* number of input features, with a corresponding $\beta$ as the *regression coefficient* (the weight applied to each *x* to make the model fit), $\beta_0$ is the y-intercept of our line, and $\epsilon$ represents the residual terms of the model.  This $\epsilon$ is what we would like to minimize to obtain a more accurate model given our parameters.  We will use a metric called *mean squared error (mse)* to approximate the average error for the model across all data points.  

Now, we can plug in our input variables.  We will be using:

* **Top Speed** - The player's highest recorded speed in the provided tracking data.
* **Max Acceleration** - The player's highest recorded acceleration in the provided tracking data.  This also effects a player's explosiveness and change of direction.
* **Height** - The player's reported height in the provided dataset.
* **Weight** - The player's reported weight in the provided dataset.
* **Age** - The player's age based on reported birthdate as of Sept. 1st of 2018.  This also acts as a proxy for a player's experience.  

Thus, the equation for our model will be of the form:

$$TCPA = \beta_0 + \beta_1(TopSpeed) + \beta_2(MaxAccel) + \beta_3(Height) + \beta_4(Weight) + \beta_5(Age) + \epsilon$$

Which will allow us to predict the TCPA based on raw physical characteristics.  

In [None]:
import pandas as pd
import numpy as np
import datastore as dataStore
from operator import itemgetter
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from scipy.optimize import minimize
from IPython.display import HTML

def grab_best_cov_players(cursor, min_attempts):
    players = dataStore.get_zoi_players(cursor, min_attempts)
    raw_zoi_arr = create_zoi_array(cursor, players)
    sorted_best_players = sorted(raw_zoi_arr, key=itemgetter(1))
    nflId_arr = [name[0] for name in sorted_best_players]
    eff_arr = [name[1] for name in sorted_best_players]
    return nflId_arr, eff_arr

def create_zoi_array(c, players):
    zoi_eff_arr = []
    for player in players:
        sum_comp = 0
        completions = dataStore.get_zoi_plays_by_player(c, player[0])
        for entry in completions:
            sum_comp += entry[0]
        eff_data = [player[0], sum_comp/len(completions)]
        zoi_eff_arr.append(eff_data)
    return zoi_eff_arr  

def convert_positions(positions):
    single_positions = []
    for position in positions:
        if position in ['CB']:
            single_positions.append('CB')
        elif position in ['SS', 'FS']:
            single_positions.append('S')
        elif position in ['LB', 'MLB']:
            single_positions.append('LB')
    return single_positions

def get_names(c, nflIds):
    names = []
    for nflId in nflIds:
        name = dataStore.get_name_by_nfl(c, nflId)[0][0]
        names.append(name)
    return names

def get_positions(c, nflIds):
    names = []
    for nflId in nflIds:
        name = dataStore.get_position_by_nflId(c, nflId)[0]
        names.append(name)
    return names


## Turn off chained assignment warning
pd.set_option('mode.chained_assignment', None)

## Connect to database
connection = dataStore.create_data_store()
cursor = connection.cursor()

## Grab info from database
nflId_arr, eff_arr = grab_best_cov_players(cursor, 30) # Minimum 30 attempts
data_list = dataStore.get_modelinfo_by_nflId(cursor, nflId_arr)

## Build Dataframe
df = pd.DataFrame(data_list, columns=['NflId', 'Completion%', 'TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age'])
positions = get_positions(cursor, df['NflId'].values)
df['Position'] = pd.Series(positions, index=df.index)

new_positions = convert_positions(df['Position'].tolist())
df['GenPosition'] = pd.Series(new_positions, index=df.index)

## Make Predictions
linear_regression = LinearRegression()
y = df['Completion%']
x = df[['TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age']]
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3)
linear_regression.fit(x, y)
y_pred = linear_regression.predict(x)

## Display Results
mse = np.sqrt(metrics.mean_squared_error(y, y_pred))
print('Mean Squared Error: ( %r )' % mse)

## Build graph Dataframe
best_df = df[['NflId', 'Completion%', 'GenPosition']]
names = get_names(cursor, df['NflId'].values)
best_df['DisplayName'] = pd.Series(names, index=best_df.index)
best_df['PredCompletion%'] = pd.Series(y_pred, index=best_df.index)
best_df.sort_values('Completion%', ascending=True, inplace=True, ignore_index=True)
error = y - y_pred
best_df['Epsilon'] = pd.Series(error, index=best_df.index)
display(HTML(best_df.head(15).to_html()))


## Display graph
fig = px.scatter(best_df, x='Completion%', y='PredCompletion%', hover_data=['DisplayName'], color='GenPosition')
fig.update_layout(
        yaxis=dict(tickformat=".2%"),
        xaxis=dict(tickformat=".2%"),
        title={
        'text': "Predicted Completion % Allowed vs Completion % Allowed",
        'y':0.99,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        )
display(HTML(fig.show("notebook")))

# Additional Term
As you can see in the chart above, our model is not quite accurate enough to be meaningful.  There must be some other factor that needs to be accounted for.  This makes sense, as there is obviously more to football than just raw physical characteristics.   Unfortunately, we run into the problem here of simply not having enough information.  There is just too much that goes into defending a pass to calculate from the information we have been given.  

Since we already know the TCPA for each player, we can actually quantify how far our model's predictions are from the expected result.  Recall the $\epsilon$ from our regression equation.  This is the *error*, or difference between the projected result and the expected result.  Since we know we are missing some sort of feature for each player, we can deduce that the feature and it's coefficient must be a part of that error.  If we then factor out our feature *xr* and it's coefficient $\beta_6$ from $\epsilon$,  our new equation will now look like this:

$$TCPA = \beta_0 + \beta_1(TopSpeed) + \beta_2(MaxAccel) + \beta_3(Height) + \beta_4(Weight) + \beta_5(Age) + \beta_6(xr) + \epsilon$$

Now that we have our equation, we need to calculate $\beta_6$, as well as the *xr* for each player.  To do this, we will use the optimize.minimize function from SciPy on our regression, running a [BFGS](https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#broyden-fletcher-goldfarb-shanno-algorithm-method-bfgs) quasi-Newton minimization algorithm.  This will, given an initial guess for *xr* (same initial guess for each player), find the $\beta_6$ and *xr* pairing that minimzes the *mse* of our our model, essentially fitting the model as closely as possible.  

In [None]:
import numpy as np
import pandas as pd
from sklearn import metrics
from scipy.optimize import minimize
import datastore as dataStore
from sklearn.linear_model import LinearRegression
from operator import itemgetter
from IPython.display import HTML



def calculate_mse(xr, df):
    linear_regression = LinearRegression()
    df.drop(labels='xr', axis="columns", inplace=True)
    df['xr'] = pd.Series(xr, index=df.index)
    y = df['CompPer']
    x = df[['TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age', 'xr']]
    linear_regression.fit(x, y)
    y_pred = linear_regression.predict(x)
    mse = np.sqrt(metrics.mean_squared_error(y, y_pred))
    #print(mse)
    return mse

def grab_best_cov_players(cursor, min_attempts):
    players = dataStore.get_zoi_players(cursor, min_attempts)
    raw_zoi_arr = create_zoi_array(cursor, players)
    sorted_best_players = sorted(raw_zoi_arr, key=itemgetter(1))
    nflId_arr = [name[0] for name in sorted_best_players]
    eff_arr = [name[1] for name in sorted_best_players]
    return nflId_arr, eff_arr

def create_zoi_array(c, players):
    zoi_eff_arr = []
    for player in players:
        sum_comp = 0
        completions = dataStore.get_zoi_plays_by_player(c, player[0])
        for entry in completions:
            sum_comp += entry[0]
        eff_data = [player[0], sum_comp/len(completions)]
        zoi_eff_arr.append(eff_data)
    return zoi_eff_arr 

def get_names(c, nflIds):
    names = []
    for nflId in nflIds:
        name = dataStore.get_name_by_nfl(c, nflId)[0][0]
        names.append(name)
    return names

## Turn off chained assignment warning
pd.set_option('mode.chained_assignment', None)

## Connect to database
connection = dataStore.create_data_store()
cursor = connection.cursor()

## Grab info from database
nflId_arr, eff_arr = grab_best_cov_players(cursor, 30) # Minimum 30 attempts
data_list = dataStore.get_modelinfo_by_nflId(cursor, nflId_arr)

## Build Dataframe
df = pd.DataFrame(data_list, columns=['NflId', 'CompPer', 'TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age'])

## Initial Guess for additional term (xr)
x0 = np.full((172, ), 1, dtype=int)
df['xr'] = pd.Series(x0, index=df.index)
print("fitting...")

## SciPy Optimization, including optimized xr
res = minimize(calculate_mse, x0, args=(df,))
xr_df = pd.DataFrame(res.x, columns=['xr'])
names = get_names(cursor, df['NflId'].values)
xr_df['Name'] = pd.Series(names, index=xr_df.index)
xr_df.sort_values('xr', ascending=False, inplace=True, ignore_index=True)
display(HTML(xr_df.head(15).to_html()))

Now that we have our additional term, we can plug it back into our model and see what difference it makes.  

In [None]:
import datastore as dataStore
import sqlite3
from sqlite3 import Error
import numpy as np
from operator import itemgetter
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from scipy.optimize import minimize
import plotly.express as px

def calculate_mse(xr, df):
    linear_regression = LinearRegression()
    df.drop(labels='xr', axis="columns", inplace=True)
    df['xr'] = pd.Series(xr, index=df.index)
    y = df['CompPer']
    x = df[['TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age', 'xr']]
    linear_regression.fit(x, y)
    y_pred = linear_regression.predict(x)
    mse = np.sqrt(metrics.mean_squared_error(y, y_pred))
    #print(mse)
    return mse


def grab_best_cov_players(cursor, min_attempts):
    players = dataStore.get_zoi_players(cursor, min_attempts)
    raw_zoi_arr = create_zoi_array(cursor, players)
    sorted_best_players = sorted(raw_zoi_arr, key=itemgetter(1))
    nflId_arr = [name[0] for name in sorted_best_players]
    eff_arr = [name[1] for name in sorted_best_players]
    return nflId_arr, eff_arr

def create_zoi_array(c, players):
    zoi_eff_arr = []
    for player in players:
        sum_comp = 0
        completions = dataStore.get_zoi_plays_by_player(c, player[0])
        for entry in completions:
            sum_comp += entry[0]
        eff_data = [player[0], sum_comp/len(completions)]
        zoi_eff_arr.append(eff_data)
    return zoi_eff_arr    

def convert_positions(positions):
    single_positions = []
    for position in positions:
        if position in ['CB']:
            single_positions.append('CB')
        elif position in ['SS', 'FS']:
            single_positions.append('S')
        elif position in ['LB', 'MLB']:
            single_positions.append('LB')
    return single_positions

def get_names(c, nflIds):
    names = []
    for nflId in nflIds:
        name = dataStore.get_name_by_nfl(c, nflId)[0][0]
        names.append(name)
    return names

def get_positions(c, nflIds):
    names = []
    for nflId in nflIds:
        name = dataStore.get_position_by_nflId(c, nflId)[0]
        names.append(name)
    return names

## Turn off chained assignment warning
pd.set_option('mode.chained_assignment', None)
pd.set_option('display.max_columns', 7)

## Connect to database
connection = dataStore.create_data_store()
cursor = connection.cursor()

## Grab info from database
nflId_arr, eff_arr = grab_best_cov_players(cursor, 30) # Minimum 30 attempts
data_list = dataStore.get_modelinfo_by_nflId(cursor, nflId_arr)

## Build Dataframe
df = pd.DataFrame(data_list, columns=['NflId', 'CompPer', 'TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age'])
positions = get_positions(cursor, df['NflId'].values)
df['Position'] = pd.Series(positions, index=df.index)

## Convert position to integer to submit to model
int_positions = convert_positions(df['Position'].tolist())
df['GenPosition'] = pd.Series(int_positions, index=df.index)


linear_regression = LinearRegression()


## Initial Guess for additional term (xr)
x0 = np.full((172, ), 1, dtype=int)
df['xr'] = pd.Series(x0, index=df.index)
print("fitting...")

## SciPy Optimization, including optimized xr
res = minimize(calculate_mse, x0, args=(df,))
#print(res.x)

## Test new xr Hypothesis
df.drop(labels='xr', axis="columns", inplace=True)
df['xr'] = pd.Series(res.x, index=df.index)
y = df['CompPer']
x = df[['TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age', 'xr']]
#xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3)
linear_regression.fit(x, y)
y_pred = linear_regression.predict(x)

## Display Results
mse = np.sqrt(metrics.mean_squared_error(y, y_pred))
print('Mean Squared Error: ( %r )' % mse)

## Build graph Dataframe
best_xr_df = df[['CompPer', 'xr', 'GenPosition']]
names = get_names(cursor, df['NflId'].values)
best_xr_df['DisplayName'] = pd.Series(names, index=best_xr_df.index)
best_xr_df['CompPer_pred'] = pd.Series(y_pred, index=best_xr_df.index)
best_xr_df.sort_values('xr', ascending=False, inplace=True, ignore_index=True)

## Display graph
fig = px.scatter(best_xr_df, x='CompPer', y='CompPer_pred', hover_data=['DisplayName'], color='GenPosition')
fig.update_layout(
        yaxis=dict(tickformat=".2%"),
        xaxis=dict(tickformat=".2%"),
        title={
        'text': "Predicted Completion % Allowed vs Completion % Allowed, Additional Term Included",
        'y':0.99,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        )
display(HTML(fig.show("notebook")))

# Results
Clearly, the addition of the *xr* term makes for a much more accurate model.  But what does this value mean?  In short, the *xr* term represents everything that goes into defending a pass, other than the raw physical characteristics we are provided.  This includes things like reaction time, skill at deflecting a pass, mental preparedness, etc.  These are metrics that are difficult to quantify, but as the game, and particularly data science, evolve and become more intertwined, we may be able to factor out into their own feature in the model, similar to how we factored *xr* out of $\epsilon$.  

In [None]:
import datastore as dataStore
import sqlite3
from sqlite3 import Error
import numpy as np
from operator import itemgetter
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from scipy.optimize import minimize
import plotly.express as px

def calculate_mse(xr, df):
    linear_regression = LinearRegression()
    df.drop(labels='xr', axis="columns", inplace=True)
    df['xr'] = pd.Series(xr, index=df.index)
    y = df['CompPer']
    x = df[['TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age', 'xr']]
    linear_regression.fit(x, y)
    y_pred = linear_regression.predict(x)
    mse = np.sqrt(metrics.mean_squared_error(y, y_pred))
    #print(mse)
    return mse


def grab_best_cov_players(cursor, min_attempts):
    players = dataStore.get_zoi_players(cursor, min_attempts)
    raw_zoi_arr = create_zoi_array(cursor, players)
    sorted_best_players = sorted(raw_zoi_arr, key=itemgetter(1))
    nflId_arr = [name[0] for name in sorted_best_players]
    eff_arr = [name[1] for name in sorted_best_players]
    return nflId_arr, eff_arr

def create_zoi_array(c, players):
    zoi_eff_arr = []
    for player in players:
        sum_comp = 0
        completions = dataStore.get_zoi_plays_by_player(c, player[0])
        for entry in completions:
            sum_comp += entry[0]
        eff_data = [player[0], sum_comp/len(completions)]
        zoi_eff_arr.append(eff_data)
    return zoi_eff_arr    

def convert_positions(positions):
    single_positions = []
    for position in positions:
        if position in ['CB']:
            single_positions.append('CB')
        elif position in ['SS', 'FS']:
            single_positions.append('S')
        elif position in ['LB', 'MLB']:
            single_positions.append('LB')
    return single_positions

def get_names(c, nflIds):
    names = []
    for nflId in nflIds:
        name = dataStore.get_name_by_nfl(c, nflId)[0][0]
        names.append(name)
    return names

def get_positions(c, nflIds):
    names = []
    for nflId in nflIds:
        name = dataStore.get_position_by_nflId(c, nflId)[0]
        names.append(name)
    return names

## Turn off chained assignment warning
pd.set_option('mode.chained_assignment', None)
pd.set_option('display.max_columns', 7)

## Connect to database
connection = dataStore.create_data_store()
cursor = connection.cursor()

## Grab info from database
nflId_arr, eff_arr = grab_best_cov_players(cursor, 30) # Minimum 30 attempts
data_list = dataStore.get_modelinfo_by_nflId(cursor, nflId_arr)

## Build Dataframe
df = pd.DataFrame(data_list, columns=['NflId', 'CompPer', 'TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age'])
positions = get_positions(cursor, df['NflId'].values)
df['Position'] = pd.Series(positions, index=df.index)

## Convert position to integer to submit to model
int_positions = convert_positions(df['Position'].tolist())
df['GenPosition'] = pd.Series(int_positions, index=df.index)


linear_regression = LinearRegression()


## Initial Guess for additional term (xr)
x0 = np.full((172, ), 1, dtype=int)
df['xr'] = pd.Series(x0, index=df.index)
print("fitting...")

## SciPy Optimization, including optimized xr
res = minimize(calculate_mse, x0, args=(df,))
#print(res.x)

## Test new xr Hypothesis
df.drop(labels='xr', axis="columns", inplace=True)
df['xr'] = pd.Series(res.x, index=df.index)
y = df['CompPer']
x = df[['TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age', 'xr']]
#xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3)
linear_regression.fit(x, y)
y_pred = linear_regression.predict(x)

## Display Results
mse = np.sqrt(metrics.mean_squared_error(y, y_pred))
print('Mean Squared Error: ( %r )' % mse)

## Build graph Dataframe
best_xr_df = df[['CompPer', 'xr', 'GenPosition']]
names = get_names(cursor, df['NflId'].values)
best_xr_df['DisplayName'] = pd.Series(names, index=best_xr_df.index)
best_xr_df['CompPer_pred'] = pd.Series(y_pred, index=best_xr_df.index)
best_xr_df.sort_values('xr', ascending=False, inplace=True, ignore_index=True)
error = y - y_pred
best_xr_df['Epsilon'] = pd.Series(error, index=best_xr_df.index)
display(HTML(best_xr_df.head(15).to_html()))

## Display graph
fig = px.scatter(best_xr_df, x='CompPer', y='xr', hover_data=['DisplayName'], color='GenPosition')
fig.update_layout(
        xaxis=dict(tickformat=".2%"),
        title={
        'text': "Additional Term XR vs Completion % Allowed",
        'y':0.99,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        )
display(HTML(fig.show("notebook")))

It is interesting to note that as a group, the Linebackers tend to have the highest *xr*.  This could be because as bigger and heavier players, LBs might have to rely on getting to the receiver and breaking up the pass as it arrives.  On the other hand, Cornerbacks tend to have the lowest *xr*.  As faster and more agile players, CBs might tend to find themselves in position to make plays on the ball more often.  Therefore, they are likely expected to make plays at a much higher rate than other players, and with higher expectations, end up with much more opportunity to perform underneath their expected results.  

If we separate out by position, here is how the top players break down:

In [None]:
import datastore as dataStore
import sqlite3
from sqlite3 import Error
import numpy as np
from operator import itemgetter
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from scipy.optimize import minimize
import plotly.express as px

def calculate_mse(xr, df):
    linear_regression = LinearRegression()
    df.drop(labels='xr', axis="columns", inplace=True)
    df['xr'] = pd.Series(xr, index=df.index)
    y = df['CompPer']
    x = df[['TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age', 'xr']]
    linear_regression.fit(x, y)
    y_pred = linear_regression.predict(x)
    mse = np.sqrt(metrics.mean_squared_error(y, y_pred))
    #print(mse)
    return mse


def grab_best_cov_players(cursor, min_attempts):
    players = dataStore.get_zoi_players(cursor, min_attempts)
    raw_zoi_arr = create_zoi_array(cursor, players)
    sorted_best_players = sorted(raw_zoi_arr, key=itemgetter(1))
    nflId_arr = [name[0] for name in sorted_best_players]
    eff_arr = [name[1] for name in sorted_best_players]
    return nflId_arr, eff_arr

def create_zoi_array(c, players):
    zoi_eff_arr = []
    for player in players:
        sum_comp = 0
        completions = dataStore.get_zoi_plays_by_player(c, player[0])
        for entry in completions:
            sum_comp += entry[0]
        eff_data = [player[0], sum_comp/len(completions)]
        zoi_eff_arr.append(eff_data)
    return zoi_eff_arr    

def convert_positions(positions):
    single_positions = []
    for position in positions:
        if position in ['CB']:
            single_positions.append('CB')
        elif position in ['SS', 'FS']:
            single_positions.append('S')
        elif position in ['LB', 'MLB']:
            single_positions.append('LB')
    return single_positions

def get_names(c, nflIds):
    names = []
    for nflId in nflIds:
        name = dataStore.get_name_by_nfl(c, nflId)[0][0]
        names.append(name)
    return names

def get_positions(c, nflIds):
    names = []
    for nflId in nflIds:
        name = dataStore.get_position_by_nflId(c, nflId)[0]
        names.append(name)
    return names

## Turn off chained assignment warning
pd.set_option('mode.chained_assignment', None)
pd.set_option('display.max_columns', 7)

## Connect to database
connection = dataStore.create_data_store()
cursor = connection.cursor()

## Grab info from database
nflId_arr, eff_arr = grab_best_cov_players(cursor, 30) # Minimum 30 attempts
data_list = dataStore.get_modelinfo_by_nflId(cursor, nflId_arr)

## Build Dataframe
df = pd.DataFrame(data_list, columns=['NflId', 'CompPer', 'TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age'])
positions = get_positions(cursor, df['NflId'].values)
df['Position'] = pd.Series(positions, index=df.index)

## Convert position to integer to submit to model
int_positions = convert_positions(df['Position'].tolist())
df['GenPosition'] = pd.Series(int_positions, index=df.index)


linear_regression = LinearRegression()


## Initial Guess for additional term (xr)
x0 = np.full((172, ), 1, dtype=int)
df['xr'] = pd.Series(x0, index=df.index)
print("fitting...")

## SciPy Optimization, including optimized xr
res = minimize(calculate_mse, x0, args=(df,))
#print(res.x)

## Test new xr Hypothesis
df.drop(labels='xr', axis="columns", inplace=True)
df['xr'] = pd.Series(res.x, index=df.index)
y = df['CompPer']
x = df[['TopSpeed', 'MaxAccel', 'Height', 'Weight', 'Age', 'xr']]
#xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3)
linear_regression.fit(x, y)
y_pred = linear_regression.predict(x)

## Display Results
mse = np.sqrt(metrics.mean_squared_error(y, y_pred))

## Build graph Dataframe
best_xr_df = df[['CompPer', 'xr', 'GenPosition']]
names = get_names(cursor, df['NflId'].values)
best_xr_df['DisplayName'] = pd.Series(names, index=best_xr_df.index)
best_xr_df['CompPer_pred'] = pd.Series(y_pred, index=best_xr_df.index)
best_xr_df.sort_values('xr', ascending=False, inplace=True, ignore_index=True)
error = y - y_pred
best_xr_df['Epsilon'] = pd.Series(error, index=best_xr_df.index)
best_xr_df.to_csv('submission.csv', index=False)

cb_xr_df = best_xr_df.loc[best_xr_df['GenPosition'] == 'CB', ['CompPer', 'CompPer_pred', 'xr', 'GenPosition', 'DisplayName', 'Epsilon']]
s_xr_df = best_xr_df.loc[best_xr_df['GenPosition'] == 'S', ['CompPer', 'CompPer_pred', 'xr', 'GenPosition', 'DisplayName', 'Epsilon']]
lb_xr_df = best_xr_df.loc[best_xr_df['GenPosition'] == 'LB', ['CompPer', 'CompPer_pred', 'xr', 'GenPosition', 'DisplayName', 'Epsilon']]

## Display best xr for each position group
display(HTML("Top Cornerbacks by XR"))
display(HTML(cb_xr_df.head(15).to_html()))
display(HTML("Top Safeties by XR"))
display(HTML(s_xr_df.head(15).to_html()))
display(HTML("Top Linebackers by XR"))
display(HTML(lb_xr_df.head(15).to_html()))

As we take a look at the results, it seems to pass a rough eye test, as many of the players with high *xr* ratings are widely regarded as some of the better coverage players in the league.  While this rating might not be an exact science, it allows us the opportunity to look at pass defense in a way that might not be obvious based on the raw, physical data we collect now.  In the future, as data science becomes more intertwined with the league and data collection becomes more prevalent, this method can be refined, and perhaps the "X-Factor" can be demystified.  