# RunPassBot: Data Munging

In order to actually build a model, we first will need data. The genesis of this project was a dataset that was released on Kaggle; the 2015 play-by-play data. From this dataset I got the idea that a prediction could be made about the type of play that a team would call based on similar situations. I used the nflscrapR tool in R to pull the previous seasons, based on the fact that removing plays that were not run or pass reduced the dataset size past the point where I felt that any insights could be gained. 

In [2]:
import glob
import os
import pandas as pd
from sklearn.model_selection import train_test_split

All of the datasets for the 2009-2015 seasons were placed in the /data/raw directory, in order to preserve data immutability; namely that the data will pass through a pipeline and be interacted with in memory, but the original files would stay the same.

In [3]:
path_to_raw_data = '../data/raw/'
path_to_processed_data = '../data/processed/'

We could open each file individually and the concat the datasets from there in a seperate step from each other. Instead we will use glob and the built-in os packages to open all the files, concat them into a single dataframe.

In [4]:
all_data_files = glob.glob(os.path.join(path_to_raw_data, "*.csv"))
all_raw_dataframe = pd.concat(pd.read_csv(f) for f in all_data_files)

  from ipykernel import kernelapp as app


In [5]:
all_data_files

['../data/raw/season2009playby.csv',
 '../data/raw/season2010playby.csv',
 '../data/raw/season2011playby.csv',
 '../data/raw/season2012playby.csv',
 '../data/raw/season2013playby.csv',
 '../data/raw/season2014playby.csv',
 '../data/raw/season2015playby.csv']

In [6]:
all_raw_dataframe.head()

Unnamed: 0.1,Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,...,Accepted.Penalty,PenalizedTeam,PenaltyType,PenalizedPlayer,Penalty.Yards,PosTeamScore,DefTeamScore,ScoreDiff,AbsScoreDiff,Season
0,46,2009-09-10,2009091000,1,1,,15:00,15.0,3600.0,0.0,...,0,,,,0,0.0,0.0,0.0,0.0,2009
1,68,2009-09-10,2009091000,1,1,1.0,14:53,15.0,3593.0,7.0,...,0,,,,0,0.0,0.0,0.0,0.0,2009
2,92,2009-09-10,2009091000,1,1,2.0,14:16,15.0,3556.0,37.0,...,0,,,,0,0.0,0.0,0.0,0.0,2009
3,113,2009-09-10,2009091000,1,1,3.0,13:35,14.0,3515.0,41.0,...,0,,,,0,0.0,0.0,0.0,0.0,2009
4,139,2009-09-10,2009091000,1,1,4.0,13:27,14.0,3507.0,8.0,...,0,,,,0,0.0,0.0,0.0,0.0,2009


The dataframe has 63 columns, many of which are unnecessary for the purposes of this project. The following list contains all of the columns that are not needed that we will drop from our dataframe.

In [7]:
columns_to_remove = ['Unnamed: 0', 'Date', 'GameID', 'Drive', 'time', 'TimeUnder', 'TimeSecs',
 					'PlayTimeDiff', 'SideofField', 'yrdln','ydsnet','GoalToGo', 'FirstDown','posteam',
 					'DefensiveTeam','desc','PlayAttempted','Yards.Gained','sp','Touchdown','ExPointResult',
 					'TwoPointConv','DefTwoPoint','Safety','Passer','PassAttempt','PassOutcome','PassLength',
 					'PassLocation','InterceptionThrown','Interceptor','Rusher','RushAttempt','RunLocation','RunGap',
 					'Receiver','Reception','ReturnResult','Returner','Tackler1','Tackler2','FieldGoalResult',
 					'FieldGoalDistance','Fumble','RecFumbTeam','RecFumbPlayer','Sack','Challenge.Replay',
 					'ChalReplayResult','Accepted.Penalty','PenalizedTeam','PenaltyType','PenalizedPlayer',
 					'Penalty.Yards','PosTeamScore','DefTeamScore','AbsScoreDiff','Season']

In [8]:
all_raw_dataframe.drop(columns_to_remove, inplace=True, axis=1)

We also want to make sure that the dataframe is free of NaNs, as the model will choke on NaNs. But first, we need to know how many NaNs are present to see if it will have a negative impact on the number of data points.

In [9]:
all_raw_dataframe.isnull().sum()

qtr               0
down          47098
yrdline100      622
ydstogo           0
PlayType          0
ScoreDiff     20459
dtype: int64

Having NaNs in the down field is not a big problem; a NaN in this field indicates events that we don't care about. We can easily drop these without a loss of precision.

In [10]:
all_raw_dataframe.dropna(inplace=True)

While we lost the NaNs from the dataset, we still need to remove the plays that aren't run or pass plays. First we will create a short list of plays we want to keep, and then filter out the plays that don't fit that category.

In [11]:
all_raw_dataframe.head()

Unnamed: 0,qtr,down,yrdline100,ydstogo,PlayType,ScoreDiff
1,1,1.0,58.0,10,Pass,0.0
2,1,2.0,53.0,5,Run,0.0
3,1,3.0,56.0,8,Pass,0.0
4,1,4.0,56.0,8,Punt,0.0
5,1,1.0,98.0,10,Run,0.0


In [12]:
play_list = ['Run', 'Pass']
final_clean_dataset = all_raw_dataframe[all_raw_dataframe['PlayType'].isin(play_list)]

We can now see from the .head() command what our dataframe looks like now that we have cleaned it up.

In [13]:
final_clean_dataset.describe()

Unnamed: 0,qtr,down,yrdline100,ydstogo,ScoreDiff
count,217699.0,217699.0,217699.0,217699.0,217699.0
mean,2.552616,1.787652,52.637431,8.613287,-1.116027
std,1.131997,0.815865,24.739398,3.913957,11.17485
min,1.0,1.0,1.0,1.0,-59.0
25%,2.0,1.0,34.0,6.0,-7.0
50%,3.0,2.0,56.0,10.0,0.0
75%,4.0,2.0,73.0,10.0,5.0
max,5.0,4.0,99.0,50.0,59.0


In [14]:
final_clean_dataset.head()

Unnamed: 0,qtr,down,yrdline100,ydstogo,PlayType,ScoreDiff
1,1,1.0,58.0,10,Pass,0.0
2,1,2.0,53.0,5,Run,0.0
3,1,3.0,56.0,8,Pass,0.0
5,1,1.0,98.0,10,Run,0.0
6,1,2.0,98.0,10,Pass,0.0


In the code for the project, I save the dataframe to a csv file, so I can load only the clean dataset instead of continually running the above code.

# RunPassBot: The Model

The model that TPOT selected as the best is the Gradient Boosting Classifier. We will use that classifier for our project.

In [15]:
from sklearn.ensemble import GradientBoostingClassifier

In [16]:
gbc = GradientBoostingClassifier(learning_rate=0.16, max_features=1.0, 
								 min_weight_fraction_leaf=1e-06, n_estimators=1000, random_state=42)

Before we start training our model, we need to split the dataset into our test and training sets. We use scikit-learn's built in module. 

In [17]:
features = ['ScoreDiff', 'down', 'qtr', 'ydstogo', 'yrdline100']
target = 'PlayType'

In [18]:
final_clean_dataset['PlayType'] = final_clean_dataset['PlayType'].map({'Run' : 0, 'Pass': 1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [19]:
(train_X, test_X, train_y, test_y) = train_test_split(final_clean_dataset[features], final_clean_dataset[target], test_size = 0.2)

In [20]:
%time gbc.fit(train_X, train_y)

CPU times: user 1min 50s, sys: 5.43 s, total: 1min 55s
Wall time: 2min 5s


GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.16, loss='deviance', max_depth=3,
              max_features=1.0, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=1e-06,
              n_estimators=1000, presort='auto', random_state=42,
              subsample=1.0, verbose=0, warm_start=False)

In [21]:
%time gbc.predict(test_X)

CPU times: user 534 ms, sys: 3.93 ms, total: 538 ms
Wall time: 539 ms


array([0, 0, 0, ..., 1, 1, 0])

In [22]:
gbc.score(test_X, test_y)

0.67429949471750117

# Making it Interactive

Blah, blah, blah.

In [32]:
from ipywidgets import widgets, interact
from IPython.display import display

In [41]:
downNumber = widgets.Text(description="Down")
quarter = widgets.Text(description = "Quarter")
distanceToFirst = widgets.Text(description = "Distance to 1st down")
fieldPos = widgets.Text(description= "Field Position")
scoreDiff = widgets.Text(description = "Score Diff")
predictButton = widgets.Button(description = "Predict!")

In [37]:
field_side = widgets.Checkbox(description = "Own")

In [43]:
def predict():
    down = downNumber.value
    distanceToFirstDown = distanceToFirst.value
    fieldPosition = fieldPos.value
    quarterNum = quarter.value
    scorediff = scoreDiff.value
    
    if(field_side.value == True):
        yrdsto100 = 100 - fieldPosition
    

In [42]:
display(downNumber)
display(quarter)
display(scoreDiff)
display(distanceToFirst)
display(distanceToTD)
display(field_side)
display(predictButton)
def on_button_clicked(b):
    predict()
predictButton.on_click(on_button_clicked)

SyntaxError: unexpected EOF while parsing (<ipython-input-42-07d7a7bd907f>, line 9)