# RunPassBot: Data Munging

In order to actually build a model, we first will need data. The genesis of this project was a dataset that was released on Kaggle; the 2015 play-by-play data. From this dataset I got the idea that a prediction could be made about the type of play that a team would call based on similar situations. I used the nflscrapR tool in R to pull the previous seasons, based on the fact that removing plays that were not run or pass reduced the dataset size past the point where I felt that any insights could be gained. 

In [1]:
import feather
import glob
import os
import pandas as pd
from sklearn.cross_validation import train_test_split

All of the datasets for the 2009-2015 seasons were placed in the /data/raw directory, in order to preserve data immutability; namely that the data will pass through a pipeline and be interacted with in memory, but the original files would stay the same.

In [2]:
path_to_raw_data = '../data/raw/'
path_to_processed_data = '../data/processed/'

We could open each file individually and the concat the datasets from there in a seperate step from each other. Instead we will use glob and the built-in os packages to open all the files, concat them into a single dataframe.

In [3]:
all_data_files = glob.glob(os.path.join(path_to_raw_data, "*.csv"))
all_raw_dataframe = pd.concat(pd.read_csv(f) for f in all_data_files)

  from ipykernel import kernelapp as app


The dataframe has 63 columns, many of which are unnecessary for the purposes of this project. The following list contains all of the columns that are not needed that we will drop from our dataframe.

In [4]:
columns_to_remove = ['Unnamed: 0', 'Date', 'GameID', 'Drive', 'time', 'TimeUnder', 'TimeSecs',
 					'PlayTimeDiff', 'SideofField', 'yrdln','ydsnet','GoalToGo', 'FirstDown','posteam',
 					'DefensiveTeam','desc','PlayAttempted','Yards.Gained','sp','Touchdown','ExPointResult',
 					'TwoPointConv','DefTwoPoint','Safety','Passer','PassAttempt','PassOutcome','PassLength',
 					'PassLocation','InterceptionThrown','Interceptor','Rusher','RushAttempt','RunLocation','RunGap',
 					'Receiver','Reception','ReturnResult','Returner','Tackler1','Tackler2','FieldGoalResult',
 					'FieldGoalDistance','Fumble','RecFumbTeam','RecFumbPlayer','Sack','Challenge.Replay',
 					'ChalReplayResult','Accepted.Penalty','PenalizedTeam','PenaltyType','PenalizedPlayer',
 					'Penalty.Yards','PosTeamScore','DefTeamScore','AbsScoreDiff','Season']

In [5]:
all_raw_dataframe.drop(columns_to_remove, inplace=True, axis=1)

We also want to make sure that the dataframe is free of NaNs, as the model will choke on NaNs. But first, we need to know how many NaNs are present to see if it will have a negative impact on the number of data points.

In [6]:
all_raw_dataframe.isnull().sum()

qtr               0
down          47098
yrdline100      622
ydstogo           0
PlayType          0
ScoreDiff     20459
dtype: int64

Having NaNs in the down field is not a big problem; a NaN in this field indicates events that we don't care about. We can easily drop these without a loss of precision.

In [7]:
all_raw_dataframe.dropna(inplace=True)

While we lost the NaNs from the dataset, we still need to remove the plays that aren't run or pass plays. First we will create a short list of plays we want to keep, and then filter out the plays that don't fit that category.

In [8]:
play_list = ['Run', 'Pass']
final_clean_dataset = all_raw_dataframe[all_raw_dataframe['PlayType'].isin(play_list)]

We can now see from the .head() command what our dataframe looks like now that we have cleaned it up.

In [9]:
final_clean_dataset.head()

Unnamed: 0,qtr,down,yrdline100,ydstogo,PlayType,ScoreDiff
1,1,1.0,58.0,10,Pass,0.0
2,1,2.0,53.0,5,Run,0.0
3,1,3.0,56.0,8,Pass,0.0
5,1,1.0,98.0,10,Run,0.0
6,1,2.0,98.0,10,Pass,0.0


In the code for the project, I save the dataframe to a feather file, so I can keep the dtypes and any other useful pandas data.

In [10]:
# feather.write_dataframe(final_clean_dataset, path_to_processed_data+ 'clean_dataset.feather')

# RunPassBot: The Model

The model that TPOT selected as the best is the Gradient Boosting Classifier. We will use that classifier for our project.

In [11]:
from sklearn.ensemble import GradientBoostingClassifier

In [12]:
gbc = GradientBoostingClassifier(learning_rate=0.16, max_features=1.0, 
								 min_weight_fraction_leaf=1e-06, n_estimators=500, random_state=42)

Before we start training our model, we need to split the dataset into our test and training sets. We use scikit-learn's built in module. 

In [13]:
features = ['ScoreDiff', 'down', 'qtr', 'ydstogo', 'yrdline100']
target = 'PlayType'

In [14]:
final_clean_dataset['PlayType'] = final_clean_dataset['PlayType'].map({'Run' : 0, 'Pass': 1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [16]:
(train_X, test_X, train_y, test_y) = train_test_split(final_clean_dataset[features], final_clean_dataset[target], test_size = 0.2)

In [17]:
%time gbc.fit(train_X, train_y)

CPU times: user 44.3 s, sys: 2 s, total: 46.3 s
Wall time: 46.3 s


GradientBoostingClassifier(init=None, learning_rate=0.16, loss='deviance',
              max_depth=3, max_features=1.0, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=1e-06, n_estimators=500,
              presort='auto', random_state=42, subsample=1.0, verbose=0,
              warm_start=False)

In [19]:
gbc.predict(test_X)

array([0, 1, 0, ..., 1, 1, 0])

In [20]:
gbc.score(test_X, test_y)

0.67296738631143771