#NFL Game Prediction Engine

When I started out trying to find something to do the dev talk on, I jumped from topic to topic. I stopped when I realized that a good number of us all share interest in football, so why not make this fun and use football stats to make a simple game prediction engine. So without further ado...

###Import graphlab
This is the library that we are going to use as the backbone for our work.

In [4]:
import graphlab
import graphlab.aggregate

In [5]:
teams = graphlab.SFrame.read_csv('nfl_00-15/csv/TEAM.csv', header=True)
games = graphlab.SFrame.read_csv('nfl_00-15/csv/GAME.csv', header=True)

------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,float,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,float,float,int,int,int,int,int,int,float,float,float,float,float,float,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------


Inferred types from first line of file as 
column_type_hints=[int,int,int,str,str,str,str,int,int,int,str,str,str,float,float,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


## This block of code has had a TON of revisions. Here is where I go through and create the data set that I will be using to create the model. 

###I went through many iterations of operations on the data trying to improve the end accuracy of the generated model.

Initially I created a model that focused on the in-game stats of each game. That model had a 97% accuracy with its test data evaluation!!! But... It did know the number of passing yards, pass attempts, first downs, penalties, etc. of each game it was evaluating. This was impractical since you wouldn't concievably have this information before a game even starts. Then we migrated to some more super-stitious / non-game related statistics. This tanked our accuracy down to 17%, which is about as atrocious as it gets. From there it made more sense to bring in some of the game data to improve it. This improved our accuracy a great deal, and it is still under improvement. 

In [6]:
# sea_visit = games.filter_by("SEA", 'v', exclude=False)
# sea_games = sea_visit.append(games.filter_by("SEA", 'h', exclude=False))

# data = teams.join(sea_games, 'gid')
data = teams.join(games, 'gid')

# valid_seasons = range(2002, 2015) # We are going to use the previous 2 seasons to draw stats from for predictions.
valid_weeks = range(0, 17) 
# data = data.filter_by(valid_seasons, 'seas')
data = data.filter_by(valid_weeks, 'wk')

winners = data.groupby(key_columns='gid', operations={'win_tid': graphlab.aggregate.ARGMAX('pts', 'tid')})
data = data.join(winners, 'gid')
data['win'] = data['tid'] == data['win_tid']

prev_season_features = ['ry', 'ra', 'py', 'pa', 'pts', 'win', 'ints', 'rfd', 'pfd', 'ir', 'sky', 'top', 'tdt']
season = data.groupby(['seas', 'tname'], {'%s_sum' % x : graphlab.aggregate.SUM(x) for x in prev_season_features})
season['seas'] = season.apply(lambda x: x['seas'] + 1)

data = data.join(season, how='outer')
data = data.filter_by([2000, 2016], 'seas', exclude=True)
data['yppa'] = data.apply(lambda x: float(x['py']) / float(x['pa']) if x['pa'] != 0 else 0)
data['ypra'] = data.apply(lambda x: float(x['ry']) / float(x['ra']) if x['ra'] != 0 else 0)

##I created a win column to use as an evaluation metric for the currently existing data. Now I will move on to the fun part: picking the variables I will use to establish the boosted tree classifier (prediction model).

In [7]:
data = data.select_columns(['tid', 'gid', 'tname', 'seas', 'wk', 'day', 'v', 'h', 'stad', 'temp', 'humd', 'wspd', 'wdir', 'cond', 'surf', 'ou', 'win', 'ry_sum', 'ra_sum', 'py_sum', 'pa_sum', 'pts_sum', 'win_sum', 'ints_sum', 'rfd_sum', 'pfd_sum', 'ir_sum', 'sky_sum', 'top_sum', 'tdt_sum', 'yppa', 'ypra'])

In [8]:
features = ['seas', 'wk', 'day', 'stad', 'v', 'h', 'temp', 'humd', 'wspd', 'wdir', 'cond', 'surf', 'ou', 'ry_sum', 'ra_sum', 'py_sum', 'pa_sum', 'pts_sum', 'win_sum', 'ints_sum', 'rfd_sum', 'pfd_sum', 'ir_sum', 'sky_sum', 'top_sum', 'tdt_sum', 'yppa', 'ypra']
data_train, data_test = data.random_split(0.8, seed=0) 

##What is a boosted tree classifier?!?!

Well a boosted tree classifier basically establishes a bunch of linear regressions (think of them like best fit lines to data) for each of the features (and combinations of features) and uses those to create a decision tree.

##What is a decision tree?!??!?!?!?!?

A binary tree... that you use... to make decisions...

##Next we generate this tree! 

We use our training data set, with the result of the variables input being the value in the 'win' column, and we give the tree a max number of decisions to reach a result of 4.

In [12]:
claz = graphlab.boosted_trees_classifier.create(data_train, target='win', features=features, max_depth=7)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



##Here is the decision tree that we generated based off of the input data

In [13]:
claz.show(view="Tree", tree_id=1)

Canvas is updated and available in a tab in the default browser.


In [14]:
claz.get_feature_importance()

name,index,count
yppa,,81
ypra,,51
ou,,47
temp,,41
humd,,39
wk,,38
sky_sum,,37
pts_sum,,37
seas,,35
ry_sum,,33


In [299]:
claz.summary()

Class                         : BoostedTreesClassifier

Schema
------
Number of examples            : 5506
Number of feature columns     : 28
Number of unpacked features   : 28
Number of classes             : 2

Settings
--------
Number of trees               : 10
Max tree depth                : 3
Training time (sec)           : 0.0645
Training accuracy             : 0.6829
Validation accuracy           : 0.705
Training log_loss             : 0.5889
Validation log_loss           : 0.5841



In [11]:
claz.evaluate(data_test, 'accuracy')

{'accuracy': 0.6840228245363766}

### https://www.youtube.com/watch?v=Z90nZtd1AmM