Natalie LaLuzerne

Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

Constants

In [2]:
data_path = r'..\Data\nfl_games.csv'

Load the data set

In [3]:
data = pd.read_csv( data_path )

Remove the date, season, neutral, and result1 columns as these will not be considered when performing regression.

In [4]:
data = data.drop( columns = [ 'date', 'season', 'neutral', 'result1' ] )

Remove any rows with missing data so that missing data does not ne

In [5]:
data = data.dropna()

Transform the team names from categorical data to numeric data to 

In [6]:
labelEncoder = preprocessing.LabelEncoder()
data[ 'team1' ] = labelEncoder.fit_transform( data[ 'team1' ] )
data[ 'team2' ] = labelEncoder.fit_transform( data[ 'team2' ] )

Predict scores for team 1

Create the features and labels for regression. Features include attributes from the table except for the result we want. Features for team 1 include: playoff, team1, team2, elo1, elo2, and elo_prob1. Even though we are trying to predict team1's score, we do not include team2's score in the feature set because the scores of either team are independent of each other. The label is the result we want from the regression and for this iteration it is score1.

In [7]:
x = data[ [ 'playoff', 'team1', 'team2', 'elo1', 'elo2', 'elo_prob1' ] ]
y = data[ 'score1' ]

Split the labels and features into testing and training sets. The training sets will be used to train the regressor and the testing sets will be used to test the accuracy of the predictions that the regressor makes. For this project I chose to use a ratio of 80/20; 80% of the data is used for training the regressor and 20% of the data is used for testing the regressor.

In [8]:
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size = 0.2 )

Create the regressor. For this project, I decided to try a Random Forest Regression because they are known for being fast and accurate. I used Scikit Learn's RandomForestRegressor() to implement my random forest and used 500 trees. I let the method auto-decide the number of features to use for each of the trees.

In [9]:
rfr = RandomForestRegressor( n_estimators = 500 )

Train the Random Forest Regressor and predict the scores for team1. I trained the regressor using the training sets I created above. Once the regressor was trained, I used the test features set to try to predict team1's scores correctly.

In [10]:
rfr.fit( x_train, y_train )
y_predict = rfr.predict( x_test )

Calculate the average score prediction errors for team1. To test the regressor's accuracy, I did a difference comparison between the score predictions from the regressor and what the actual scores were from the scores test set. To do this, I simply took the differrence between the predicted score and the actual score for each test case, then took the mean of the differences.

In [11]:
err = [ ( y_predict[ i ] - r ) for i, r in enumerate( y_test ) ]
avg_err = np.mean( err )
print( 'Average Team 1 Prediction Error: {0:.5f} points'.format( avg_err ) )

Average Team 1 Prediction Error: 0.02429 points


Create the features and labels for regression. Features include attributes from the table except for the result we want. Features for team 1 include: playoff, team1, team2, elo1, elo2, and elo_prob1. Even though we are trying to predict team2's score, we do not include team1's score in the feature set because the scores of either team are independent of each other. The label is the result we want from the regression and for this iteration it is score1.

In [12]:
x = data[ [ 'playoff', 'team1', 'team2', 'elo1', 'elo2', 'elo_prob1' ] ]
y = data[ 'score2' ]

Split the labels and features into testing and training sets. The training sets will be used to train the regressor and the testing sets will be used to test the accuracy of the predictions that the regressor makes. For this project I chose to use a ratio of 80/20; 80% of the data is used for training the regressor and 20% of the data is used for testing the regressor.

In [13]:
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size = 0.2 )

Create the regressor. For this project, I decided to try a Random Forest Regression because they are known for being fast and accurate. I used Scikit Learn's RandomForestRegressor() to implement my random forest and used 500 trees. I let the method auto-decide the number of features to use for each of the trees.

In [14]:
rfr = RandomForestRegressor( n_estimators = 500 )

Train the Random Forest Regressor and predict the scores for team2. I trained the regressor using the training sets I created above. Once the regressor was trained, I used the test features set to try to predict team2's scores correctly.

In [15]:
rfr.fit( x_train, y_train )
y_predict = rfr.predict( x_test )

Calculate the average score prediction errors for team2. To test the regressor's accuracy, I did a difference comparison between the score predictions from the regressor and what the actual scores were from the scores test set. To do this, I simply took the differrence between the predicted score and the actual score for each test case, then took the mean of the differences.

In [16]:
err = [ ( y_predict[ i ] - r ) for i, r in enumerate( y_test ) ]
avg_err = np.mean( err )
print( 'Average Team 2 Prediction Error: {0:.5f} points'.format( avg_err ) )

Average Team 2 Prediction Error: -0.08702 points


Results

Using a Random Forest Regressor with 500 trees, I was able to correctly predict the scores of team1 within an average of 0.02429 points and correctly predict the scores of team2 within an average of -0.08702 points.

Using the Random Forest Regressor appeared to be very accurate. On average, the random forest regressor was within less than a point of the acutal score of any given team at any given matchup. The positive-valued errors represent an over-prediction by the regressor (the regressor predicted a score higher than what wasactually scored) and negative-valued errors represent an under-prediction by the regressor (the regressor predicted a score lower than what was acutally scored).

To improve upon the accuracy of the regressor, one thing that could be easily adjusted would be the number of trees used in the random forest regressor. Other things that could be used to potentially improve the accuracy would be to see if the date and the season data actually does affect the number of points a team scores.