## EECS 731 Project 4: Regression
### by Matthew Taylor

### Import required modules

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

### Load datasets
The first dataframe contains a mapping between team names and team identifiers, as well as initial Elo ratings. The second contains a history of NFL games and final scores.

In [2]:
initial_elos = pd.read_csv('data/initial_elos.csv')
nfl_games = pd.read_csv('data/nfl_games.csv')

In [3]:
# The 'initial_elos' datasets contains team abbreviation encodings through integer indices

initial_elos.head()

Unnamed: 0,team,elo
0,RII,1503.947
1,STP,1300.0
2,BFF,1478.004
3,WBU,1300.0
4,RCH,1503.42


In [4]:
# A majority of the data is already numerical, which simplifies data preparation

nfl_games.head()

Unnamed: 0,date,season,neutral,playoff,team1,team2,elo1,elo2,elo_prob1,score1,score2,result1
0,1920-09-26,1920,0,0,RII,STP,1503.947,1300.0,0.824651,48,0,1.0
1,1920-10-03,1920,0,0,AKR,WHE,1503.42,1300.0,0.824212,43,0,1.0
2,1920-10-03,1920,0,0,RCH,ABU,1503.42,1300.0,0.824212,10,0,1.0
3,1920-10-03,1920,0,0,DAY,COL,1493.002,1504.908,0.575819,14,0,1.0
4,1920-10-03,1920,0,0,RII,MUN,1516.108,1478.004,0.644171,45,0,1.0


### Data Preparation
The feature engineering in this project is rather simple. I replace the team names in the second dataframe with their numerical IDs from the first dataframe. Then I break up the date column into multiple parts. The reason behind this is to capture any trends that teams may experience. For instance, teams may have a good season, or a good string of games, or what have you.

In [5]:
years = []
months = []
days = []

team1_ids = []
team2_ids = []

for index, row in nfl_games.iterrows():
    # Break date column up into three separate columns
    date = row.date.split('-')
    
    years.append(date[0])
    months.append(date[1])
    days.append(date[2])
    
    # Replace team abbreviations with integer IDs
    team1_ids.append(initial_elos.index[initial_elos['team'] == row['team1']].values[0])
    team2_ids.append(initial_elos.index[initial_elos['team'] == row['team2']].values[0])
    
# Add new columns to the dataframe
nfl_games.insert(1, 'year', years)
nfl_games.insert(2, 'month', months)
nfl_games.insert(3, 'day', days)
nfl_games.insert(9, 'team1_id', team1_ids)
nfl_games.insert(10, 'team2_id', team2_ids)

# Remove unwanted columns
nfl_games = nfl_games.drop(columns=['date', 'team1', 'team2'])

# Inspect resulting dataframe to verify data preparation steps
nfl_games.head()

Unnamed: 0,year,month,day,season,neutral,playoff,team1_id,team2_id,elo1,elo2,elo_prob1,score1,score2,result1
0,1920,9,26,1920,0,0,0,1,1503.947,1300.0,0.824651,48,0,1.0
1,1920,10,3,1920,0,0,13,14,1503.42,1300.0,0.824212,43,0,1.0
2,1920,10,3,1920,0,0,4,5,1503.42,1300.0,0.824212,10,0,1.0
3,1920,10,3,1920,0,0,6,7,1493.002,1504.908,0.575819,14,0,1.0
4,1920,10,3,1920,0,0,0,8,1516.108,1478.004,0.644171,45,0,1.0


### Train-Test Split
I perform an 80/20 train/test split on the data. This allows me to both create the model and gauge its accuracy. For this project, I must create two regression models, one to predict the score of each team.

In [6]:
# Separate the data into two types, training and testing (80/20)
# two outputs are required to predict the score of each team
# since each regression model can only produce one result

input_columns = nfl_games[['year', 'month', 'day', 'season', 'neutral', 'playoff', 'team1_id', 'team2_id', 'elo1', 'elo2', 'elo_prob1', 'result1']].values
output_columns1 = nfl_games['score1'].values
output_columns2 = nfl_games['score2'].values

train_input,   test_input   = train_test_split(input_columns, test_size=0.2, random_state=0)
train_output1, test_output1 = train_test_split(output_columns1, test_size=0.2, random_state=0)
train_output2, test_output2 = train_test_split(output_columns2, test_size=0.2, random_state=0)

### Linear Regression
Now that the data is ready, I can begin training models. I start with simple linear regression models. The linear models have an average score-prediction accuracy of about 36%. At first, this figure may seem underwhelming. However, it's crucially important to remember what these models are truly predicting. They aren't predicting which team wins the game. Rather, they are predicting the final scores of each team, which is far more challenging. The fact that these models can correctly predict the final score of these games this often is astounding.

In [7]:
# Train linear regression model

lr1 = LinearRegression().fit(train_input, train_output1)
lr2 = LinearRegression().fit(train_input, train_output2)

In [8]:
# Score linear regression model

lr1_score = round(lr1.score(test_input, test_output1) * 100, 2)
lr2_score = round(lr2.score(test_input, test_output2) * 100, 2)
lr_average = round((lr1_score + lr2_score) / 2, 2)

print('Team 1 scores correctly predicted {}% of the time'.format(lr1_score))
print('Team 2 scores correctly predicted {}% of the time\n'.format(lr2_score))
print('Scores correctly predicted an average of {}% of the time'.format(lr_average))

Team 1 scores correctly predicted 34.29% of the time
Team 2 scores correctly predicted 37.69% of the time

Scores correctly predicted an average of 35.99% of the time


### Random Forest Regressor
To see just how high we can get the prediction accuracy, I decided to implement random forest regressors as well. These models are slightly more sophisticated than the previous models, so the results should be more impressive. Sure enough, the prediction accuracy was around 2% higher than the previous models (both the linear regression and random forest models used the same training and testing data).

In [9]:
# Train random forest

rf1 = RandomForestRegressor(n_estimators=100, random_state=0).fit(train_input, train_output1)
rf2 = RandomForestRegressor(n_estimators=100, random_state=0).fit(train_input, train_output2)

In [10]:
# Test random forest

rf1_score = round(rf1.score(test_input, test_output1) * 100, 2)
rf2_score = round(rf2.score(test_input, test_output2) * 100, 2)
rf_average = round((rf1_score + rf2_score) / 2, 2)

print('Team 1 scores correctly predicted {}% of the time'.format(rf1_score))
print('Team 2 scores correctly predicted {}% of the time\n'.format(rf2_score))
print('Scores correctly predicted an average of {}% of the time'.format(rf_average))

Team 1 scores correctly predicted 36.13% of the time
Team 2 scores correctly predicted 40.51% of the time

Scores correctly predicted an average of 38.32% of the time


### Footnote
To ensure the models weren't simply predicting the scores to be zero for every game, as is common in some machine learning applications, I calculated the percentage of games in the testing set that had a score of zero on either team. Surprisingly, the fraction of games which ended with one team having a score of zero was far smaller than the percentage of correctly predicted scores. This further supports the accuracy of the models created here.

In [11]:
n_zeros = 0

for i in range(len(test_output1)):
    if test_output1[i] == 0 or test_output2[i] == 0:
        n_zeros += 1

p_zeros = round(n_zeros / len(test_output1) * 100, 2)

print('{}% of games in the test set ended with when one team had a score of 0'.format(p_zeros))

8.51% of games in the test set ended with when one team had a score of 0
