Predict Sports Scores with Regression
========================================

Background: Who doesn't want to know if their favorite team will win on gameday? This notebook attempts to predict the scores of various sporting events using regression models.

All data is courtesy of FiveThirtyEight, one of my favorite websites.

In [1]:
import pandas as pd

In [2]:
# Get data into dataframes
# I'm going to start out with Soccer
matches = pd.read_csv("../data/raw/soccer/spi_matches.csv")
rankings = pd.read_csv("../data/raw/soccer/spi_global_rankings.csv")
intl_rankings = pd.read_csv("../data/raw/soccer/spi_global_rankings_intl.csv")

In [3]:
# Take a look at each Data Frame
matches.head(5)

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,0.1157,...,32.4,67.7,0.0,1.0,0.97,0.63,0.43,0.45,0.0,1.05
1,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,0.2617,...,53.7,22.9,2.0,2.0,2.45,0.77,1.75,0.42,2.1,2.1
2,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,0.2921,...,38.1,22.2,2.0,1.0,0.85,2.77,0.17,1.25,2.1,1.05
3,2016-08-13,2411,Barclays Premier League,Burnley,Swansea City,58.98,59.74,0.4482,0.2663,0.2854,...,36.5,29.1,0.0,1.0,1.24,1.84,1.71,1.56,0.0,1.05
4,2016-08-13,2411,Barclays Premier League,Middlesbrough,Stoke City,56.32,60.35,0.438,0.2692,0.2927,...,33.9,32.5,1.0,1.0,1.4,0.55,1.13,1.06,1.05,1.05


In [4]:
rankings.head(5)

Unnamed: 0,rank,prev_rank,name,league,off,def,spi
0,1,1,Manchester City,Barclays Premier League,2.98,0.23,93.72
1,2,3,Barcelona,Spanish Primera Division,3.16,0.39,92.6
2,3,4,Real Madrid,Spanish Primera Division,2.99,0.38,91.75
3,4,2,Bayern Munich,German Bundesliga,2.94,0.4,90.93
4,5,6,Juventus,Italy Serie A,2.66,0.29,90.72


In [5]:
intl_rankings.head(5)

Unnamed: 0,rank,name,confed,off,def,spi
0,1,Brazil,CONMEBOL,3.11,0.29,92.96
1,2,Spain,UEFA,3.46,0.48,92.54
2,3,Belgium,UEFA,3.06,0.54,89.1
3,4,France,UEFA,2.84,0.46,88.57
4,5,Germany,UEFA,2.96,0.56,87.93


From the above dataframes, we can begin to determine what may be the most useful pieces of information. It seems like the "matches" dataframe has just about everything we could want...there's no need to match team names and rankings from the rankings df into the matches dataframe.

One potentially useful feature would be to consider a team's al-time average score. This could be calculated by looking at all matches played by a team and taking a simple average. To go evern a step further, we could break this down by season. 

In [6]:
# Take a look at our columns
list(matches.columns.values)

['date',
 'league_id',
 'league',
 'team1',
 'team2',
 'spi1',
 'spi2',
 'prob1',
 'prob2',
 'probtie',
 'proj_score1',
 'proj_score2',
 'importance1',
 'importance2',
 'score1',
 'score2',
 'xg1',
 'xg2',
 'nsxg1',
 'nsxg2',
 'adj_score1',
 'adj_score2']

From this list of columns, it seems that we may be able to eliminate some of the features. For instance, 'league_id' takes care of the 'league' attribute, so both are not needed.

Additionally, it seems from this that the adjusted score and score are not needed as features in our model, since these are what we are looking to predict.

In [7]:
# Run a simple regression test on the barebones dataframe with no 
# feature engineering other than dropping some features. 
from sklearn.preprocessing import LabelEncoder

e = LabelEncoder()
e.fit(matches['team1'])
matches['team1'] = e.transform(matches['team1'])
matches['team2'] = e.transform(matches['team2'])

e.fit(matches['date'])
matches['date'] = e.transform(matches['date'])

matches = matches.drop('league', 1)

matches = matches.dropna()

matches.head(5)

Unnamed: 0,date,league_id,team1,team2,spi1,spi2,prob1,prob2,probtie,proj_score1,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,0,1843,78,473,51.16,85.68,0.0463,0.838,0.1157,0.91,...,32.4,67.7,0.0,1.0,0.97,0.63,0.43,0.45,0.0,1.05
1,0,1843,16,293,68.85,56.48,0.5714,0.1669,0.2617,1.82,...,53.7,22.9,2.0,2.0,2.45,0.77,1.75,0.42,2.1,2.1
2,1,2411,317,366,53.57,66.81,0.3459,0.3621,0.2921,1.16,...,38.1,22.2,2.0,1.0,0.85,2.77,0.17,1.25,2.1,1.05
3,1,2411,118,610,58.98,59.74,0.4482,0.2663,0.2854,1.37,...,36.5,29.1,0.0,1.0,1.24,1.84,1.71,1.56,0.0,1.05
4,1,2411,404,605,56.32,60.35,0.438,0.2692,0.2927,1.3,...,33.9,32.5,1.0,1.0,1.4,0.55,1.13,1.06,1.05,1.05


In [8]:
# set up model by choosing labels and features
decision = matches[['score1']]
features = matches[['date',
                   'league_id',
                   'team1',
                   'team2',
                   'prob1',
                   'prob2',
                   'spi1',
                   'spi2',
                   'probtie',
                   'proj_score1',
                   'proj_score2',
                   'importance1',
                   'importance2',
                   'xg1',
                   'xg2']]

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

train, test, train_d, test_d = train_test_split(features,
                                                decision,
                                                test_size = 0.2,
                                                random_state = 14)

cls = RandomForestClassifier()
mdl = cls.fit(train, train_d.values.ravel())

In [10]:
# Time to Make some Predictions
results = cls.predict(test)

In [11]:
from sklearn.metrics import accuracy_score

accuracy_score(test_d, results)

0.3402430307362402

In [12]:
# determine importance of labels for a refactoring
import json
imp = {}
for feature, importance in zip(train.columns, cls.feature_importances_):
    imp[feature] = importance
df = pd.Series(imp)
df = df.sort_values(ascending=False).head(5)
df

xg1      0.125028
prob1    0.072747
xg2      0.068402
spi1     0.067980
date     0.067938
dtype: float64

By examining these importances, it is possible that eliminating some features may give better results. For example, it seems that the league_id may have a much smaller impact on the predicted score than previously thought.

I am going to select the top 5 features and run the classifier again and see if it becomes more accurate.

In [13]:
# these are the top 5 features
keys = df.keys()
keys

Index(['xg1', 'prob1', 'xg2', 'spi1', 'date'], dtype='object')

In [14]:
# copy the selection process from earlier, but pick the best 5 features
decision = matches[['score1']]
features = matches[keys]

In [15]:
train, test, train_d, test_d = train_test_split(features,
                                                decision,
                                                test_size = 0.2,
                                                random_state = 7)

cls = RandomForestClassifier()
mdl = cls.fit(train, train_d.values.ravel())

In [16]:
results = cls.predict(test)

In [17]:
accuracy_score(test_d, results)

0.33666904932094355

In [18]:
imp = {}
for feature, importance in zip(train.columns, cls.feature_importances_):
    imp[feature] = importance
df = pd.Series(imp)
df.sort_values(ascending=False)

xg1      0.243119
prob1    0.193450
spi1     0.192340
xg2      0.186462
date     0.184629
dtype: float64

This predictive model seems to have essentially the same accuracy as the original model, and in some cases is even worse. 

It seems like the best approach to making a better model may be to do some feature engineering to create some new features that may be of interest