# Major Leagues
## General Information
1. Author: Patrick McNamee
2. Date: 10/11/2019
## Description
Analyzing the soccer, i.e. "football", spi from [538](https://github.com/fivethirtyeight/data/tree/master/soccer-spi). The goal of this notebook is to get models to predic the scores for each team playing each other.

## Data Engineering
First we need to load the data into a data frame and then examine what information we have available.

In [1]:
import pandas as pd

df = pd.read_csv("./data/spi_matches.csv")
df = df.dropna()
df.head()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,0.1157,...,32.4,67.7,0.0,1.0,0.97,0.63,0.43,0.45,0.0,1.05
1,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,0.2617,...,53.7,22.9,2.0,2.0,2.45,0.77,1.75,0.42,2.1,2.1
2,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,0.2921,...,38.1,22.2,2.0,1.0,0.85,2.77,0.17,1.25,2.1,1.05
3,2016-08-13,2411,Barclays Premier League,Crystal Palace,West Bromwich Albion,55.19,58.66,0.4214,0.2939,0.2847,...,43.6,34.6,0.0,1.0,1.11,0.68,0.84,1.6,0.0,1.05
4,2016-08-13,2411,Barclays Premier League,Everton,Tottenham Hotspur,68.02,73.25,0.391,0.3401,0.2689,...,31.9,48.0,1.0,1.0,0.73,1.11,0.88,1.81,1.05,1.05


There appears to be some previous work done by 538 has there are scores, projected scores, and adjusted scores. First thing to do is remove the both projective scores and adjusted slore columns as they could possible influence any models built through a preconcieved bias.

In [2]:
df = df.drop("adj_score1", axis=1)
df = df.drop("adj_score2", axis=1)
df = df.drop("proj_score1", axis=1)
df = df.drop("proj_score2", axis=1)
df.head()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2
0,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,0.1157,32.4,67.7,0.0,1.0,0.97,0.63,0.43,0.45
1,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,0.2617,53.7,22.9,2.0,2.0,2.45,0.77,1.75,0.42
2,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,0.2921,38.1,22.2,2.0,1.0,0.85,2.77,0.17,1.25
3,2016-08-13,2411,Barclays Premier League,Crystal Palace,West Bromwich Albion,55.19,58.66,0.4214,0.2939,0.2847,43.6,34.6,0.0,1.0,1.11,0.68,0.84,1.6
4,2016-08-13,2411,Barclays Premier League,Everton,Tottenham Hotspur,68.02,73.25,0.391,0.3401,0.2689,31.9,48.0,1.0,1.0,0.73,1.11,0.88,1.81


There are a couple features that can be made from the given columns in the data frame. The first set is to get the month, day, and year of the match into numerical values for evaluation. The second set is that the teams are given in an order pair so would be nice to know the averages and then signed difference of attributes between the teams. The attributes that this will be applied to is the spi and the score.

In [4]:
#numeric dates
df['year'] = df['date'].map(lambda x: int(x.split('-')[0]))
df['month'] = df['date'].map(lambda x: int(x.split('-')[1]))
df['day'] = df['date'].map(lambda x: int(x.split('-')[2]))

# average and deltas
df['avg_spi'] = (df['spi1'] + df['spi2'])/2
df['delta_spi'] = (df['spi1'] - df['spi2'])
df['avg_score'] = (df['score1'] + df['score2'])/2
df['delta_score'] = (df['score1'] - df['score2'])

df.head()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,xg2,nsxg1,nsxg2,year,month,day,avg_spi,delta_spi,avg_score,delta_score
0,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,0.1157,...,0.63,0.43,0.45,2016,8,12,68.42,-34.52,0.5,-1.0
1,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,0.2617,...,0.77,1.75,0.42,2016,8,12,62.665,12.37,2.0,0.0
2,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,0.2921,...,2.77,0.17,1.25,2016,8,13,60.19,-13.24,1.5,1.0
3,2016-08-13,2411,Barclays Premier League,Crystal Palace,West Bromwich Albion,55.19,58.66,0.4214,0.2939,0.2847,...,0.68,0.84,1.6,2016,8,13,56.925,-3.47,0.5,-1.0
4,2016-08-13,2411,Barclays Premier League,Everton,Tottenham Hotspur,68.02,73.25,0.391,0.3401,0.2689,...,1.11,0.88,1.81,2016,8,13,70.635,-5.23,1.0,0.0


For ease of modeling use, I will use all data from before 2019 as training data and everything from 2019 as testing data to evaluate the various model results.

In [5]:
training = df[df['year'] <= 2018]
training = training.reset_index(drop=True)
training.tail()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,xg2,nsxg1,nsxg2,year,month,day,avg_spi,delta_spi,avg_score,delta_score
8417,2018-12-30,2411,Barclays Premier League,Southampton,Manchester City,66.79,91.81,0.1363,0.6801,0.1837,...,3.12,0.4,2.3,2018,12,30,79.3,-25.02,2.0,-2.0
8418,2018-12-30,1856,Italy Serie B,Benevento,Brescia,31.06,34.62,0.3902,0.3223,0.2875,...,0.66,1.55,0.75,2018,12,30,32.84,-3.56,1.0,0.0
8419,2018-12-30,2411,Barclays Premier League,Manchester United,AFC Bournemouth,78.36,64.51,0.6312,0.1616,0.2072,...,0.82,1.71,1.25,2018,12,30,71.435,13.85,2.5,3.0
8420,2018-12-30,1856,Italy Serie B,Livorno,Padova,21.62,17.82,0.4906,0.2134,0.296,...,0.4,1.04,0.49,2018,12,30,19.72,3.8,1.0,0.0
8421,2018-12-30,1856,Italy Serie B,Spezia,Lecce,29.41,26.79,0.4496,0.246,0.3044,...,1.04,1.04,2.01,2018,12,30,28.1,2.62,1.0,0.0


In [6]:
testing = df[df['year'] == 2019]
testing = testing.reset_index(drop=True)
testing.tail()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,xg2,nsxg1,nsxg2,year,month,day,avg_spi,delta_spi,avg_score,delta_score
3922,2019-10-09,2105,Brasileiro Série A,Bahía,São Paulo,55.32,57.6,0.4429,0.2564,0.3007,...,0.36,0.92,0.56,2019,10,9,56.46,-2.28,0.0,0.0
3923,2019-10-09,2105,Brasileiro Série A,Cruzeiro,Fluminense,50.04,49.45,0.5202,0.2292,0.2506,...,0.3,1.96,0.55,2019,10,9,49.745,0.59,0.0,0.0
3924,2019-10-09,2105,Brasileiro Série A,Santos,Palmeiras,60.26,68.3,0.362,0.3643,0.2737,...,0.35,0.67,0.55,2019,10,9,64.28,-8.04,1.0,2.0
3925,2019-10-10,2105,Brasileiro Série A,Avaí,Vasco da Gama,38.77,51.21,0.3604,0.3573,0.2823,...,1.39,1.34,1.26,2019,10,10,44.99,-12.44,0.0,0.0
3926,2019-10-10,2105,Brasileiro Série A,Flamengo,Atletico Mineiro,73.23,54.1,0.7637,0.0787,0.1576,...,0.82,3.02,1.03,2019,10,10,63.665,19.13,2.0,2.0


From the indexing it looks to be roughly a 70% train and 30% test split which is a decent initial split. As more games are played then more data will be added to the training set. Next thing to do is save the training and testing data and begin making the models.