# Major Leagues
## General Information
1. Author: Patrick McNamee
2. Date: 10/11/2019
## Description
Analyzing the soccer, i.e. "football", spi from [538](https://github.com/fivethirtyeight/data/tree/master/soccer-spi). The goal of this notebook is to get models to predic the scores for each team playing each other.

## Data Engineering
First we need to load the data into a data frame and then examine what information we have available.

In [3]:
import pandas as pd

df = pd.read_csv("./data/spi_matches.csv")
df.dropna()
df.head()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,0.1157,...,32.4,67.7,0.0,1.0,0.97,0.63,0.43,0.45,0.0,1.05
1,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,0.2617,...,53.7,22.9,2.0,2.0,2.45,0.77,1.75,0.42,2.1,2.1
2,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,0.2921,...,38.1,22.2,2.0,1.0,0.85,2.77,0.17,1.25,2.1,1.05
3,2016-08-13,2411,Barclays Premier League,Crystal Palace,West Bromwich Albion,55.19,58.66,0.4214,0.2939,0.2847,...,43.6,34.6,0.0,1.0,1.11,0.68,0.84,1.6,0.0,1.05
4,2016-08-13,2411,Barclays Premier League,Everton,Tottenham Hotspur,68.02,73.25,0.391,0.3401,0.2689,...,31.9,48.0,1.0,1.0,0.73,1.11,0.88,1.81,1.05,1.05


There appears to be some previous work done by 538 has there is scores and adjusted scores. First thing to do is remove the two adjusted slore columns.

In [5]:
df = df.drop("adj_score1", axis=1)
df = df.drop("adj_score2", axis=1)
df.head()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,proj_score1,proj_score2,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2
0,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,0.1157,0.91,2.36,32.4,67.7,0.0,1.0,0.97,0.63,0.43,0.45
1,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,0.2617,1.82,0.86,53.7,22.9,2.0,2.0,2.45,0.77,1.75,0.42
2,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,0.2921,1.16,1.24,38.1,22.2,2.0,1.0,0.85,2.77,0.17,1.25
3,2016-08-13,2411,Barclays Premier League,Crystal Palace,West Bromwich Albion,55.19,58.66,0.4214,0.2939,0.2847,1.35,1.14,43.6,34.6,0.0,1.0,1.11,0.68,0.84,1.6
4,2016-08-13,2411,Barclays Premier League,Everton,Tottenham Hotspur,68.02,73.25,0.391,0.3401,0.2689,1.47,1.38,31.9,48.0,1.0,1.0,0.73,1.11,0.88,1.81


It would probably be useful to isolate what year, month, and day the match occurs in. Perhaps as time gets closer to the end, the games become closer.

In [24]:
df['year'] = df['date'].map(lambda x: int(x.split('-')[0]))
df['month'] = df['date'].map(lambda x: int(x.split('-')[1]))
df.tail()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,year,month
32101,2020-05-24,1871,Spanish Segunda Division,Deportivo La Coruña,Fuenlabrada,35.34,32.51,0.4824,0.2422,0.2754,...,,,,,,,,,2020,5
32102,2020-05-24,1871,Spanish Segunda Division,Almeria,Málaga,41.12,36.79,0.4774,0.2199,0.3027,...,,,,,,,,,2020,5
32103,2020-05-24,1871,Spanish Segunda Division,Numancia,Tenerife,30.59,30.92,0.4495,0.2582,0.2923,...,,,,,,,,,2020,5
32104,2020-05-24,1871,Spanish Segunda Division,Lugo,Mirandes,25.2,24.83,0.4588,0.2679,0.2733,...,,,,,,,,,2020,5
32105,2020-05-24,1869,Spanish Primera Division,Levante,Getafe,64.24,75.84,0.3617,0.3764,0.2619,...,,,,,,,,,2020,5


Now there appears to be games that have not happened yet and so we will seperate everything that happens in 2020 into a fun prediction dataframe while removing them from the training set.