# Project 2
#### By: David Hoffman and Kyle Kolodziej

In [1]:
import numpy as np 
import pandas as pd

# read in combined dataset into a dataframe
df = pd.read_csv("combined_data.csv") 

df.head(5)

Unnamed: 0.1,Unnamed: 0,Home Team,Away Team,Score,Half Time Score,Match Excitement,Home Team Rating,Away Team Rating,Home Team Possession %,Away Team Possession %,...,Away Team Fouls,Away Team Yellow Cards,Away Team Second Yellow Cards,Away Team Red Cards,Home Team Goals Scored,Away Team Goals Scored,Home Team Goals Conceeded,Away Team Goals Conceeded,year,league
0,0,MAN UTD,SWANSEA,1-2,0-1,5.9,5.6,7.6,60,40,...,20.0,4.0,0.0,0.0,1,2,2,1,2014,english
1,1,WEST BROM,SUNDERLAND,2-2,1-1,7.3,6.5,7.4,58,42,...,9.0,1.0,0.0,0.0,2,2,2,2,2014,english
2,2,LEICESTER CITY,EVERTON,2-2,1-2,7.0,6.5,6.3,37,63,...,10.0,1.0,0.0,0.0,2,2,2,2,2014,english
3,3,WEST HAM,TOTTENHAM,0-1,0-0,4.8,5.9,6.4,47,53,...,10.0,0.0,0.0,1.0,0,1,1,0,2014,english
4,4,QPR,HULL CITY,0-1,0-0,3.8,5.7,6.6,51,49,...,10.0,2.0,0.0,0.0,0,1,1,0,2014,english


The first of our datasets provides various game data from each of the top 5 soccer leagues in the world: The Premier League, Ligue, The Budesliga, Seria, and La Liga. This dataset contains data from games from 2014-2020 and includes many different game statistics such as each team's rating, the match excitement, team posession percentages, shots on goal, etc. To get this dataset into a pandas dataframe, we first had to download the dataset from kaggle. After this, the data was downloaded into our local environment in the form of a .zip file with several different files within it. We then exported all of the individual .csv files from the original .zip file and loaded the combined .csv into our original dataframe.

In [2]:
# read in second dataset into a dataframe
df2 = pd.read_csv("Football_Data.csv") 

df2.head(5)

Unnamed: 0,League,Year,position,Team,matches,wins,draws,loses,scored,pts,xG,xGA,%LoseR,%DrawR,Shots,Yellow,Red,Fouls,S_OnTarget
0,La_liga,2014,1,Barcelona,38,30,4,4,110,94,102.980152,28.444293,0.25,0.714286,626.0,66.0,3.0,369.0,273.0
1,La_liga,2014,2,Real Madrid,38,30,2,6,118,92,95.766243,42.607198,0.0,0.607143,688.0,85.0,3.0,407.0,282.0
2,La_liga,2014,3,Atletico Madrid,38,23,9,6,67,78,57.04767,29.069107,0.166667,0.261364,438.0,109.0,5.0,543.0,187.0
3,La_liga,2014,4,Valencia,38,22,11,5,70,77,55.0625,39.392572,0.25,0.452381,439.0,101.0,9.0,566.0,159.0
4,La_liga,2014,5,Sevilla,38,23,7,8,71,76,69.526624,47.862742,0.166667,0.666667,455.0,115.0,2.0,604.0,181.0


Rather than showing game-to-game statistics, our second dataset gives year totals for each team in the five biggest soccer leagues in the world. This dataset contains information such as total matches, total wins, total losses, points scored, foul statistics, and different shooting statistics. Unlike the first dataset, this dataset only contains information for 2014-2018 missing data from the 2019 and 2020 season which is contained in the first dataset. We hope that this will not be a problem moving forward; however, if it does prove to be a problem we may need to inpute the values for the missing years or drop 2019 and 2020 from the first dataset altogether.

Questions we hope to answer after data exploration:

* What individual match factors are most correlated with match wins?
* What season statisitics are most correlate total match wins?
* Do expected goals for/against provide a direct correlation to a team's finishing position that year?

## 2. Data Inspection

In [3]:
# data types of first dataset attributes
df.dtypes

Unnamed: 0                         int64
Home Team                         object
Away Team                         object
Score                             object
Half Time Score                   object
Match Excitement                 float64
Home Team Rating                 float64
Away Team Rating                 float64
Home Team Possession %             int64
Away Team Possession %             int64
Home Team Off Target Shots       float64
Home Team On Target Shots        float64
Home Team Total Shots            float64
Home Team Blocked Shots          float64
Home Team Corners                float64
Home Team Throw Ins              float64
Home Team Pass Success %         float64
Home Team Aerials Won            float64
Home Team Clearances             float64
Home Team Fouls                  float64
Home Team Yellow Cards           float64
Home Team Second Yellow Cards    float64
Home Team Red Cards              float64
Away Team Off Target Shots       float64
Away Team On Tar

With the first dataset, nearly all the features are comprised of numerical data as many of them are totals for each statistical category throughout the game. Despite this, there are also several features that currently contain categorical data the obvious ones being the league and home/away team names. In addition, the final score and halftime score are also currently listed as strings and could be interpretted as categorical data or numerical data depending on the context. The string representation of the halftime and final scores are caused because the feature contains both the away and home team's goal total seperated by a hyphen. It may be beneficial to divide this feature into two features (home score and away score), but I believe that the total score representation also has merit because it shows the entire picture indicating the closeness of the game.

In [4]:
# data types of second dataset attributes
df2.dtypes

League         object
Year            int64
position        int64
Team           object
matches         int64
wins            int64
draws           int64
loses           int64
scored          int64
pts             int64
xG            float64
xGA           float64
%LoseR        float64
%DrawR        float64
Shots         float64
Yellow        float64
Red           float64
Fouls         float64
S_OnTarget    float64
dtype: object

Similarly to our first dataset, the second dataset contains mostly numerical data representing totals in each listed statistical category accross an entire season. Also similarly to our first dataset, the exception to this rule is the league and team name features which are categorical variables and represented as strings.

In [5]:
print("The first dataset contains",df.size,"elements and",df.shape[0],"rows.")

The first dataset contains 506604 elements and 12062 rows.


In [6]:
print("The second dataset contains",df2.size,"elements and",df2.shape[0],"rows.")

The second dataset contains 9310 elements and 490 rows.


In [7]:
# check for null values
df.isnull().sum()

Unnamed: 0                       0
Home Team                        0
Away Team                        0
Score                            0
Half Time Score                  0
Match Excitement                 0
Home Team Rating                 0
Away Team Rating                 0
Home Team Possession %           0
Away Team Possession %           0
Home Team Off Target Shots       0
Home Team On Target Shots        0
Home Team Total Shots            0
Home Team Blocked Shots          0
Home Team Corners                0
Home Team Throw Ins              0
Home Team Pass Success %         0
Home Team Aerials Won            0
Home Team Clearances             0
Home Team Fouls                  0
Home Team Yellow Cards           0
Home Team Second Yellow Cards    0
Home Team Red Cards              0
Away Team Off Target Shots       0
Away Team On Target Shots        0
Away Team Total Shots            0
Away Team Blocked Shots          0
Away Team Corners                0
Away Team Throw Ins 

In [8]:
# check for null values
df2.isnull().sum()

League        0
Year          0
position      0
Team          0
matches       0
wins          0
draws         0
loses         0
scored        0
pts           0
xG            0
xGA           0
%LoseR        6
%DrawR        6
Shots         6
Yellow        6
Red           6
Fouls         6
S_OnTarget    6
dtype: int64

## 3. Data Description

Let's start with the first dataset...

In [14]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,Home Team,Away Team,Score,Half Time Score,Match Excitement,Home Team Rating,Away Team Rating,Home Team Possession %,Away Team Possession %,...,Away Team Fouls,Away Team Yellow Cards,Away Team Second Yellow Cards,Away Team Red Cards,Home Team Goals Scored,Away Team Goals Scored,Home Team Goals Conceeded,Away Team Goals Conceeded,year,league
0,0,MAN UTD,SWANSEA,1-2,0-1,5.9,5.6,7.6,60,40,...,20.0,4.0,0.0,0.0,1,2,2,1,2014,english
1,1,WEST BROM,SUNDERLAND,2-2,1-1,7.3,6.5,7.4,58,42,...,9.0,1.0,0.0,0.0,2,2,2,2,2014,english
2,2,LEICESTER CITY,EVERTON,2-2,1-2,7.0,6.5,6.3,37,63,...,10.0,1.0,0.0,0.0,2,2,2,2,2014,english
3,3,WEST HAM,TOTTENHAM,0-1,0-0,4.8,5.9,6.4,47,53,...,10.0,0.0,0.0,1.0,0,1,1,0,2014,english
4,4,QPR,HULL CITY,0-1,0-0,3.8,5.7,6.6,51,49,...,10.0,2.0,0.0,0.0,0,1,1,0,2014,english
5,5,STOKE,ASTON VILLA,0-1,0-0,2.8,6.5,7.0,63,37,...,9.0,3.0,0.0,0.0,0,1,1,0,2014,english
6,6,ARSENAL,CRYSTAL PALACE,2-1,1-1,5.8,7.7,5.6,76,24,...,19.0,3.0,1.0,0.0,2,1,1,2,2014,english
7,7,LIVERPOOL,SOUTHAMPTON,2-1,1-0,6.0,7.1,6.7,56,44,...,11.0,2.0,0.0,0.0,2,1,1,2,2014,english
8,8,NEWCASTLE,MAN CITY,0-2,0-1,4.6,5.5,8.1,44,56,...,11.0,5.0,0.0,0.0,0,2,2,0,2014,english
9,9,BURNLEY,CHELSEA,1-3,1-3,5.3,5.3,7.4,39,61,...,7.0,1.0,0.0,0.0,1,3,3,1,2014,english


In [11]:
# data types of first dataset attributes
df.dtypes

Unnamed: 0                         int64
Home Team                         object
Away Team                         object
Score                             object
Half Time Score                   object
Match Excitement                 float64
Home Team Rating                 float64
Away Team Rating                 float64
Home Team Possession %             int64
Away Team Possession %             int64
Home Team Off Target Shots       float64
Home Team On Target Shots        float64
Home Team Total Shots            float64
Home Team Blocked Shots          float64
Home Team Corners                float64
Home Team Throw Ins              float64
Home Team Pass Success %         float64
Home Team Aerials Won            float64
Home Team Clearances             float64
Home Team Fouls                  float64
Home Team Yellow Cards           float64
Home Team Second Yellow Cards    float64
Home Team Red Cards              float64
Away Team Off Target Shots       float64
Away Team On Tar

#### Attribute Information

* Unnamed: 0 : index
* Home Team: club name of team playing at home
* Away Team: club name of team playing on the road
* Score: final score of the game
* Match excitement: excitement rating of the match
    - Not entirely sure how they derived this. I am assuming a combination between attendance, crowd noise, and TV views
* Home team rating: match rating of the home team
* Away team rating: match rating of the away team
    - Team Rating note: I am assuming this is a calculated value of how well a team performed in a match
* Home team possession %: percent of the match the home team had possession of the ball
* Away team possession %: percent of the match the away team had possession of the ball
* Home Team Off Target Shots: number of shots off target for the home team
* Home Team On Target Shots: number of shots on target for the home team
* Home Team Total Shots: total number of shots for the home team
* Home Team Blocked Shots: number of blocked shots by the home team
* Home Team Corners: number of corners for the home team
* Home Team Throw Ins: number of throw ins for the home team
* Home Team Pass Success %: percent of successful passes for the home team
* Home Team Aerials Won: number of balls won in the air by the home team
* Home Team Clearances: number of balls cleared by the home team
* Home Team Fouls: number of fouls committed by the home team
* Home Team Yellow Cards: number of yellow cards for the home team
* Home Team Second Yellow Cards: number of times a second yellow card is given to a player
* Home Team Red Cards: number of red cards for the home team
* Away Team Off Target Shots: number of shots off target for the away team
* Away Team On Target Shots: number of shots on target for the away team
* Away Team Total Shots: total number of shots for the away team
* Away Team Blocked Shots: number of blocked shots by the away team
* Away Team Corners: number of corners for the away team
* Away Team Throw Ins: number of throw ins for the away team
* Away Team Pass Success %: percent of successful passes for the away team
* Away Team Aerials Won: number of balls won in the air by the away team
* Away Team Clearances: number of balls cleared by the away team
* Away Team Fouls: number of fouls committed by the away team
* Away Team Yellow Cards: number of yellow cards for the away team
* Away Team Second Yellow Cards: number of times a second yellow card is given to a player
* Away Team Red Cards: number of red cards for the away team
* Home Team Goals Scored: number of goals scored by the home team
* Away Team Goals Scored: number of goals scored by the away team
* Home Team Goals Conceeded: number of goals conceded by the home team
* Away Team Goals Conceeded: number of goals conceded by the away team
* year: year
* league: soccer league

Categorical variables:
* Home team
* Away team
* Score
* League

In [16]:
# Let's look at these numerical variables
df.describe()

Unnamed: 0.1,Unnamed: 0,Match Excitement,Home Team Rating,Away Team Rating,Home Team Possession %,Away Team Possession %,Home Team Off Target Shots,Home Team On Target Shots,Home Team Total Shots,Home Team Blocked Shots,...,Away Team Clearances,Away Team Fouls,Away Team Yellow Cards,Away Team Second Yellow Cards,Away Team Red Cards,Home Team Goals Scored,Away Team Goals Scored,Home Team Goals Conceeded,Away Team Goals Conceeded,year
count,12062.0,12062.0,12062.0,12062.0,12062.0,12062.0,12062.0,12062.0,12062.0,12062.0,...,12062.0,12062.0,12062.0,12062.0,12062.0,12062.0,12062.0,12062.0,12062.0,12062.0
mean,6030.5,5.391494,6.399702,6.142539,51.199469,48.800531,5.644338,4.72318,13.690764,3.338501,...,22.792323,13.083154,2.162245,0.060852,0.057785,1.531172,1.206351,1.206351,1.531172,2017.109352
std,3482.143808,1.719655,0.988037,0.992074,11.275257,11.275257,2.78152,2.559607,5.244448,2.246498,...,10.141781,4.205046,1.365581,0.244216,0.243772,1.305178,1.168266,1.168266,1.305178,1.965096
min,0.0,1.4,2.1,1.3,17.0,16.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2014.0
25%,3015.25,4.1,5.7,5.5,43.0,41.0,4.0,3.0,10.0,2.0,...,15.0,10.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,2015.0
50%,6030.5,5.3,6.4,6.1,51.0,49.0,5.0,4.0,13.0,3.0,...,22.0,13.0,2.0,0.0,0.0,1.0,1.0,1.0,1.0,2017.0
75%,9045.75,6.4,7.0,6.8,59.0,57.0,7.0,6.0,17.0,5.0,...,29.0,16.0,3.0,0.0,0.0,2.0,2.0,2.0,2.0,2019.0
max,12061.0,10.0,10.0,10.0,84.0,83.0,20.0,18.0,47.0,19.0,...,79.0,32.0,9.0,2.0,2.0,10.0,9.0,9.0,10.0,2020.0


Now let's look at the second data set...

In [19]:
df2.head(5)

Unnamed: 0,League,Year,position,Team,matches,wins,draws,loses,scored,pts,xG,xGA,%LoseR,%DrawR,Shots,Yellow,Red,Fouls,S_OnTarget
0,La_liga,2014,1,Barcelona,38,30,4,4,110,94,102.980152,28.444293,0.25,0.714286,626.0,66.0,3.0,369.0,273.0
1,La_liga,2014,2,Real Madrid,38,30,2,6,118,92,95.766243,42.607198,0.0,0.607143,688.0,85.0,3.0,407.0,282.0
2,La_liga,2014,3,Atletico Madrid,38,23,9,6,67,78,57.04767,29.069107,0.166667,0.261364,438.0,109.0,5.0,543.0,187.0
3,La_liga,2014,4,Valencia,38,22,11,5,70,77,55.0625,39.392572,0.25,0.452381,439.0,101.0,9.0,566.0,159.0
4,La_liga,2014,5,Sevilla,38,23,7,8,71,76,69.526624,47.862742,0.166667,0.666667,455.0,115.0,2.0,604.0,181.0


In [18]:
df2.dtypes

League         object
Year            int64
position        int64
Team           object
matches         int64
wins            int64
draws           int64
loses           int64
scored          int64
pts             int64
xG            float64
xGA           float64
%LoseR        float64
%DrawR        float64
Shots         float64
Yellow        float64
Red           float64
Fouls         float64
S_OnTarget    float64
dtype: object

#### Attribute Information

* League: league
* Year: year
* position: finishing position in that league for that year
* Team: club name
* matches: matches played
* wins: wins
* draws: draws/ties
* loses: loses
* scored: goals for
* pts: points
* xG: expected goals for
* xGA: expected goals against
* %LoseR: % games lost
* %DrawR: % games drawn
* Shots: shots
* Yellow: yellow cards
* Red: red cards
* Fouls: fouls committed
* S_OnTarget: shots on target

In [20]:
df2.describe()

Unnamed: 0,Year,position,matches,wins,draws,loses,scored,pts,xG,xGA,%LoseR,%DrawR,Shots,Yellow,Red,Fouls,S_OnTarget
count,490.0,490.0,490.0,490.0,490.0,490.0,490.0,490.0,490.0,490.0,484.0,484.0,484.0,484.0,484.0,484.0,484.0
mean,2016.0,10.316327,37.265306,13.965306,9.334694,13.965306,50.640816,51.230612,49.266015,49.266015,0.08545,0.323255,461.863636,75.743802,3.987603,488.258264,163.10124
std,1.415659,5.683537,1.550454,6.008925,2.957473,5.585259,17.409702,17.162578,13.843452,10.395048,0.116161,0.166336,80.30564,18.935558,2.420549,74.619403,38.412441
min,2014.0,1.0,34.0,2.0,2.0,1.0,22.0,15.0,26.527059,20.794377,0.0,0.0,281.0,37.0,0.0,278.0,94.0
25%,2015.0,5.0,38.0,10.0,7.0,10.0,38.25,39.0,39.834251,42.673362,0.0,0.206931,407.75,62.0,2.0,433.0,136.0
50%,2016.0,10.0,38.0,12.5,9.0,14.0,47.0,48.0,46.184655,48.964268,0.055556,0.2875,445.0,72.0,4.0,492.0,154.0
75%,2017.0,15.0,38.0,17.0,11.0,18.0,58.0,61.0,54.927071,55.741148,0.125,0.430871,498.25,90.0,5.0,541.0,181.0
max,2018.0,20.0,38.0,32.0,18.0,29.0,118.0,100.0,113.59827,78.863255,0.75,0.916667,710.0,133.0,14.0,678.0,290.0


## 4. Merge the Data

In [21]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,Home Team,Away Team,Score,Half Time Score,Match Excitement,Home Team Rating,Away Team Rating,Home Team Possession %,Away Team Possession %,...,Away Team Fouls,Away Team Yellow Cards,Away Team Second Yellow Cards,Away Team Red Cards,Home Team Goals Scored,Away Team Goals Scored,Home Team Goals Conceeded,Away Team Goals Conceeded,year,league
0,0,MAN UTD,SWANSEA,1-2,0-1,5.9,5.6,7.6,60,40,...,20.0,4.0,0.0,0.0,1,2,2,1,2014,english
1,1,WEST BROM,SUNDERLAND,2-2,1-1,7.3,6.5,7.4,58,42,...,9.0,1.0,0.0,0.0,2,2,2,2,2014,english


In [22]:
df2.head(2)

Unnamed: 0,League,Year,position,Team,matches,wins,draws,loses,scored,pts,xG,xGA,%LoseR,%DrawR,Shots,Yellow,Red,Fouls,S_OnTarget
0,La_liga,2014,1,Barcelona,38,30,4,4,110,94,102.980152,28.444293,0.25,0.714286,626.0,66.0,3.0,369.0,273.0
1,La_liga,2014,2,Real Madrid,38,30,2,6,118,92,95.766243,42.607198,0.0,0.607143,688.0,85.0,3.0,407.0,282.0


In [23]:
aggData = pd.merge(df, df2, on=['Year', 'Team'], how='inner')
aggData.head()

KeyError: 'Year'

Strategy for merging data will be to combine data based off of the Team 

## 5. Explore Bivariate Relationships

In [10]:
# Predict scores of 2019 and 2020
# Two models: one for home goals, one for away
# Eval on accuracy of score, goal differential, match winner


## 6. References

[1] Football Data : Top 5 Leagues. https://www.kaggle.com/sanjeetsinghnaik/football-data-top-5-leagues

[2] Points Table: Top 5 Leagues. https://www.kaggle.com/thegreatcoder/points-table-of-5-leagues-in-football-20142018