# Predicting Soccer Match Results

As a fan of the English Premier League (EPL), particularly Tottenham Hotspur (Come On You Spurs!), I've always been obsessed with tracking the stats of different players and teams. In this project, I'm going to channel that obsession to see if I am able to predict the outcome of a match based on certain team stats. This project will involve a heavy amount of web scraping to pull in the data I need from [FBRef](fbref.com), as well as machine learning to then analyze and then make predictions off of that data.

In [3]:
import requests #importing library for sending HTTP requests
from bs4 import BeautifulSoup #importing library for pulling data out of HTML and XML files
import pandas as pd #importing library for data manipulation
pd.set_option('display.max_columns', None) #display all columns of the dataframe
pd.set_option('display.max_rows', 100) #display 100 rows of the dataframe
import time #importing time library
import random #importing random library
import numpy as np #importing library for numerical computations

#importing libraries for machine learning
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score
from sklearn.feature_selection import RFECV
from sklearn.metrics import precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

In [2]:
# The years I will use for scraping data
years = ['2021-2022','2020-2021','2019-2020','2018-2019','2017-2018']

In [3]:
# This list will hold my dataframes
dfs = []

In [4]:
domain = "https://fbref.com/"

In [5]:
# Creating a function that will pull all of the links to the individual teams' pages for each season
def pull_team_links(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    # The top table is the one we want. It is the final season standings with all of the teams
    table = soup.table
    a_tags = table.find_all('a')
    links = []
    teams = []
    for tag in a_tags:
        if 'squads' in tag['href']:
            team = tag.text
            teams.append(team)
            link = tag['href']
            links.append(link)
    team_links = [link for link in links if 'squads' in link]
    usable_links = [domain + link for link in team_links]
    return teams, usable_links

In [6]:
# Scrape each team's matchlogs into a dataframe
def scrape_team(team, url):
    r = requests.get(url)
    html = r.text
    soup = BeautifulSoup(html, 'html.parser')
    matchlogs = soup.find("table", {"id": "matchlogs_for"})
    match_df = pd.read_html(str(matchlogs))[0]
    # I am now going to pull the shooting stats, which I will merge with the matchlogs I just pulled
    shooting_links = soup.find_all("a", text="Shooting")
    shooting_url = domain + shooting_links[0]['href']
    r_2 = requests.get(shooting_url)
    html_2 = r_2.text
    soup_2 = BeautifulSoup(html_2, 'html.parser')
    shooting_matchlogs = soup_2.find("table", {"id": "matchlogs_for"})
    shooting_df = pd.read_html(str(shooting_matchlogs))[0]
    shooting_df.columns = shooting_df.columns.droplevel()
    combined_df = match_df.join(shooting_df[['Sh', 'SoT', 'Dist','PKatt']], how='left')
    combined_df['Team'] = team
    combined_df['Season'] = year
    epl_df = combined_df.loc[combined_df['Comp']=='Premier League']
    return epl_df

In [7]:
for year in years:
    year_url = "https://fbref.com/en/comps/9/" + year + "/" + year + "-Premier-League-Stats" # Constructing the URL using the year value
    teams, links = pull_team_links(year_url) # Calling the pull_team_links function with the constructed URL
    for team, link in zip(teams, links):
        df = scrape_team(team, link) # Calling the scrape_team function for each team and link
        dfs.append(df) # Appending the returned dataframe to the dfs list
        time.sleep(random.randint(1,11)) # sleeping for a random amount of time between 1 and 11 seconds before moving on to the next iteration

In [9]:
master_df = dfs[0] #start with the first dataframe
for df in dfs[1:]: 
    master_df = master_df.append(df) #append the rest of the dataframes
master_df.reset_index(inplace=True, drop=True) # reset the index
master_df.shape #display shape

(3800, 25)

In [10]:
master_df.tail()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,PKatt,Team,Season
3795,2018-04-15,16:00,Premier League,Matchweek 34,Sun,Away,W,1,0,Manchester Utd,0.7,0.7,30.0,75095.0,Chris Brunt,4-4-1-1,Paul Tierney,Match Report,,10.0,4.0,18.1,0.0,West Brom,2017-2018
3796,2018-04-21,12:30,Premier League,Matchweek 35,Sat,Home,D,2,2,Liverpool,1.3,1.3,39.0,24520.0,Chris Brunt,4-4-1-1,Stuart Attwell,Match Report,,13.0,6.0,17.7,0.0,West Brom,2017-2018
3797,2018-04-28,15:00,Premier League,Matchweek 36,Sat,Away,W,1,0,Newcastle Utd,0.7,1.8,38.0,52283.0,Chris Brunt,4-4-1-1,David Coote,Match Report,,9.0,2.0,20.1,0.0,West Brom,2017-2018
3798,2018-05-05,15:00,Premier League,Matchweek 37,Sat,Home,W,1,0,Tottenham,1.6,1.2,26.0,23685.0,Chris Brunt,4-4-1-1,Mike Jones,Match Report,,9.0,1.0,10.2,0.0,West Brom,2017-2018
3799,2018-05-13,15:00,Premier League,Matchweek 38,Sun,Away,L,0,2,Crystal Palace,0.2,2.2,41.0,25357.0,Chris Brunt,4-4-1-1,Jonathan Moss,Match Report,,7.0,1.0,24.8,0.0,West Brom,2017-2018


In [2]:
master_df.to_csv('EPL_Stats.csv', index=False)

NameError: name 'master_df' is not defined

The reason I am saving it to csv here is so that if I want to come back and do more analysis, I do not have to re-scrape the data.

In [4]:
matches = pd.read_csv('EPL_Stats.csv')
# Previewing the data
matches.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,PKatt,Team,Season
0,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,2.0,1.0,65.0,58262.0,Fernandinho,4-3-3,Anthony Taylor,Match Report,,18.0,4.0,17.3,0.0,Manchester City,2021-2022
1,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,2.7,0.1,67.0,51437.0,İlkay Gündoğan,4-3-3,Graham Scott,Match Report,,16.0,4.0,18.5,0.0,Manchester City,2021-2022
2,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,4.0,0.2,80.0,52276.0,İlkay Gündoğan,4-3-3,Martin Atkinson,Match Report,,25.0,10.0,14.8,0.0,Manchester City,2021-2022
3,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,3.3,0.6,61.0,32087.0,İlkay Gündoğan,4-3-3,Paul Tierney,Match Report,,25.0,8.0,14.3,0.0,Manchester City,2021-2022
4,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0,0,Southampton,1.2,0.5,64.0,52698.0,Fernandinho,4-3-3,Jonathan Moss,Match Report,,16.0,1.0,16.4,0.0,Manchester City,2021-2022


In [5]:
# Inspecting the variables for irregularities
matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3800 entries, 0 to 3799
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          3800 non-null   object 
 1   Time          3800 non-null   object 
 2   Comp          3800 non-null   object 
 3   Round         3800 non-null   object 
 4   Day           3800 non-null   object 
 5   Venue         3800 non-null   object 
 6   Result        3800 non-null   object 
 7   GF            3800 non-null   int64  
 8   GA            3800 non-null   int64  
 9   Opponent      3800 non-null   object 
 10  xG            3800 non-null   float64
 11  xGA           3800 non-null   float64
 12  Poss          3800 non-null   float64
 13  Attendance    2920 non-null   float64
 14  Captain       3800 non-null   object 
 15  Formation     3800 non-null   object 
 16  Referee       3800 non-null   object 
 17  Match Report  3800 non-null   object 
 18  Notes         0 non-null    

The distance column has some null values, let's get rid of those so we can use it as a float instead of an object. Let's see what the null values are.

In [7]:
matches[matches['Dist'].isnull()]

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,PKatt,Team,Season
1258,2020-10-18,14:00,Premier League,Matchweek 5,Sun,Home,D,1,1,Brighton,0.8,1.7,34.0,,Gary Cahill,4-4-2,Stuart Attwell,Match Report,,0.0,0.0,,1.0,Crystal Palace,2020-2021
2802,2019-03-02,15:00,Premier League,Matchweek 29,Sat,Home,L,0,1,Manchester City,0.0,1.2,19.0,10699.0,Andrew Surman,5-4-1,Kevin Friend,Match Report,,0.0,0.0,,0.0,Bournemouth,2018-2019
3715,2018-03-10,15:00,Premier League,Matchweek 30,Sat,Away,D,0,0,Huddersfield,0.0,1.4,20.0,23567.0,Federico Fernández,5-4-1,Michael Oliver,Match Report,,0.0,0.0,,0.0,Swansea City,2017-2018


I want to fill these values, but not as 0 and not as a universal mean. Instead, I will fill it with the average for each team, as that is likely considerably more representative.

In [8]:
matches['Dist'] = matches['Dist'].fillna(matches.groupby('Team')['Dist'].transform('mean'))

In [9]:
match_stats = matches[['Date','Venue','Result','GF','GA','Opponent','xG','xGA','Poss','Sh','SoT','Dist','PKatt','Team','Referee']].copy()
match_stats.head()

Unnamed: 0,Date,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Sh,SoT,Dist,PKatt,Team,Referee
0,2021-08-15,Away,L,0,1,Tottenham,2.0,1.0,65.0,18.0,4.0,17.3,0.0,Manchester City,Anthony Taylor
1,2021-08-21,Home,W,5,0,Norwich City,2.7,0.1,67.0,16.0,4.0,18.5,0.0,Manchester City,Graham Scott
2,2021-08-28,Home,W,5,0,Arsenal,4.0,0.2,80.0,25.0,10.0,14.8,0.0,Manchester City,Martin Atkinson
3,2021-09-11,Away,W,1,0,Leicester City,3.3,0.6,61.0,25.0,8.0,14.3,0.0,Manchester City,Paul Tierney
4,2021-09-18,Home,D,0,0,Southampton,1.2,0.5,64.0,16.0,1.0,16.4,0.0,Manchester City,Jonathan Moss


I mostly only kept the actual stats of the game. I kept date for two reasons. First, because I am going to use that to join this table with itself, matching on date and opponent=team so I can have defensive stats as well as offensive. Basically, I want to consider not just the stats of the team playing, but also of the team they're playing against. Additionally, I kept referee because I will need to delete duplicates of the same game under different teams. For example, Arsenal vs Bournemouth is the same as Bournemouth vs Arsenal. The most accurate way to find that duplicate is by date and referee, as a referee cannot referee two games at once.

Before we get to that, first we need to create rolling averages to get a sense of how the teams are playing going into the game. I wound up testing this several times, and found that generally using a half season's averages (19 games) worked quite well, so we'll use that.

I also need date because in order to calculate the form each team is.

In [10]:
match_stats = match_stats.sort_values(['Team','Date']) # Sorting by team and date to set up rolling average
match_stats.head(40)

Unnamed: 0,Date,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Sh,SoT,Dist,PKatt,Team,Referee
3230,2017-08-11,Home,W,4,3,Leicester City,2.5,1.5,68.0,27.0,10.0,19.7,0.0,Arsenal,Mike Dean
3231,2017-08-19,Away,L,0,1,Stoke City,1.5,0.7,76.0,18.0,6.0,18.1,0.0,Arsenal,Andre Marriner
3232,2017-08-27,Away,L,0,4,Liverpool,0.6,3.1,52.0,8.0,0.0,16.6,0.0,Arsenal,Craig Pawson
3233,2017-09-09,Home,W,3,0,Bournemouth,2.2,0.6,58.0,17.0,9.0,15.6,0.0,Arsenal,Anthony Taylor
3234,2017-09-17,Away,D,0,0,Chelsea,1.4,0.8,49.0,11.0,2.0,17.5,0.0,Arsenal,Michael Oliver
3235,2017-09-25,Home,W,2,0,West Brom,2.2,0.9,69.0,15.0,5.0,19.1,1.0,Arsenal,Robert Madley
3236,2017-10-01,Home,W,2,0,Brighton,2.4,0.4,64.0,25.0,8.0,18.4,0.0,Arsenal,Kevin Friend
3237,2017-10-14,Away,L,1,2,Watford,1.0,1.6,54.0,9.0,6.0,20.6,0.0,Arsenal,Niel Swarbrick
3238,2017-10-22,Away,W,5,2,Everton,3.5,1.0,67.0,30.0,14.0,16.6,0.0,Arsenal,Craig Pawson
3239,2017-10-28,Home,W,2,1,Swansea City,2.0,0.9,72.0,17.0,5.0,16.4,0.0,Arsenal,Lee Mason


In [11]:
# Creating a 'points' column to measure the points
match_stats.loc[match_stats['Result'] == 'W', 'points'] = 2
match_stats.loc[match_stats['Result'] == 'D', 'points'] = 1
match_stats.loc[match_stats['Result'] == 'L', 'points'] = 0

A note here. Normally in soccer, a win is worth 3 points. However, I believe this would skew the data unfairly, as in terms of actual precitions, the gap from win to draw is the same as from draw to loss, not bigger.

In [13]:
def rolling_stats(df, team_name):
    team_df = df.loc[df['Team']==team_name]
    team_rolling = team_df.rolling(19,min_periods=10,closed='left').mean()
    return team_rolling

I made a function to provide the rolling stats for each team. The reason I need to do it team by team is to make sure one team's stats don't bleed into another's, which would happen with a standard rolling operation.

In [14]:
for team in match_stats['Team'].unique():
    team_rolling = rolling_stats(match_stats, team)
    # Create a new df for the first team in the list
    if team == 'Arsenal':
        rolling_df = team_rolling
    # Otherwise append to the existing df
    else:
        rolling_df = pd.concat([rolling_df, team_rolling])
print(rolling_df.head())
print(rolling_df.shape)

      GF  GA  xG  xGA  Poss  Sh  SoT  Dist  PKatt  points
3230 NaN NaN NaN  NaN   NaN NaN  NaN   NaN    NaN     NaN
3231 NaN NaN NaN  NaN   NaN NaN  NaN   NaN    NaN     NaN
3232 NaN NaN NaN  NaN   NaN NaN  NaN   NaN    NaN     NaN
3233 NaN NaN NaN  NaN   NaN NaN  NaN   NaN    NaN     NaN
3234 NaN NaN NaN  NaN   NaN NaN  NaN   NaN    NaN     NaN
(3800, 10)


Great! Now we have rolling averages that look backwards at the previous five matches for each of our numerical columns. Now we need to join these to the original dataframe so we can make predictions based off of them. These columns are all listed as NaN because I set a limit of at leeast 10 games of data before making a rolling average.

In [17]:
rolling_match_stats = pd.merge(match_stats, rolling_df, left_index=True, right_index=True, suffixes=['','_last19'])

In [18]:
rolling_match_stats.head()

Unnamed: 0,Date,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Sh,SoT,Dist,PKatt,Team,Referee,points,GF_last19,GA_last19,xG_last19,xGA_last19,Poss_last19,Sh_last19,SoT_last19,Dist_last19,PKatt_last19,points_last19
3230,2017-08-11,Home,W,4,3,Leicester City,2.5,1.5,68.0,27.0,10.0,19.7,0.0,Arsenal,Mike Dean,2.0,,,,,,,,,,
3231,2017-08-19,Away,L,0,1,Stoke City,1.5,0.7,76.0,18.0,6.0,18.1,0.0,Arsenal,Andre Marriner,0.0,,,,,,,,,,
3232,2017-08-27,Away,L,0,4,Liverpool,0.6,3.1,52.0,8.0,0.0,16.6,0.0,Arsenal,Craig Pawson,0.0,,,,,,,,,,
3233,2017-09-09,Home,W,3,0,Bournemouth,2.2,0.6,58.0,17.0,9.0,15.6,0.0,Arsenal,Anthony Taylor,2.0,,,,,,,,,,
3234,2017-09-17,Away,D,0,0,Chelsea,1.4,0.8,49.0,11.0,2.0,17.5,0.0,Arsenal,Michael Oliver,1.0,,,,,,,,,,


Now I need to drop all of the non-rolling stats (besides the results, which I will use as targets). These stats tell me how the game actually went, which I obviously would not have if I was trying to predict the game in advance. I will also need to drop the NA columns (these are the first 10 matches played by a team in the dataframe, so we couldn't really make a viable prediction anyway.

In [19]:
rolling_match_stats = rolling_match_stats.drop(columns=['xG','xGA','Poss','Sh','SoT','Dist','PKatt','GF','GA'])

In [20]:
rolling_match_stats = rolling_match_stats.dropna()

In [21]:
rolling_match_stats.head()

Unnamed: 0,Date,Venue,Result,Opponent,Team,Referee,points,GF_last19,GA_last19,xG_last19,xGA_last19,Poss_last19,Sh_last19,SoT_last19,Dist_last19,PKatt_last19,points_last19
3240,2017-11-05,Away,L,Manchester City,Arsenal,Michael Oliver,0.0,1.9,1.3,1.93,1.15,62.9,17.7,6.5,17.86,0.1,1.3
3241,2017-11-18,Home,W,Tottenham,Arsenal,Mike Dean,2.0,1.818182,1.454545,1.781818,1.209091,61.090909,16.636364,6.181818,18.1,0.090909,1.181818
3242,2017-11-26,Away,W,Burnley,Arsenal,Lee Mason,2.0,1.833333,1.333333,1.808333,1.166667,59.583333,16.416667,6.083333,17.958333,0.083333,1.25
3243,2017-11-29,Home,W,Huddersfield,Arsenal,Graham Scott,2.0,1.769231,1.230769,1.807692,1.107692,59.923077,16.461538,5.692308,18.307692,0.153846,1.307692
3244,2017-12-02,Home,L,Manchester Utd,Arsenal,Andre Marriner,0.0,2.0,1.142857,1.964286,1.064286,60.571429,16.785714,5.714286,18.085714,0.142857,1.357143


We now have all the rolling stats and the outcomes in the dataframe. The last thing to do before building models is joining the table on itself to get the opponent data. The notion here is that my prediction model will be more accurate if I consider not just the form of the team in question, but also the opponent.

In [22]:
df = pd.merge(rolling_match_stats, rolling_match_stats, left_on=['Date','Team'], right_on=['Date','Opponent'], suffixes=['','_opp'])

The last thing we need to do is to convert Venue into a categorical column, with 1 representing Home and 0 representing away so that we can use it with the models.

In [23]:
df['Venue'].replace(['Home','Away'], [1,0], inplace=True)

One more thing: We need to delete the duplicates, which we'll do by deleting any rows that have the same data and the same referee. There are duplicates because Arsenal playing Liverpool is the same as Liverpool playing Arsenal since we merged the opponent dataframe.

In [25]:
df.shape

(3444, 33)

In [26]:
df = df.drop_duplicates(subset=['Date','Referee'])

In [27]:
df.shape

(1722, 33)

In [28]:
df.head()

Unnamed: 0,Date,Venue,Result,Opponent,Team,Referee,points,GF_last19,GA_last19,xG_last19,xGA_last19,Poss_last19,Sh_last19,SoT_last19,Dist_last19,PKatt_last19,points_last19,Venue_opp,Result_opp,Opponent_opp,Team_opp,Referee_opp,points_opp,GF_last19_opp,GA_last19_opp,xG_last19_opp,xGA_last19_opp,Poss_last19_opp,Sh_last19_opp,SoT_last19_opp,Dist_last19_opp,PKatt_last19_opp,points_last19_opp
0,2017-11-05,0,L,Manchester City,Arsenal,Michael Oliver,0.0,1.9,1.3,1.93,1.15,62.9,17.7,6.5,17.86,0.1,1.3,Home,W,Arsenal,Manchester City,Michael Oliver,2.0,3.5,0.6,2.47,0.59,70.9,18.3,7.6,16.96,0.2,1.9
1,2017-11-18,1,W,Tottenham,Arsenal,Mike Dean,2.0,1.818182,1.454545,1.781818,1.209091,61.090909,16.636364,6.181818,18.1,0.090909,1.181818,Away,L,Arsenal,Tottenham,Mike Dean,0.0,1.818182,0.636364,1.545455,0.609091,61.090909,17.363636,5.454545,18.772727,0.0,1.454545
2,2017-11-26,0,W,Burnley,Arsenal,Lee Mason,2.0,1.833333,1.333333,1.808333,1.166667,59.583333,16.416667,6.083333,17.958333,0.083333,1.25,Home,L,Arsenal,Burnley,Lee Mason,0.0,1.0,0.75,0.675,1.341667,41.75,10.083333,2.916667,18.358333,0.0,1.333333
3,2017-11-29,1,W,Huddersfield,Arsenal,Graham Scott,2.0,1.769231,1.230769,1.807692,1.107692,59.923077,16.461538,5.692308,18.307692,0.153846,1.307692,Away,L,Arsenal,Huddersfield,Graham Scott,0.0,0.692308,1.461538,0.684615,1.184615,45.923077,8.846154,2.615385,20.553846,0.0,0.846154
4,2017-12-02,1,L,Manchester Utd,Arsenal,Andre Marriner,0.0,2.0,1.142857,1.964286,1.064286,60.571429,16.785714,5.714286,18.085714,0.142857,1.357143,Away,W,Arsenal,Manchester Utd,Andre Marriner,2.0,2.285714,0.571429,1.778571,1.0,54.5,14.357143,5.0,17.835714,0.142857,1.571429


In [30]:
# Randomizing the dataset for predictive purposes
df = df.sample(n=1722, random_state = 10)
df.head()

Unnamed: 0,Date,Venue,Result,Opponent,Team,Referee,points,GF_last19,GA_last19,xG_last19,xGA_last19,Poss_last19,Sh_last19,SoT_last19,Dist_last19,PKatt_last19,points_last19,Venue_opp,Result_opp,Opponent_opp,Team_opp,Referee_opp,points_opp,GF_last19_opp,GA_last19_opp,xG_last19_opp,xGA_last19_opp,Poss_last19_opp,Sh_last19_opp,SoT_last19_opp,Dist_last19_opp,PKatt_last19_opp,points_last19_opp
26,2018-05-09,0,L,Leicester City,Arsenal,Graham Scott,0.0,2.210526,1.473684,1.636842,1.342105,61.157895,13.947368,5.736842,17.926316,0.157895,1.105263,Home,W,Arsenal,Leicester City,Graham Scott,2.0,1.157895,1.631579,1.321053,1.252632,52.684211,11.052632,3.263158,16.763158,0.105263,0.736842
1876,2018-04-29,0,W,West Ham,Manchester City,Niel Swarbrick,2.0,2.736842,0.789474,2.010526,0.663158,70.315789,17.052632,6.736842,17.636842,0.210526,1.684211,Home,L,Manchester City,West Ham,Niel Swarbrick,0.0,1.526316,1.631579,1.0,1.473684,43.842105,9.473684,3.526316,17.484211,0.157895,1.0
925,2021-09-25,1,L,Manchester City,Chelsea,Michael Oliver,0.0,1.578947,0.684211,1.784211,0.763158,58.789474,15.105263,5.631579,17.389474,0.157895,1.421053,Away,W,Chelsea,Manchester City,Michael Oliver,2.0,2.368421,0.947368,2.0,0.8,63.473684,17.157895,5.578947,16.821053,0.105263,1.421053
195,2020-06-27,1,L,Wolves,Aston Villa,Craig Pawson,0.0,1.0,2.052632,1.152632,1.984211,45.947368,11.473684,3.263158,16.315789,0.105263,0.631579,Away,W,Aston Villa,Wolves,Craig Pawson,2.0,1.473684,1.0,1.531579,0.873684,49.421053,13.526316,4.263158,17.2,0.052632,1.263158
1155,2018-04-28,0,W,Huddersfield,Everton,Lee Probert,2.0,1.052632,1.315789,1.094737,1.436842,45.947368,9.0,2.631579,16.405263,0.157895,1.0,Home,L,Everton,Huddersfield,Lee Probert,0.0,0.947368,1.473684,0.973684,1.221053,48.0,10.368421,2.947368,19.547368,0.105263,0.789474


In [31]:
corr = df.corr()['points'].abs().sort_values(ascending=False)
corr

points               1.000000
points_opp           1.000000
xG_last19_opp        0.315401
Sh_last19_opp        0.307436
GF_last19_opp        0.299568
SoT_last19_opp       0.299231
points_last19_opp    0.290775
Poss_last19_opp      0.278638
xGA_last19_opp       0.264509
points_last19        0.254063
Poss_last19          0.248378
GA_last19_opp        0.246765
GF_last19            0.243617
xGA_last19           0.236678
xG_last19            0.233553
SoT_last19           0.226958
Sh_last19            0.214825
GA_last19            0.204495
Venue                0.118839
Dist_last19_opp      0.117798
PKatt_last19_opp     0.115364
PKatt_last19         0.094610
Dist_last19          0.031504
Name: points, dtype: float64

I'm using correlation to pick out the features that are most correlated with points.

In [32]:
features = corr[corr.between(0.15,0.35)].index
features

Index(['xG_last19_opp', 'Sh_last19_opp', 'GF_last19_opp', 'SoT_last19_opp',
       'points_last19_opp', 'Poss_last19_opp', 'xGA_last19_opp',
       'points_last19', 'Poss_last19', 'GA_last19_opp', 'GF_last19',
       'xGA_last19', 'xG_last19', 'SoT_last19', 'Sh_last19', 'GA_last19'],
      dtype='object')

In [33]:
X = df[features]
X.head()

Unnamed: 0,xG_last19_opp,Sh_last19_opp,GF_last19_opp,SoT_last19_opp,points_last19_opp,Poss_last19_opp,xGA_last19_opp,points_last19,Poss_last19,GA_last19_opp,GF_last19,xGA_last19,xG_last19,SoT_last19,Sh_last19,GA_last19
26,1.321053,11.052632,1.157895,3.263158,0.736842,52.684211,1.252632,1.105263,61.157895,1.631579,2.210526,1.342105,1.636842,5.736842,13.947368,1.473684
1876,1.0,9.473684,1.526316,3.526316,1.0,43.842105,1.473684,1.684211,70.315789,1.631579,2.736842,0.663158,2.010526,6.736842,17.052632,0.789474
925,2.0,17.157895,2.368421,5.578947,1.421053,63.473684,0.8,1.421053,58.789474,0.947368,1.578947,0.763158,1.784211,5.631579,15.105263,0.684211
195,1.531579,13.526316,1.473684,4.263158,1.263158,49.421053,0.873684,0.631579,45.947368,1.0,1.0,1.984211,1.152632,3.263158,11.473684,2.052632
1155,0.973684,10.368421,0.947368,2.947368,0.789474,48.0,1.221053,1.0,45.947368,1.473684,1.052632,1.436842,1.094737,2.631579,9.0,1.315789


In [34]:
y = df['points']

In [35]:
lr = LinearRegression()

In [36]:
rfr = RandomForestRegressor()

In [37]:
# Predicting the number of points
lr_scores = cross_val_score(lr, X, y)
#print each lr score (accuracy) and average them
print(lr_scores)
print('lr_scores mean:{}'.format(np.mean(lr_scores)))

[0.09905808 0.20808801 0.12358092 0.17883499 0.2092135 ]
lr_scores mean:0.16375509971591276


In [38]:
lr_pts_predictions = cross_val_predict(lr, X, y)

In [39]:
# Predicting the number of goals scored by the team
rfr_scores = cross_val_score(rfr, X, y)
#print each rfr score (accuracy) and average them
print(rfr_scores)
print('rfr_scores mean:{}'.format(np.mean(rfr_scores)))

[0.04148099 0.18558763 0.09723939 0.15074922 0.16788098]
rfr_scores mean:0.12858764157826783


In [40]:
df['predicted_pts'] = lr_pts_predictions

In [41]:
df.head()

Unnamed: 0,Date,Venue,Result,Opponent,Team,Referee,points,GF_last19,GA_last19,xG_last19,xGA_last19,Poss_last19,Sh_last19,SoT_last19,Dist_last19,PKatt_last19,points_last19,Venue_opp,Result_opp,Opponent_opp,Team_opp,Referee_opp,points_opp,GF_last19_opp,GA_last19_opp,xG_last19_opp,xGA_last19_opp,Poss_last19_opp,Sh_last19_opp,SoT_last19_opp,Dist_last19_opp,PKatt_last19_opp,points_last19_opp,predicted_pts
26,2018-05-09,0,L,Leicester City,Arsenal,Graham Scott,0.0,2.210526,1.473684,1.636842,1.342105,61.157895,13.947368,5.736842,17.926316,0.157895,1.105263,Home,W,Arsenal,Leicester City,Graham Scott,2.0,1.157895,1.631579,1.321053,1.252632,52.684211,11.052632,3.263158,16.763158,0.105263,0.736842,1.240739
1876,2018-04-29,0,W,West Ham,Manchester City,Niel Swarbrick,2.0,2.736842,0.789474,2.010526,0.663158,70.315789,17.052632,6.736842,17.636842,0.210526,1.684211,Home,L,Manchester City,West Ham,Niel Swarbrick,0.0,1.526316,1.631579,1.0,1.473684,43.842105,9.473684,3.526316,17.484211,0.157895,1.0,1.900673
925,2021-09-25,1,L,Manchester City,Chelsea,Michael Oliver,0.0,1.578947,0.684211,1.784211,0.763158,58.789474,15.105263,5.631579,17.389474,0.157895,1.421053,Away,W,Chelsea,Manchester City,Michael Oliver,2.0,2.368421,0.947368,2.0,0.8,63.473684,17.157895,5.578947,16.821053,0.105263,1.421053,0.74959
195,2020-06-27,1,L,Wolves,Aston Villa,Craig Pawson,0.0,1.0,2.052632,1.152632,1.984211,45.947368,11.473684,3.263158,16.315789,0.105263,0.631579,Away,W,Aston Villa,Wolves,Craig Pawson,2.0,1.473684,1.0,1.531579,0.873684,49.421053,13.526316,4.263158,17.2,0.052632,1.263158,0.448542
1155,2018-04-28,0,W,Huddersfield,Everton,Lee Probert,2.0,1.052632,1.315789,1.094737,1.436842,45.947368,9.0,2.631579,16.405263,0.157895,1.0,Home,L,Everton,Huddersfield,Lee Probert,0.0,0.947368,1.473684,0.973684,1.221053,48.0,10.368421,2.947368,19.547368,0.105263,0.789474,1.096563


In [42]:
r2_score(df['points'],df['predicted_pts'])

0.16526562340490403

In [43]:
accuracy_score(df['points'], df['predicted_pts'].round())

0.34843205574912894

In [44]:
pd.crosstab(df['points'],df['predicted_pts'].round(), margins=True, normalize=True)

predicted_pts,0.0,1.0,2.0,All
points,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.078978,0.295006,0.011034,0.385017
1.0,0.012776,0.198026,0.015679,0.226481
2.0,0.009292,0.307782,0.071429,0.388502
All,0.101045,0.800813,0.098142,1.0


Hmmm. The accuracy score is low because when I round, a remarkable 84% of the time the predicted points rounds to a draw. Let's narrow the prediction for drawing from anything from -0.5 to 0.5 to a tighter amount until the proporiton of draws is correct.

In [45]:
df.loc[df['predicted_pts'] >= 1.105, 'pred_pts_2'] = 2
df.loc[df['predicted_pts'].between(0.895, 1.105, inclusive='left'), 'pred_pts_2'] = 1
df.loc[df['predicted_pts'] < 0.895, 'pred_pts_2'] = 0

In [46]:
pd.crosstab(df['points'],df['pred_pts_2'], margins=True, normalize=True)

pred_pts_2,0.0,1.0,2.0,All
points,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.21777,0.081301,0.085947,0.385017
1.0,0.082462,0.053426,0.090592,0.226481
2.0,0.076655,0.083624,0.228223,0.388502
All,0.376887,0.218351,0.404762,1.0


In [47]:
accuracy_score(df['points'], df['pred_pts_2'])

0.4994192799070848

In [48]:
precision_score(df['points'], df['pred_pts_2'], average='weighted')

0.496938001607275

That's it! Overall, we did a good job building a predictor that could predict wins and losses, but perhaps struggled a little bit more with predicting draws. This makes sense for a couple of reasons. First, draws are the least common outcome, only occuring about 22.6% of the time over the last 5 seaasons. Secondly, a draw means the teams are relatively evenly matched, which in turn means the outcome is a bit more of a tossup.

I'm creating one more prediction just to see if a team will pick up any points (win or draw).

In [49]:
df.loc[df['predicted_pts'] >= 0.89, 'pred_pts_3'] = 1
df.loc[df['predicted_pts'] < 0.89, 'pred_pts_3'] = 0

In [50]:
# Creating a 'points' column to measure the points
df.loc[df['Result'] == 'W', 'points_2'] = 1
df.loc[df['Result'] == 'D', 'points_2'] = 1
df.loc[df['Result'] == 'L', 'points_2'] = 0

In [51]:
accuracy_score(df['points_2'], df['pred_pts_3'])

0.6753774680603949

In [52]:
precision_score(df['points_2'], df['pred_pts_3'])

0.7310536044362292

In [53]:
pd.crosstab(df['points_2'],df['pred_pts_3'], margins=True, normalize=True)

pred_pts_3,0.0,1.0,All
points_2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.216028,0.16899,0.385017
1.0,0.155633,0.45935,0.614983
All,0.371661,0.628339,1.0


Looks like we can say with about 67.7% accuracy whether a team will pick up a win or a draw. That's a pretty good score!

Let's try using classification algorithms. Basically, instead of trying to predict the score and then calculate points from there, let's skip the first step and just try to classify the outcome based on previous outcomes. First svc, then random forest.

In [54]:
clf_features = corr[corr.between(0.01,0.35)].index
clf_X = df[clf_features]

In [55]:
estimator = SVC(kernel="linear")
selector = RFECV(estimator, step=1, cv=5)
selector = selector.fit(clf_X, y)

In [56]:
top_features = clf_X.columns[selector.support_]
print(top_features)

Index(['xG_last19_opp', 'Sh_last19_opp', 'GF_last19_opp', 'xGA_last19_opp',
       'points_last19', 'Poss_last19', 'GA_last19_opp', 'GF_last19',
       'xGA_last19', 'xG_last19', 'SoT_last19', 'Venue', 'Dist_last19_opp',
       'PKatt_last19_opp', 'PKatt_last19', 'Dist_last19'],
      dtype='object')


In [57]:
X2 = clf_X[top_features]

In [58]:
svc = make_pipeline(StandardScaler(), SVC())

In [59]:
scores = cross_val_score(svc, X2, y)
print(scores)
print(scores.mean())

[0.51304348 0.55072464 0.52906977 0.5377907  0.53488372]
0.5331024603977081


In [60]:
predictions_svc = cross_val_predict(svc, X2, y)

In [61]:
pd.crosstab(predictions_svc, y, normalize=True, margins=True)

points,0.0,1.0,2.0,All
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.264808,0.103949,0.120209,0.488966
1.0,0.001161,0.002323,0.002323,0.005807
2.0,0.119048,0.120209,0.26597,0.505226
All,0.385017,0.226481,0.388502,1.0


Fascinatingly, using a classifier actually got around the difficulty of predicting draws... by not predicting any draws (technically 4 out of all the matches). That being said, it did get a higher accuracy score still!

In [62]:
precision_score(y, predictions_svc, average = 'micro')

0.5331010452961672

In [63]:
rfc = RandomForestClassifier()

In [64]:
scores = cross_val_score(rfc, X2, y)
print(scores)
print(scores.mean())

[0.50724638 0.56811594 0.50872093 0.5494186  0.50872093]
0.5284445567913718


Looks like our svc model produced the most accurate results, with a precision score of 0.533. Let's see how it does with predicting between loss and win or draw.

In [67]:
y2 = df['points_2']

In [68]:
estimator2 = SVC(kernel="linear")
selector2 = RFECV(estimator2, step=1, cv=5)
selector2 = selector.fit(clf_X, y2)

In [69]:
top_features_2 = clf_X.columns[selector.support_]
X3 = clf_X[top_features]

In [70]:
scores = cross_val_score(svc, X3, y2)
print(scores)
print(scores.mean())

[0.68985507 0.71014493 0.68023256 0.69767442 0.68895349]
0.6933720930232558


In [71]:
scores = cross_val_score(rfc, X3, y2)
print(scores)
print(scores.mean())

[0.66956522 0.72173913 0.66860465 0.68604651 0.69476744]
0.68814459049545


Once again, the svc model did the best.

# Conclusion

Using the last 5 years of Premier League Data, I was able to predict with about 69% accuracy whether a premier league team would win or not. I found that using classification to predict results was more accurate than trying to predict the goals scored using regression, and specifically that Support Vector Classification (SVC) worked best.