# Predicting EPL Winners

This notebook will be used to create a model that can predict winners of the English Premier League. The data comes from the 'matches' dataset and contains data for the entire 2020-2021 season and partial data for the 2021-2022 season.

The data was scraped during matchweek 30 of the 2021-2022 season.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from sklearn import *

## Exploring the data

The data contains 1389 entries. Since there are 20 teams in the league and each team plays 28 matches, there should be (2 * 28 * 30), or, 1520 entries. The reason for the 'missing' entries is that three teams were relegated at the end of the 2020-2021 season and three were promoted. This means that there will be 6 teams with 38 matches and 17 teams with 76 matches in the dataset.

In [2]:
match_data = pd.read_csv('matches.csv', index_col=0)
match_data.head()


Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,Match Report,,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,Match Report,,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,Match Report,,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,Match Report,,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,Match Report,,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City


In [3]:
match_data.shape

(1389, 27)

On further inspection, the data for Liverpool seems to be missing as they were not relegated from the league. 'Liverpool' only has entries rom the 2020-2021 season

In [4]:
match_data['team'].value_counts()

Southampton                 72
Brighton and Hove Albion    72
Manchester United           72
West Ham United             72
Newcastle United            72
Burnley                     71
Leeds United                71
Crystal Palace              71
Manchester City             71
Wolverhampton Wanderers     71
Tottenham Hotspur           71
Arsenal                     71
Leicester City              70
Chelsea                     70
Aston Villa                 70
Everton                     70
Liverpool                   38
Fulham                      38
West Bromwich Albion        38
Sheffield United            38
Brentford                   34
Watford                     33
Norwich City                33
Name: team, dtype: int64

In [5]:
match_data[match_data['team'] == 'Liverpool']

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2020-09-12,17:30,Premier League,Matchweek 1,Sat,Home,W,4.0,3.0,Leeds United,...,Match Report,,20.0,4.0,17.0,0.0,2.0,2.0,2021,Liverpool
2,2020-09-20,16:30,Premier League,Matchweek 2,Sun,Away,W,2.0,0.0,Chelsea,...,Match Report,,17.0,5.0,17.7,1.0,0.0,0.0,2021,Liverpool
4,2020-09-28,20:00,Premier League,Matchweek 3,Mon,Home,W,3.0,1.0,Arsenal,...,Match Report,,21.0,9.0,16.8,0.0,0.0,0.0,2021,Liverpool
6,2020-10-04,19:15,Premier League,Matchweek 4,Sun,Away,L,2.0,7.0,Aston Villa,...,Match Report,,14.0,8.0,15.8,1.0,0.0,0.0,2021,Liverpool
7,2020-10-17,12:30,Premier League,Matchweek 5,Sat,Away,D,2.0,2.0,Everton,...,Match Report,,22.0,8.0,15.0,1.0,0.0,0.0,2021,Liverpool
9,2020-10-24,20:00,Premier League,Matchweek 6,Sat,Home,W,2.0,1.0,Sheffield Utd,...,Match Report,,17.0,5.0,18.2,1.0,0.0,0.0,2021,Liverpool
11,2020-10-31,17:30,Premier League,Matchweek 7,Sat,Home,W,2.0,1.0,West Ham,...,Match Report,,8.0,2.0,18.6,1.0,1.0,1.0,2021,Liverpool
13,2020-11-08,16:30,Premier League,Matchweek 8,Sun,Away,D,1.0,1.0,Manchester City,...,Match Report,,9.0,2.0,21.5,0.0,1.0,1.0,2021,Liverpool
14,2020-11-22,19:15,Premier League,Matchweek 9,Sun,Home,W,3.0,0.0,Leicester City,...,Match Report,,24.0,12.0,11.9,0.0,0.0,0.0,2021,Liverpool
16,2020-11-28,12:30,Premier League,Matchweek 10,Sat,Away,D,1.0,1.0,Brighton,...,Match Report,,6.0,2.0,20.9,0.0,0.0,0.0,2021,Liverpool


### Matchweek Data
The data was scraped partway through the 2020-21 season so there are some matchweeks that have fewer than 39 entries.

In [6]:
match_data["round"].value_counts()

Matchweek 1     39
Matchweek 16    39
Matchweek 34    39
Matchweek 32    39
Matchweek 31    39
Matchweek 29    39
Matchweek 28    39
Matchweek 26    39
Matchweek 25    39
Matchweek 24    39
Matchweek 23    39
Matchweek 2     39
Matchweek 19    39
Matchweek 17    39
Matchweek 20    39
Matchweek 15    39
Matchweek 5     39
Matchweek 3     39
Matchweek 13    39
Matchweek 12    39
Matchweek 4     39
Matchweek 11    39
Matchweek 10    39
Matchweek 9     39
Matchweek 8     39
Matchweek 14    39
Matchweek 7     39
Matchweek 6     39
Matchweek 30    37
Matchweek 27    37
Matchweek 22    37
Matchweek 21    37
Matchweek 18    37
Matchweek 33    32
Matchweek 35    20
Matchweek 36    20
Matchweek 37    20
Matchweek 38    20
Name: round, dtype: int64

## Data cleaning
Creating columns that are numeric to be able to use them as input for the predictor.

In [7]:
# Convert dates column values to "datetime"

match_data["date"] = pd.to_datetime(match_data["date"])
match_data.dtypes

date            datetime64[ns]
time                    object
comp                    object
round                   object
day                     object
venue                   object
result                  object
gf                     float64
ga                     float64
opponent                object
xg                     float64
xga                    float64
poss                   float64
attendance             float64
captain                 object
formation               object
referee                 object
match report            object
notes                  float64
sh                     float64
sot                    float64
dist                   float64
fk                     float64
pk                     float64
pkatt                  float64
season                   int64
team                    object
dtype: object

## Creating the predictors

#### Venue Code
Convert the "Home" or "Away" venue for each team to a value that can be used for the predictor. This will be of type "int".

In [8]:
# Away matches will be 0 and home matches will be 1

match_data["venue_code"] = match_data["venue"].astype("category").cat.codes

#### Opponent code

Convert each opponent to an int value

In [9]:
# Each team will get its own number

match_data["opp_code"] = match_data["opponent"].astype("category").cat.codes

#### Time of match
Create categories for each time of day. Some teams may play better at day or night matches

In [10]:
# Keep only the hour in the time column. Convert the hour to an integer

match_data["hour"] = match_data["time"].str.replace(":.+", "", regex=True).astype("int")

#### Day of The Week

Give each day of the week a unique number

In [11]:
match_data["day_code"] = match_data["date"].dt.dayofweek

In [12]:
match_data.head() # check the data

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,dist,fk,pk,pkatt,season,team,venue_code,opp_code,hour,day_code
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,16.9,1.0,0.0,0.0,2022,Manchester City,0,18,16,6
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,17.3,1.0,0.0,0.0,2022,Manchester City,1,15,15,5
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,14.3,0.0,0.0,0.0,2022,Manchester City,1,0,12,5
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,14.0,0.0,0.0,0.0,2022,Manchester City,0,10,15,5
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,15.7,1.0,0.0,0.0,2022,Manchester City,1,17,15,5


In [13]:
# Return a 1 if a team won or a 0 if the team lost or played a draw

match_data["target"] = (match_data['result'] == 'W').astype("int")

## Creating Initial 

The following Machine Learning section was completed with the help of online tutorials. It is included here to finish off the project.

In [14]:
from sklearn.ensemble import RandomForestClassifier

In [15]:
rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)

## Training and Testing Our Model
We trained our algorith with all matches before 2022 and then tested it with all matches played in 2022.

In [16]:
train = match_data[match_data["date"] < '2022-01-01']
test = match_data[match_data["date"] > '2022-01-01']

In [17]:
predictors = ["venue_code", "opp_code", "hour", "day_code"]

In [18]:
rf.fit(train[predictors], train["target"])

In [19]:
preds = rf.predict(test[predictors])

#### Determine Our Measure of Accuracy

In [20]:
from sklearn.metrics import accuracy_score

In [21]:
accuracy = accuracy_score(test["target"], preds)
accuracy

0.6123188405797102

### When Was Accuracy High?

The model had a higher accuracy at predicting draws and losses. We need to revisit this in order to reach our goal of predicting winners accurately.

In [22]:
combined = pd.DataFrame(dict(actual=test["target"], prediction=preds))

pd.crosstab(index=combined["actual"], columns=["prediction"])

col_0,prediction
actual,Unnamed: 1_level_1
0,172
1,104


In [23]:
from sklearn.metrics import precision_score

## We only predicted the correct winner 47% of the time with the above code

In [24]:
precision_score(test["target"], preds)

0.4745762711864407

## Improving Accuracy With Rolling Averages

In [25]:
# Group the data for each team

grouped_matches = match_data.groupby("team")

In [26]:
''' Compute rolling averages for a certain column then assign them to the relevant team'''

def rolling_averages(group, cols, new_cols):
    group = group.sort_values("date")
    rolling_stats = group[cols].rolling(3, closed='left').mean()
    group[new_cols] = rolling_stats
    group = group.dropna(subset=new_cols)
    return group

In [27]:
cols = ["gf", "ga", "sh", "sot", "dist", "fk", "pk", "pkatt"]
new_cols = [f"{c}_rolling" for c in cols]

In [28]:
matches_rolling = match_data.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))

Show the rolling averages for each matchweek

In [29]:
matches_rolling = matches_rolling.droplevel("team")
matches_rolling

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
6,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Home,W,2.0,1.0,Sheffield Utd,...,6,1,2.000000,1.333333,7.666667,3.666667,14.733333,0.666667,0.000000,0.000000
7,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0.0,1.0,Manchester City,...,5,0,1.666667,1.666667,5.333333,3.666667,15.766667,0.000000,0.000000,0.000000
9,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0.0,1.0,Leicester City,...,6,0,1.000000,1.666667,7.000000,3.666667,16.733333,0.666667,0.000000,0.000000
11,2020-11-01,16:30,Premier League,Matchweek 7,Sun,Away,W,1.0,0.0,Manchester Utd,...,6,1,0.666667,1.000000,9.666667,4.000000,16.033333,1.000000,0.000000,0.000000
13,2020-11-08,19:15,Premier League,Matchweek 8,Sun,Home,L,0.0,3.0,Aston Villa,...,6,0,0.333333,0.666667,9.666667,2.666667,18.033333,1.000000,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32,2022-03-13,14:00,Premier League,Matchweek 29,Sun,Away,W,1.0,0.0,Everton,...,6,1,1.333333,1.000000,12.333333,3.666667,19.300000,0.000000,0.000000,0.000000
33,2022-03-18,20:00,Premier League,Matchweek 30,Fri,Home,L,2.0,3.0,Leeds United,...,4,0,1.666667,0.666667,12.333333,4.333333,19.600000,0.000000,0.000000,0.000000
34,2022-04-02,15:00,Premier League,Matchweek 31,Sat,Home,W,2.0,1.0,Aston Villa,...,5,1,2.333333,1.000000,13.000000,5.333333,19.833333,0.000000,0.000000,0.000000
35,2022-04-08,20:00,Premier League,Matchweek 32,Fri,Away,L,0.0,1.0,Newcastle Utd,...,4,0,1.666667,1.333333,13.000000,5.000000,18.533333,0.000000,0.000000,0.000000


In [30]:
def make_predictions(data, predictors):
    train = data[data["date"] < '2022-01-01']
    test = data[data["date"] > '2022-01-01']
    rf.fit(train[predictors], train["target"])
    preds = rf.predict(test[predictors])
    combined = pd.DataFrame(dict(actual=test["target"], prediction=preds), index=test.index)
    precision = precision_score(test["target"], preds)
    return combined, precision

In [31]:
combined, precision = make_predictions(matches_rolling, predictors + new_cols)

## Precision Score

The precision has now risen to 62.5%

In [32]:
precision

0.625

In [33]:
# Combine team with their opponent

combined = combined.merge(matches_rolling[["date", "team", "opponent", "result"]], left_index=True, right_index=True)

In [34]:
combined

Unnamed: 0,actual,prediction,date,team,opponent,result
19,0,0,2021-12-15,Arsenal,West Ham,W
19,0,0,2021-01-20,Aston Villa,Manchester City,L
19,0,0,2021-12-26,Aston Villa,Chelsea,L
19,0,0,2021-01-02,Brighton and Hove Albion,Wolves,D
19,0,0,2021-12-26,Brighton and Hove Albion,Brentford,W
...,...,...,...,...,...,...
52,0,1,2021-05-23,Liverpool,Crystal Palace,W
52,0,1,2021-04-25,Manchester United,Leeds United,D
52,0,1,2021-04-21,Tottenham Hotspur,Southampton,W
53,1,1,2021-05-08,Chelsea,Manchester City,W


## Combine Home and Away Matches

In [35]:
class MissingDict(dict):
    __missing__ = lambda self, key: key

map_vals = {
    "Brighton and Hove Albion": "Brighton",
    "Manchester United": "Manchester Utd",
    "Newcastle United": "Newcastle Utd",
    "Tottenham Hotspur": "Tottenham",
    "West Ham United": "West Ham",
    "Wolverhampton Wanderers": "Wolves"
}

mapping = MissingDict(**map_vals)

In [36]:
mapping["West Ham United"]

'West Ham'

In [37]:
combined["new_team"] = combined["team"].map(mapping)

In [38]:
merged = combined.merge(combined, left_on=["date", "new_team"], right_on=["date", "opponent"])

In [39]:
merged

Unnamed: 0,actual_x,prediction_x,date,team_x,opponent_x,result_x,new_team_x,actual_y,prediction_y,team_y,opponent_y,result_y,new_team_y
0,0,0,2021-12-15,Arsenal,West Ham,W,Arsenal,0,0,West Ham United,Arsenal,L,West Ham
1,0,0,2021-12-15,Arsenal,West Ham,W,Arsenal,0,0,West Ham United,Arsenal,L,West Ham
2,0,0,2021-12-15,Arsenal,West Ham,W,Arsenal,0,0,West Ham United,Arsenal,L,West Ham
3,0,0,2021-12-15,Arsenal,West Ham,W,Arsenal,0,0,West Ham United,Arsenal,L,West Ham
4,0,0,2021-12-15,Arsenal,West Ham,W,Arsenal,0,0,West Ham United,Arsenal,L,West Ham
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74444,0,1,2021-04-21,Tottenham Hotspur,Southampton,W,Tottenham,1,1,Southampton,Tottenham,L,Southampton
74445,1,1,2022-04-24,Chelsea,West Ham,W,Chelsea,1,0,West Ham United,Chelsea,L,West Ham
74446,1,1,2022-04-24,Chelsea,West Ham,W,Chelsea,0,0,West Ham United,Chelsea,L,West Ham
74447,1,1,2022-04-24,Chelsea,West Ham,W,Chelsea,1,1,West Ham United,Chelsea,L,West Ham


In [40]:
merged[(merged['prediction_x'] == 1) & (merged['prediction_y'] == 0)]

Unnamed: 0,actual_x,prediction_x,date,team_x,opponent_x,result_x,new_team_x,actual_y,prediction_y,team_y,opponent_y,result_y,new_team_y
3518,1,1,2022-01-01,Arsenal,Manchester City,L,Arsenal,1,0,Manchester City,Arsenal,W,Manchester City
3520,1,1,2022-01-01,Arsenal,Manchester City,L,Arsenal,0,0,Manchester City,Arsenal,W,Manchester City
3521,1,1,2022-01-01,Arsenal,Manchester City,L,Arsenal,0,0,Manchester City,Arsenal,W,Manchester City
3522,1,1,2022-01-01,Arsenal,Manchester City,L,Arsenal,0,0,Manchester City,Arsenal,W,Manchester City
3523,1,1,2022-01-01,Arsenal,Manchester City,L,Arsenal,0,0,Manchester City,Arsenal,W,Manchester City
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74442,0,1,2021-04-21,Tottenham Hotspur,Southampton,W,Tottenham,0,0,Southampton,Tottenham,L,Southampton
74443,0,1,2021-04-21,Tottenham Hotspur,Southampton,W,Tottenham,1,0,Southampton,Tottenham,L,Southampton
74445,1,1,2022-04-24,Chelsea,West Ham,W,Chelsea,1,0,West Ham United,Chelsea,L,West Ham
74446,1,1,2022-04-24,Chelsea,West Ham,W,Chelsea,0,0,West Ham United,Chelsea,L,West Ham
