In [1]:
import pandas as pd
from weekly_prediction_functions import *
from data_preparation_functions import *
from sklearn.metrics import log_loss, confusion_matrix
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)

# EPL Machine Learning Walkthrough

## 03. Weekly Predictions
Welcome to the third part of this Machine Learning Walkthrough. This tutorial will be a walk through of creating weekly EPL predictions from the basic logistic regression model we built in the previous tutorial. We will then analyse our predictions and create staking strategies in the next tutorial.

Specifically, this tutorial will cover a few things:

1. Obtaining Weekly Odds / Game Info Using Betfair's API
2. Data Wrangling This Week's Game Info Into Our Feature Set

### Obtaining Weekly Odds / Game Info Using Betfair's API
The first thing we need to do to create weekly predictions is get both the games being played this week, as well as match odds from Betfair to be used as features.

To make this process easier, I have created a csv file with the fixture for the 2018/19 season. Let's load that now.

In [2]:
fixture = (pd.read_csv('data/fixture_epl_1920.csv')
              .assign(Date=lambda df: pd.to_datetime(df.Date)))

In [3]:
fixture.head(12)

Unnamed: 0,Date,Time,HomeTeam,AwayTeam,Location,TV,Year,round,season
0,2019-09-08,08:00:00 PM,Liverpool,Norwich,Anfield,,2019,1,1920
1,2019-10-08,12:30:00 PM,West Ham,Man City,London Stadium,,2019,1,1920
2,2019-10-08,03:00:00 PM,Bournemouth,Sheffield United,Vitality Stadium,,2019,1,1920
3,2019-10-08,03:00:00 PM,Burnley,Southampton,Turf Moor,,2019,1,1920
4,2019-10-08,03:00:00 PM,Crystal Palace,Everton,Selhurst Park,,2019,1,1920
5,2019-10-08,03:00:00 PM,Watford,Brighton,Vicarage Road,,2019,1,1920
6,2019-10-08,05:30:00 PM,Tottenham,Aston Villa,Tottenham Hotspur Stadium,,2019,1,1920
7,2019-11-08,02:00:00 PM,Leicester,Wolves,King Power Stadium,,2019,1,1920
8,2019-11-08,02:00:00 PM,Newcastle,Arsenal,St. James' Park,,2019,1,1920
9,2019-11-08,04:30:00 PM,Man United,Chelsea,Old Trafford,,2019,1,1920


Now we are going to connect to the API and retrieve game level information for the next week. To do this, we will use an R script. If you are not familiar with R, don't worry, it is relatively simple to read through. For this, we will run the script weekly_game_info_puller.R. Go ahead and run that script now.

Note that for this step, you will require a Betfair API App Key. If you don't have one, visit [this](https://www.betfair.com.au/hub/tools/betting-tools/developer-program/) page.

I will upload an updated weekly file, so you can follow along regardless of if you have an App Key or not. Let's load that file in now.

In [4]:
#game_info = create_game_info_df("data/weekly_game_info.csv")

In [5]:
#game_info.head(3)

Finally, we will use the API to grab the weekly odds. This R script is also provided, but I have also included the weekly odds csv for convenience.

In [6]:
odds = (pd.read_csv('data/weekly_epl_odds.csv')
           .replace({
                'Man Utd': 'Man United',
                'C Palace': 'Crystal Palace',
                'Sheff Utd': 'Sheffield United'
           }))

In [7]:
odds.head(10)

Unnamed: 0,HomeTeam,AwayTeam,f_homeOdds,f_drawOdds,f_awayOdds
0,Southampton,Bournemouth,2.08,3.7,3.9
1,Leicester,Tottenham,3.0,3.6,2.52
2,Burnley,Norwich,2.04,3.85,3.85
3,Everton,Sheffield United,1.68,3.95,6.2
4,Man City,Watford,1.12,12.5,30.0
5,Newcastle,Brighton,2.58,3.25,3.25
6,Crystal Palace,Wolves,2.84,3.2,2.9
7,West Ham,Man United,3.45,3.75,2.2
8,Arsenal,Aston Villa,1.43,5.2,8.6
9,Chelsea,Liverpool,3.8,3.95,2.06


### Data Wrangling This Week's Game Info Into Our Feature Set
Now we have the arduous task of wrangling all of this info into a feature set that we can use to predict this week's games. Luckily our functions we created earlier should work if we just append the non-features to our main dataframe.

In [8]:
df = create_df('data/epl_data.csv')

In [9]:
df.head()

Unnamed: 0,AC,AF,AHh,AR,AS,AST,AY,Avg<2.5,Avg>2.5,AvgA,AvgAHA,AvgAHH,AvgD,AvgH,AwayTeam,B365A,B365D,B365H,BWA,BWD,BWH,Date,Day,Div,FTAG,FTHG,FTR,HC,HF,HR,HS,HST,HTAG,HTHG,HTR,HY,HomeTeam,IWA,IWD,IWH,Max<2.5,Max>2.5,MaxA,MaxAHA,MaxAHH,MaxD,MaxH,Month,Referee,VCA,VCD,VCH,Year,season,gameId,homeWin,awayWin,result
0,6.0,14.0,0.0,1.0,11.0,5.0,1.0,2.02,1.71,2.74,2.04,1.82,3.16,2.4,Blackburn,2.75,3.2,2.5,2.9,3.3,2.2,2005-08-13,13,E0,1.0,3.0,H,2.0,11.0,0.0,13.0,5.0,1.0,0.0,A,0.0,West Ham,2.7,3.0,2.3,1.8,2.25,2.9,2.08,1.86,3.35,2.6,8,A Wiley,2.75,3.25,2.4,2005,506,1,1,0,home
1,8.0,16.0,-0.25,0.0,13.0,6.0,2.0,2.01,1.7,3.05,1.84,2.01,3.16,2.2,Bolton,3.0,3.25,2.3,3.15,3.25,2.1,2005-08-13,13,E0,2.0,2.0,D,7.0,14.0,0.0,3.0,2.0,2.0,2.0,D,0.0,Aston Villa,3.1,3.0,2.1,1.87,2.2,3.4,1.92,2.1,3.3,2.4,8,M Riley,3.1,3.25,2.2,2005,506,2,0,0,draw
2,6.0,14.0,0.75,0.0,12.0,5.0,1.0,1.93,1.79,1.69,1.86,2.0,3.36,4.69,Man United,1.72,3.4,5.0,1.75,3.35,4.35,2005-08-13,13,E0,2.0,0.0,A,8.0,15.0,0.0,10.0,5.0,1.0,0.0,A,3.0,Everton,1.8,3.1,3.8,1.87,2.1,1.8,1.93,2.05,3.7,5.65,8,G Poll,1.8,3.3,4.5,2005,506,3,0,1,away
3,6.0,13.0,0.0,0.0,7.0,4.0,2.0,2.04,1.69,2.87,2.05,1.81,3.16,2.31,Birmingham,2.87,3.25,2.37,2.8,3.2,2.3,2005-08-13,13,E0,0.0,0.0,D,6.0,12.0,0.0,15.0,7.0,0.0,0.0,D,1.0,Fulham,2.9,3.0,2.2,1.77,2.24,3.05,2.11,1.85,3.3,2.6,8,R Styles,2.8,3.25,2.35,2005,506,4,0,0,draw
4,6.0,11.0,-0.75,0.0,13.0,3.0,3.0,1.94,1.77,4.79,1.76,2.1,3.38,1.69,West Brom,5.0,3.4,1.72,4.8,3.45,1.65,2005-08-13,13,E0,0.0,0.0,D,3.0,13.0,0.0,15.0,8.0,0.0,0.0,D,2.0,Man City,4.2,3.2,1.7,1.9,2.1,5.6,1.83,2.19,3.63,1.8,8,C Foy,5.0,3.25,1.75,2005,506,5,0,0,draw


Now we need to specify which game week we would like to predict. We will then filter the fixture for this game week and append this info to the main DataFrame

In [10]:
round_to_predict = int(input("Which game week would you like to predict? Please input next week's Game Week\n"))

Which game week would you like to predict? Please input next week's Game Week
6


In [11]:
future_predictions = (fixture.loc[fixture['round'] == round_to_predict, ['Date', 'HomeTeam', 'AwayTeam', 'season']]
                             .pipe(pd.merge, odds, on=['HomeTeam', 'AwayTeam'])
                             .rename(columns={
                                 'f_homeOdds': 'B365H',
                                 'f_awayOdds': 'B365A',
                                 'f_drawOdds': 'B365D'})
                             .assign(season=lambda df: df.season.astype(str)))

In [12]:
df_including_future_games = (pd.read_csv('data/epl_data.csv', dtype={'season': str})
                .assign(Date=lambda df: pd.to_datetime(df.Date))
                .pipe(lambda df: df.dropna(thresh=len(df) - 2, axis=1))  # Drop cols with NAs
                .dropna(axis=0)  # Drop rows with NAs
                .sort_values('Date')
                .append(future_predictions, sort=True)
                .reset_index(drop=True)
                .assign(gameId=lambda df: list(df.index + 1),
                            Year=lambda df: df.Date.apply(lambda row: row.year),
                            homeWin=lambda df: df.apply(lambda row: 1 if row.FTHG > row.FTAG else 0, axis=1),
                            awayWin=lambda df: df.apply(lambda row: 1 if row.FTAG > row.FTHG else 0, axis=1),
                            result=lambda df: df.apply(lambda row: 'home' if row.FTHG > row.FTAG else ('draw' if row.FTHG == row.FTAG else 'away'), axis=1)))

In [13]:
df_including_future_games.tail(12)

Unnamed: 0,AC,AF,AHh,AR,AS,AST,AY,Avg<2.5,Avg>2.5,AvgA,AvgAHA,AvgAHH,AvgD,AvgH,AwayTeam,B365A,B365D,B365H,BWA,BWD,BWH,Date,Day,Div,FTAG,FTHG,FTR,HC,HF,HR,HS,HST,HTAG,HTHG,HTR,HY,HomeTeam,IWA,IWD,IWH,Max<2.5,Max>2.5,MaxA,MaxAHA,MaxAHH,MaxD,MaxH,Month,Referee,VCA,VCD,VCH,Year,season,gameId,homeWin,awayWin,result
5352,1.0,4.0,0.5,0.0,10.0,4.0,3.0,2.41,1.57,2.01,2.02,1.87,3.74,3.64,Arsenal,2.0,3.6,3.6,1.91,3.8,3.8,2019-09-15,15.0,E0,2.0,2.0,D,7.0,14.0,0.0,31.0,7.0,2.0,0.0,A,3.0,Watford,1.9,3.7,3.9,2.56,1.61,2.08,2.08,1.91,3.9,3.9,9.0,A Taylor,2.05,3.75,3.6,2019,1920,5353,0,0,draw
5353,4.0,12.0,0.0,1.0,13.0,1.0,1.0,2.31,1.62,2.63,1.95,1.94,3.56,2.64,West Ham,2.6,3.5,2.62,2.7,3.4,2.6,2019-09-16,16.0,E0,0.0,0.0,D,2.0,13.0,0.0,10.0,5.0,0.0,0.0,D,2.0,Aston Villa,2.55,3.6,2.6,2.42,1.67,2.7,1.99,1.97,3.67,2.7,9.0,M Dean,2.7,3.6,2.63,2019,1920,5354,0,0,draw
5354,,,,,,,,,,,,,,,Bournemouth,3.9,3.7,2.08,,,,2019-09-20,,,,,,,,,,,,,,,Southampton,,,,,,,,,,,,,,,,2019,1920,5355,0,0,away
5355,,,,,,,,,,,,,,,Tottenham,2.52,3.6,3.0,,,,2019-09-21,,,,,,,,,,,,,,,Leicester,,,,,,,,,,,,,,,,2019,1920,5356,0,0,away
5356,,,,,,,,,,,,,,,Norwich,3.85,3.85,2.04,,,,2019-09-21,,,,,,,,,,,,,,,Burnley,,,,,,,,,,,,,,,,2019,1920,5357,0,0,away
5357,,,,,,,,,,,,,,,Sheffield United,6.2,3.95,1.68,,,,2019-09-21,,,,,,,,,,,,,,,Everton,,,,,,,,,,,,,,,,2019,1920,5358,0,0,away
5358,,,,,,,,,,,,,,,Watford,30.0,12.5,1.12,,,,2019-09-21,,,,,,,,,,,,,,,Man City,,,,,,,,,,,,,,,,2019,1920,5359,0,0,away
5359,,,,,,,,,,,,,,,Brighton,3.25,3.25,2.58,,,,2019-09-21,,,,,,,,,,,,,,,Newcastle,,,,,,,,,,,,,,,,2019,1920,5360,0,0,away
5360,,,,,,,,,,,,,,,Wolves,2.9,3.2,2.84,,,,2019-09-22,,,,,,,,,,,,,,,Crystal Palace,,,,,,,,,,,,,,,,2019,1920,5361,0,0,away
5361,,,,,,,,,,,,,,,Man United,2.2,3.75,3.45,,,,2019-09-22,,,,,,,,,,,,,,,West Ham,,,,,,,,,,,,,,,,2019,1920,5362,0,0,away


As we can see, what we have done is appended the Game information to our main DataFrame. The rest of the info is left as NAs, but this will be filled when we created our rolling average features. This is a 'hacky' type of way to complete this task, but works well as we can use the same functions that we created in the previous tutorials on this DataFrame. We now need to add the odds from our odds DataFrame, then we can just run our create features functions as usual.

### Predicting Next Gameweek's Results
Now that we have our feature DataFrame, all we need to do is split the feature DataFrame up into a training set and next week's games, then use the model we tuned in the last tutorial to create predictions!

In [14]:
features = create_feature_df(df=df_including_future_games)

Creating all games feature DataFrame
Creating stats feature DataFrame
Creating odds feature DataFrame
Creating market values feature DataFrame
Filling NAs
Merging stats, odds and market values into one features DataFrame
Complete.


In [15]:
# Create a feature DataFrame for this week's games.
production_df = pd.merge(future_predictions, features, on=['Date', 'HomeTeam', 'AwayTeam', 'season'])

In [16]:
# Create a training DataFrame
training_df = features[~features.gameId.isin(production_df.gameId)]

In [17]:
feature_names = [col for col in training_df if col.startswith('f_')]

le = LabelEncoder()
train_y = le.fit_transform(training_df.result)
train_x = training_df[feature_names]

In [18]:
#lr = LogisticRegression(C=0.01, solver='liblinear')
lr = LogisticRegression(C=0.05, solver='lbfgs')
lr.fit(train_x, train_y)
predicted_probs = lr.predict_proba(production_df[feature_names])
predicted_odds = 1 / predicted_probs

In [19]:
# Assign the modelled odds to our predictions df
predictions_df = (production_df.loc[:, ['Date', 'HomeTeam', 'AwayTeam', 'B365H', 'B365D', 'B365A']]
                               .assign(homeModelledOdds=[i[2] for i in predicted_odds],
                                      drawModelledOdds=[i[1] for i in predicted_odds],
                                      awayModelledOdds=[i[0] for i in predicted_odds])
                               .rename(columns={
                                   'B365H': 'BetfairHomeOdds',
                                   'B365D': 'BetfairDrawOdds',
                                   'B365A': 'BetfairAwayOdds'}))

In [20]:
predictions_df

Unnamed: 0,Date,HomeTeam,AwayTeam,BetfairHomeOdds,BetfairDrawOdds,BetfairAwayOdds,homeModelledOdds,drawModelledOdds,awayModelledOdds
0,2019-09-20,Southampton,Bournemouth,2.08,3.7,3.9,2.13863,3.837353,3.678976
1,2019-09-21,Leicester,Tottenham,3.0,3.6,2.52,2.675024,3.319227,3.077902
2,2019-09-21,Burnley,Norwich,2.04,3.85,3.85,2.199465,3.478334,3.878222
3,2019-09-21,Everton,Sheffield United,1.68,3.95,6.2,1.695623,3.984884,6.277549
4,2019-09-21,Man City,Watford,1.12,12.5,30.0,1.034685,50.792203,72.283098
5,2019-09-21,Newcastle,Brighton,2.58,3.25,3.25,2.086903,3.482114,4.280104
6,2019-09-22,Crystal Palace,Wolves,2.84,3.2,2.9,2.374476,3.319259,3.602533
7,2019-09-22,West Ham,Man United,3.45,3.75,2.2,3.22441,3.783717,2.34976
8,2019-09-22,Arsenal,Aston Villa,1.43,5.2,8.6,1.432229,5.023309,9.735624
9,2019-09-22,Chelsea,Liverpool,3.8,3.95,2.06,3.705022,2.759913,2.719122


Above are the predictions for this Gameweek's matches. In the next tutorial we will explore the errors our model has made, and work on creating a profitable betting strategy.