# A Study Into Prediction Models - Premier League 2022/2023 Season
By Sam Grant

---

# Introduction

In this notebook I will create 3 different prediction models and compare and contrast them to see what emerges. The first model will be using a formula such that for each given game it will take inputs specific to that game and spit out the given result. This formula will be based on the statistics of the 5 previous games of each of those teams, specifically it will take a weighted mean of xGf (expected goals for) and xGa (expected goals against) to compute a predicted xG for each team and then generate a result based on this. The second model will be an ordered logit regression model, using several predictor variables and the third model will be based on machine learning using the library sklearn. I will be using a dataset, of all results in the premier league in the last season, in which I will split it into games played in the first of the season and games played in the second. The first half will be used as information to input/regress/train the models and then the second half will be the 'test' data to see how each model performs. 

This will be done is by looking at the success rate of each model. As there are 3 possible outcomes (win, lose or draw) the benchmark will be $33.33\%$ as if you chose results at random the law of large numbers dictates you would expect to get a third of them correct. So any value larger than this means the model performs better than choosing at random, but this does not necessarily mean the model has performed well. That will be down to our discretion. For the second model, we will look at the brier score, which is calculated using an average of the sum of the squared differences between the probability of an outcome and the indicator function of that outcome. This will be shown clearer when we come to calculating it but the benchmark here (using a model that gives each result equal probability) is $0.666$ as shown below.

$$(0-1/3)^2+(0-1/3)^2+(1-1/3)^2 = 1/9+1/9+4/9 = 2/3$$

So again like above, a value less than this (better performance means lower value) means the model peformed better than choosing at random and again further to that, it will be down to discretion. We can only do this on the second model as it is the only one that gives us the probabilities of each event

One crucial thing I have not mentioned yet is the fundamental principle that the data/information we apply in each case are things we know about the games we want to predict before they kick off. Obviously, this is being done retrospectively so we do actually know a lot more, but based on that we could simply use the statistics of goals scored by each team to predict each result with 100% accuracy. We are interested in variables we know for games with a given result, building a model based on this and using it to predict games without using data not available prior to it.

---

## Preparing the Data

The below code uses a technique called web scraping that essentially means using a website/webpage containing the data you want, converting that into html code and then using python to find the data and converting into a workable data set. The website we are using is called __fbref.com__, a massive database of data and statistics from a huge array of football games in various years and competitions. Specifically, as indicated by the title and intro, we will be looking at each game in the 2022/23 Premier League Season which is found in a table on the website in the code below. We will 'scrape' this table and then this will be the basis for the data sets used for each model, which will differ by what extra information needs adding.

In [1]:
import requests
import numpy as np
from bs4 import BeautifulSoup
import pandas as pd
from statsmodels.miscmodels.ordinal_model import OrderedModel
from sklearn.linear_model import LogisticRegression

In [2]:
headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://fbref.com/en/comps/9/2022-2023/schedule/2022-2023-Premier-League-Scores-and-Fixtures"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

In [3]:
table1 = pageSoup.find("table", id="sched_2022-2023_9_1")

In [4]:
headers = []
for i in table1.find_all("th"):
    title = i.text
    headers.append(title)
headers = headers[1:14]
headers[4] = "xG_h"
headers[6] = "xG_a"

In [5]:
df = pd.DataFrame(columns = headers)

In [6]:
for j in table1.find_all("tr")[1:]:
    row_data = j.find_all("td")
    row = [i.text for i in row_data]
    length = len(df)
    df.loc[length] = row

In [7]:
df = df[['Home','xG_h','Score','xG_a','Away']]
df = df.iloc[np.where(df['Home'] != "")]
df = df.reset_index(drop=True)

In [8]:
df.to_csv("results_22_23.csv", index=False)
df = pd.read_csv("results_22_23.csv")

In [9]:
columns = np.array(['Home','Away','Score','xG_h','xG_a'])
df = df[columns]
df

Unnamed: 0,Home,Away,Score,xG_h,xG_a
0,Crystal Palace,Arsenal,0–2,1.2,1.0
1,Fulham,Liverpool,2–2,1.2,1.2
2,Tottenham,Southampton,4–1,1.5,0.5
3,Newcastle Utd,Nott'ham Forest,2–0,1.7,0.3
4,Leeds United,Wolves,2–1,0.8,1.3
...,...,...,...,...,...
375,Everton,Bournemouth,1–0,1.0,0.5
376,Leicester City,West Ham,2–1,1.4,1.4
377,Aston Villa,Brighton,2–1,2.8,1.4
378,Leeds United,Tottenham,1–4,1.5,2.2


In [10]:
df['Home_goals'] = np.repeat(0,len(df)) 
df['Away_goals'] = np.repeat(0,len(df))
for i in range(len(df)):
    df.loc[i, ('Home_goals')] = df['Score'][i][0]
    df.loc[i, ('Away_goals')] = df['Score'][i][-1]
    
df

Unnamed: 0,Home,Away,Score,xG_h,xG_a,Home_goals,Away_goals
0,Crystal Palace,Arsenal,0–2,1.2,1.0,0,2
1,Fulham,Liverpool,2–2,1.2,1.2,2,2
2,Tottenham,Southampton,4–1,1.5,0.5,4,1
3,Newcastle Utd,Nott'ham Forest,2–0,1.7,0.3,2,0
4,Leeds United,Wolves,2–1,0.8,1.3,2,1
...,...,...,...,...,...,...,...
375,Everton,Bournemouth,1–0,1.0,0.5,1,0
376,Leicester City,West Ham,2–1,1.4,1.4,2,1
377,Aston Villa,Brighton,2–1,2.8,1.4,2,1
378,Leeds United,Tottenham,1–4,1.5,2.2,1,4


In [11]:
df['Result'] = np.where(df['Home_goals'] > df['Away_goals'], 'H', np.where(df['Home_goals'] == df['Away_goals'], 'D', 'A'))
df

Unnamed: 0,Home,Away,Score,xG_h,xG_a,Home_goals,Away_goals,Result
0,Crystal Palace,Arsenal,0–2,1.2,1.0,0,2,A
1,Fulham,Liverpool,2–2,1.2,1.2,2,2,D
2,Tottenham,Southampton,4–1,1.5,0.5,4,1,H
3,Newcastle Utd,Nott'ham Forest,2–0,1.7,0.3,2,0,H
4,Leeds United,Wolves,2–1,0.8,1.3,2,1,H
...,...,...,...,...,...,...,...,...
375,Everton,Bournemouth,1–0,1.0,0.5,1,0,H
376,Leicester City,West Ham,2–1,1.4,1.4,2,1,H
377,Aston Villa,Brighton,2–1,2.8,1.4,2,1,H
378,Leeds United,Tottenham,1–4,1.5,2.2,1,4,A


---

## Model 1 - Formula based on Expected Goal Statistics

The code below defines a formula that acts on a dataset and spits out another one with the desired additions so we can see how the model has performed. There are hashed comments that are explanations for bits that are not immediately obvious rather than executable code.

In [12]:
def model1(df) :
    
    df1 = df.copy()
    # initialise 2 variables/columns in our dataset that we will need later
    
    df1['pred_home_goals'] = np.repeat(0.1,len(df)) 
    df1['pred_away_goals'] = np.repeat(0.1,len(df))
    
    # for each game in our test set (second half of the games) this for loop creates 2 datasets with each teams results
    # and defines more needed variables
    
    for i in range(len(df1))[190:]:
        Home_Team = df1['Home'][i]
        Away_Team = df1['Away'][i]
        
        dfH = df1[(df1['Home'] == Home_Team) | (df1['Away'] == Home_Team)]
        dfH = dfH.reset_index(drop=True)
        dfH['xGf'] = np.repeat(0.1,len(dfH)) 
        dfH['xGa'] = np.repeat(0.1,len(dfH))
        
        # this for loop, like the one below, sets the expected goals for/against equal to the home/away team expected
        # goals based on if the team of interest was at home/away
        
        for j in range(len(dfH)):
            if (dfH['Home'][j] == Home_Team): 
                dfH.loc[j,('xGf')] = dfH.loc[j, ('xG_h')]
                dfH.loc[j,('xGa')] = dfH.loc[j, ('xG_a')]
        
            else: 
                dfH.loc[j,('xGf')] = dfH.loc[j,('xG_a')]
                dfH.loc[j,('xGa')] = dfH.loc[j,('xG_h')]
    
        dfA = df1[(df1['Home'] == Away_Team) | (df1['Away'] == Away_Team)]
        dfA = dfA.reset_index(drop=True)
        dfA['xGf'] = np.repeat(0.1,len(dfA)) 
        dfA['xGa'] = np.repeat(0.1,len(dfA))
        
        for k in range(len(dfA)):
            if (dfA['Home'][k] == Away_Team): 
                dfA.loc[k,('xGf')] = dfA.loc[k,('xG_h')] 
                dfA.loc[k,('xGa')] = dfA.loc[k,('xG_a')]
        
            else: 
                dfA.loc[k,('xGf')] = dfA.loc[k,('xG_a')] 
                dfA.loc[k,('xGa')] = dfA.loc[k,('xG_h')]
        
        # find the index of our game of interest in these datasets
        
        b   = np.where((dfH['Home'] == Home_Team) & (dfH['Away'] == Away_Team))[0][0]
        c   = np.where((dfA['Home'] == Home_Team) & (dfA['Away'] == Away_Team))[0][0]
        
        # use a formula, taking a weighted mean of expected goals for in the last 5 of a team's games and additionally
        # the expected goals against in the last 5 of the opposition's games, and assign this value to our predicted
        # goals variables in the original data set
        
        df1.loc[i,('pred_home_goals')] = round((dfH.loc[b-5,('xGf')]+dfH.loc[b-4,('xGf')]+dfH.loc[b-3,('xGf')]+dfH.loc[b-2,('xGf')]+dfH.loc[b-1,('xGf')]+
                                               dfA.loc[c-5,('xGa')]+dfA.loc[c-4,('xGa')]+dfA.loc[c-3,('xGa')]+dfA.loc[c-2,('xGa')]+dfA.loc[c-1,('xGa')])/10,1)
        df1.loc[i,('pred_away_goals')] = round((dfA.loc[c-5,('xGf')]+dfA.loc[c-4,('xGf')]+dfA.loc[c-3,('xGf')]+dfA.loc[c-2,('xGf')]+dfA.loc[c-1,('xGf')]+
                                               dfH.loc[b-5,('xGa')]+dfH.loc[b-4,('xGa')]+dfH.loc[b-3,('xGa')]+dfH.loc[b-2,('xGa')]+dfH.loc[b-1,('xGa')])/10,1)
    
    df1['pred_result_1'] = np.where(df1['pred_home_goals'] > df1['pred_away_goals'], 'H', np.where(df1['pred_home_goals'] == df1['pred_away_goals'], 'D', 'A'))
    
    return(df1.iloc[190:])

In [13]:
df1 = model1(df)
df1

Unnamed: 0,Home,Away,Score,xG_h,xG_a,Home_goals,Away_goals,Result,pred_home_goals,pred_away_goals,pred_result_1
190,Bournemouth,Nott'ham Forest,1–1,1.4,1.9,1,1,D,1.1,1.5,A
191,Southampton,Aston Villa,0–1,0.6,1.4,0,1,A,1.3,1.4,A
192,Leicester City,Brighton,2–2,0.9,1.7,2,2,D,1.5,1.6,A
193,West Ham,Everton,2–0,2.2,0.5,2,0,H,1.4,1.3,H
194,Crystal Palace,Newcastle Utd,0–0,0.3,1.2,0,0,D,0.7,1.6,A
...,...,...,...,...,...,...,...,...,...,...,...
375,Everton,Bournemouth,1–0,1.0,0.5,1,0,H,1.6,1.7,A
376,Leicester City,West Ham,2–1,1.4,1.4,2,1,H,1.9,1.6,H
377,Aston Villa,Brighton,2–1,2.8,1.4,2,1,H,1.5,1.4,H
378,Leeds United,Tottenham,1–4,1.5,2.2,1,4,A,1.3,2.0,A


In [14]:
df1['correct_pred_1'] = np.where(df1['Result'] == df1['pred_result_1'],1,0)
df1['correct_pred_1'].describe()

count    190.000000
mean       0.473684
std        0.500626
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: correct_pred_1, dtype: float64

---

The last bit of code applies the formula/model to the dataset and returns the test data set now with predicted values for each team's goals and a predicted result based off of this. Then we can define a variable equal to 1 if the predicted result agrees with the actual result and 0 otherwise and look at the summary statistics of this variable - most importantly the mean. This represents the proportion of the values that are 1 and hence the success rate of predicting the correct result and so also the success rate model. In this case it is $47.37\%$ (to $2$.d.p), so about $14\%$ higher than our benchmark value. This means it is better than simply choosing at random, however I would argue it is not a very successful model as it predicts on average less than 1 in 2 correctly - you would want to be a bit more confident in your prediction than that.

At the last step of writing the formula I had the choice of how to compute the predicted goals and then the result. Originally, I chose to round the weighted means to a whole number and define results by this, which to me intuitively made the most sense. However, what you find is that, it predicts a lot more draws than without rounding ($65$ vs $9$), which is not what you tend to find in football games in practice. This clearly happens because you increase the values the variables can take (eg $0,1,2,...$ vs $0,0.1,0.2,0.3,...$) so you are less likely to find a game where they have the same value. This also clearly leads to a decrease in success rate ($36.84\%$ vs $47.37\%$) due to the fact it is predicting more results less likely to happen.

After this discovery, I modified the model by rounding to $1$ d.p. which produces the results discussed above. However this introduces it's own issues, as we are predicting values of goals that are impossible to achieve in reality. One could argue this is irrelevant as the only prediction we are concerned about is the result we spit out but I think a refinement of the model would modify the formula to produce more accurate predictions of the goals __in whole numbers__ and then a predicted result based off of that. Ultimately, I think this model falls short because despite the logic making sense - taking into account the recent performances/form of both teams defensively and offensively - the process of averaging simply reduces most of these values to a narrower distribution. This makes is harder to see a clear difference and thus make a confident prediction of the result. It removes the information that is potentially captured by those statistics and hence is less effective as a predictor variable.

---

## Model 2 - Ordered Logistic Regression

For this model I need to do a bit of variable selection to decide what to include before I prepare the data and regress on it. To explain the main variation in results one of the dependent variables will be the log of home team average players value over away team average players value, as money clearly contributes to explaining who beats who and is a common predictor variable to use in a sports prediction model. I was originally going to use squad value but then thought about how a team like Forest has higher value than Brentford purely because it has more players, so we can eliminate this by dividing by the size of the squad which leaves us with a value that more accurately shows a club's level of performance with respect to a monetary value. The data is taken from the start of the season (which means it won't take into account January transfer window, a flaw in the model) and from the website __transfermarkt.co.uk__. It will not be scraped like above I will just manually create the data frame as seen below.

In [15]:
df_SV = pd.DataFrame({'Team' : ["Manchester City","Arsenal","Chelsea","Manchester Utd","Liverpool","Tottenham","Newcastle Utd","Brighton","Aston Villa","Wolves","Leicester City",
                                "West Ham","Southampton","Everton","Nott'ham Forest", "Brentford", "Leeds United", "Crystal Palace", "Fulham", "Bournemouth"],
                      'Value' : [1150,1000,994.95,848,811.85,649.1,541.6,529.83,509.55,497.65,490.8,465.75,419.95,413.15,376.25,371.2,345.15,323.05,295.25,287.20],
                      'Squad' : [34,42,43,48,39,34,38,42,43,43,38,44,46,40,50,36,47,39,47,47]})
df_SV['Av_value'] = df_SV['Value']/df_SV['Squad']
df_SV

Unnamed: 0,Team,Value,Squad,Av_value
0,Manchester City,1150.0,34,33.823529
1,Arsenal,1000.0,42,23.809524
2,Chelsea,994.95,43,23.138372
3,Manchester Utd,848.0,48,17.666667
4,Liverpool,811.85,39,20.816667
5,Tottenham,649.1,34,19.091176
6,Newcastle Utd,541.6,38,14.252632
7,Brighton,529.83,42,12.615
8,Aston Villa,509.55,43,11.85
9,Wolves,497.65,43,11.573256


In [16]:
df_SV['Home'] = df_SV['Team']
df_SV['Away'] = df_SV['Team']
df2 = pd.merge(df,df_SV[['Home','Av_value']], on= 'Home', how = 'left').rename(columns={'Av_value':'hAV'})
df2 = pd.merge(df2,df_SV[['Away','Av_value']], on= 'Away', how = 'left').rename(columns={'Av_value':'aAV'})
df2['lAVratio'] = np.log(df2['hAV']/df2['aAV'])
df2['winvalue'] = np.where(df2['Result']=='H',2,np.where(df2['Result']=='D',1,0))
df2

Unnamed: 0,Home,Away,Score,xG_h,xG_a,Home_goals,Away_goals,Result,hAV,aAV,lAVratio,winvalue
0,Crystal Palace,Arsenal,0–2,1.2,1.0,0,2,A,8.283333,23.809524,-1.055840,0
1,Fulham,Liverpool,2–2,1.2,1.2,2,2,D,6.281915,20.816667,-1.198079,1
2,Tottenham,Southampton,4–1,1.5,0.5,4,1,H,19.091176,9.129348,0.737732,2
3,Newcastle Utd,Nott'ham Forest,2–0,1.7,0.3,2,0,H,14.252632,7.525000,0.638711,2
4,Leeds United,Wolves,2–1,0.8,1.3,2,1,H,7.343617,11.573256,-0.454865,2
...,...,...,...,...,...,...,...,...,...,...,...,...
375,Everton,Bournemouth,1–0,1.0,0.5,1,0,H,10.328750,6.110638,0.524900,2
376,Leicester City,West Ham,2–1,1.4,1.4,2,1,H,12.915789,10.585227,0.198991,2
377,Aston Villa,Brighton,2–1,2.8,1.4,2,1,H,11.850000,12.615000,-0.062559,2
378,Leeds United,Tottenham,1–4,1.5,2.2,1,4,A,7.343617,19.091176,-0.955395,0


In [17]:
df_regress_1 = df2[:190]
model2 = OrderedModel(df_regress_1['winvalue'],
                        df_regress_1[['lAVratio']],
                        distr='logit')
 
res_log = model2.fit(method='bfgs')
res_log.summary()

Optimization terminated successfully.
         Current function value: 0.981512
         Iterations: 11
         Function evaluations: 12
         Gradient evaluations: 12


0,1,2,3
Dep. Variable:,winvalue,Log-Likelihood:,-186.49
Model:,OrderedModel,AIC:,379.0
Method:,Maximum Likelihood,BIC:,388.7
Date:,"Fri, 20 Oct 2023",,
Time:,13:51:11,,
No. Observations:,190,,
Df Residuals:,187,,
Df Model:,1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
lAVratio,1.1386,0.230,4.946,0.000,0.687,1.590
0/1,-0.9404,0.168,-5.595,0.000,-1.270,-0.611
1/2,0.0649,0.140,0.464,0.642,-0.209,0.339


---

The above code creates the data frame with relevant monetary values as discussed, then combines this with our original data set of results and adds a couple more variables for the regression and then finally fits the model on the regression subset of the data and prints the summary. As we can see in the output, the coefficient for $lAVratio$
is $1.1386$ with a p-value of less than $0.000$ so it is statistically significant at any common significance level. This means the higher the ratio, the better the outcome, as viewed by the home team.

To generate the forecasting probabilites, which allow us to predict results in the test subset of our data, we need to manipulate the coefficients in the table above, which will be done below.

---

In [18]:
print(f'beta = {res_log.params.values[0]:.4f}')
print(f'interceptAD = {res_log.params.values[1]:.4f}')
print(f'interceptDH = {res_log.params.values[2]:.4f}')

beta = 1.1386
interceptAD = -0.9404
interceptDH = 0.0649


In [19]:
df_test_1 = df2[190:]
df_test_1.loc[:, ('predA')] = 1/(1+np.exp(-(res_log.params.values[1]-res_log.params.values[0]*df_test_1.loc[:,('lAVratio')])))
df_test_1.loc[:, ('predD')] = 1/(1+np.exp(-(res_log.params.values[2]-res_log.params.values[0]*df_test_1.loc[:,('lAVratio')]))) - df_test_1.loc[:,('predA')]
df_test_1.loc[:, ('predH')] = 1 - df_test_1.loc[:,'predA'] - df_test_1.loc[:,'predD']
df_test_1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


Unnamed: 0,Home,Away,Score,xG_h,xG_a,Home_goals,Away_goals,Result,hAV,aAV,lAVratio,winvalue,predA,predD,predH
190,Bournemouth,Nott'ham Forest,1–1,1.4,1.9,1,1,D,6.110638,7.525000,-0.208200,1,0.331071,0.243847,0.425082
191,Southampton,Aston Villa,0–1,0.6,1.4,0,1,A,9.129348,11.850000,-0.260834,0,0.344474,0.245020,0.410506
192,Leicester City,Brighton,2–2,0.9,1.7,2,2,D,12.915789,12.615000,0.023564,1,0.275433,0.234081,0.490486
193,West Ham,Everton,2–0,2.2,0.5,2,0,H,10.585227,10.328750,0.024528,2,0.275214,0.234026,0.490761
194,Crystal Palace,Newcastle Utd,0–0,0.3,1.2,0,0,D,8.283333,14.252632,-0.542696,1,0.420069,0.244295,0.335636
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,Everton,Bournemouth,1–0,1.0,0.5,1,0,H,10.328750,6.110638,0.524900,2,0.176819,0.193055,0.630126
376,Leicester City,West Ham,2–1,1.4,1.4,2,1,H,12.915789,10.585227,0.198991,2,0.237403,0.222265,0.540332
377,Aston Villa,Brighton,2–1,2.8,1.4,2,1,H,11.850000,12.615000,-0.062559,2,0.295427,0.238551,0.466023
378,Leeds United,Tottenham,1–4,1.5,2.2,1,4,A,7.343617,19.091176,-0.955395,0,0.536785,0.223218,0.239997


In [20]:
df_test_1.loc[:,'Maxprob'] =df_test_1[['predA','predD','predH']].max(axis=1)
df_test_1.loc[:,'pred_result_2']=np.where(df_test_1['Maxprob']==df_test_1['predA'],'A',\
                               np.where(df_test_1['Maxprob']==df_test_1['predD'],'D','H'))
df_test_1.loc[:,'correct_pred_2']= np.where(df_test_1['pred_result_2'] == df_test_1['Result'],1,0)
df_test_1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


Unnamed: 0,Home,Away,Score,xG_h,xG_a,Home_goals,Away_goals,Result,hAV,aAV,lAVratio,winvalue,predA,predD,predH,Maxprob,pred_result_2,correct_pred_2
190,Bournemouth,Nott'ham Forest,1–1,1.4,1.9,1,1,D,6.110638,7.525000,-0.208200,1,0.331071,0.243847,0.425082,0.425082,H,0
191,Southampton,Aston Villa,0–1,0.6,1.4,0,1,A,9.129348,11.850000,-0.260834,0,0.344474,0.245020,0.410506,0.410506,H,0
192,Leicester City,Brighton,2–2,0.9,1.7,2,2,D,12.915789,12.615000,0.023564,1,0.275433,0.234081,0.490486,0.490486,H,0
193,West Ham,Everton,2–0,2.2,0.5,2,0,H,10.585227,10.328750,0.024528,2,0.275214,0.234026,0.490761,0.490761,H,1
194,Crystal Palace,Newcastle Utd,0–0,0.3,1.2,0,0,D,8.283333,14.252632,-0.542696,1,0.420069,0.244295,0.335636,0.420069,A,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,Everton,Bournemouth,1–0,1.0,0.5,1,0,H,10.328750,6.110638,0.524900,2,0.176819,0.193055,0.630126,0.630126,H,1
376,Leicester City,West Ham,2–1,1.4,1.4,2,1,H,12.915789,10.585227,0.198991,2,0.237403,0.222265,0.540332,0.540332,H,1
377,Aston Villa,Brighton,2–1,2.8,1.4,2,1,H,11.850000,12.615000,-0.062559,2,0.295427,0.238551,0.466023,0.466023,H,1
378,Leeds United,Tottenham,1–4,1.5,2.2,1,4,A,7.343617,19.091176,-0.955395,0,0.536785,0.223218,0.239997,0.536785,A,1


In [21]:
df_test_1['correct_pred_2'].mean()

0.5473684210526316

In [22]:
df_test_1.loc[:,'Houtcome']= np.where(df_test_1['Result']=='H',1,0)
df_test_1.loc[:,'Doutcome']= np.where(df_test_1['Result']=='D',1,0)
df_test_1.loc[:,'Aoutcome']= np.where(df_test_1['Result']=='A',1,0)
df_test_1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


Unnamed: 0,Home,Away,Score,xG_h,xG_a,Home_goals,Away_goals,Result,hAV,aAV,...,winvalue,predA,predD,predH,Maxprob,pred_result_2,correct_pred_2,Houtcome,Doutcome,Aoutcome
190,Bournemouth,Nott'ham Forest,1–1,1.4,1.9,1,1,D,6.110638,7.525000,...,1,0.331071,0.243847,0.425082,0.425082,H,0,0,1,0
191,Southampton,Aston Villa,0–1,0.6,1.4,0,1,A,9.129348,11.850000,...,0,0.344474,0.245020,0.410506,0.410506,H,0,0,0,1
192,Leicester City,Brighton,2–2,0.9,1.7,2,2,D,12.915789,12.615000,...,1,0.275433,0.234081,0.490486,0.490486,H,0,0,1,0
193,West Ham,Everton,2–0,2.2,0.5,2,0,H,10.585227,10.328750,...,2,0.275214,0.234026,0.490761,0.490761,H,1,1,0,0
194,Crystal Palace,Newcastle Utd,0–0,0.3,1.2,0,0,D,8.283333,14.252632,...,1,0.420069,0.244295,0.335636,0.420069,A,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,Everton,Bournemouth,1–0,1.0,0.5,1,0,H,10.328750,6.110638,...,2,0.176819,0.193055,0.630126,0.630126,H,1,1,0,0
376,Leicester City,West Ham,2–1,1.4,1.4,2,1,H,12.915789,10.585227,...,2,0.237403,0.222265,0.540332,0.540332,H,1,1,0,0
377,Aston Villa,Brighton,2–1,2.8,1.4,2,1,H,11.850000,12.615000,...,2,0.295427,0.238551,0.466023,0.466023,H,1,1,0,0
378,Leeds United,Tottenham,1–4,1.5,2.2,1,4,A,7.343617,19.091176,...,0,0.536785,0.223218,0.239997,0.536785,A,1,0,0,1


In [23]:
Brier_model2 = ((df_test_1['predH'] - df_test_1['Houtcome'])**2 +(df_test_1['predD'] - df_test_1['Doutcome'])**2 +\
             (df_test_1['predA'] - df_test_1['Aoutcome'])**2).sum()/190
Brier_model2

0.5820744742362601

---

In this last chunk of code, I separated the test subset from our overall data set and then using the model's coefficients (essentially using the model to predict), computed probabilities for each outcome in each game. Then to generate our predicted result, we took the maximum of these probabilities (the most likely outcome) and set the outcome associated with this probability equal to the predicted result and, similar to the first model, defined a variable that was equal to 1 if the prediction matched the outcome and 0 otherwise. From this, we take the mean, which is equivalent to the success rate of the model, and obtain $54.74\%$ - this is roughly $7\%$ higher than the first model and hence $21\%$ above our benchmark value. So clearly, this model performs better, but again is that enough to deem it successful. We have now risen above the $50\%$ mark meaning we average more than one correct prediction in every 2 but again I feel it is slightly disappointing. I think it's successful enough to see it has some merit but not enough to put out into the world and tell other people to trust what it says.

Additionally, because we computed probabilities here, we can calculate the brier score for this model which is found in the last few lines of code; it was found to be $0.582$ which is around $.1$ away from the benchmark value. This supports the claim above that the model performs better than choosing at random, but then further to that it doesn't seem as impressive as being $21\%$ better than choosing at random. This is because brier score is not linear so it is much harder to reduce it from $0.582$ to $0.482$ than it is $0.682$ to $0.582$, so in actual fact they represent similar levels of improvement above choosing at random.

I think where this models falters is that it doesn't have enough input variables to capture different nuances in results that would lead to more accurate predictions. This will be discussed more in the conclusion but unfortunately I could not independently decide on which other variables to include and ones I tried I could not implement. I would have liked to extend this model and explore how to improve it however it was still a worthwhile exercise on seeing how a basic logistic regression model performs.

I will now proceed to implement a machine learning model and see how this stacks up against the 2 previous ones.

---

## Model 3 - Machine Learning

In [24]:
df3 = df2.copy()
X = np.array(df3['lAVratio'])
X = X.reshape(-1, 1)
X_train = X[:190]
X_test  = X[190:]
y = df3['winvalue']
y_train = y[:190]
y_test  = y[190:]

In [25]:
model3 = LogisticRegression()
model3.fit(X_train, y_train)
y_pred = model3.predict(X_test)

In [26]:
y_test_1 = np.repeat('',len(y_test))
y_pred_1 = np.repeat('',len(y_pred))
y_test_1[y_test == 0] = 'A'
y_test_1[y_test == 1] = 'D'
y_test_1[y_test == 2] = 'H'
y_pred_1[y_pred == 0] = 'A'
y_pred_1[y_pred == 1] = 'D'
y_pred_1[y_pred == 2] = 'H'


pd.crosstab(y_test_1,y_pred_1,margins=True,rownames=['Result'],colnames=['Pred'])

Pred,A,H,All
Result,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,24,27,51
D,16,28,44
H,14,81,95
All,54,136,190


In [27]:
(24+81)/190

0.5526315789473685

In [28]:
pd.crosstab(df_test_1['Result'],df_test_1['pred_result_2'],margins=True,rownames=['Result'],colnames=['Pred'])

Pred,A,H,All
Result,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,23,28,51
D,16,28,44
H,14,81,95
All,53,137,190


In [29]:
np.where(y_pred_1 == df_test_1['pred_result_2'],0,1).sum()

1

---

This section of code is much shorter than the previous 2 and also doesn't include as much description about what different bits are, only this bit at the end here. This will be explained when I look back on the notebook overall but for now I will just explain what the code does and review how the model has performed. 

Firstly, like always, the code subsets the data into what we will use to train the model and what will we test it on. Then the model is called and fitted, and unlike the previous model, where we computed probabilities, we directly predict/generate the results. This produces a vector of numeric values, so along with the test data, I converted this into strings matching the results they predict, essentially undoing the stage where the results are converted into numbers so that in the crosstab it is in a nicer format and much clearer what is going. The crosstab displays all of the games in our test set and shows how many of each result was predicted compared to what the actual result was - so for example the model predicted $14$ away wins where the home team actually won. This allows us to compute the success rate of the model by adding up all the games where the correct result was predicted and dividing by the total giving us a value of $55.26\%$.

I thought that this was especially close to the second model, and made me think it would be a good idea to compare the crosstabs of the respective models and it can clearly be seen that not only are the models very similar in success rate, they virtually predict the same number of each result associated with each given result. Simply, as the last line shows, they disagree on only one game. Now this was a very long way of showing what is fairly explicit in the code - the models themselves are marginally different. So why is that? Well now I'm going to move on to the conclusion where that will be answered.

---

# Conclusion

Unfortunately what I realised doing the third model as that I don't have enough knowledge on machine learning models to create one different to what I had already done. I know there is some sort of difference but I naively thought I could just use the package, create it and just like that that would've been a clear progression from 1 to 2 to 3. 

The first model was a useful exercise in writing an arguably complex function on a dataset (something I hadn't done previous) and doing the usual process of generating results, calculating success rate and evaluating. Then when it came to the second model, as mentioned, that was also a useful exercise in building a simple version of models built on a complex idea and doing the same processes above in this different scenario. There was also a clear difference between the 2 and you can see that in the code and also the results found. There was also some discussion of variable selection which is something I could definitely change after more thinking, research and planning. I would have liked to implement the effect home advantage has on predictions and therefore accuracy but I was unsuccessful, defining a variable that equalled 1 if the home team won and 0 otherwise but then later realising that my predictions were then based on information that wasn't available prior to each game - a violation of the fundamental principle from the intro. So while that it was worthwhile, there is definitely some extension I can do to that potentially do at some point, and then that will start a lateral discussion about what subset of a collection of variables should be included. Like my first project, it starts to sound like it will be an overly arduous task that I don't have the time or headspace for. 

For the the machine learning, I only started thinking about the variables and data that would be needed for it as I was finishing the second model and about to start the third one. For a long time, I didn't even think I needed any sort of additional data, and that I would just call the functions needed, it would do it's thing on just the dataset of results, and give me predicted results with the highest accuracy out of all the models. This was clearly my inexperience and lack of knowledge. When it came to writing it, I simply decided to use the same variable as I did in the second model and therefore just see the difference between using an ordinal logistic regression model and a machine learning one. So I formulated the data and then looked for what function to call to produce the model and it was a logistic regression. Again it seemed like there was going to be less and less difference, and after fitting it, predicting with it and comparing them that was exactly right. This is why there was no explanation in that section as I wanted to go through my entire thought process as I worked through it retrospectively. 

So I had basically created 2 near identical models simply from 2 different packages. There is obviously not no difference as they differ on one result, which I think is down to how the models are fitted because when I was looking into it, while they both fall under the category of logistic regression there are points where you choose different methods to exactly how it will work. They choose different methods by default, not because I have actively chosen them. Like the second model then, this section requires an extension of incorporating more variables and seeing what that does, which comes with all the caveats mentioned there. Then, to make this third section distinct from what the second involves, I need to go away and improve my knowledge and understanding of machine learning models so that if I ever revisit this, I can clearly demonstrate what the difference is and then compare, contrast and criticually evaluate them again. Until then, this is the end of this project. 

----

#### Last edited 20/10/23