<a href="https://colab.research.google.com/github/ola-sumbo/NG-task3/blob/master/Predicting_Football.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**A Poisson Regression Model:**
Our model was founded on the belief that the number goals can be accurately expressed as a Poisson distribution. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn
from scipy.stats import poisson,skellam

In [2]:
epl_1617 = pd.read_csv("http://www.football-data.co.uk/mmz4281/1617/E0.csv")
epl_1617 = epl_1617[['HomeTeam','AwayTeam','FTHG','FTAG']] # select columns needed
epl_1617 = epl_1617.rename(columns={'FTHG': 'HomeGoals', 'FTAG': 'AwayGoals'})
epl_1617.head()

Unnamed: 0,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,Burnley,Swansea,0,1
1,Crystal Palace,West Brom,0,1
2,Everton,Tottenham,1,1
3,Hull,Leicester,2,1
4,Man City,Sunderland,2,1


In [3]:
# our task is to model the last 10 games so we remove the last 10 game played for prediction
epl_1617 = epl_1617[:-10]
epl_1617.mean()

HomeGoals    1.591892
AwayGoals    1.183784
dtype: float64

In [4]:
#probability of a draw between home and away teams
skellam.pmf(0.0, epl_1617.mean()[0], epl_1617.mean()[1])

0.24809376810717076

In [5]:
# probability of home team winning by one goal
skellam.pmf(1,  epl_1617.mean()[0],  epl_1617.mean()[1])

0.22706765807563964

In [6]:
# importing the tools required for the Poisson regression model
import statsmodels.api as sm
import statsmodels.formula.api as smf

  import pandas.util.testing as tm


In [7]:
goal_model_data = pd.concat([epl_1617[['HomeTeam', 'AwayTeam', 'HomeGoals']].assign(home=1).rename(columns={'HomeTeam':'team','AwayTeam':'opponent', 'HomeGoals':'goals'}), 
                             epl_1617[['AwayTeam','HomeTeam','AwayGoals']].assign(home=0).rename(columns={'AwayTeam':'team', 'HomeTeam':'opponent','AwayGoals':'goals'})])


In [8]:
poisson_model = smf.glm(formula = "goals ~ home + team + opponent ", data = goal_model_data, family = sm.families.Poisson()).fit()

In [9]:
poisson_model.summary()

0,1,2,3
Dep. Variable:,goals,No. Observations:,740.0
Model:,GLM,Df Residuals:,700.0
Model Family:,Poisson,Df Model:,39.0
Link Function:,log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-1042.4
Date:,"Tue, 11 May 2021",Deviance:,776.11
Time:,21:42:33,Pearson chi2:,659.0
No. Iterations:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.3725,0.198,1.880,0.060,-0.016,0.761
team[T.Bournemouth],-0.2891,0.179,-1.612,0.107,-0.641,0.062
team[T.Burnley],-0.6458,0.200,-3.230,0.001,-1.038,-0.254
team[T.Chelsea],0.0789,0.162,0.488,0.626,-0.238,0.396
team[T.Crystal Palace],-0.3865,0.183,-2.107,0.035,-0.746,-0.027
team[T.Everton],-0.2008,0.173,-1.161,0.246,-0.540,0.138
team[T.Hull],-0.7006,0.204,-3.441,0.001,-1.100,-0.302
team[T.Leicester],-0.4204,0.187,-2.249,0.025,-0.787,-0.054
team[T.Liverpool],0.0162,0.164,0.099,0.921,-0.306,0.338


**Explain:** The Coef is the slope of Logistic reg., It is the output we are interested in. The more positive the value of coef_ the more goals, closer to 0 it implies neutrality. Here we see home as 0.2969 meaning the home team has more goals.For more on glm : click here; [link text](https://www.statsmodels.org/stable/examples/notebooks/generated/glm_formula.html)
Finally, the opponent* values penalize/reward teams based on the quality of the opposition. This relfects the defensive strength of each team (Chelsea: -0.3036; Sunderland: 0.3707). In other words, you’re less likely to score against Chelsea. Hopefully, that all makes both statistical and intuitive sense.


In [10]:
poisson_model.predict(pd.DataFrame(data={'team': 'Chelsea', 'opponent': 'Sunderland',  
                                       'home':1},index=[1])) # the average number of goals against sunderland is 3.06

1    3.061662
dtype: float64

In [11]:
poisson_model.predict(pd.DataFrame(data={'team': 'Sunderland', 'opponent': 'Chelsea',
                                       'home':0},index=[1]))

1    0.409373
dtype: float64

**Using Two Poisson Distribution, I will wrap this in a function called simulate_match**

In [20]:
def simulate_match(foot_model, homeTeam, awayTeam, max_goals=10):
    home_goals_avg = foot_model.predict(pd.DataFrame(data ={'team':homeTeam, 'opponent':awayTeam, 'home':1}, index=[1])).values[0]
    away_goals_avg = foot_model.predict(pd.DataFrame(data={'team': awayTeam, 'opponent': homeTeam, 'home': 0}, index=[1])).values[0]  
    team_pred = [[poisson.pmf(i, team_avg) for i in range(0, max_goals+1)] for team_avg in [home_goals_avg, away_goals_avg]] 
    return(np.outer(np.array(team_pred[0]), np.array(team_pred[1])))

In [22]:
simulate_match(poisson_model, 'Chelsea', 'Sunderland', max_goals=3)

array([[0.03108485, 0.01272529, 0.00260469, 0.00035543],
       [0.0951713 , 0.03896054, 0.00797469, 0.00108821],
       [0.14569118, 0.059642  , 0.01220791, 0.00166586],
       [0.14868571, 0.06086788, 0.01245883, 0.0017001 ]])

**For example, along the diagonal, both teams score the same the number of goals (e.g. P(0-0)=0.031).**

Everything below the diagonal represents a Chelsea victory (e.g P(3-0)=0.149). If you prefer Over/Under markets, you can estimate P(Under 2.5 goals) by summing the entries where the sum of the column number and row number (both starting at zero) is less than 3 (i.e. the 6 values that form the upper left triangle)

In [23]:
# Simulate Chelsea_Sunderland
chel_sun = simulate_match(poisson_model, "Chelsea", "Sunderland", max_goals=10)

In [25]:
# chelsea win
np.sum(np.tril(chel_sun, -1))

0.8885986612364136

In [26]:
# draw
np.sum(np.diag(chel_sun))

0.08409349268649578

In [27]:
# sunderland win
np.sum(np.triu(chel_sun, 1)) #our model gives Sunderland a 2.7% 

0.02696181994285301

If that assumption is misguided, then the model outputs will be unreliable. Given a Poisson distribution with mean λ, then the number of events in half that time period follows a Poisson distribution with mean λ/2. In football terms, according to our Poisson model, there should be an equal number of goals in the first and second halves. Unfortunately, that doesn’t appear to hold true.

In [29]:
epl_1617_halves = pd.read_csv("http://www.football-data.co.uk/mmz4281/1617/E0.csv")
epl_1617_halves = epl_1617_halves[['FTHG', 'FTAG', 'HTHG', 'HTAG']]
epl_1617_halves['FHgoals'] = epl_1617_halves['HTHG'] + epl_1617_halves['HTAG']
epl_1617_halves['SHgoals'] = epl_1617_halves['FTHG'] + epl_1617_halves['FTAG'] - epl_1617_halves['FHgoals']
epl_1617_halves = epl_1617_halves[['FHgoals', 'SHgoals']]