# Do Leeds United Perform Worse Over the Christmas Period?

My father is a mad-keen Leeds United supporter - he's been through all of the highs and the lows of the last 16 years that they've been in the footballing wilderness. It's been a staple of my life that Leeds would have a promising start to the season, and then my father would grumble pessimistically - "Just wait until Christmas". One day he asked me - is it really true that Leeds perform worse over the Christmas period? 

To answer this question, I collected data on Premier League results from the last 5 years to the end of 2019. I then set my target period as the month following Christmas, from the 26th December - 31st January of the following year.

In [10]:
from datetime import datetime, timedelta
from math import sqrt

import pandas as pd
import numpy as np

data = pd.read_excel("data.xlsx")

In [11]:
data.head()

Unnamed: 0,Season,Date,Home,Home_Score,Away_Score,Away,Leeds_Score,Opponent_Score,Points,In_Window,percent_available
0,2019,2019-08-10,Leeds,1,1,Nott Forest,1,1,1,0,0.333333
1,2019,2019-08-22,Leeds,1,0,Brentford,1,0,3,0,1.0
2,2019,2019-08-31,Leeds,0,1,Swansea,0,1,0,0,0.0
3,2019,2019-09-21,Leeds,1,1,Derby,1,1,1,0,0.333333
4,2019,2019-10-01,Leeds,1,0,West Brom,1,0,3,0,1.0


In [22]:
points_proportions = data.groupby('In_Window')[['percent_available']].mean()
christmas_points_proportion = points_proportions.loc[1, 'percent_available']
non_christmas_points_proportion = points_proportions.loc[0, 'percent_available']

points_proportions

Unnamed: 0_level_0,percent_available
In_Window,Unnamed: 1_level_1
0,0.513043
1,0.45045


The table above shows the the proportion of points that Leeds won in the Christmas and non-Christmas period, being 51.3% and 45.0% respectively. This would initially show that there is some merit to our hypothesis. 
However, let's make this more formal with a hypothesis test.

Let $p_{Xmas}$, $p_{Other}$ be the proportions of points scored between christmas, points scored outside of christmas.

## Hypothesis Testing

Our hypotheses are:

$H_{0}: p_{Xmas} = p_{Other}$

$H_{1}: p_{Xmas} < p_{Other}$

We will continue with a two-sample Z test of proportions. We're going to do this twice, once for the case of winning, and once for the case of drawing or winning.

In [13]:
data['draw_or_win'] = np.where(data['Points'] > 0, 1, 0)

In [14]:
data.head()

Unnamed: 0,Season,Date,Home,Home_Score,Away_Score,Away,Leeds_Score,Opponent_Score,Points,In_Window,percent_available,draw_or_win
0,2019,2019-08-10,Leeds,1,1,Nott Forest,1,1,1,0,0.333333,1
1,2019,2019-08-22,Leeds,1,0,Brentford,1,0,3,0,1.0,1
2,2019,2019-08-31,Leeds,0,1,Swansea,0,1,0,0,0.0,0
3,2019,2019-09-21,Leeds,1,1,Derby,1,1,1,0,0.333333,1
4,2019,2019-10-01,Leeds,1,0,West Brom,1,0,3,0,1.0,1


In [25]:
draw_or_win_proportions = data.groupby('In_Window')['draw_or_win'].mean()
christmas_draw_or_win = draw_or_win_proportions[1]
non_christmas_draw_or_win = draw_or_win_proportions[0]

draw_or_win_proportions

In_Window
0    0.660870
1    0.648649
Name: draw_or_win, dtype: float64

The test statistic is:

$Z = \frac{\hat{p_{Xmas}} - \hat{p_{Other}}}{\sqrt{\frac{p_{Xmas}(1-p_{Xmas})}{n_{Xmas}}+\frac{p_{Other}(1-p_{Other})}{n_{Other}}}}$

In [26]:
# These give us the sample sizes - we can see the sample size for the Christmas period
# is lower as expected.
sample_sizes = data.groupby('In_Window')[['draw_or_win']].count()
christmas_sample_size = sample_sizes.loc[1, 'draw_or_win']
non_christmas_sample_size = sample_sizes.loc[0, 'draw_or_win']

In [28]:
# Calculate the test statistic
Z = (christmas_draw_or_win - non_christmas_draw_or_win) / (sqrt((non_christmas_draw_or_win * (1 - non_christmas_draw_or_win) / non_christmas_sample_size) + christmas_draw_or_win * (1 - christmas_draw_or_win) / christmas_sample_size))
round(Z, 2)

-0.14

This is not nearly large enough of a Z score to say that this difference is statistically significant, which is unsurprising giving the relatively small sample size. 

Let's see if there is a larger effect for winning games.

In [29]:
data['win'] = np.where(data['Points'] > 2, 1, 0)

In [30]:
data.head()

Unnamed: 0,Season,Date,Home,Home_Score,Away_Score,Away,Leeds_Score,Opponent_Score,Points,In_Window,percent_available,draw_or_win,win
0,2019,2019-08-10,Leeds,1,1,Nott Forest,1,1,1,0,0.333333,1,0
1,2019,2019-08-22,Leeds,1,0,Brentford,1,0,3,0,1.0,1,1
2,2019,2019-08-31,Leeds,0,1,Swansea,0,1,0,0,0.0,0,0
3,2019,2019-09-21,Leeds,1,1,Derby,1,1,1,0,0.333333,1,0
4,2019,2019-10-01,Leeds,1,0,West Brom,1,0,3,0,1.0,1,1


In [35]:
win_proportions = data.groupby('In_Window')['win'].mean()
christmas_win_proportion = win_proportions[1]
non_christmas_win_proportion = win_proportions[0]

win_proportions

In_Window
0    0.439130
1    0.351351
Name: win, dtype: float64

We can use the sample sizes from before as the number of games inside and outside of the windows have not changed.

In [36]:
# Calculate the test statistic
Z = (christmas_win_proportion - non_christmas_win_proportion) / (sqrt((non_christmas_win_proportion * (1 - non_christmas_win_proportion) / non_christmas_sample_size) + christmas_win_proportion * (1 - christmas_win_proportion) / christmas_sample_size))
round(Z, 2)

-1.03

Once again there is insufficient evidence at the 95% level that our null hypothesis is incorrect.

This is a highly simplisitic model and only takes into account the end result of the games. To do this analysis more thoroughly we should also consider taking into account the following factors:

- Table Position
- Opponent Table Position
- Number of Injuries
- Home / Away

An analysis can then be done on whether Leeds underperform or outperform their expected performance during the Christmas period.

Until then, all I can say to my dad is that it's all in his head! Cheer up!

Cheers,

Justin Smallwood

justin.d.smallwood@gmail.com