# The Pythagorean Expectation
"Pythagorean expectation is a sports analytics formula devised by Bill James to estimate the percentage of games a baseball team "should" have won based on the number of runs they scored and allowed. Comparing a team's actual and Pythagorean winning percentage can be used to make predictions and evaluate which teams are over-performing and under-performing. The name comes from the formula's resemblance to the Pythagorean theorem."

## Goal:
The goal of this analysis is to apply Pythagorean Expectation to predict which teams will under/overperform on the second half o the season. I am not going to focus on the details of all match statistics. Just on the Pythagorean Expectation analysis

## League:
I am going to focus on the English football league Premier League. Specifically on the last season that started at: 11/08/2023 and ended date: 19/05/2024.

"The Premier League is the highest level of the English football league system. Contested by 20 clubs, it operates on a system of promotion and relegation with the English Football League (EFL). Seasons usually run from August to May, with each team playing 38 matches: two against each other, one home and one away. Most games are played on weekend afternoons, with occasional weekday evening fixtures."

# Import Packages

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

# Import data

In [2]:
fixtures = pd.read_csv('PremierLeague.csv')

pd.set_option('display.max_columns', None)

# Gather basic Information

In [3]:
# Data Size
print(f'Stats\nRows: {fixtures.shape[0]} \nCols: {fixtures.shape[1]}\n')

Stats
Rows: 380 
Cols: 8



In [4]:
fixtures.head()

Unnamed: 0,round_number,league_name,name,starting_at,home_team_name,away_team_name,home_team_goals,away_team_goals
0,1,Premier League,Burnley vs Manchester City,8/11/2023,Burnley,Manchester City,0,3
1,1,Premier League,Arsenal vs Nottingham Forest,8/12/2023,Arsenal,Nottingham Forest,2,1
2,1,Premier League,AFC Bournemouth vs West Ham United,8/12/2023,AFC Bournemouth,West Ham United,1,1
3,1,Premier League,Brighton & Hove Albion vs Luton Town,8/12/2023,Brighton & Hove Albion,Luton Town,4,1
4,1,Premier League,Everton vs Fulham,8/12/2023,Everton,Fulham,0,1


In [5]:
fixtures.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   round_number     380 non-null    int64 
 1   league_name      380 non-null    object
 2   name             380 non-null    object
 3   starting_at      380 non-null    object
 4   home_team_name   380 non-null    object
 5   away_team_name   380 non-null    object
 6   home_team_goals  380 non-null    int64 
 7   away_team_goals  380 non-null    int64 
dtypes: int64(3), object(5)
memory usage: 23.9+ KB


In [6]:
fixtures.describe().T.round(2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
round_number,380.0,19.5,10.98,1.0,10.0,19.5,29.0,38.0
home_team_goals,380.0,1.8,1.37,0.0,1.0,2.0,3.0,6.0
away_team_goals,380.0,1.48,1.28,0.0,1.0,1.0,2.0,8.0


In [7]:
fixtures[fixtures.duplicated()]

Unnamed: 0,round_number,league_name,name,starting_at,home_team_name,away_team_name,home_team_goals,away_team_goals


In [8]:
fixtures.isna().sum()

round_number       0
league_name        0
name               0
starting_at        0
home_team_name     0
away_team_name     0
home_team_goals    0
away_team_goals    0
dtype: int64

## Data Prep Pythagorean Expectation (First Half of the Season)

## Season 2022/2023 Premier League

### Filter rounds from 1 to 19 (half season)

In [9]:
half_season = fixtures[fixtures['round_number'].astype(int) <= 19]

In [10]:
# indicates if the team won when was playing home
half_season['home_win'] = [1 if  h > v else 0.5 if h == v else 0  \
                           for h, v in zip(half_season['home_team_goals'],half_season['away_team_goals'])]


# indicates if the team won when was playing away
half_season['away_win'] = [1 if  v > h  else 0.5 if h == v else 0   \
                           for h, v in zip(half_season['home_team_goals'], half_season['away_team_goals'])]

half_season['count'] = 1

In [11]:
# Points System
WIN = 3
DRAW = 1

In [12]:
# Calculate the number of points playing home
half_season['home_points'] = [WIN if h > a else DRAW if h ==  a else 0 \
                              for h, a in zip(half_season['home_win'], half_season['away_win'])] 

In [13]:
# Calculate the number of points playing away
half_season['away_points'] = [WIN if a > h else DRAW if a ==  h else 0 \
                              for h, a in zip(half_season['home_win'], half_season['away_win'])] 

In [14]:
# Summarize total matches won at home and total goals
home_team = (half_season[['home_team_name','home_win','home_team_goals','away_team_goals','home_points','count']]
                 .groupby(['home_team_name'])
                 .sum()
                 .reset_index()
                 .rename(columns={'home_team_name': 'team', 'count':'Ph', 'home_team_goals':'FTHGh','away_team_goals':'FTAGh' })
            )

In [15]:
# Summarize total matches won away and total goals
away_team = (half_season[['away_team_name','away_win','away_team_goals','home_team_goals','away_points','count']]
                 .groupby(['away_team_name'])
                 .sum()
                 .reset_index()
                 .rename(columns={'away_team_name': 'team', 'count':'Pa','away_team_goals':'FTAGa', 'home_team_goals':'FTHGa'})
            )

In [16]:
# Summarize total wins home
total_home_win = (half_season[['home_team_name', 'home_win']][half_season['home_win'] == 1]
                        .groupby('home_team_name')
                        .count()
                        .reset_index()
                    )
# Summarize total lost home
total_home_lost = (half_season[['home_team_name', 'home_win']][half_season['home_win'] == 0]
                        .groupby('home_team_name')
                        .count()
                        .reset_index()
                    )

# Summarize total draw home
total_home_draw = (half_season[['home_team_name', 'home_win']][half_season['home_win'] == 0.5]
                        .groupby('home_team_name')
                        .count()
                        .reset_index()
                    )