# EPL DATA ANALYSIS

## Introduction 

The purpose of this notebook is to analyze the impact of covid-19 to the performance of English Premier League (EPL) teams. The dataset contained match information from : 
* 2018/2019 season (1 season before covid-19 interruption)
* 2019/2020 season (season where covid interrupted the league)
* 2020/2021 season (1 season after covid-19 interruption)

Information regarding the date of covid-19 interruption : 
* last match before covid-19 interruption = Leicester vs Aston Villa 10 March 2020 (match ID 46889)
* first match before covid-19 interruption = Aston Villa vs Sheffield United 18 June 2020 (match ID 46875)

Analysis objectives : 
* How the absence of spectators had impacted performance of EPL teams 
    * Home/Away Possession
    * Home/Away Shots on Target
    * Home/Away Shots
    * Home/Away Goals Scored per Game
    * Home/Away Win Rate
    * Shot on Target % = Shots on Target / Shots (Home/Away)
    * Quantity Conversion Rate = Goals / Shots (Home/Away)
    * Quality Conversion Rate  = Goals / Shots on Target (Home/Away)
    * Win Rate = No. of win / No. of games (Home/Away)
    * Points per game (Ppg) = Average points/game (Home / Away)
* Relation of Ball Possession % vs Goal Conversion Rate
* Made an similar analysis starting from 2019/2020 season (the introduction of VAR) but the sample size is too small and possible for imbalance data sampling.
* Compare the performance metrics between "big-6" and "non big-6" teams before and after covid-19. (note : Big-6 = "Manchester United", "Manchester City", "Liverpool", "Arsenal", "Chelsea", "Tottenham Hotspurs")

In [78]:
import pandas as pd
#Set float number just have two decimals
pd.set_option('display.float_format','{:.2f}' .format)

In [79]:
epl_data = pd.read_csv('epl_data_cleaned.csv', index_col=[0])

## Change Datatype

In [80]:
epl_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1140 entries, 0 to 1139
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   match_id              1140 non-null   int64  
 1   match_date            1140 non-null   object 
 2   matchweek             1140 non-null   object 
 3   home_team             1140 non-null   object 
 4   away_team             1140 non-null   object 
 5   season                1140 non-null   object 
 6   home_score            1140 non-null   int64  
 7   away_score            1140 non-null   int64  
 8   home_possession       1140 non-null   float64
 9   away_possession       1140 non-null   float64
 10  home_shots_on_target  1140 non-null   int64  
 11  away_shots_on_target  1140 non-null   int64  
 12  home_shots            1140 non-null   int64  
 13  away_shots            1140 non-null   int64  
 14  home_points           1140 non-null   float64
 15  away_points          

In [81]:
string_cols = ['matchweek','home_team','away_team','season']
int_cols = ['home_score','away_score','home_shots_on_target','away_shots_on_target',
            'home_shots','away_shots','home_points','away_points']
float_cols = ['home_possession','away_possession']
datetime_cols = 'match_date'

#convert to corresponding datatype
epl_data[string_cols] = epl_data[string_cols].astype("string")
epl_data[int_cols] = epl_data[int_cols].astype("int")
epl_data[float_cols] = epl_data[float_cols].astype("float")

#convert object to datetime object
epl_data['match_date'] = pd.to_datetime(epl_data['match_date'], format='%Y-%m-%d')

epl_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1140 entries, 0 to 1139
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   match_id              1140 non-null   int64         
 1   match_date            1140 non-null   datetime64[ns]
 2   matchweek             1140 non-null   string        
 3   home_team             1140 non-null   string        
 4   away_team             1140 non-null   string        
 5   season                1140 non-null   string        
 6   home_score            1140 non-null   int32         
 7   away_score            1140 non-null   int32         
 8   home_possession       1140 non-null   float64       
 9   away_possession       1140 non-null   float64       
 10  home_shots_on_target  1140 non-null   int32         
 11  away_shots_on_target  1140 non-null   int32         
 12  home_shots            1140 non-null   int32         
 13  away_shots        

In [82]:
#label match before and after covid19
epl_data.loc[epl_data['match_date']<='2020-03-10','covid19'] = 'before'
epl_data.loc[epl_data['match_date']>'2020-03-10','covid19'] = 'after'

epl_data

Unnamed: 0,match_id,match_date,matchweek,home_team,away_team,season,home_score,away_score,home_possession,away_possession,home_shots_on_target,away_shots_on_target,home_shots,away_shots,home_points,away_points,covid19
0,38309,2018-08-11,Matchweek 1,AFC Bournemouth,Cardiff City,2018/2019,2,0,62.90,37.10,4,1,12,10,3,0,before
1,38310,2018-08-11,Matchweek 1,Fulham,Crystal Palace,2018/2019,0,2,66.30,33.70,6,10,15,12,0,3,before
2,38311,2018-08-11,Matchweek 1,Huddersfield Town,Chelsea,2018/2019,0,3,37.20,62.80,1,4,6,13,0,3,before
3,38313,2018-08-11,Matchweek 1,Manchester United,Leicester City,2018/2019,2,1,46.30,53.70,6,4,8,13,3,0,before
4,38314,2018-08-11,Matchweek 1,Newcastle United,Tottenham Hotspur,2018/2019,1,2,40.40,59.60,2,5,15,15,0,3,before
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1135,59272,2021-05-23,Matchweek 38,Manchester City,Everton,2020/2021,5,0,67.70,32.30,11,3,21,8,3,0,after
1136,59273,2021-05-23,Matchweek 38,Sheffield United,Burnley,2020/2021,1,0,43.00,57.00,3,3,12,10,3,0,after
1137,59274,2021-05-23,Matchweek 38,West Ham United,Southampton,2020/2021,3,0,36.90,63.10,7,5,14,17,3,0,after
1138,59275,2021-05-23,Matchweek 38,Wolverhampton Wanderers,Manchester United,2020/2021,1,2,57.20,42.80,4,4,14,9,0,3,after


## Data Analysis

### Home / Away Ball Possession %

In [100]:
#home possession before and after covid
h_pos = pd.DataFrame()
h_pos['home_possession_before'] = epl_data.loc[epl_data['covid19']=='before'].groupby('home_team')\
                                        ['home_possession'].mean()
h_pos['home_possession_after'] = epl_data.loc[epl_data['covid19']=='after'].groupby('home_team')\
                                        ['home_possession'].mean()
h_pos['home_possession_changes'] = h_pos['home_possession_after']-h_pos['home_possession_before']

#drop na values, which eliminates teams that did not play in both before and after covid19
h_pos.dropna(inplace=True)

#define covid19 effect on home possession
h_pos.loc[h_pos['home_possession_changes']>0,'effect'] = 'positive'  
h_pos.loc[h_pos['home_possession_changes']<0,'effect'] = 'negative'  
h_pos.loc[h_pos['home_possession_changes']==0,'effect'] = 'not affected'  

h_pos.sort_values('home_possession_changes',ascending=True)

Unnamed: 0_level_0,home_possession_before,home_possession_after,home_possession_changes,effect
home_team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tottenham Hotspur,58.73,50.82,-7.91,negative
Watford,45.67,38.72,-6.95,negative
Arsenal,59.0,52.74,-6.26,negative
Sheffield United,47.81,42.5,-5.3,negative
Everton,52.55,47.93,-4.63,negative
Crystal Palace,45.98,41.8,-4.18,negative
West Ham United,47.63,43.6,-4.03,negative
Chelsea,62.71,60.49,-2.21,negative
Aston Villa,48.42,46.3,-2.12,negative
Manchester City,67.76,66.72,-1.04,negative


In [96]:
h_pos.value_counts('effect')

effect
negative    11
positive    10
dtype: int64

In [104]:
#away possession before and after covid
a_pos = pd.DataFrame()
a_pos['away_possession_before'] = epl_data.loc[epl_data['covid19']=='before'].groupby('away_team')\
                                        ['away_possession'].mean()
a_pos['away_possession_after'] = epl_data.loc[epl_data['covid19']=='after'].groupby('away_team')\
                                        ['away_possession'].mean()
a_pos['away_possession_changes'] = a_pos['away_possession_after']-a_pos['away_possession_before']

#drop na values, which eliminates teams that did not play in both before and after covid19
a_pos.dropna(inplace=True)

#define covid19 effect on away possession
a_pos.loc[a_pos['away_possession_changes']>0,'effect'] = 'positive'  
a_pos.loc[a_pos['away_possession_changes']<0,'effect'] = 'negative'  
a_pos.loc[a_pos['away_possession_changes']==0,'effect'] = 'not affected'  

a_pos.sort_values('away_possession_changes',ascending=True)

Unnamed: 0_level_0,away_possession_before,away_possession_after,away_possession_changes,effect
away_team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AFC Bournemouth,47.08,33.27,-13.81,negative
Norwich City,51.31,41.02,-10.28,negative
Everton,49.26,45.14,-4.12,negative
Manchester City,66.74,62.74,-4.0,negative
West Ham United,45.55,42.45,-3.1,negative
Crystal Palace,43.43,40.45,-2.98,negative
Arsenal,55.41,52.71,-2.69,negative
Tottenham Hotspur,54.25,51.94,-2.31,negative
Liverpool,62.07,61.16,-0.91,negative
Fulham,48.42,47.85,-0.57,negative


In [105]:
a_pos.value_counts('effect')

effect
positive    11
negative    10
dtype: int64