---
title: Preprocessing
description: Preprocessing the data for future use
---

In [39]:
import pandas as pd

### Preprocessing match results

In [40]:
# Column names : 'League', 'Country', 'Season', 'Date', 'Home', 'HomeGoals', 'Away', 'AwayGoals'
match_results = pd.read_csv('data/extracted_match_results.csv', parse_dates=['Date'])

# Fix encoding issue : renaming 'Fu\303\237ball-Bundesliga' to 'Bundesliga'
match_results['League'] = match_results['League'].replace('Fu<U+00C3><U+009F>ball-Bundesliga', 'Bundesliga')

# Map country codes to country names
match_country = {'ENG': 'England', 'ITA': 'Italy', 'FRA': 'France', 'GER': 'Germany', 'ESP': 'Spain'}
match_results['Country'] = match_results['Country'].map(match_country)

In [41]:
#| label: match_results
match_results.head()

Unnamed: 0,League,Country,Season,Date,Home,HomeGoals,Away,AwayGoals
0,Premier League,England,2018,2017-08-11,Arsenal,4.0,Leicester City,3.0
1,Premier League,England,2018,2017-08-12,Watford,3.0,Liverpool,3.0
2,Premier League,England,2018,2017-08-12,Crystal Palace,0.0,Huddersfield,3.0
3,Premier League,England,2018,2017-08-12,West Brom,1.0,Bournemouth,0.0
4,Premier League,England,2018,2017-08-12,Chelsea,2.0,Burnley,3.0


Since we are not interested in match opponents but rather individual team's result, we will modify this dataframe by splitting the match results into two separate rows, one for each team. This will allow us to calculate the statistics for each team separately.

In [42]:
def return_result(goal1, goal2):
    if goal1 > goal2:
        return 'win'
    elif goal1 < goal2:
        return 'loss'
    else:
        return 'draw'
    
match_results['HomeResult'] = match_results.apply(lambda x: return_result(x['HomeGoals'], x['AwayGoals']), axis=1)
match_results['AwayResult'] = match_results.apply(lambda x: return_result(x['AwayGoals'], x['HomeGoals']), axis=1)

home_results = match_results[['League', 'Country', 'Date', 'Home', 'HomeGoals', 'HomeResult']]
home_results = home_results.rename(columns={'Home': 'Team', 'HomeGoals': 'Goals', 'HomeResult': 'Result'})
home_results['isHome'] = True

away_results = match_results[['League', 'Country', 'Away', 'Date', 'AwayGoals', 'AwayResult']]
away_results = away_results.rename(columns={'Away': 'Team', 'AwayGoals': 'Goals', 'AwayResult': 'Result'})
away_results['isHome'] = False

match_results = pd.concat([home_results, away_results], ignore_index=True)

### Preprocessing head coach

In [43]:
#| label: head_coach

# Column names : 'Team', 'League', 'Country', 'HeadCoach', 'Appointed', 'EndDate', 'DaysInPost', 'Matches', 'Wins', 'Draws', 'Losses'
head_coach = pd.read_csv('data/extracted_head_coach.csv', parse_dates=['Appointed', 'EndDate'])
head_coach.head()

Unnamed: 0,Team,League,Country,HeadCoach,Appointed,EndDate,Tenure,Matches,Wins,Draws,Losses
0,Manchester City,Premier League,England,Pep Guardiola,2016-07-01,NaT,2838,461,341,56,64
1,Manchester City,Premier League,England,Manuel Pellegrini,2013-07-01,2016-06-30,1095,166,101,27,38
2,Manchester City,Premier League,England,Roberto Mancini,2009-12-19,2013-05-13,1241,191,113,38,40
3,Manchester City,Premier League,England,Mark Hughes,2008-06-04,2009-12-19,563,77,37,15,25
4,Manchester City,Premier League,England,Sven-Göran Eriksson,2007-07-06,2008-06-02,332,45,19,11,15


We need to filter head coach that were not active between 2018 to 2022.

In [44]:
# Keep head coach that were appointed before 2022
head_coach = head_coach[head_coach['Appointed'] <= '2022-12-31']
# Keep head coach that were dimissed after 2018 or that are still active
head_coach = head_coach[(head_coach['EndDate'] >= '2018-01-01') | (head_coach['EndDate'].isna())]

We don't have information on match past 2022-12-31 but head coach tenure could have been longer.
We need to limit head coach tenure to 2022-12-31.

:::{caution}
Head coach appointments records, extracted from TransferMarkt, contains data related to head coach in that specific club : tenure, number of matches played, number of matches won, etc. Those datapoint goes beyond our cut-off date (2022-12-31).

One important thing is that those feature beyond cut-off date still relate to a head coach appointment we have in our records. This guarantees us that metrics such *number of club head coach managed* are properly reflected and still relate to head coach performance. 

However, this create an asymetry in our data, as certain data point are limited by a time-frame and others not.
Moreover, we must be careful in how we compare these datapoint to others dataset such as match results as it could easily bias our statistical study.

The only way we combine this dataset to match result is by extracting head coach tenure on day of a match. This does not bias our statistical study as it is a feature that is properly reflected by our cut-off date.
:::

:::{note}
I have considered computing Head Coach performance metric from match results but we would lose information on prior records as well as creating imbalance data for plot such as linear regression of head coach performance over head coach tenure : a long standing coach which would not been dismissed soon after our start date would have a lower number of matches, thus a performance metric with higher variance that would bias linear regression due to long tenure.
:::

We need to ensure that data is coherent and that there is only 1 head coach at a time for a team.

In [45]:
# We need to verify that for a given team there is only 1 head coach at a time
# Each row contains a record of an head coach appointments. This appointment is ongoing between the Appointed and EndDate

head_coach_bis = head_coach.copy()
# Sort data by 'Team' and 'Appointed'
head_coach_bis = head_coach_bis.sort_values(['Team', 'Appointed'])
# Fillna with the last date of the dataset
head_coach_bis['EndDate'] = head_coach_bis['EndDate'].fillna('2022-12-31')
# Check if the next appointment is overlapping with the current one
head_coach_bis['Overlap'] = head_coach_bis.groupby('Team')['Appointed'].shift(-1) < head_coach_bis['EndDate']

# Show team with overlapping appointments
overlapping = head_coach_bis[head_coach_bis['Overlap']]

There is {eval}`overlapping.shape[0]` inconsistent record of head coach in {eval}`' ,'.join(overlapping['team'].unique())` teams.

In [46]:
#| label: hc_inconsistency
head_coach_bis[head_coach_bis['Team'] == 'Stade Reims'][['Team', 'Appointed', 'EndDate', 'Overlap']]

Unnamed: 0,Team,Appointed,EndDate,Overlap
3497,Stade Reims,2017-05-22,2021-05-25,True
3496,Stade Reims,2018-07-01,2019-03-30,False
3495,Stade Reims,2021-06-23,2022-10-13,False
3494,Stade Reims,2022-10-13,2022-12-31,False


In [47]:
# Filter out overlapping record
head_coach = head_coach[~head_coach.index.isin(overlapping.index)]

In [48]:
# Check if total_matches = wins + draws + losses
head_coach[head_coach['Matches'] != head_coach['Wins'] + head_coach['Draws'] + head_coach['Losses']].shape[0]

0

In [49]:
# Check if there is head coach with 0 matches
display(head_coach[head_coach['Matches'] == 0].shape[0])
# Remove this records 
head_coach = head_coach[head_coach['Matches'] > 0]

1

### Joining head coach with match results

We would like to add information about how long head coach has been in charge of the team when the match was played. This will allow us to see if the head coach's tenure have any impact on the match result.

However, when trying to join the two dataframes, we found that team names are not consistent between the two dataframes. We will need to fix this before we can join the two dataframes.

In [50]:
# Compute number of team that are in head_coach but not in match_results
coach_teams = set(head_coach['Team'])
match_teams = set(match_results['Team'])

coach_team_not_in_match = coach_teams - match_teams
match_team_not_in_coach = match_teams - coach_teams

len(coach_team_not_in_match), len(match_team_not_in_coach)

(63, 132)

In total, match_results dataset contains {eval}`len(match_team)` teams and head_coach dataset contains {eval}`len(coach_team)` teams. However some teams name are different between the two datasets. For example 'Liverpool' in match_results is 'Liverpool FC' in head_coach. This is problematic as we will need to join data on team's columns.

In total, there is {eval}`len(coach_team_not_in_match)` teams present in head coach records that are not in match results and {eval}`len(match_team_not_in_coach)` teams present in match results but not in head coach records.

We will use Levenshtein distance to find the closest team of *match_results* that match each team in head coach records. We will then manually check the results to ensure that the matches are correct.

In [51]:
from thefuzz import process

team_name_mapping = {}

for coach_team in coach_teams:
    matching_scores = process.extract(coach_team, match_teams, limit=1)
    
    if len(matching_scores) != 0 and matching_scores[0][1] >= 60:
        team_name_mapping[coach_team] = matching_scores[0][0]
    else:
        team_name_mapping[coach_team] = None
        print(f"No match found for {coach_team}")

In [52]:
#| label: team_match_table

name_match = pd.DataFrame(team_name_mapping.items(), columns=['Team in head coach records', 'Team in match results'])
name_match.head()

Unnamed: 0,Team in head coach records,Team in match results
0,Manchester City,Manchester City
1,Juventus FC,Juventus
2,Crystal Palace,Crystal Palace
3,Villarreal CF,Villarreal
4,Udinese Calcio,Udinese


In [53]:
# Map head_coach['team'] with name_match
head_coach['Team'] = head_coach['Team'].map(team_name_mapping)
head_coach

Unnamed: 0,Team,League,Country,HeadCoach,Appointed,EndDate,Tenure,Matches,Wins,Draws,Losses
0,Manchester City,Premier League,England,Pep Guardiola,2016-07-01,NaT,2838,461,341,56,64
36,Liverpool,Premier League,England,Jürgen Klopp,2015-10-08,2024-06-30,3188,480,300,98,82
57,Chelsea,Premier League,England,Graham Potter,2022-09-08,2023-04-02,206,31,12,8,11
58,Chelsea,Premier League,England,Thomas Tuchel,2021-01-26,2022-09-07,589,100,63,19,18
59,Chelsea,Premier League,England,Frank Lampard,2019-07-04,2021-01-25,571,84,44,15,25
...,...,...,...,...,...,...,...,...,...,...,...
3405,Montpellier,Ligue 1,France,Michel Der Zakarian,2017-05-23,2021-05-24,1462,161,63,51,47
3439,Strasbourg,Ligue 1,France,Julien Stéphan,2021-05-28,2023-01-09,591,58,19,20,19
3440,Strasbourg,Ligue 1,France,Thierry Laurey,2016-07-01,2021-05-24,1788,209,81,51,77
3494,Reims,Ligue 1,France,Will Still,2022-10-13,NaT,543,62,26,17,19


We can now add head coach days in post to match results.

In [54]:
#| label: final_match_results

# Merge head_coach with match_results
match_results_bis = match_results.merge(head_coach[['Team', 'HeadCoach', 'Appointed', 'EndDate']], on=['Team'], how='left')
# Put aside team that don't have a head coach
no_headcoach = match_results_bis[match_results_bis['HeadCoach'].isna()]
match_results_bis = match_results_bis[~match_results_bis['HeadCoach'].isna()]
# Filter match_results_bis to keep only head coach that were appointed before the match and with no end date or end date after the match
match_results_bis = match_results_bis[(match_results_bis['Date'] >= match_results_bis['Appointed']) & ((match_results_bis['EndDate'].isna()) | (match_results_bis['EndDate'] >= match_results_bis['Date']))]
# Join back the team that don't have a head coach
match_results_bis = pd.concat([match_results_bis, no_headcoach], ignore_index=True)
# Compute daysInPost
match_results_bis['DaysInPost'] = (match_results_bis['Date'] - match_results_bis['Appointed']).dt.days
match_results_bis = match_results_bis.drop(columns=['Appointed', 'EndDate'])
match_results_bis.head()

Unnamed: 0,League,Country,Date,Team,Goals,Result,isHome,HeadCoach,DaysInPost
0,Premier League,England,2017-08-11,Arsenal,4.0,win,True,Arsène Wenger,7619.0
1,Premier League,England,2017-08-12,Chelsea,2.0,loss,True,Antonio Conte,407.0
2,Premier League,England,2017-08-12,Brighton,0.0,loss,True,Chris Hughton,955.0
3,Premier League,England,2017-08-13,Newcastle Utd,0.0,loss,True,Rafael Benítez,520.0
4,Premier League,England,2017-08-13,Manchester Utd,4.0,win,True,José Mourinho,408.0


### Saving preprocessed data

In [55]:
# Save match_results
match_results_bis.to_csv('data/match_results.csv', index=False)
head_coach.to_csv('data/head_coach.csv', index=False)