---
title: Preprocessing
description: ...
---

In [9]:
import pandas as pd

match_results = pd.read_csv('data/extracted_match_results.csv', parse_dates=['date'])
head_coach = pd.read_csv('data/extracted_head_coach.csv', parse_dates=['appointed', 'end_date'])

match_results.drop(columns = ['match_url'], inplace = True)
match_results.rename(columns = {'home': 'home_team', 'away': 'away_team'}, inplace = True)
head_coach.drop(columns = ['staff_url'], inplace = True)
head_coach.rename(columns = {'team_name': 'team'}, inplace = True)

display(match_results.head())
display(head_coach.head())

Unnamed: 0,league,country,season_year,date,home_team,home_goals,away_team,away_goals
0,Premier League,England,2018,2017-08-11,Arsenal,4.0,Leicester City,3.0
1,Premier League,England,2018,2017-08-12,Watford,3.0,Liverpool,3.0
2,Premier League,England,2018,2017-08-12,Crystal Palace,0.0,Huddersfield,3.0
3,Premier League,England,2018,2017-08-12,West Brom,1.0,Bournemouth,0.0
4,Premier League,England,2018,2017-08-12,Chelsea,2.0,Burnley,3.0


Unnamed: 0,team,league,country,coach_name,staff_dob,staff_nationality,staff_nationality_secondary,appointed,end_date,days_in_post,matches,wins,draws,losses
0,Manchester City,Premier League,England,Pep Guardiola,"Jan 18, 1971",Spain,,2016-07-01,NaT,2784,450,333,53,64
1,Liverpool FC,Premier League,England,Jürgen Klopp,"Jun 16, 1967",Germany,,2015-10-08,2024-06-30,3188,468,291,96,81
2,Chelsea FC,Premier League,England,Graham Potter,"May 20, 1975",England,,2022-09-08,2023-04-02,206,31,12,8,11
3,Chelsea FC,Premier League,England,Thomas Tuchel,"Aug 29, 1973",Germany,,2021-01-26,2022-09-07,589,100,63,19,18
4,Chelsea FC,Premier League,England,Frank Lampard,"Jun 20, 1978",England,,2019-07-04,2021-01-25,571,84,44,15,25


### Team's name

In [10]:
# Compute number of team that are in head_coach but not in match_results
coach_team = set(head_coach['team'])
match_team = set(match_results['home_team']) | set(match_results['away_team'])
coach_team_not_in_match = coach_team - match_team
match_team_not_in_coach = match_team - coach_team

len(coach_team_not_in_match), len(match_team_not_in_coach)

(63, 132)

In total, match_results dataset contains {eval}`len(match_team)` teams and head_coach dataset contains {eval}`len(coach_team)` teams. However some teams name are different between the two datasets. For example 'Liverpool' in match_results is 'Liverpool FC' in head_coach. This is problematic as we will need to join data on team's columns.

In total there is {eval}`len(coach_team_not_in_match)` teams present in match_results but not in head_coach and {eval}`len(match_team_not_in_coach)` teams present in head_coach but not in match_results. It indicates that despite mismatched names, that there are several teams present in match_results which do not have records of a coach. (needs more explaination in Data Extraction about data and why this is surprising based on how we filter head coach to at least include latest head coach).

Addressing this surprise ...

To address mismatched teams name we will use Levenshtein Distance (add reference to paper) to match team's name of head_coach missing in match teams with match teams.

In [11]:
from thefuzz import process

def match_names(name, list_names, min_score=0):
    scores = process.extract(name, list_names, limit=1)
    
    if len(scores) > 0 and scores[0][1] >= min_score:
        return scores[0][0]
    return None

name_match = {}

for team in coach_team:
    match = match_names(team, match_team, min_score=60)
    if match is not None:
        name_match[team] = match
    else:
        name_match[team] = None
        print(f"No match found for {team}")

# Show name_match
for team, match in name_match.items():
    print(f"{team:30} matched with {match}")

Arsenal FC                     matched with Arsenal
FC Nantes                      matched with Nantes
Frosinone Calcio               matched with Frosinone
Rayo Vallecano                 matched with Rayo Vallecano
Stade Reims                    matched with Reims
SS Lazio                       matched with Lazio
Inter Milan                    matched with Inter
Brighton & Hove Albion         matched with Brighton
Sevilla FC                     matched with Sevilla
RB Leipzig                     matched with RB Leipzig
Borussia Mönchengladbach       matched with M'Gladbach
FC Augsburg                    matched with Augsburg
OGC Nice                       matched with Nice
Genoa CFC                      matched with Genoa
Chelsea FC                     matched with Chelsea
Deportivo Alavés               matched with Alavés
Newcastle United               matched with Newcastle Utd
Manchester United              matched with Manchester Utd
Stade Rennais FC               matched with Ren

In [12]:
# # Fix some names
# name_match['Inter Milan'] = 'Inter'
# name_match['AC Milan'] = 'Milan'
# name_match['Stade Rennais FC'] = 'Rennes'

# Ensure everything map
for team in coach_team:
    if name_match[team] is None:
        print(f"No match found for {team}")

In [13]:
# Map head_coach['team'] with name_match
head_coach['team'] = head_coach['team'].map(name_match)

### To-Do

- investigate NaN values
- investigate inf and -inf values

### Saving preprocessed data

In [14]:
# Save match_results
match_results.to_csv('data/match_results.csv', index=False)
head_coach.to_csv('data/head_coach.csv', index=False)