---
title: Preprocessing
description: Preprocessing the data for future use
output_matplotlib_strings: remove
---

In [None]:
import pandas as pd

match_results = pd.read_csv('data/extracted_match_results.csv', parse_dates=['date'])
head_coach = pd.read_csv('data/extracted_head_coach.csv', parse_dates=['appointed', 'end_date'])

match_results.drop(columns = ['match_url'], inplace = True)
match_results.rename(columns = {'home': 'home_team', 'away': 'away_team'}, inplace = True)
head_coach.drop(columns = ['staff_url'], inplace = True)
head_coach.rename(columns = {'team_name': 'team'}, inplace = True)

display(match_results.head())
display(head_coach.head())

### Team's name

In [None]:
# Compute number of team that are in head_coach but not in match_results
coach_team = set(head_coach['team'])
match_team = set(match_results['home_team']) | set(match_results['away_team'])
coach_team_not_in_match = coach_team - match_team
match_team_not_in_coach = match_team - coach_team

len(coach_team_not_in_match), len(match_team_not_in_coach)

In total, match_results dataset contains {eval}`len(match_team)` teams and head_coach dataset contains {eval}`len(coach_team)` teams. However some teams name are different between the two datasets. For example 'Liverpool' in match_results is 'Liverpool FC' in head_coach. This is problematic as we will need to join data on team's columns.

In total there is {eval}`len(coach_team_not_in_match)` teams present in match_results but not in head_coach and {eval}`len(match_team_not_in_coach)` teams present in head_coach but not in match_results. It indicates that despite mismatched names, that there are several teams present in match_results which do not have records of a coach. (needs more explaination in Data Extraction about data and why this is surprising based on how we filter head coach to at least include latest head coach).

Addressing this surprise ...

To address mismatched teams name we will use Levenshtein Distance (add reference to paper) to match team's name of head_coach missing in match teams with match teams.

In [None]:
from thefuzz import process

def match_names(name, list_names, min_score=0):
    scores = process.extract(name, list_names, limit=1)
    
    if len(scores) > 0 and scores[0][1] >= min_score:
        return scores[0][0]
    return None

name_match = {}

for team in coach_team:
    match = match_names(team, match_team, min_score=60)
    if match is not None:
        name_match[team] = match
    else:
        name_match[team] = None
        print(f"No match found for {team}")

# Show name_match
for team, match in name_match.items():
    print(f"{team:30} matched with {match}")

In [None]:
# # Fix some names
# name_match['Inter Milan'] = 'Inter'
# name_match['AC Milan'] = 'Milan'
# name_match['Stade Rennais FC'] = 'Rennes'

# Ensure everything map
for team in coach_team:
    if name_match[team] is None:
        print(f"No match found for {team}")

In [None]:
# Map head_coach['team'] with name_match
head_coach['team'] = head_coach['team'].map(name_match)

### To-Do

- investigate NaN values
- investigate inf and -inf values

### Saving preprocessed data

In [None]:
# Save match_results
match_results.to_csv('data/match_results.csv', index=False)
head_coach.to_csv('data/head_coach.csv', index=False)