<center><h1><font size=6> Data Processing </h1></center>

This notebook takes the raw EPL match data that was scraped in notebook 01 and processes the data to make sure it is in a clean and consistent format. More machine learning specific processing, filtering and cleaning will need to be done at the feature engineering stage, but this notebook carries out some basic functionality, including:
* Cleaning up column names and descriptions
* Cleaning column types
* Ensuring team names are consistent across the dataset

### Load libraries and setup notebook configuration

In [1]:
# import packages
import pandas as pd 
import numpy as np
import os
from pathlib import Path


# set pandas configurations
pd.set_option("display.precision", 2) # display to 1 decimpal place
pd.set_option("display.max.columns", None) # display all columns so we can view the whole dataset


# set directories
os.chdir('..') # change current working directory to the parent directory to help access files/directories at a higher level
DATAPATH = Path(r'data') # set data path


# import from source directory
from src import constants

### Load data from local data file

In [5]:
# load EPL match data
matches_raw = pd.read_csv(f"{DATAPATH}/raw/matches_raw.csv")
matches_raw.head(5)

Unnamed: 0.1,Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes,season,team,date_donwloaded
0,0,2022-07-30,17:00,Community Shield,FA Community Shield,Sat,Neutral,L,1,3,Liverpool,,,57.0,,Rúben Dias,4-3-3,Craig Pawson,Match Report,,2022,Manchester City,2023-06-19
1,1,2022-08-07,16:30,Premier League,Matchweek 1,Sun,Away,W,2,0,West Ham,2.2,0.5,75.0,62443.0,İlkay Gündoğan,4-3-3,Michael Oliver,Match Report,,2022,Manchester City,2023-06-19
2,2,2022-08-13,15:00,Premier League,Matchweek 2,Sat,Home,W,4,0,Bournemouth,1.7,0.1,67.0,53453.0,İlkay Gündoğan,4-2-3-1,David Coote,Match Report,,2022,Manchester City,2023-06-19
3,3,2022-08-21,16:30,Premier League,Matchweek 3,Sun,Away,D,3,3,Newcastle Utd,2.1,1.8,69.0,52258.0,İlkay Gündoğan,4-3-3,Jarred Gillett,Match Report,,2022,Manchester City,2023-06-19
4,4,2022-08-27,15:00,Premier League,Matchweek 4,Sat,Home,W,4,2,Crystal Palace,2.2,0.1,74.0,53112.0,Kevin De Bruyne,4-2-3-1,Darren England,Match Report,,2022,Manchester City,2023-06-19


### Clean data

In [24]:
matches_clean = matches_raw.iloc[:, 1:] # drop first column as it is an empty unnamed index


# clean column names
matches_clean.columns = map(str.lower, matches_clean.columns) # make columns lower case
matches_clean.columns = matches_clean.columns.str.replace(' ', '_') # replace spaces in column names


# clean column types
matches_clean['date'] = pd.to_datetime(matches_clean['date']) # convert date into datetime object
matches_clean['time'] = pd.to_datetime(matches_clean['time'], format='%H:%M').dt.time # convert date into time object
matches_clean['date_downloaded'] = pd.to_datetime(matches_clean['date_downloaded']) # convert date downloaded into datetime object
matches_clean['gf'] = pd.to_numeric(matches_clean['gf'], errors='coerce', downcast='integer') # convert goals for to numeric
matches_clean['ga'] = pd.to_numeric(matches_clean['gf'], errors='coerce', downcast='integer') # convert goals against to numeric

In [25]:
matches_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5732 entries, 0 to 5731
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             5732 non-null   datetime64[ns]
 1   time             5732 non-null   object        
 2   comp             5732 non-null   object        
 3   round            5732 non-null   object        
 4   day              5732 non-null   object        
 5   venue            5732 non-null   object        
 6   result           5732 non-null   object        
 7   gf               5637 non-null   float64       
 8   ga               5637 non-null   float64       
 9   opponent         5732 non-null   object        
 10  xg               4996 non-null   float64       
 11  xga              4996 non-null   float64       
 12  poss             5710 non-null   float64       
 13  attendance       4639 non-null   float64       
 14  captain          5730 non-null   object 

### Cleaning team names

One issue we have with the data in its raw format is that the team names are different in the 'team' and 'opponent' column, which makes it dificult to map across data in future. So let's create a mapping to ensure all our team names are consistent.

In [31]:
# collect team names
teams = np.sort(matches_clean.team.unique())


# collect opponent names (just premier league)
pl_matches = matches_clean[matches_clean['comp'] == "Premier League"] # filter to just contain PL opponents otherwise we get european and cup game opponents as well
opponents = np.sort(pl_matches.opponent.unique())


# find the index points within each list where the names are not equal
unequal_indexes = [i for i, (x, y) in enumerate(zip(teams, opponents)) if x != y]

for index in unequal_indexes:
    print(f"Team name [{index}]: {teams[index]}, Opponenent name [{index}]: {opponents[index]}")

Team name [4]: Brighton and Hove Albion, Opponenent name [4]: Brighton
Team name [11]: Huddersfield Town, Opponenent name [11]: Huddersfield
Team name [16]: Manchester United, Opponenent name [16]: Manchester Utd
Team name [17]: Newcastle United, Opponenent name [17]: Newcastle Utd
Team name [19]: Nottingham Forest, Opponenent name [19]: Nott'ham Forest
Team name [20]: Sheffield United, Opponenent name [20]: Sheffield Utd
Team name [24]: Tottenham Hotspur, Opponenent name [24]: Tottenham
Team name [26]: West Bromwich Albion, Opponenent name [26]: West Brom
Team name [27]: West Ham United, Opponenent name [27]: West Ham
Team name [28]: Wolverhampton Wanderers, Opponenent name [28]: Wolves


In [42]:
# map the opponent team names to the team names to ensure consistency
team_mapping = dict(zip(opponents, teams)) # create a dictionary mapping each unique opponent name to the team name
matches_clean['opponent'] = matches_clean['opponent'].map(lambda x: team_mapping.get(x, x)) # map to new names but leave original name if no mapping found (e.g., champions league games)

In [43]:
matches_clean.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,xg,xga,poss,attendance,captain,formation,referee,match_report,notes,season,team,date_donwloaded,opponent_mapped
0,2022-07-30,17:00:00,Community Shield,FA Community Shield,Sat,Neutral,L,1.0,1.0,Liverpool,,,57.0,,Rúben Dias,4-3-3,Craig Pawson,Match Report,,2022,Manchester City,2023-06-19,Liverpool
1,2022-08-07,16:30:00,Premier League,Matchweek 1,Sun,Away,W,2.0,2.0,West Ham United,2.2,0.5,75.0,62443.0,İlkay Gündoğan,4-3-3,Michael Oliver,Match Report,,2022,Manchester City,2023-06-19,West Ham United
2,2022-08-13,15:00:00,Premier League,Matchweek 2,Sat,Home,W,4.0,4.0,Bournemouth,1.7,0.1,67.0,53453.0,İlkay Gündoğan,4-2-3-1,David Coote,Match Report,,2022,Manchester City,2023-06-19,Bournemouth
3,2022-08-21,16:30:00,Premier League,Matchweek 3,Sun,Away,D,3.0,3.0,Newcastle United,2.1,1.8,69.0,52258.0,İlkay Gündoğan,4-3-3,Jarred Gillett,Match Report,,2022,Manchester City,2023-06-19,Newcastle United
4,2022-08-27,15:00:00,Premier League,Matchweek 4,Sat,Home,W,4.0,4.0,Crystal Palace,2.2,0.1,74.0,53112.0,Kevin De Bruyne,4-2-3-1,Darren England,Match Report,,2022,Manchester City,2023-06-19,Crystal Palace


In [45]:
# save clean data in processed data file
matches_clean.to_csv(f"{DATAPATH}/processed/matches_processed.csv")