<center><h1><font size=6> Basic Feature Engineering </h1></center>

This notebook takes the pre-processed data and calculates some simple features that I thought were useful based on my domain knowledge of the EPL. Some more advanced feature engineering and feature selection was done after I ran some Exploratory Data Analysis (EDA), but I wanted to do this after I split the data into training and testing sets.

The main bits of feature engineering this script does are:
* **Days since last game**: this is an important indicator of fatigue.
* **Points, total points and goal difference**.
* **League position**: this is an important indicator of form. 
* **Last head to head result**: the result of the last game between the same two sides, and the last game at the same venue.

### Load libraries and setup notebook configuration

In [1]:
# import packages
import pandas as pd 
import numpy as np
import os
from pathlib import Path


# set pandas configurations
pd.set_option("display.precision", 2) # display to 1 decimpal place
pd.set_option("display.max.columns", None) # display all columns so we can view the whole dataset
pd.set_option('display.float_format', '{:.2f}'.format) # Disable scientific notation for pandas


# set directories
os.chdir('..') # change current working directory to the parent directory to help access files/directories at a higher level
DATAPATH = Path(r'data') # set data path


# import from source directory
from src import constants

### Load processed data from local data file

In [53]:
matches = pd.read_csv(f"{DATAPATH}/processed/matches_processed.csv")

In [54]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26234 entries, 0 to 26233
Data columns (total 23 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   unique_match_id  26234 non-null  int64  
 1   date             26234 non-null  object 
 2   time             7679 non-null   object 
 3   comp             26234 non-null  object 
 4   round            26234 non-null  object 
 5   day              26234 non-null  object 
 6   venue            26234 non-null  object 
 7   result           26234 non-null  object 
 8   gf               26113 non-null  float64
 9   ga               26113 non-null  float64
 10  opponent         26234 non-null  object 
 11  xg               4215 non-null   float64
 12  xga              4215 non-null   float64
 13  poss             7339 non-null   float64
 14  attendance       6588 non-null   float64
 15  captain          6558 non-null   object 
 16  formation        12062 non-null  object 
 17  referee     

### Days since last game

In [55]:
# convert date to date time object
matches['date'] = pd.to_datetime(matches['date'])

# sort by team, season and date
matches = matches.sort_values(['team', 'season', 'date'])

# calculate days since last game
matches['days_since_last_game'] = matches.groupby(['team', 'season'])['date'].diff().dt.days

### Points, total points and goal difference

In [56]:
# calculate the number of points in each game based on the result
# Function to calculate points from a game
def calculate_points(row):
    if row['result'] == 'W':
        return 3
    elif row['result'] == 'D':
        return 1
    else:
        return 0

# calculate points
matches['points'] = matches.apply(calculate_points, axis=1)


# define a function to calculate the cumulative amount of a certain variable over the course of a PL season (up to but not including the game of the row)
def calculate_cumulative_pl_value(data, new_column_name, column):
    data[new_column_name] = data[data['comp'] == 'Premier League'].groupby(['team', 'season'])[column].transform(lambda x: x.shift().cumsum())
    data[new_column_name].fillna(method='ffill', inplace=True)  # Fill NaN values with the previous row's value (for games not in the PL)
    data[new_column_name].fillna(0, inplace=True)  # Fill remaining NaN values with 0 (first games of the season)
    
    return data

    
# calculate total cumulative points, goals for and goals against
matches = calculate_cumulative_pl_value(data=matches, new_column_name='pl_total_points', column='points')
matches = calculate_cumulative_pl_value(data=matches, new_column_name='pl_total_gf', column='gf')
matches = calculate_cumulative_pl_value(data=matches, new_column_name='pl_total_ga', column='ga')


# calculate total cumulative goal difference
matches['pl_total_goal_diff'] = matches['pl_total_gf'] - matches['pl_total_ga']

### Relative position in the PL table at each matchweek

In [57]:
def calculate_league_position(data):
    # Create a copy of the dataframe to avoid modifying the original one
    temp_df = data[data['comp'] == 'Premier League'].copy()

    # Convert 'round' to integer matchweek number
    temp_df['matchweek'] = temp_df['round'].apply(lambda x: int(x.split(' ')[1]))

    # Create a new dataframe to hold the ranks
    ranks = pd.DataFrame()

    # For each season and each matchweek, sort teams by total points, then by goal difference and finally by team name
    for season in temp_df['season'].unique():
        for matchweek in temp_df['matchweek'].unique():
            temp = temp_df[(temp_df['season'] == season) & (temp_df['matchweek'] == matchweek)].copy()
            temp.sort_values(['pl_total_points', 'pl_total_goal_diff', 'team'], ascending=[False, False, True], inplace=True)
            temp['rank'] = range(1, len(temp) + 1)
            temp = temp[temp['rank'] <= 20]  # Discard ranks greater than 20
            ranks = pd.concat([ranks, temp])

    # Replace rank of first matchweek with NaN in the 'temp_df' dataframe
    ranks.loc[ranks['matchweek'] == 1, 'rank'] = np.nan

    # For each matchweek and each team, take the minimum rank
    ranks['rank'] = ranks.groupby(['season', 'matchweek', 'team'])['rank'].transform('min')
    
    # Merge back with the original dataframe
    matches = data.merge(ranks[['unique_match_id', 'rank']], on='unique_match_id', how='left')

    return matches

matches = calculate_league_position(matches)

### Head to Head

In [58]:
# Create a sorted copy of the data
matches_sorted = matches.sort_values(['date']).copy()

# Create a set identifier for each match (sorted alphabetically to ensure that the same set represents the same pair of teams regardless of home/away status)
matches_sorted['set_identifier'] = matches_sorted.apply(lambda row: ''.join(sorted([row['team'], row['opponent']])), axis=1)

# Calculate last h2h result
matches_sorted['last_h2h'] = matches_sorted.groupby('set_identifier')['points'].transform(lambda x: x.shift())

# Create a venue identifier (including the venue in the identifier)
matches_sorted['venue_identifier'] = matches_sorted.apply(lambda row: ''.join(sorted([row['team'], row['opponent'], row['venue']])), axis=1)

# Calculate last h2h result at same venue
matches_sorted['last_h2h_venue'] = matches_sorted.groupby('venue_identifier')['points'].transform(lambda x: x.shift())
matches = matches_sorted.sort_values(['date']).copy()

In [61]:
matches

Unnamed: 0,unique_match_id,date,time,comp,round,day,venue,result,gf,ga,opponent,xg,xga,poss,attendance,captain,formation,referee,match_report,notes,season,team,date_downloaded,days_since_last_game,points,pl_total_points,pl_total_gf,pl_total_ga,pl_total_goal_diff,rank,set_identifier,last_h2h,venue_identifier,last_h2h_venue
0,199108172488,1991-08-17,,First Division,Matchweek 1,Sat,Home,D,1.00,1.00,Queens Park Rangers,,,,,,,,Match Report,,1992,Arsenal,2023-06-20,,1,0.00,0.00,0.00,0.00,,ArsenalQueens Park Rangers,,ArsenalHomeQueens Park Rangers,
11741,199108172182,1991-08-17,,First Division,Matchweek 1,Sat,Home,W,2.00,1.00,Oldham Athletic,,,,,,,,Match Report,,1992,Liverpool,2023-06-20,,3,49.00,58.00,58.00,0.00,,LiverpoolOldham Athletic,,HomeLiverpoolOldham Athletic,
1,199108201419,1991-08-20,,First Division,Matchweek 2,Tue,Away,L,1.00,3.00,Everton,,,,,,,,Match Report,,1992,Arsenal,2023-06-20,3.00,0,0.00,0.00,0.00,0.00,,ArsenalEverton,,ArsenalAwayEverton,
11742,199108211140,1991-08-21,,First Division,Matchweek 2,Wed,Away,L,1.00,2.00,Manchester City,,,,,,,,Match Report,,1992,Liverpool,2023-06-20,4.00,0,49.00,58.00,58.00,0.00,,LiverpoolManchester City,,AwayLiverpoolManchester City,
11743,199108241152,1991-08-24,,First Division,Matchweek 3,Sat,Away,D,0.00,0.00,Luton Town,,,,,,,,Match Report,,1992,Liverpool,2023-06-20,3.00,1,49.00,58.00,58.00,0.00,,LiverpoolLuton Town,,AwayLiverpoolLuton Town,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14494,202305212016,2023-05-21,16:00:00,Premier League,Matchweek 37,Sun,Home,W,1.00,0.00,Chelsea,1.20,1.20,64.00,53490.00,Kyle Walker,3-4-3◆,Michael Oliver,Match Report,,2022,Manchester City,2023-06-20,4.00,3,178.00,191.00,57.00,134.00,1.00,ChelseaManchester City,3.00,ChelseaHomeManchester City,3.00
14495,202305241013,2023-05-24,20:00:00,Premier League,Matchweek 32,Wed,Away,D,1.00,1.00,Brighton and Hove Albion,1.80,2.20,60.00,31388.00,İlkay Gündoğan,4-3-3,Simon Hooper,Match Report,,2022,Manchester City,2023-06-20,3.00,1,181.00,192.00,57.00,135.00,1.00,Brighton and Hove AlbionManchester City,3.00,AwayBrighton and Hove AlbionManchester City,0.00
14496,202305281017,2023-05-28,16:30:00,Premier League,Matchweek 38,Sun,Away,L,0.00,1.00,Brentford,1.60,1.30,65.00,17120.00,Kyle Walker,3-2-4-1,John Brooks,Match Report,,2022,Manchester City,2023-06-20,4.00,0,182.00,193.00,58.00,135.00,1.00,BrentfordManchester City,0.00,AwayBrentfordManchester City,0.00
14497,202306030010,2023-06-03,15:00:00,FA Cup,Final,Sat,Neutral,W,2.00,1.00,Manchester United,,,60.00,83179.00,İlkay Gündoğan,3-2-4-1,Paul Tierney,Match Report,,2022,Manchester City,2023-06-20,6.00,3,182.00,193.00,58.00,135.00,,Manchester CityManchester United,0.00,Manchester CityManchester UnitedNeutral,


### PL points in previous season

In [62]:
# sum all the points for each team in each season
season_points = pd.DataFrame(matches[matches['comp'] == 'Premier League'].groupby(['season', 'team'])['points'].sum()).reset_index()


# rename the variable to pre_season points and add one to the season
season_points['season'] = season_points['season']+1
season_points = season_points.rename(columns = {'points': 'prev_season_points'})

# merge with match data
matches.merge(season_points, left_on = ['season', 'team'], right_on = ['season', 'team'], how = 'left')

Unnamed: 0,unique_match_id,date,time,comp,round,day,venue,result,gf,ga,opponent,xg,xga,poss,attendance,captain,formation,referee,match_report,notes,season,team,date_downloaded,days_since_last_game,points,pl_total_points,pl_total_gf,pl_total_ga,pl_total_goal_diff,rank,set_identifier,last_h2h,venue_identifier,last_h2h_venue,prev_season_points
0,199108172488,1991-08-17,,First Division,Matchweek 1,Sat,Home,D,1.00,1.00,Queens Park Rangers,,,,,,,,Match Report,,1992,Arsenal,2023-06-20,,1,0.00,0.00,0.00,0.00,,ArsenalQueens Park Rangers,,ArsenalHomeQueens Park Rangers,,
1,199108172182,1991-08-17,,First Division,Matchweek 1,Sat,Home,W,2.00,1.00,Oldham Athletic,,,,,,,,Match Report,,1992,Liverpool,2023-06-20,,3,49.00,58.00,58.00,0.00,,LiverpoolOldham Athletic,,HomeLiverpoolOldham Athletic,,
2,199108201419,1991-08-20,,First Division,Matchweek 2,Tue,Away,L,1.00,3.00,Everton,,,,,,,,Match Report,,1992,Arsenal,2023-06-20,3.00,0,0.00,0.00,0.00,0.00,,ArsenalEverton,,ArsenalAwayEverton,,
3,199108211140,1991-08-21,,First Division,Matchweek 2,Wed,Away,L,1.00,2.00,Manchester City,,,,,,,,Match Report,,1992,Liverpool,2023-06-20,4.00,0,49.00,58.00,58.00,0.00,,LiverpoolManchester City,,AwayLiverpoolManchester City,,
4,199108241152,1991-08-24,,First Division,Matchweek 3,Sat,Away,D,0.00,0.00,Luton Town,,,,,,,,Match Report,,1992,Liverpool,2023-06-20,3.00,1,49.00,58.00,58.00,0.00,,LiverpoolLuton Town,,AwayLiverpoolLuton Town,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26235,202305212016,2023-05-21,16:00:00,Premier League,Matchweek 37,Sun,Home,W,1.00,0.00,Chelsea,1.20,1.20,64.00,53490.00,Kyle Walker,3-4-3◆,Michael Oliver,Match Report,,2022,Manchester City,2023-06-20,4.00,3,178.00,191.00,57.00,134.00,1.00,ChelseaManchester City,3.00,ChelseaHomeManchester City,3.00,86.00
26236,202305241013,2023-05-24,20:00:00,Premier League,Matchweek 32,Wed,Away,D,1.00,1.00,Brighton and Hove Albion,1.80,2.20,60.00,31388.00,İlkay Gündoğan,4-3-3,Simon Hooper,Match Report,,2022,Manchester City,2023-06-20,3.00,1,181.00,192.00,57.00,135.00,1.00,Brighton and Hove AlbionManchester City,3.00,AwayBrighton and Hove AlbionManchester City,0.00,86.00
26237,202305281017,2023-05-28,16:30:00,Premier League,Matchweek 38,Sun,Away,L,0.00,1.00,Brentford,1.60,1.30,65.00,17120.00,Kyle Walker,3-2-4-1,John Brooks,Match Report,,2022,Manchester City,2023-06-20,4.00,0,182.00,193.00,58.00,135.00,1.00,BrentfordManchester City,0.00,AwayBrentfordManchester City,0.00,86.00
26238,202306030010,2023-06-03,15:00:00,FA Cup,Final,Sat,Neutral,W,2.00,1.00,Manchester United,,,60.00,83179.00,İlkay Gündoğan,3-2-4-1,Paul Tierney,Match Report,,2022,Manchester City,2023-06-20,6.00,3,182.00,193.00,58.00,135.00,,Manchester CityManchester United,0.00,Manchester CityManchester UnitedNeutral,,86.00


### Form over last 5 games

### Head to head form over last 5 games

### Amount of games played in last 21 days

In [14]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26240 entries, 0 to 14498
Data columns (total 34 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   unique_match_id       26240 non-null  int64         
 1   date                  26240 non-null  datetime64[ns]
 2   time                  7683 non-null   object        
 3   comp                  26240 non-null  object        
 4   round                 26240 non-null  object        
 5   day                   26240 non-null  object        
 6   venue                 26240 non-null  object        
 7   result                26240 non-null  object        
 8   gf                    26119 non-null  float64       
 9   ga                    26119 non-null  float64       
 10  opponent              26240 non-null  object        
 11  xg                    4217 non-null   float64       
 12  xga                   4217 non-null   float64       
 13  poss                 