<center><h1><font size=6> Basic Feature Engineering </h1></center>

This notebook takes the pre-processed data and calculates some simple features that I thought were useful based on my domain knowledge of the EPL. Some more advanced feature engineering and feature selection was done after I ran some Exploratory Data Analysis (EDA), but I wanted to do this after I split the data into training and testing sets.

The main bits of feature engineering this script does are:
* **Days since last game and number of games in last 21 days**: this is an important indicator of fatigue.
* **Points, total points and goal difference**. This is a good indicator of how well a team is doing in the current season.
* **League position**: this is an important indicator of form. 
* **Stats from previous season**: this is a good indicator of how good a team is.
* **Last head to head result**: the most recent result and form of the team against the same opponent (and at the same venue)

### Load libraries and setup notebook configuration

In [1]:
# import packages
import pandas as pd 
import numpy as np
import os
from pathlib import Path


# set pandas configurations
pd.set_option("display.precision", 2) # display to 1 decimpal place
pd.set_option("display.max.columns", None) # display all columns so we can view the whole dataset
pd.set_option('display.float_format', '{:.2f}'.format) # Disable scientific notation for pandas


# set directories
os.chdir('..') # change current working directory to the parent directory to help access files/directories at a higher level
DATAPATH = Path(r'data') # set data path


# import from source directory
from src import constants

### Load processed data from local data file

In [2]:
matches = pd.read_csv(f"{DATAPATH}/processed/matches_processed.csv")

In [3]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26234 entries, 0 to 26233
Data columns (total 23 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   unique_match_id  26234 non-null  int64  
 1   date             26234 non-null  object 
 2   time             7679 non-null   object 
 3   comp             26234 non-null  object 
 4   round            26234 non-null  object 
 5   day              26234 non-null  object 
 6   venue            26234 non-null  object 
 7   result           26234 non-null  object 
 8   gf               26113 non-null  float64
 9   ga               26113 non-null  float64
 10  opponent         26234 non-null  object 
 11  xg               4215 non-null   float64
 12  xga              4215 non-null   float64
 13  poss             7339 non-null   float64
 14  attendance       6588 non-null   float64
 15  captain          6558 non-null   object 
 16  formation        12062 non-null  object 
 17  referee     

### Days since last game

In [4]:
# convert date to date time object
matches['date'] = pd.to_datetime(matches['date'])

# sort by team, season and date
matches = matches.sort_values(['team', 'season', 'date'])

# calculate days since last game
matches['days_since_last_game'] = matches.groupby(['team', 'season'])['date'].diff().dt.days

### Amount of games played in last 21 days

In [5]:
# sort data into right order
matches = matches.sort_values(['team', 'season', 'date'])

# calculate count of games in previous 21 days up to current day for each team
matches['games_played_last_21_days'] = matches.set_index('date')\
                .groupby('team', sort=False)['unique_match_id']\
                .rolling('21d', closed='left').count().tolist()

### Points, total points and goal difference

In [6]:
# calculate the number of points in each game based on the result
# Function to calculate points from a game
def calculate_points(row):
    if row['result'] == 'W':
        return 3
    elif row['result'] == 'D':
        return 1
    else:
        return 0

# calculate points
matches['points'] = matches.apply(calculate_points, axis=1)


# define a function to calculate the cumulative amount of a certain variable over the course of a PL season (up to but not including the game of the row)
def calculate_cumulative_pl_value(data, new_column_name, column):
    data[new_column_name] = data[data['comp'] == 'Premier League'].groupby(['team', 'season'])[column].transform(lambda x: x.shift().cumsum())
    data[new_column_name].fillna(method='ffill', inplace=True)  # Fill NaN values with the previous row's value (for games not in the PL)
    data[new_column_name].fillna(0, inplace=True)  # Fill remaining NaN values with 0 (first games of the season)
    
    return data

    
# calculate total cumulative points, goals for and goals against
matches = calculate_cumulative_pl_value(data=matches, new_column_name='pl_total_points', column='points')
matches = calculate_cumulative_pl_value(data=matches, new_column_name='pl_total_gf', column='gf')
matches = calculate_cumulative_pl_value(data=matches, new_column_name='pl_total_ga', column='ga')


# calculate total cumulative goal difference
matches['pl_total_goal_diff'] = matches['pl_total_gf'] - matches['pl_total_ga']

### Relative position in the PL table at each matchweek

In [7]:
def calculate_league_position(data):
    # Create a copy of the dataframe to avoid modifying the original one
    temp_df = data[data['comp'] == 'Premier League'].copy()

    # Convert 'round' to integer matchweek number
    temp_df['matchweek'] = temp_df['round'].apply(lambda x: int(x.split(' ')[1]))

    # Create a new dataframe to hold the ranks
    ranks = pd.DataFrame()

    # For each season and each matchweek, sort teams by total points, then by goal difference and finally by team name
    for season in temp_df['season'].unique():
        for matchweek in temp_df['matchweek'].unique():
            temp = temp_df[(temp_df['season'] == season) & (temp_df['matchweek'] == matchweek)].copy()
            temp.sort_values(['pl_total_points', 'pl_total_goal_diff', 'team'], ascending=[False, False, True], inplace=True)
            temp['rank'] = range(1, len(temp) + 1)
            temp = temp[temp['rank'] <= 20]  # Discard ranks greater than 20
            ranks = pd.concat([ranks, temp])

    # Replace rank of first matchweek with NaN in the 'temp_df' dataframe
    ranks.loc[ranks['matchweek'] == 1, 'rank'] = np.nan

    # For each matchweek and each team, take the minimum rank
    ranks['rank'] = ranks.groupby(['season', 'matchweek', 'team'])['rank'].transform('min')
    
    # Merge back with the original dataframe
    matches = data.merge(ranks[['unique_match_id', 'rank']], on='unique_match_id', how='left')

    return matches

matches = calculate_league_position(matches)

matches = matches.rename(columns = {'rank': 'pl_position'})

### Head to Head

In [8]:
# Create a sorted copy of the data
matches_sorted = matches.sort_values(['date']).copy()

# Create a set identifier for each match (sorted alphabetically to ensure that the same set represents the same pair of teams regardless of home/away status)
matches_sorted['set_identifier'] = matches_sorted.apply(lambda row: ''.join(sorted([row['team'], row['opponent']])), axis=1)

# Calculate last h2h result
matches_sorted['last_h2h'] = matches_sorted.groupby('set_identifier')['points'].transform(lambda x: x.shift())

# Calculate average points from last 5 head-to-head games
matches_sorted['last_h2h_form'] = matches_sorted.groupby('set_identifier')['points'].rolling(window=5, min_periods=1, closed = 'left').mean().reset_index(level=0, drop=True)

# Create a venue identifier (including the venue in the identifier)
matches_sorted['venue_identifier'] = matches_sorted.apply(lambda row: ''.join(sorted([row['team'], row['opponent'], row['venue']])), axis=1)

# Calculate last h2h result at same venue
matches_sorted['last_h2h_venue'] = matches_sorted.groupby('venue_identifier')['points'].transform(lambda x: x.shift())

# Calculate average points from last 5 head-to-head games at the same venue
matches_sorted['last_h2h_venue_form'] = matches_sorted.groupby('venue_identifier')['points'].rolling(window=5, min_periods=1, closed = 'left').mean().reset_index(level=0, drop=True)

matches = matches_sorted.sort_values(['date']).copy()

### PL stats from previous season

In [9]:
# Sum all the points, goals scored, goals conceded, and calculate goal difference for each team in each season
season_stats = matches[matches['comp'] == 'Premier League'].groupby(['season', 'team']).agg({
    'points': 'sum',
    'gf': 'sum',
    'ga': 'sum'
}).reset_index()

# Rename the variables and increment the season by one
season_stats['season'] = season_stats['season'] + 1
season_stats = season_stats.rename(columns={'points': 'prev_season_points', 'gf': 'prev_season_gf', 'ga': 'prev_season_ga'})

# Merge with the match data
matches = matches.merge(season_stats, on=['season', 'team'], how='left')

# calculate previous season goal difference
matches['prev_season_goal_diff'] = matches['prev_season_gf'] - matches['prev_season_ga']

### Form over last 5 games

In [10]:
# Define a list of variables to calculate the form
variables = ['points', 'gf', 'ga', 'poss', 'xg', 'xga']

# Filter the dataframe to include only Premier League games
premier_league_matches = matches[matches['comp'] == 'Premier League'].copy()

# Calculate the form for each variable
for variable in variables:
    # Create a new column for the variable's form
    premier_league_matches[f'{variable}_pl_form'] = premier_league_matches.groupby('team')[variable].rolling(window=5, min_periods=1, closed='left').mean().reset_index(level=0, drop=True)


# create a list of form variables
form_variables =  [f'{variable}_pl_form' for variable in variables]
    
# Merge the form data into the original matches dataframe
matches = matches.merge(premier_league_matches[['unique_match_id'] + form_variables], on='unique_match_id', how='left')

### Dummy variable for promoted teams

In [11]:
# Create an empty column 'promoted' in the matches dataframe
matches['promoted'] = 0

# Get the unique seasons in the dataframe
unique_seasons = matches['season'].unique()

# Iterate over the unique seasons
for i, season in enumerate(unique_seasons):
    if i > 0: # ignore the first season
        prev_season_teams = matches[matches['season'] == unique_seasons[i - 1]]['team'].unique()
        current_season_teams = matches[matches['season'] == season]['team']
        
        # Update the 'promoted' column for teams not present in the previous season
        matches.loc[(matches['season'] == season) & (~current_season_teams.isin(prev_season_teams)), 'promoted'] = 1

### Collecting opponent data


In [16]:
# Merge on team and opponent
team_data = matches.merge(matches, left_on=['date', 'team', 'opponent'], right_on=['date', 'opponent', 'team'],
                          suffixes=('', '_opponent'))
opponent_data = matches.merge(matches, left_on=['date', 'team', 'opponent'], right_on=['date', 'opponent', 'team'],
                              suffixes=('_opponent', ''))

# Select the columns to calculate for the opponent
opponent_columns = ['days_since_last_game', 'games_played_last_21_days', 'pl_total_points', 'pl_total_gf',
                    'pl_total_ga', 'pl_total_goal_diff', 'pl_position', 'points_pl_form', 'gf_pl_form',
                    'ga_pl_form', 'poss_pl_form', 'xg_pl_form', 'xga_pl_form', 'prev_season_points',
                    'prev_season_gf', 'prev_season_ga', 'prev_season_goal_diff', 'promoted']

# Append "_opponent" to the opponent column names
opponent_columns = [f"{column}_opponent" for column in opponent_columns]

# Select the opponent data and rename columns
opponent_data = opponent_data[['unique_match_id'] + opponent_columns]

# Merge the opponent data into the matches dataframe
barp = matches.merge(opponent_data, on='unique_match_id', how='left')

barp

Unnamed: 0,unique_match_id,date,time,comp,round,day,venue,result,gf,ga,opponent,xg,xga,poss,attendance,captain,formation,referee,match_report,notes,season,team,date_downloaded,days_since_last_game,games_played_last_21_days,points,pl_total_points,pl_total_gf,pl_total_ga,pl_total_goal_diff,pl_position,set_identifier,last_h2h,last_h2h_form,venue_identifier,last_h2h_venue,last_h2h_venue_form,prev_season_points,prev_season_gf,prev_season_ga,prev_season_goal_diff,points_pl_form,gf_pl_form,ga_pl_form,poss_pl_form,xg_pl_form,xga_pl_form,promoted,days_since_last_game_opponent,games_played_last_21_days_opponent,pl_total_points_opponent,pl_total_gf_opponent,pl_total_ga_opponent,pl_total_goal_diff_opponent,pl_position_opponent,points_pl_form_opponent,gf_pl_form_opponent,ga_pl_form_opponent,poss_pl_form_opponent,xg_pl_form_opponent,xga_pl_form_opponent,prev_season_points_opponent,prev_season_gf_opponent,prev_season_ga_opponent,prev_season_goal_diff_opponent,promoted_opponent
0,199108172488,1991-08-17,,First Division,Matchweek 1,Sat,Home,D,1.00,1.00,Queens Park Rangers,,,,,,,,Match Report,,1992,Arsenal,2023-06-20,,,1,0.00,0.00,0.00,0.00,,ArsenalQueens Park Rangers,,,ArsenalHomeQueens Park Rangers,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,
1,199108172182,1991-08-17,,First Division,Matchweek 1,Sat,Home,W,2.00,1.00,Oldham Athletic,,,,,,,,Match Report,,1992,Liverpool,2023-06-20,,,3,49.00,58.00,58.00,0.00,,LiverpoolOldham Athletic,,,HomeLiverpoolOldham Athletic,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,
2,199108201419,1991-08-20,,First Division,Matchweek 2,Tue,Away,L,1.00,3.00,Everton,,,,,,,,Match Report,,1992,Arsenal,2023-06-20,3.00,1.00,0,0.00,0.00,0.00,0.00,,ArsenalEverton,,,ArsenalAwayEverton,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,
3,199108211140,1991-08-21,,First Division,Matchweek 2,Wed,Away,L,1.00,2.00,Manchester City,,,,,,,,Match Report,,1992,Liverpool,2023-06-20,4.00,1.00,0,49.00,58.00,58.00,0.00,,LiverpoolManchester City,,,AwayLiverpoolManchester City,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,
4,199108241152,1991-08-24,,First Division,Matchweek 3,Sat,Away,D,0.00,0.00,Luton Town,,,,,,,,Match Report,,1992,Liverpool,2023-06-20,3.00,2.00,1,49.00,58.00,58.00,0.00,,LiverpoolLuton Town,,,AwayLiverpoolLuton Town,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27033,202305212016,2023-05-21,16:00:00,Premier League,Matchweek 37,Sun,Home,W,1.00,0.00,Chelsea,1.20,1.20,64.00,53490.00,Kyle Walker,3-4-3◆,Michael Oliver,Match Report,,2022,Manchester City,2023-06-20,4.00,6.00,3,178.00,191.00,57.00,134.00,1.00,ChelseaManchester City,3.00,2.40,ChelseaHomeManchester City,3.00,1.80,86.00,83.00,32.00,51.00,3.00,2.80,0.60,65.20,2.14,0.58,0,,,,,,,,,,,,,,,,,,
27034,202305241013,2023-05-24,20:00:00,Premier League,Matchweek 32,Wed,Away,D,1.00,1.00,Brighton and Hove Albion,1.80,2.20,60.00,31388.00,İlkay Gündoğan,4-3-3,Simon Hooper,Match Report,,2022,Manchester City,2023-06-20,3.00,6.00,1,181.00,192.00,57.00,135.00,1.00,Brighton and Hove AlbionManchester City,3.00,1.80,AwayBrighton and Hove AlbionManchester City,0.00,1.20,86.00,83.00,32.00,51.00,3.00,2.20,0.40,67.60,1.88,0.72,0,,,,,,,,,,,,,,,,,,
27035,202305281017,2023-05-28,16:30:00,Premier League,Matchweek 38,Sun,Away,L,0.00,1.00,Brentford,1.60,1.30,65.00,17120.00,Kyle Walker,3-2-4-1,John Brooks,Match Report,,2022,Manchester City,2023-06-20,4.00,5.00,0,182.00,193.00,58.00,135.00,1.00,BrentfordManchester City,0.00,1.20,AwayBrentfordManchester City,0.00,1.50,86.00,83.00,32.00,51.00,2.60,2.00,0.40,67.20,1.74,1.12,0,,,,,,,,,,,,,,,,,,
27036,202306030010,2023-06-03,15:00:00,FA Cup,Final,Sat,Neutral,W,2.00,1.00,Manchester United,,,60.00,83179.00,İlkay Gündoğan,3-2-4-1,Paul Tierney,Match Report,,2022,Manchester City,2023-06-20,6.00,5.00,3,182.00,193.00,58.00,135.00,,Manchester CityManchester United,0.00,1.80,Manchester CityManchester UnitedNeutral,,,86.00,83.00,32.00,51.00,,,,,,,0,,,,,,,,,,,,,,,,,,


In [12]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26276 entries, 0 to 26275
Data columns (total 48 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   unique_match_id            26276 non-null  int64         
 1   date                       26276 non-null  datetime64[ns]
 2   time                       7707 non-null   object        
 3   comp                       26276 non-null  object        
 4   round                      26276 non-null  object        
 5   day                        26276 non-null  object        
 6   venue                      26276 non-null  object        
 7   result                     26276 non-null  object        
 8   gf                         26155 non-null  float64       
 9   ga                         26155 non-null  float64       
 10  opponent                   26276 non-null  object        
 11  xg                         4229 non-null   float64       
 12  xga 

In [13]:
# save clean data in processed data file
matches.to_csv(f"{DATAPATH}/processed/matches_processed_basic_features.csv", index=False)