<a href="https://colab.research.google.com/github/lbrogna/football_data_for_ml/blob/main/processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook was written to process data from football-data.co.uk into a more usable format for machine learning.

Here is the catalogue of features contained in this dataset off the shelf : https://www.football-data.co.uk/notes.txt

Up to date data can be downloaded from this URL: https://www.football-data.co.uk/downloadm.php. 

Data can be downloaded by years. When downloaded from the archives, files will be named using the same naming convention as seen in the below code cell. To use this notebook, download the desired years and upload them to your Google Drive.

NOTE: the code must be tweaked if using years before 2019-2020, as more data began being collected then, so the data from prior years has slightly different columns.

The dataset comes to us in the form of match by match stats.

Along with cleaning, the data will be augmented in this notebook in the following ways:

1. Add to every match row historical statistics for both teams.
2. Add to every match row columns which reflect each team's position in their league at the time of the match (like a league table).

I will also be reducing dimensionality for smoother processing. For a discussion about the processed data, and for the download link to an example dataset that already been processed, scroll to the bottom.

In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive

drive.mount('/content/drive')

file_paths = ['/content/drive/MyDrive/improved_odds/all-euro-data-2022-2023.xlsx', 
              '/content/drive/MyDrive/improved_odds/all-euro-data-2021-2022.xlsx', 
              '/content/drive/MyDrive/improved_odds/all-euro-data-2020-2021.xlsx', 
              '/content/drive/MyDrive/improved_odds/all-euro-data-2019-2020.xlsx']


# Create an empty list to store dataframes
df_list = []

#initialize value to be added as a season column
season_num = '1'

# Loop through each file path in the list
for file_path in file_paths:
    # Read the Excel file
    excel_file = pd.read_excel(file_path, sheet_name=None)

    # Loop through each sheet in the Excel file
    for sheet_name, df in excel_file.items():
        # Add a new column to the dataframe to represent the sheet name
        df['season'] = season_num
        # Append the dataframe to the list
        df_list.append(df)
    
    season_num = str(int(season_num) + 1)

# Concatenate all dataframes into a master dataframe
master_df = pd.concat(df_list, ignore_index=True)

master_df

Mounted at /content/drive


Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,AHCh,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA,season
0,E0,2022-08-05,20:00:00,Crystal Palace,Arsenal,0,2,A,0.0,1.0,...,0.50,2.09,1.84,2.04,1.88,2.09,1.88,2.03,1.85,1
1,E0,2022-08-06,12:30:00,Fulham,Liverpool,2,2,D,1.0,0.0,...,1.75,1.90,2.03,1.91,2.02,2.01,2.06,1.89,1.99,1
2,E0,2022-08-06,15:00:00,Bournemouth,Aston Villa,2,0,H,1.0,0.0,...,0.50,1.93,2.00,1.93,2.00,1.94,2.04,1.88,2.00,1
3,E0,2022-08-06,15:00:00,Leeds,Wolves,2,1,H,1.0,1.0,...,-0.25,2.08,1.85,2.10,1.84,2.14,1.87,2.08,1.81,1
4,E0,2022-08-06,15:00:00,Newcastle,Nott'm Forest,2,0,H,0.0,0.0,...,-1.00,1.97,1.96,1.99,1.93,2.19,1.97,2.03,1.86,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28161,G1,2020-07-18,17:15:00,Larisa,Xanthi,0,0,D,0.0,0.0,...,0.00,2.05,1.80,2.07,1.78,2.14,1.83,2.06,1.78,4
28162,G1,2020-07-18,17:15:00,Panetolikos,Volos NFC,1,0,H,1.0,0.0,...,-1.00,1.83,2.02,1.85,2.00,1.88,2.07,1.82,2.01,4
28163,G1,2020-07-19,18:00:00,Olympiakos,AEK,3,0,H,2.0,0.0,...,-0.75,1.95,1.90,1.93,1.93,2.00,1.97,1.92,1.91,4
28164,G1,2020-07-19,18:00:00,Panathinaikos,OFI Crete,3,2,H,3.0,2.0,...,-1.00,1.93,1.93,1.90,1.95,1.99,2.00,1.90,1.92,4


In [None]:
def update_league_table(table, home_team, away_team, home_score, away_score):
    """
    Update the league table with the result of a match.

    Args:
        table (pandas.DataFrame): The league table to update.
        home_team (str): The name of the home team.
        away_team (str): The name of the away team.
        home_score (int): The number of goals scored by the home team.
        away_score (int): The number of goals scored by the away team.

    Returns:
        None
    """
    # Update the goal differential
    table.at[home_team, 'GF'] += home_score
    table.at[home_team, 'GA'] += away_score
    table.at[home_team, 'GD'] = table.at[home_team, 'GF'] - table.at[home_team, 'GA']
    
    table.at[away_team, 'GF'] += away_score
    table.at[away_team, 'GA'] += home_score
    table.at[away_team, 'GD'] = table.at[away_team, 'GF'] - table.at[away_team, 'GA']

    # update the matches played
    table.at[home_team, 'MP'] += 1
    table.at[away_team, 'MP'] += 1

    # Update the points based on the result
    if home_score > away_score:
        table.at[home_team, 'W'] += 1
        table.at[home_team, 'P'] += 3
        table.at[away_team, 'L'] += 1
    elif home_score == away_score:
        table.at[home_team, 'D'] += 1
        table.at[home_team, 'P'] += 1
        table.at[away_team, 'D'] += 1
        table.at[away_team, 'P'] += 1
    else:
        table.at[home_team, 'L'] += 1
        table.at[away_team, 'W'] += 1
        table.at[away_team, 'P'] += 3


def create_empty_league_table(df, league, season):
    """
    Create an empty league table with columns for team names, matches played, wins, draws, losses,     
    goals for, goals against, goal differential, and points.

    Args:
        df (pandas.DataFrame): The dataframe containing the match data.
        league (str): The name of the league.
        season (str): The season of the league.

    Returns:
        pandas.DataFrame: An empty league table.
    """
    # Get unique team names for the given league and season
    teams = df.loc[(df['Div'] == league) & (df['season'] == season), ['HomeTeam', 'AwayTeam']].stack().unique()
    
    # Create an empty dataframe with columns for team names, matches played, wins, draws, losses,     
    # goals for, goals against, goal differential, and points
    table = pd.DataFrame(columns=['team', 'MP', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'P'])
    table['team'] = teams
    table.set_index('team', inplace=True)
    table[['MP', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'P']] = 0
    
    return table


def add_league_table_columns(df, ranks):
      """
    Add columns derived from league table.

    Args:
        df (pandas.DataFrame): The dataframe containing the match data.
        ranks (int): The number of ranks in the resulting league table columns.

    """
    # Create an empty dictionary to hold league tables for each league/season combination
  league_tables = {}
    
  # Loop through each row in the dataframe in chronological order
  for _, row in df.sort_values('Date').iterrows():
      # Get the league and season for the current row
      league = row['Div']
      season = row['season']
      
      # Create a new league table if one doesn't exist for this league/season combination
      if (league, season) not in league_tables:
          league_tables[(league, season)] = create_empty_league_table(df, league, season)
      
      # game_count = len(df[(df['Div'] == league) & (df['season'] == season)])

      # Update the league table with the result of the current match
      home_team = row['HomeTeam']
      away_team = row['AwayTeam']
      home_score = row['FTHG']
      away_score = row['FTAG']
      
      # Get the league table for the current league/season combination
      table = league_tables[(league, season)]
      
      # Calculate the completion percentage of the season so far
      # matches_per_team = game_count / len(table)
      current_match = row.name
      # season_completion = ((table.at[home_team, 'MP'] + table.at[away_team, 'MP']) / 2) / matches_per_team
      
      # Add the season completion percentage column to the dataframe
      # df.at[current_match, 'season_completion'] = season_completion
      
      table_positions = np.linspace(1/len(table), 1, ranks)
      position_list = list(range(ranks, 0, -1))

      # Add columns representing the absolute difference in points and goal difference
      # between the given team and all other ranks on the league table
      home_team_rank = table.index.get_loc(home_team) + 1  # Add 1 to start from rank 1 instead of 0
      away_team_rank = table.index.get_loc(away_team) + 1
      name_pos = 0
      for rank in table_positions:
          if rank == home_team_rank/len(table):
              df.at[current_match, f'H_rank_{position_list[name_pos]}_abs_diff_points'] = 0
              df.at[current_match, f'H_rank_{position_list[name_pos]}_diff_gd'] = 0
          else:
              rank_team = table.iloc[int(len(table) * rank) - 1].name  # Get the team name at the current rank
              df.at[current_match, f'H_rank_{position_list[name_pos]}_abs_diff_points'] = abs(table.at[home_team, 'P'] - table.at[rank_team, 'P'])
              df.at[current_match, f'H_rank_{position_list[name_pos]}_diff_gd'] = table.at[home_team, 'GD'] - table.at[rank_team, 'GD']
              
          if rank == away_team_rank/len(table):
              df.at[current_match, f'A_rank_{position_list[name_pos]}_abs_diff_points'] = 0
              df.at[current_match, f'A_rank_{position_list[name_pos]}_diff_gd'] = 0
          else:
              rank_team = table.iloc[int(len(table) * rank) - 1].name  # Get the team name at the current rank
              df.at[current_match, f'A_rank_{position_list[name_pos]}_abs_diff_points'] = abs(table.at[away_team, 'P'] - table.at[rank_team, 'P'])
              df.at[current_match, f'A_rank_{position_list[name_pos]}_diff_gd'] = table.at[away_team, 'GD'] - table.at[rank_team, 'GD']
          name_pos+=1
      update_league_table(league_tables[(league, season)], home_team, away_team, home_score, away_score)

  return df

IndentationError: ignored

In [None]:
# df = master_df[master_df['Div'] == 'E0']

Below, I'm dropping the 'Referee' column. I'm assuming we don't have enough matches for each individual referee to learn anything meaningful. We only have their names, so the only option would be to one-hot encode, which I suspect would only stunt the model.

In [None]:
df = master_df.drop(['Referee'], axis=1)

In [None]:
# df_home_sort = df.sort_values(by=['HomeTeam', 'Date'])
# df_away_sort = df.sort_values(by=['AwayTeam', 'Date'])
# df_home_sort.drop(columns='AwayTeam', inplace=True)
# df_away_sort.drop(columns='HomeTeam', inplace=True)
# df_home_sort.rename(columns={'HomeTeam': 'team'}, inplace=True)
# df_away_sort.rename(columns={'AwayTeam': 'team'}, inplace=True)
# df_home_sort['is_home'] = 1
# df_away_sort['is_home'] = 0
# # df_away_sort.rename(columns=merged_dict, inplace=True)
# # df_sort = pd.concat([df_home_sort, df_away_sort])
# # df_sort.sort_values(by=['team', 'Date'], inplace=True)
# # df_sort.reset_index(inplace=True)
# # df_sort

In [None]:
# nan_cols = df.columns[df.isna().any()].tolist()

# # Print the names of the columns with NaN values and their counts
# for col in nan_cols:
#     print(col, df[col].isna().sum())

HTHG 5
HTAG 5
HTR 5
HS 1880
AS 1880
HST 1880
AST 1880
HF 1882
AF 1882
HC 1880
AC 1880
HY 5
AY 5
HR 5
AR 5
B365H 76
B365D 76
B365A 76
BWH 675
BWD 675
BWA 675
IWH 108
IWD 108
IWA 108
PSH 144
PSD 144
PSA 144
WHH 734
WHD 734
WHA 734
VCH 181
VCD 181
VCA 181
MaxH 17
MaxD 17
MaxA 17
AvgH 17
AvgD 17
AvgA 17
B365>2.5 88
B365<2.5 88
P>2.5 176
P<2.5 176
Max>2.5 18
Max<2.5 18
Avg>2.5 18
Avg<2.5 18
AHh 19
B365AHH 255
B365AHA 256
PAHH 146
PAHA 145
MaxAHH 19
MaxAHA 19
AvgAHH 19
AvgAHA 19
B365CH 33
B365CD 33
B365CA 33
BWCH 486
BWCD 486
BWCA 486
IWCH 82
IWCD 82
IWCA 82
PSCH 26
PSCD 26
PSCA 26
WHCH 691
WHCD 691
WHCA 691
VCCH 35
VCCD 35
VCCA 35
MaxCH 3
MaxCD 3
MaxCA 3
AvgCH 3
AvgCD 3
AvgCA 3
B365C>2.5 37
B365C<2.5 37
PC>2.5 51
PC<2.5 51
MaxC>2.5 3
MaxC<2.5 3
AvgC>2.5 3
AvgC<2.5 3
AHCh 3
B365CAHH 66
B365CAHA 66
PCAHH 27
PCAHA 27
MaxCAHH 3
MaxCAHA 3
AvgCAHH 3
AvgCAHA 3


I want to represent the 'FTR' and 'HTR' columns (Full and Half Time Results) as 3 one-hots for each outcome. I use this code to do that:

In [None]:
df = pd.get_dummies(df, columns=['FTR'])
df = pd.get_dummies(df, columns=['HTR'])

Now, I'll use a function I coded above to add columns that represent position on the league table (or suitable proxy) for each team on a given match day. Not every league has the same amount of teams, so I opted to have 10 league table position columns, indicating distance in points and goal difference from 10 locations on the table (first place, last place, and 8 others in between). It also provides me with a column for each match representing the season completion percentage (0-1, where 0 is the first game of the season and 1 is the last) at the time of the match, but it seems a bit bugged, so I'll drop it. 

I say suitable proxy above because not every league has the same rules regarding tiebreakers. Still, points and goal difference should be plenty to get a good idea of the strength of each team compared to the league that season, as well as to their opponent.

In [None]:
add_league_table_columns(df)

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,HTHG,HTAG,HS,...,away_rank_3_abs_diff_points,away_rank_3_diff_gd,home_rank_2_abs_diff_points,home_rank_2_diff_gd,away_rank_2_abs_diff_points,away_rank_2_diff_gd,home_rank_1_abs_diff_points,home_rank_1_diff_gd,away_rank_1_abs_diff_points,away_rank_1_diff_gd
0,E0,2022-08-05,20:00:00,Crystal Palace,Arsenal,0,2,0.0,1.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,E0,2022-08-06,12:30:00,Fulham,Liverpool,2,2,1.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,E0,2022-08-06,15:00:00,Bournemouth,Aston Villa,2,0,1.0,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,E0,2022-08-06,15:00:00,Leeds,Wolves,2,1,1.0,1.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,E0,2022-08-06,15:00:00,Newcastle,Nott'm Forest,2,0,0.0,0.0,23.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28161,G1,2020-07-18,17:15:00,Larisa,Xanthi,0,0,0.0,0.0,8.0,...,37.0,-42.0,6.0,8.0,6.0,5.0,0.0,3.0,0.0,0.0
28162,G1,2020-07-18,17:15:00,Panetolikos,Volos NFC,1,0,1.0,0.0,7.0,...,41.0,-55.0,0.0,0.0,5.0,-7.0,9.0,-6.0,4.0,-13.0
28163,G1,2020-07-19,18:00:00,Olympiakos,AEK,3,0,2.0,0.0,8.0,...,4.0,1.0,59.0,73.0,40.0,48.0,52.0,68.0,33.0,43.0
28164,G1,2020-07-19,18:00:00,Panathinaikos,OFI Crete,3,2,3.0,2.0,6.0,...,36.0,-41.0,26.0,28.0,7.0,6.0,19.0,23.0,0.0,1.0


Below, I change the 'Date' column to be a proper datetime data type. I also drop the 'Time' column because I'm not that interested in when during the day each game was played.

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df = df.drop(columns='Time')

In [None]:
def H_to_self(strings):
    replaced_strings = []
    for string in strings:
        # Find the index of the second occurrence of 'H' in the string
        second_h_index = string.find('H', string.find('H') + 1)
        
        if second_h_index != -1:
            # Replace the second occurrence of 'H' with 'self'
            replaced_string = string[:second_h_index] + 'self' + string[second_h_index + 1:]
            replaced_strings.append(replaced_string)
        else:
            replaced_strings.append(string)
    
    return replaced_strings

def A_to_opp(strings):
    replaced_strings = []
    for string in strings:
        string = string.replace('Avg', '~')
        # Find the index of the second occurrence of 'H' in the string
        second_h_index = string.find('A', string.find('A') + 1)
        
        if second_h_index != -1:
            # Replace the second occurrence of 'H' with 'self'
            replaced_string = string[:second_h_index] + 'opp' + string[second_h_index + 1:]
            replaced_string = replaced_string.replace('~', 'Avg')
            replaced_strings.append(replaced_string)
        else:
            replaced_string = replaced_string.replace('~', 'Avg')
            replaced_strings.append(string)
    
    return replaced_strings

def disambiguate_home_away(df):
  

In [None]:
home_cols = ['FTHG', 'HTHG', 'HS', 'HST', 'HF', 'HC', 'HY', 'HR', 'B365H', 'BWH', 'IWH', 'PSH', 'WHH', 'WHD', 'VCH', 'MaxH', 'AvgH', 'B365AHH', 'PAHH', 'MaxAHH', 'AvgAHH', 'BWCH', 'IWCH', 'PSCH', 'WHCH', 'WHCD', 'VCCH', 'MaxCH', 'AvgCH', 'B365CAHH', 'PCAHH', 'MaxCAHH', 'AvgCAHH', 'home_rank_10_abs_diff_points', 'home_rank_10_diff_gd', 'home_rank_9_abs_diff_points', 'home_rank_9_diff_gd', 'home_rank_8_abs_diff_points', 'home_rank_8_diff_gd', 'home_rank_7_abs_diff_points', 'home_rank_7_diff_gd', 'home_rank_6_abs_diff_points', 'home_rank_6_diff_gd', 'home_rank_5_abs_diff_points', 'home_rank_5_diff_gd', 'home_rank_4_abs_diff_points', 'home_rank_4_diff_gd', 'home_rank_3_abs_diff_points', 'home_rank_3_diff_gd', 'home_rank_2_abs_diff_points', 'home_rank_2_diff_gd', 'home_rank_1_abs_diff_points', 'home_rank_1_diff_gd']
away_cols = ['FTAG', 'HTAG', 'AS', 'AST', 'AF', 'AC', 'AY', 'AR', 'B365A', 'BWA', 'IWA', 'PSA', 'WHA', 'VCA', 'MaxA', 'AvgA', 'B365AHA', 'PAHA', 'MaxAHA', 'AvgAHA', 'B365CA', 'BWCA', 'IWCA', 'PSCA', 'WHCA', 'VCCA', 'MaxCA', 'AvgCA', 'B365CAHA', 'PCAHA', 'MaxCAHA', 'AvgCAHA', 'away_rank_10_abs_diff_points', 'away_rank_10_diff_gd', 'away_rank_9_abs_diff_points', 'away_rank_9_diff_gd', 'away_rank_8_abs_diff_points', 'away_rank_8_diff_gd', 'away_rank_7_abs_diff_points', 'away_rank_7_diff_gd', 'away_rank_6_abs_diff_points', 'away_rank_6_diff_gd', 'away_rank_5_abs_diff_points', 'away_rank_5_diff_gd', 'away_rank_4_abs_diff_points', 'away_rank_4_diff_gd', 'away_rank_3_abs_diff_points', 'away_rank_3_diff_gd', 'away_rank_2_abs_diff_points', 'away_rank_2_diff_gd', 'away_rank_1_abs_diff_points', 'away_rank_1_diff_gd']
neutral_cols = ['B365D', 'BWD', 'IWD', 'PSD', 'VCD', 'MaxD', 'AvgD', 'B365>2.5', 'B365<2.5', 'P>2.5', 'P<2.5', 'Max>2.5', 'Max<2.5', 'Avg>2.5', 'Avg<2.5', 'AHh', 'B365CD', 'BWCD', 'IWCD', 'PSCD', 'VCCD', 'MaxCD', 'AvgCD', 'B365C>2.5', 'B365C<2.5', 'PC>2.5', 'PC<2.5', 'MaxC>2.5', 'MaxC<2.5', 'AvgC>2.5', 'AvgC<2.5', 'AHCh']

In [None]:

def ha_to_so(df, is_home):
  if is_home == 1:


The next thing I want to do is add historical match data to each row. I want to reduce some of the bulk in these features if I'm going to be adding information about past games to every row. Below I'm preparing to perform a PCA by removing certain categorical columns and odds and ends.

In [None]:
df['FTHG'] = df['FTHG'].astype('float64')
df['FTAG'] = df['FTAG'].astype('float64')
leave_out_of_pca = [df.columns.to_list()[i] for i in range(len(df.columns)) if df.dtypes[i] != 'float64']
leave_out_of_pca

['Div',
 'Date',
 'HomeTeam',
 'AwayTeam',
 'B365CH',
 'season',
 'FTR_A',
 'FTR_D',
 'FTR_H',
 'HTR_A',
 'HTR_D',
 'HTR_H']

The function below will allow me to get the column names for stats that correspond to the home team and stats that correspond to the away team into different lists. This is so I can perform different PCAs on these different subsets. I want to preserve this H/A encoding when I reduce the dimensionality, because many of the questions I want to ask my eventual model will be related to this. I care about how each team performed during a match, and the columns mean different things with respect to each team.

In [None]:
def get_home_away(stats):
  home_cols = []
  away_cols = []
  for stat in stats:
    if 'Avg' in stat:
      stat = stat.replace('Avg', '~')
    if 'AH' in stat:
      stat = stat.replace('AH', '-')
    if 'A' in stat or 'away' in stat:
      stat = stat.replace('~', 'Avg')
      stat = stat.replace('-', 'AH')
      stat = stat.replace('*', 'PCA')
      away_cols.append(stat)
    else:
      if 'H' in stat or 'home' in stat:
        stat = stat.replace('~', 'Avg')
        stat = stat.replace('-', 'AH')
        home_cols.append(stat)

  return (home_cols, away_cols)

I don't want to include in my PCA any variables that are one-hot encoded, or that represent any categorical dtypes. I create a list containing all these columns here:

Here I am getting the names of home and away specific columns that I plan on using in the PCA. I then print just the home columns to see how I did.

In [None]:
stat_cols = [col for col in df.columns.to_list() if col not in leave_out_of_pca]

home_cols, away_cols = get_home_away(stat_cols)

print(home_cols)

['FTHG', 'HTHG', 'HS', 'HST', 'HF', 'HC', 'HY', 'HR', 'B365H', 'BWH', 'IWH', 'PSH', 'WHH', 'WHD', 'VCH', 'MaxH', 'AvgH', 'B365AHH', 'PAHH', 'MaxAHH', 'AvgAHH', 'BWCH', 'IWCH', 'PSCH', 'WHCH', 'WHCD', 'VCCH', 'MaxCH', 'AvgCH', 'B365CAHH', 'PCAHH', 'MaxCAHH', 'AvgCAHH', 'home_rank_10_abs_diff_points', 'home_rank_10_diff_gd', 'home_rank_9_abs_diff_points', 'home_rank_9_diff_gd', 'home_rank_8_abs_diff_points', 'home_rank_8_diff_gd', 'home_rank_7_abs_diff_points', 'home_rank_7_diff_gd', 'home_rank_6_abs_diff_points', 'home_rank_6_diff_gd', 'home_rank_5_abs_diff_points', 'home_rank_5_diff_gd', 'home_rank_4_abs_diff_points', 'home_rank_4_diff_gd', 'home_rank_3_abs_diff_points', 'home_rank_3_diff_gd', 'home_rank_2_abs_diff_points', 'home_rank_2_diff_gd', 'home_rank_1_abs_diff_points', 'home_rank_1_diff_gd']


In [None]:
df

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,HTHG,HTAG,HS,AS,...,away_rank_3_abs_diff_points,away_rank_3_diff_gd,home_rank_2_abs_diff_points,home_rank_2_diff_gd,away_rank_2_abs_diff_points,away_rank_2_diff_gd,home_rank_1_abs_diff_points,home_rank_1_diff_gd,away_rank_1_abs_diff_points,away_rank_1_diff_gd
0,E0,2022-08-05,Crystal Palace,Arsenal,0.0,2.0,0.0,1.0,10.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,E0,2022-08-06,Fulham,Liverpool,2.0,2.0,1.0,0.0,9.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,E0,2022-08-06,Bournemouth,Aston Villa,2.0,0.0,1.0,0.0,7.0,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,E0,2022-08-06,Leeds,Wolves,2.0,1.0,1.0,1.0,12.0,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,E0,2022-08-06,Newcastle,Nott'm Forest,2.0,0.0,0.0,0.0,23.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28161,G1,2020-07-18,Larisa,Xanthi,0.0,0.0,0.0,0.0,8.0,3.0,...,37.0,-42.0,6.0,8.0,6.0,5.0,0.0,3.0,0.0,0.0
28162,G1,2020-07-18,Panetolikos,Volos NFC,1.0,0.0,1.0,0.0,7.0,12.0,...,41.0,-55.0,0.0,0.0,5.0,-7.0,9.0,-6.0,4.0,-13.0
28163,G1,2020-07-19,Olympiakos,AEK,3.0,0.0,2.0,0.0,8.0,12.0,...,4.0,1.0,59.0,73.0,40.0,48.0,52.0,68.0,33.0,43.0
28164,G1,2020-07-19,Panathinaikos,OFI Crete,3.0,2.0,3.0,2.0,6.0,16.0,...,36.0,-41.0,26.0,28.0,7.0,6.0,19.0,23.0,0.0,1.0


I want to save a copy of my semi-processed data set before running PCA. I'll be adding my engineered features back to this dataframe after I create them.

In [None]:
raw_df = df.copy()

In [None]:
raw_df

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,HTHG,HTAG,HS,AS,...,away_rank_3_abs_diff_points,away_rank_3_diff_gd,home_rank_2_abs_diff_points,home_rank_2_diff_gd,away_rank_2_abs_diff_points,away_rank_2_diff_gd,home_rank_1_abs_diff_points,home_rank_1_diff_gd,away_rank_1_abs_diff_points,away_rank_1_diff_gd
0,E0,2022-08-05,Crystal Palace,Arsenal,0.0,2.0,0.0,1.0,10.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,E0,2022-08-06,Fulham,Liverpool,2.0,2.0,1.0,0.0,9.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,E0,2022-08-06,Bournemouth,Aston Villa,2.0,0.0,1.0,0.0,7.0,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,E0,2022-08-06,Leeds,Wolves,2.0,1.0,1.0,1.0,12.0,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,E0,2022-08-06,Newcastle,Nott'm Forest,2.0,0.0,0.0,0.0,23.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28161,G1,2020-07-18,Larisa,Xanthi,0.0,0.0,0.0,0.0,8.0,3.0,...,37.0,-42.0,6.0,8.0,6.0,5.0,0.0,3.0,0.0,0.0
28162,G1,2020-07-18,Panetolikos,Volos NFC,1.0,0.0,1.0,0.0,7.0,12.0,...,41.0,-55.0,0.0,0.0,5.0,-7.0,9.0,-6.0,4.0,-13.0
28163,G1,2020-07-19,Olympiakos,AEK,3.0,0.0,2.0,0.0,8.0,12.0,...,4.0,1.0,59.0,73.0,40.0,48.0,52.0,68.0,33.0,43.0
28164,G1,2020-07-19,Panathinaikos,OFI Crete,3.0,2.0,3.0,2.0,6.0,16.0,...,36.0,-41.0,26.0,28.0,7.0,6.0,19.0,23.0,0.0,1.0


Here I perform a PCA on all the stats in each row except for the ones that I mentioned. I first scale the data, then I fill the NaNs with their column means. There are very few rows with any NaNs, so I think the PCA will be able to handle the small amount of added noise. I replace them at the end.

The number of components yielded by PCA is printed at the end of the block. There are substantially less components than input variables, even when preserving 95% of the data. 

In [None]:
na_count = df.isna().sum(axis=1)
na_ind_list = na_count[na_count>0]
print(len(na_ind_list))

3312


In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

columns_for_pca = [col for col in df.columns.to_list() if col not in leave_out_of_pca] 

scaler = StandardScaler()

df[columns_for_pca] = scaler.fit_transform(df[columns_for_pca])

# Save the original NaN locations in a boolean mask
nan_mask = df.isna()

# Fill NaN values with column averages
column_averages = df.mean()
df = df.fillna(column_averages)

# getting column names for home, away, and neutral stats
h_cols = home_cols
a_cols = away_cols
ha_cols = h_cols + a_cols
n_cols = [col for col in columns_for_pca if col not in ha_cols]

# PCA
hpca = PCA(n_components = 0.95)
apca = PCA(n_components = 0.95)
npca = PCA(n_components = 0.95)
hpca_components = hpca.fit_transform(df[h_cols])
apca_components = apca.fit_transform(df[a_cols])
npca_components = npca.fit_transform(df[n_cols])

# Restore NaN values to their original locations
df = df.where(~nan_mask, pd.NA)

# Create new column names for the PCA components
home_pca_column_names = ['HPCA_'+ str(i+1) for i in range(len(hpca_components[1]))]
away_pca_column_names = ['APCA_'+ str(i+1) for i in range(len(apca_components[1]))]
neutral_pca_column_names = ['NPCA_'+ str(i+1) for i in range(len(npca_components[1]))]

# Replace the old data with the new PCA components
df[home_pca_column_names] = hpca_components
df[away_pca_column_names] = apca_components
df[neutral_pca_column_names] = npca_components

# get the number of components
print('home stat components: ' + str(len(hpca_components[1])))
print('away stat components: ' + str(len(apca_components[1])))
print('neutral stat components: ' + str(len(npca_components[1])))

  column_averages = df.mean()
  column_averages = df.mean()


home stat components: 27
away stat components: 26
neutral stat components: 4


In [None]:
# for reference, here are the stats used to train each of the 3 PCAs I just did.
print(h_cols)
print(a_cols)
print(n_cols)

['FTHG', 'HTHG', 'HS', 'HST', 'HF', 'HC', 'HY', 'HR', 'B365H', 'BWH', 'IWH', 'PSH', 'WHH', 'WHD', 'VCH', 'MaxH', 'AvgH', 'B365AHH', 'PAHH', 'MaxAHH', 'AvgAHH', 'BWCH', 'IWCH', 'PSCH', 'WHCH', 'WHCD', 'VCCH', 'MaxCH', 'AvgCH', 'B365CAHH', 'PCAHH', 'MaxCAHH', 'AvgCAHH', 'home_rank_10_abs_diff_points', 'home_rank_10_diff_gd', 'home_rank_9_abs_diff_points', 'home_rank_9_diff_gd', 'home_rank_8_abs_diff_points', 'home_rank_8_diff_gd', 'home_rank_7_abs_diff_points', 'home_rank_7_diff_gd', 'home_rank_6_abs_diff_points', 'home_rank_6_diff_gd', 'home_rank_5_abs_diff_points', 'home_rank_5_diff_gd', 'home_rank_4_abs_diff_points', 'home_rank_4_diff_gd', 'home_rank_3_abs_diff_points', 'home_rank_3_diff_gd', 'home_rank_2_abs_diff_points', 'home_rank_2_diff_gd', 'home_rank_1_abs_diff_points', 'home_rank_1_diff_gd']
['FTAG', 'HTAG', 'AS', 'AST', 'AF', 'AC', 'AY', 'AR', 'B365A', 'BWA', 'IWA', 'PSA', 'WHA', 'VCA', 'MaxA', 'AvgA', 'B365AHA', 'PAHA', 'MaxAHA', 'AvgAHA', 'B365CA', 'BWCA', 'IWCA', 'PSCA', 'W

In [None]:
print('num cols: ' + str(df.shape[1]))

num cols: 206


In [None]:
df.columns.to_list()

['Div',
 'Date',
 'HomeTeam',
 'AwayTeam',
 'FTHG',
 'FTAG',
 'HTHG',
 'HTAG',
 'HS',
 'AS',
 'HST',
 'AST',
 'HF',
 'AF',
 'HC',
 'AC',
 'HY',
 'AY',
 'HR',
 'AR',
 'B365H',
 'B365D',
 'B365A',
 'BWH',
 'BWD',
 'BWA',
 'IWH',
 'IWD',
 'IWA',
 'PSH',
 'PSD',
 'PSA',
 'WHH',
 'WHD',
 'WHA',
 'VCH',
 'VCD',
 'VCA',
 'MaxH',
 'MaxD',
 'MaxA',
 'AvgH',
 'AvgD',
 'AvgA',
 'B365>2.5',
 'B365<2.5',
 'P>2.5',
 'P<2.5',
 'Max>2.5',
 'Max<2.5',
 'Avg>2.5',
 'Avg<2.5',
 'AHh',
 'B365AHH',
 'B365AHA',
 'PAHH',
 'PAHA',
 'MaxAHH',
 'MaxAHA',
 'AvgAHH',
 'AvgAHA',
 'B365CH',
 'B365CD',
 'B365CA',
 'BWCH',
 'BWCD',
 'BWCA',
 'IWCH',
 'IWCD',
 'IWCA',
 'PSCH',
 'PSCD',
 'PSCA',
 'WHCH',
 'WHCD',
 'WHCA',
 'VCCH',
 'VCCD',
 'VCCA',
 'MaxCH',
 'MaxCD',
 'MaxCA',
 'AvgCH',
 'AvgCD',
 'AvgCA',
 'B365C>2.5',
 'B365C<2.5',
 'PC>2.5',
 'PC<2.5',
 'MaxC>2.5',
 'MaxC<2.5',
 'AvgC>2.5',
 'AvgC<2.5',
 'AHCh',
 'B365CAHH',
 'B365CAHA',
 'PCAHH',
 'PCAHA',
 'MaxCAHH',
 'MaxCAHA',
 'AvgCAHH',
 'AvgCAHA',
 'season',

Here I am creating a dictionary to use to rename the columns.

In [None]:
# home_comp = ['HPCA_1' ,  'HPCA_2' ,  'HPCA_3' ,  'HPCA_4' ,  'HPCA_5' ,  'HPCA_6' ,  'HPCA_7' ,  'HPCA_8' ,  'HPCA_9' ,  'HPCA_10' ,  'HPCA_11' ,  'HPCA_12' ,  'HPCA_13' ,  'HPCA_14' ,  'HPCA_15' ,  'HPCA_16' ,  'HPCA_17']
# away_comp = ['APCA_1' ,  'APCA_2' ,  'APCA_3' ,  'APCA_4' ,  'APCA_5' ,  'APCA_6' ,  'APCA_7' ,  'APCA_8' ,  'APCA_9' ,  'APCA_10' ,  'APCA_11' ,  'APCA_12' ,  'APCA_13' ,  'APCA_14' ,  'APCA_15' ,  'APCA_16' ,  'APCA_17']
home_comp = [col for col in df.columns if 'HPCA' in col]
away_comp = [col for col in df.columns if 'APCA' in col]
home_away_dict = dict(zip(home_comp, away_comp))
away_home_dict = dict(zip(away_comp, home_comp))
merged_dict = home_away_dict.copy()
merged_dict.update(away_home_dict)

Here I am processing my dataframe so that there are two rows for every match, one for each team. I'll sort this dataframe and use it to generate team-specific historical data from the past n games played by that team.

In [None]:
df_home_sort = df.sort_values(by=['HomeTeam', 'Date'])
df_away_sort = df.sort_values(by=['AwayTeam', 'Date'])
df_home_sort.drop(columns='AwayTeam', inplace=True)
df_away_sort.drop(columns='HomeTeam', inplace=True)
df_home_sort.rename(columns={'HomeTeam': 'team'}, inplace=True)
df_away_sort.rename(columns={'AwayTeam': 'team'}, inplace=True)
df_home_sort['is_home'] = 1
df_away_sort['is_home'] = 0
df_away_sort.rename(columns=merged_dict, inplace=True)
df_sort = pd.concat([df_home_sort, df_away_sort])
df_sort.sort_values(by=['team', 'Date'], inplace=True)
df_sort.reset_index(inplace=True)
df_sort

Unnamed: 0,index,Div,Date,team,FTHG,FTAG,HTHG,HTAG,HS,AS,...,APCA_22,APCA_23,APCA_24,APCA_25,APCA_26,NPCA_1,NPCA_2,NPCA_3,NPCA_4,is_home
0,27932,G1,2019-08-25,AEK,-0.356261,0.717236,-0.794635,-0.714088,-0.920563,-1.465858,...,-0.056798,0.058765,-0.113044,-0.064619,0.005526,1.125657,4.255837,-1.240545,-1.540191,1
1,27939,G1,2019-09-01,AEK,0.445277,1.598058,0.432954,0.641269,-1.713943,0.049972,...,-0.345385,0.007138,-0.067301,-0.021156,-0.093346,-5.222869,1.415725,1.051449,0.213280,0
2,27944,G1,2019-09-15,AEK,0.445277,-1.044409,1.660542,-0.714088,0.666199,-1.898953,...,0.423860,0.043590,0.066812,-0.306406,-0.213338,6.773529,4.645677,-1.491419,-2.322681,1
3,27949,G1,2019-09-21,AEK,-1.157799,-0.163587,-0.794635,-0.714088,-0.920563,1.132709,...,-0.299986,-0.029241,-0.020661,-0.286825,-0.192476,-2.839016,1.562192,2.291168,-0.868373,0
4,27960,G1,2019-09-29,AEK,0.445277,0.717236,-0.794635,0.641269,-0.523872,0.049972,...,0.522068,0.459493,0.278751,0.442878,0.086940,-4.046526,1.388628,0.020160,-0.008109,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56327,12646,N1,2022-04-23,Zwolle,-1.157799,0.717236,-0.794635,0.641269,0.071163,-0.166575,...,1.941618,0.340003,-0.329057,-0.739140,0.111885,-1.684819,-0.939196,0.031255,-0.172333,0
56328,12653,N1,2022-04-30,Zwolle,1.246815,-1.044409,0.432954,-0.714088,1.459579,-0.166575,...,2.005974,-1.170278,1.060080,1.176015,1.180836,20.860383,5.682536,-0.403558,1.336156,0
56329,12664,N1,2022-05-07,Zwolle,-0.356261,-0.163587,-0.794635,0.641269,1.062889,0.916161,...,0.042925,-0.281075,-0.448215,0.366662,-0.803826,-0.200125,-1.640323,-0.028291,-0.106660,1
56330,12675,N1,2022-05-11,Zwolle,0.445277,-1.044409,1.660542,-0.714088,-1.317253,2.431992,...,1.432467,1.060465,-0.349343,-0.612129,0.310661,-1.012325,-1.034790,-0.460326,-0.050526,0


Here I am adding columns to each row which represent the time delta in days from the row's match start time to previous match start times for the last n matches. I'll save these columns to add to my final dataframe later.

In [None]:
# Define the number of columns you want to add
n = 7

# Create n new columns, each representing the time delta in days from the current row to the previous n rows
for i in range(1, n+1):
    df_sort[f'delta_{i}'] = (df_sort['Date'] - df_sort['Date'].shift(i)).dt.days

deltas = df_sort[['delta_1', 'delta_2', 'delta_3', 'delta_4', 'delta_5', 'delta_6', 'delta_7', 'index', 'is_home']]

df_sort = df_sort.drop(columns=['delta_1', 'delta_2', 'delta_3', 'delta_4', 'delta_5', 'delta_6', 'delta_7'], axis=1)

In [None]:
df_sort

Unnamed: 0,index,Div,Date,team,FTHG,FTAG,HTHG,HTAG,HS,AS,...,APCA_22,APCA_23,APCA_24,APCA_25,APCA_26,NPCA_1,NPCA_2,NPCA_3,NPCA_4,is_home
0,27932,G1,2019-08-25,AEK,-0.356261,0.717236,-0.794635,-0.714088,-0.920563,-1.465858,...,-0.056798,0.058765,-0.113044,-0.064619,0.005526,1.125657,4.255837,-1.240545,-1.540191,1
1,27939,G1,2019-09-01,AEK,0.445277,1.598058,0.432954,0.641269,-1.713943,0.049972,...,-0.345385,0.007138,-0.067301,-0.021156,-0.093346,-5.222869,1.415725,1.051449,0.213280,0
2,27944,G1,2019-09-15,AEK,0.445277,-1.044409,1.660542,-0.714088,0.666199,-1.898953,...,0.423860,0.043590,0.066812,-0.306406,-0.213338,6.773529,4.645677,-1.491419,-2.322681,1
3,27949,G1,2019-09-21,AEK,-1.157799,-0.163587,-0.794635,-0.714088,-0.920563,1.132709,...,-0.299986,-0.029241,-0.020661,-0.286825,-0.192476,-2.839016,1.562192,2.291168,-0.868373,0
4,27960,G1,2019-09-29,AEK,0.445277,0.717236,-0.794635,0.641269,-0.523872,0.049972,...,0.522068,0.459493,0.278751,0.442878,0.086940,-4.046526,1.388628,0.020160,-0.008109,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56327,12646,N1,2022-04-23,Zwolle,-1.157799,0.717236,-0.794635,0.641269,0.071163,-0.166575,...,1.941618,0.340003,-0.329057,-0.739140,0.111885,-1.684819,-0.939196,0.031255,-0.172333,0
56328,12653,N1,2022-04-30,Zwolle,1.246815,-1.044409,0.432954,-0.714088,1.459579,-0.166575,...,2.005974,-1.170278,1.060080,1.176015,1.180836,20.860383,5.682536,-0.403558,1.336156,0
56329,12664,N1,2022-05-07,Zwolle,-0.356261,-0.163587,-0.794635,0.641269,1.062889,0.916161,...,0.042925,-0.281075,-0.448215,0.366662,-0.803826,-0.200125,-1.640323,-0.028291,-0.106660,1
56330,12675,N1,2022-05-11,Zwolle,0.445277,-1.044409,1.660542,-0.714088,-1.317253,2.431992,...,1.432467,1.060465,-0.349343,-0.612129,0.310661,-1.012325,-1.034790,-0.460326,-0.050526,0


Here I remove columns which I don't want to be included in the next part, mainly the ones I previously PCAed. I want to add some of those columns back in later, but for now I just want to use my PCA columns to add historical data to each row.

In [None]:
df_hist = df_sort.drop(columns=columns_for_pca)
df_hist = df_hist.drop(columns=['Div','Date','season'], axis=1)
df_hist.columns.to_list()

['index',
 'team',
 'B365CH',
 'FTR_A',
 'FTR_D',
 'FTR_H',
 'HTR_A',
 'HTR_D',
 'HTR_H',
 'HPCA_1',
 'HPCA_2',
 'HPCA_3',
 'HPCA_4',
 'HPCA_5',
 'HPCA_6',
 'HPCA_7',
 'HPCA_8',
 'HPCA_9',
 'HPCA_10',
 'HPCA_11',
 'HPCA_12',
 'HPCA_13',
 'HPCA_14',
 'HPCA_15',
 'HPCA_16',
 'HPCA_17',
 'HPCA_18',
 'HPCA_19',
 'HPCA_20',
 'HPCA_21',
 'HPCA_22',
 'HPCA_23',
 'HPCA_24',
 'HPCA_25',
 'HPCA_26',
 'HPCA_27',
 'APCA_1',
 'APCA_2',
 'APCA_3',
 'APCA_4',
 'APCA_5',
 'APCA_6',
 'APCA_7',
 'APCA_8',
 'APCA_9',
 'APCA_10',
 'APCA_11',
 'APCA_12',
 'APCA_13',
 'APCA_14',
 'APCA_15',
 'APCA_16',
 'APCA_17',
 'APCA_18',
 'APCA_19',
 'APCA_20',
 'APCA_21',
 'APCA_22',
 'APCA_23',
 'APCA_24',
 'APCA_25',
 'APCA_26',
 'NPCA_1',
 'NPCA_2',
 'NPCA_3',
 'NPCA_4',
 'is_home']

Here I am converting the categorical results columns (Home,Draw,Away) to (Win,Draw,Loss) columns in relation to the team in question.

In [None]:
df_hist['win'] = ((df_hist['is_home'] == 1) & (df_hist['FTR_H'] == 1)) | ((df_hist['is_home'] == 0) & (df_hist['FTR_A'] == 1))
df_hist['loss'] = (df_hist['is_home'] == 1) & (df_hist['FTR_A'] == 1) | ((df_hist['is_home'] == 0) & (df_hist['FTR_H'] == 1))
df_hist['draw'] = df_hist['FTR_D'] == 1
df_hist['ht_win'] = ((df_hist['is_home'] == 1) & (df_hist['HTR_H'] == 1)) | ((df_hist['is_home'] == 0) & (df_hist['HTR_A'] == 1))
df_hist['ht_loss'] = (df_hist['is_home'] == 1) & (df_hist['HTR_A'] == 1) | ((df_hist['is_home'] == 0) & (df_hist['HTR_H'] == 1))
df_hist['ht_draw'] = df_hist['HTR_D'] == 1
df_hist = df_hist.drop(columns= ['FTR_A', 'FTR_D', 'FTR_H', 'HTR_A', 'HTR_D', 'HTR_H'], axis=1)

In [None]:
df_hist

Unnamed: 0,index,team,B365CH,HPCA_1,HPCA_2,HPCA_3,HPCA_4,HPCA_5,HPCA_6,HPCA_7,...,NPCA_2,NPCA_3,NPCA_4,is_home,win,loss,draw,ht_win,ht_loss,ht_draw
0,27932,AEK,1.28,-2.626024,-2.239404,0.268356,-1.627834,1.506976,-1.421845,0.915675,...,4.255837,-1.240545,-1.540191,1,False,True,False,False,False,True
1,27939,AEK,3.2,-2.154906,-1.488635,-3.902671,-1.298224,1.563570,1.352575,0.798639,...,1.415725,1.051449,0.213280,0,True,False,False,False,False,True
2,27944,AEK,1.25,-3.700222,-0.055011,3.534565,-1.572209,0.625023,3.740218,-0.644976,...,4.645677,-1.491419,-2.322681,1,True,False,False,True,False,False
3,27949,AEK,6.0,-3.234368,-2.448380,3.504832,-0.992870,-0.157754,0.324694,0.362158,...,1.562192,2.291168,-0.868373,0,True,False,False,False,False,True
4,27960,AEK,2.1,-0.818088,-1.041503,0.763353,-1.228417,-2.435420,-0.954197,0.374043,...,1.388628,0.020160,-0.008109,1,False,False,True,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56327,12646,Zwolle,2.87,-0.615706,-0.517882,2.888977,6.212589,2.224750,0.936749,0.588284,...,-0.939196,0.031255,-0.172333,0,True,False,False,True,False,False
56328,12653,Zwolle,1.18,17.249583,3.778193,0.324355,2.579904,-1.272104,0.406960,-0.747143,...,5.682536,-0.403558,1.336156,0,False,True,False,False,True,False
56329,12664,Zwolle,2.45,1.001490,-0.355490,-0.816281,5.874112,-2.817475,1.918742,-2.025535,...,-1.640323,-0.028291,-0.106660,1,False,False,True,False,True,False
56330,12675,Zwolle,2.45,-0.233207,-0.546963,3.352708,6.180005,1.411629,1.154587,-3.173929,...,-1.034790,-0.460326,-0.050526,0,False,True,False,False,True,False


This function will add stats from the last n rows as new columns to each row. It will also write stats onto rows that should have NaNs so I indicate those columns after I run the function with a different function.

In [None]:
def add_last_n_rows(df, n, columns_to_add):
    for i in range(1, n + 1):
        temp_df = df.shift(i)
        for col in columns_to_add:
            df[f"{col}_last_{i}"] = temp_df[col]
    df = df.reset_index(drop=True)
    return df

In [None]:
add_last_n_rows(df_hist, n, df_hist.columns.to_list())

  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] = temp_df[col]
  df[f"{col}_last_{i}"] =

Unnamed: 0,index,team,B365CH,HPCA_1,HPCA_2,HPCA_3,HPCA_4,HPCA_5,HPCA_6,HPCA_7,...,NPCA_2_last_7,NPCA_3_last_7,NPCA_4_last_7,is_home_last_7,win_last_7,loss_last_7,draw_last_7,ht_win_last_7,ht_loss_last_7,ht_draw_last_7
0,27932,AEK,1.28,-2.626024,-2.239404,0.268356,-1.627834,1.506976,-1.421845,0.915675,...,,,,,,,,,,
1,27939,AEK,3.2,-2.154906,-1.488635,-3.902671,-1.298224,1.563570,1.352575,0.798639,...,,,,,,,,,,
2,27944,AEK,1.25,-3.700222,-0.055011,3.534565,-1.572209,0.625023,3.740218,-0.644976,...,,,,,,,,,,
3,27949,AEK,6.0,-3.234368,-2.448380,3.504832,-0.992870,-0.157754,0.324694,0.362158,...,,,,,,,,,,
4,27960,AEK,2.1,-0.818088,-1.041503,0.763353,-1.228417,-2.435420,-0.954197,0.374043,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56327,12646,Zwolle,2.87,-0.615706,-0.517882,2.888977,6.212589,2.224750,0.936749,0.588284,...,0.019254,-0.253290,-0.110817,1.0,False,False,True,False,True,False
56328,12653,Zwolle,1.18,17.249583,3.778193,0.324355,2.579904,-1.272104,0.406960,-0.747143,...,-0.362609,-0.306934,-0.143700,0.0,False,True,False,False,True,False
56329,12664,Zwolle,2.45,1.001490,-0.355490,-0.816281,5.874112,-2.817475,1.918742,-2.025535,...,-1.285321,0.056645,-0.021792,0.0,True,False,False,False,False,True
56330,12675,Zwolle,2.45,-0.233207,-0.546963,3.352708,6.180005,1.411629,1.154587,-3.173929,...,-1.513018,3.069199,-1.047415,1.0,False,True,False,True,False,False


In [None]:
def rolling_window_check(df, column, n):

    # Convert the column to a numeric representation using pd.factorize
    cat_codes, _ = pd.factorize(df[column])
    
    # Create a rolling window of size n over the column
    window = pd.Series(cat_codes).rolling(n)
    
    # Check if all values in the window are equal
    result = window.apply(lambda x: len(set(x)) == 1, raw=True)
    
    # Return a boolean series indicating if the condition is True or False for each row
    return result.fillna(False)


In [None]:
df_hist['check'] = rolling_window_check(df_hist, 'team', 7)

  df_hist['check'] = rolling_window_check(df_hist, 'team', 7)


In [None]:
# Get the counts of Trues and Falses in the 'check' column
df_hist['check'] = df_hist['check'] == 1.0
value_counts = df_hist['check'].value_counts()

# Print the result
print(value_counts)

True     53499
False     2833
Name: check, dtype: int64


Here I am preparing the historical data to be added to the main dataframe. The rows were previously split so that each contained only one team in order to generate the historical data. Now the rows must be rebuilt to represent matches between two teams again. I make sure to remove the columns that represent the current match in each row. I will add current stats when I'm assembling the final data.

In [None]:
indices = df_hist['index']
df_hist = df_hist.drop(columns=['team'], axis=1)
df_hist = df_hist[[col for col in df_hist.columns if 'index' not in col]]
df_hist['index'] = indices
df_hist_h = df_hist[df_hist['is_home'] == 1]
df_hist_a = df_hist[df_hist['is_home'] == 0]
df_hist_h = df_hist_h.add_prefix('home_')
df_hist_a = df_hist_a.add_prefix('away_')
df_hist_h = df_hist_h.rename(columns={'home_index': 'index'})
df_hist_a = df_hist_a.rename(columns={'away_index': 'index'})
hist_df = pd.merge(df_hist_h, df_hist_a, on='index')
indices = hist_df['index']
col_list = [col for col in hist_df.columns if 'last' in col]
col_list.append('home_check')
col_list.append('away_check')
hist_df = hist_df[col_list]
hist_df['index'] = indices

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hist['index'] = indices
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hist_df['index'] = indices


I need to reformat the time deltas in a similar way, which I do below.

In [None]:
deltas_h = deltas[deltas['is_home'] == 1]
deltas_a = deltas[deltas['is_home'] == 0]
deltas_h = deltas_h.add_prefix('home_')
deltas_a = deltas_a.add_prefix('away_')
deltas_h = deltas_h.rename(columns={'home_index': 'index'})
deltas_a = deltas_a.rename(columns={'away_index': 'index'})
deltas = pd.merge(deltas_h, deltas_a, on='index')
indices = deltas['index']
deltas = deltas[[col for col in deltas.columns if 'is_home' not in col]]
deltas['index'] = indices
deltas

Unnamed: 0,home_delta_1,home_delta_2,home_delta_3,home_delta_4,home_delta_5,home_delta_6,home_delta_7,index,away_delta_1,away_delta_2,away_delta_3,away_delta_4,away_delta_5,away_delta_6,away_delta_7
0,,,,,,,,27932,-1297.0,-1294.0,-1290.0,-1287.0,-1280.0,-1273.0,-1269.0
1,14.0,21.0,,,,,,27944,14.0,22.0,-327.0,-306.0,-301.0,-297.0,-294.0
2,8.0,14.0,28.0,35.0,,,,27960,7.0,15.0,28.0,35.0,-1259.0,-1252.0,-1245.0
3,15.0,21.0,29.0,35.0,49.0,56.0,,27973,14.0,21.0,29.0,36.0,50.0,56.0,-1239.0
4,7.0,14.0,29.0,35.0,43.0,49.0,63.0,27986,7.0,14.0,28.0,36.0,43.0,50.0,63.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28161,7.0,15.0,21.0,29.0,36.0,50.0,58.0,12612,8.0,14.0,21.0,28.0,35.0,49.0,57.0
28162,15.0,21.0,28.0,36.0,42.0,50.0,57.0,12628,15.0,22.0,29.0,35.0,42.0,49.0,56.0
28163,7.0,22.0,28.0,35.0,43.0,49.0,57.0,12640,8.0,21.0,28.0,35.0,42.0,50.0,56.0
28164,7.0,14.0,27.0,34.0,49.0,55.0,62.0,12664,8.0,13.0,28.0,35.0,48.0,55.0,63.0


The historical data is a bit thick. I already PCAed the data, so if I want to further reduce features I want to do it in a smarter way.

In [None]:
hist_df = hist_df[[col for col in hist_df.columns if 'team' not in col]]
hist_df

Unnamed: 0,home_B365CH_last_1,home_HPCA_1_last_1,home_HPCA_2_last_1,home_HPCA_3_last_1,home_HPCA_4_last_1,home_HPCA_5_last_1,home_HPCA_6_last_1,home_HPCA_7_last_1,home_HPCA_8_last_1,home_HPCA_9_last_1,...,away_is_home_last_7,away_win_last_7,away_loss_last_7,away_draw_last_7,away_ht_win_last_7,away_ht_loss_last_7,away_ht_draw_last_7,home_check,away_check,index
0,,,,,,,,,,,...,0.0,True,False,False,True,False,False,False,False,27932
1,3.2,-2.154906,-1.488635,-3.902671,-1.298224,1.563570,1.352575,0.798639,0.431349,0.296377,...,1.0,True,False,False,True,False,False,False,False,27944
2,6.0,-3.234368,-2.448380,3.504832,-0.992870,-0.157754,0.324694,0.362158,1.226640,0.203568,...,1.0,False,True,False,False,True,False,False,False,27960
3,4.5,-2.467690,-1.519434,0.252090,-0.588082,-0.068462,-2.654724,1.397384,0.807580,0.206605,...,0.0,False,True,False,True,False,False,True,True,27973
4,1.55,3.025984,0.188062,1.573031,-1.415786,-0.828914,-1.437629,0.618944,1.546741,-0.073723,...,0.0,False,False,True,False,False,True,True,True,27986
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28161,2.7,-0.831588,-0.234788,1.419606,5.726112,1.074039,0.704001,-0.942589,0.277295,4.672967,...,1.0,False,True,False,False,False,True,True,True,12612
28162,1.57,3.417621,0.484356,0.487427,4.784783,2.694679,-0.081524,-1.223260,-0.452930,4.479893,...,0.0,False,True,False,False,True,False,True,True,12628
28163,1.9,-0.068878,-0.654773,2.025644,6.699371,-1.445553,-0.334620,-1.481324,-1.159532,0.462748,...,0.0,True,False,False,True,False,False,True,True,12640
28164,1.18,17.249583,3.778193,0.324355,2.579904,-1.272104,0.406960,-0.747143,-0.655120,4.817965,...,0.0,False,False,True,False,True,False,True,True,12664


In [None]:
h_check = hist_df['home_check']
a_check = hist_df['away_check']
indices = hist_df['index']

To reduce the amount of columns but keep relevant information, I'm going to aggregate the stats of each previous n match by averaging them. I'll keep the most recent game intact, then calculate averages for the last 3, 5 and 7 matches for each PCA component for each team. I will still retain 7 of a few of the columns, such as the time deltas and the wins and losses.

In [None]:
home_away  = ['home_', 'away_']
PCAs = ['H', 'A', 'N']
HPCA_comps = len(hpca_components[1])
APCA_comps = len(apca_components[1])
NPCA_comps = len(npca_components[1])
agg_list = [1,3,5,7]

for side in home_away:
  for comp in range(1, HPCA_comps+1):
    col_name = side + 'HPCA_' + str(comp) + '_'
    for span in agg_list:
      new_col = col_name + 'avg' + str(span)
      hist_df[new_col] = hist_df[[col for col in hist_df.columns if col_name in col and int(col[-1]) <= span and 'avg' not in col]].mean(axis=1)
  for comp in range(1, APCA_comps+1):
    col_name = side + 'APCA_' + str(comp) + '_'
    for span in agg_list:
      new_col = col_name + 'avg' + str(span)
      hist_df[new_col] = hist_df[[col for col in hist_df.columns if col_name in col and int(col[-1]) <= span and 'avg' not in col]].mean(axis=1)
  for comp in range(1, NPCA_comps+1):
    col_name = side + 'NPCA_' + str(comp) + '_'
    for span in agg_list:
      new_col = col_name + 'avg' + str(span)
      hist_df[new_col] = hist_df[[col for col in hist_df.columns if col_name in col and int(col[-1]) <= span and 'avg' not in col]].mean(axis=1)

In [None]:
hist_df['index'] = indices
hist_df['home_check'] = h_check
hist_df['away_check'] = a_check

In [None]:
hist_df = hist_df.loc[(hist_df['home_check'] != False) & (hist_df['away_check'] != False)]
hist_df = hist_df.drop(columns=['home_check', 'away_check'], axis=1)
hist_df

Unnamed: 0,home_B365CH_last_1,home_HPCA_1_last_1,home_HPCA_2_last_1,home_HPCA_3_last_1,home_HPCA_4_last_1,home_HPCA_5_last_1,home_HPCA_6_last_1,home_HPCA_7_last_1,home_HPCA_8_last_1,home_HPCA_9_last_1,...,away_NPCA_2_avg5,away_NPCA_2_avg7,away_NPCA_3_avg1,away_NPCA_3_avg3,away_NPCA_3_avg5,away_NPCA_3_avg7,away_NPCA_4_avg1,away_NPCA_4_avg3,away_NPCA_4_avg5,away_NPCA_4_avg7
3,4.5,-2.467690,-1.519434,0.252090,-0.588082,-0.068462,-2.654724,1.397384,0.807580,0.206605,...,2.535434,3.269829,0.571287,0.115285,0.159201,0.170551,-0.015257,-0.056495,-0.679325,-0.195938
4,1.55,3.025984,0.188062,1.573031,-1.415786,-0.828914,-1.437629,0.618944,1.546741,-0.073723,...,1.670432,1.688226,-1.155519,-0.093589,-0.205589,0.313870,-0.565120,-0.103722,-0.124934,-0.113678
5,2.75,-2.296139,-0.484742,-0.153611,-0.792241,2.379758,1.886652,1.837519,-0.305593,0.079720,...,1.386458,1.769826,-1.193657,-0.297553,0.496990,0.123394,-0.921335,-0.951352,-0.731783,-0.665008
6,3.1,-1.901171,-0.723673,-2.007288,-0.043212,-0.428673,-2.814701,1.784207,0.937590,0.478202,...,2.193178,2.166407,-0.244138,2.212797,1.393036,0.929103,0.069134,-1.181036,-0.949915,-0.755426
7,5.5,-3.083875,-1.072719,3.314907,-0.480648,-1.983743,-0.893700,-0.171920,-0.456283,0.282408,...,1.951746,2.018499,2.831207,0.609627,-0.069793,-0.089804,-1.690832,-0.633527,-0.697349,-0.504190
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28161,2.7,-0.831588,-0.234788,1.419606,5.726112,1.074039,0.704001,-0.942589,0.277295,4.672967,...,2.373429,1.747055,-1.158933,-0.443844,0.493908,0.750159,-1.044294,0.038146,-0.594714,-0.733511
28162,1.57,3.417621,0.484356,0.487427,4.784783,2.694679,-0.081524,-1.223260,-0.452930,4.479893,...,0.257558,0.250538,-0.600735,-0.002524,1.282658,1.209133,-0.129099,-0.297718,-0.517836,-0.521316
28163,1.9,-0.068878,-0.654773,2.025644,6.699371,-1.445553,-0.334620,-1.481324,-1.159532,0.462748,...,-0.414665,-0.162863,-1.247901,0.035197,0.545260,0.570303,-0.363606,-0.494430,-0.506616,-0.646680
28164,1.18,17.249583,3.778193,0.324355,2.579904,-1.272104,0.406960,-0.747143,-0.655120,4.817965,...,-0.590556,-1.029693,-1.179202,-1.425156,-0.828913,-0.226053,0.144300,0.248532,0.039657,-0.022201


In [None]:
drop_cols = [col for col in hist_df.columns if 'PCA' in col and 'last' in col]
historical = hist_df[[col for col in hist_df.columns if col not in drop_cols]]
historical

Unnamed: 0,home_B365CH_last_1,home_is_home_last_1,home_win_last_1,home_loss_last_1,home_draw_last_1,home_ht_win_last_1,home_ht_loss_last_1,home_ht_draw_last_1,home_B365CH_last_2,home_is_home_last_2,...,away_NPCA_2_avg5,away_NPCA_2_avg7,away_NPCA_3_avg1,away_NPCA_3_avg3,away_NPCA_3_avg5,away_NPCA_3_avg7,away_NPCA_4_avg1,away_NPCA_4_avg3,away_NPCA_4_avg5,away_NPCA_4_avg7
3,4.5,0.0,False,False,True,False,False,True,2.1,1.0,...,2.535434,3.269829,0.571287,0.115285,0.159201,0.170551,-0.015257,-0.056495,-0.679325,-0.195938
4,1.55,0.0,False,True,False,False,True,False,1.25,1.0,...,1.670432,1.688226,-1.155519,-0.093589,-0.205589,0.313870,-0.565120,-0.103722,-0.124934,-0.113678
5,2.75,0.0,False,True,False,True,False,False,1.4,1.0,...,1.386458,1.769826,-1.193657,-0.297553,0.496990,0.123394,-0.921335,-0.951352,-0.731783,-0.665008
6,3.1,0.0,False,True,False,False,False,True,1.55,1.0,...,2.193178,2.166407,-0.244138,2.212797,1.393036,0.929103,0.069134,-1.181036,-0.949915,-0.755426
7,5.5,0.0,True,False,False,False,False,True,1.28,1.0,...,1.951746,2.018499,2.831207,0.609627,-0.069793,-0.089804,-1.690832,-0.633527,-0.697349,-0.504190
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28161,2.7,0.0,True,False,False,False,False,True,2.25,0.0,...,2.373429,1.747055,-1.158933,-0.443844,0.493908,0.750159,-1.044294,0.038146,-0.594714,-0.733511
28162,1.57,0.0,False,True,False,False,True,False,7.0,1.0,...,0.257558,0.250538,-0.600735,-0.002524,1.282658,1.209133,-0.129099,-0.297718,-0.517836,-0.521316
28163,1.9,1.0,False,True,False,False,False,True,1.57,0.0,...,-0.414665,-0.162863,-1.247901,0.035197,0.545260,0.570303,-0.363606,-0.494430,-0.506616,-0.646680
28164,1.18,0.0,False,True,False,False,True,False,2.87,0.0,...,-0.590556,-1.029693,-1.179202,-1.425156,-0.828913,-0.226053,0.144300,0.248532,0.039657,-0.022201


Now I'll add my time deltas to the historical data.

In [None]:
historical = pd.merge(historical, deltas, on='index')
historical

Unnamed: 0,home_B365CH_last_1,home_is_home_last_1,home_win_last_1,home_loss_last_1,home_draw_last_1,home_ht_win_last_1,home_ht_loss_last_1,home_ht_draw_last_1,home_B365CH_last_2,home_is_home_last_2,...,home_delta_5,home_delta_6,home_delta_7,away_delta_1,away_delta_2,away_delta_3,away_delta_4,away_delta_5,away_delta_6,away_delta_7
0,4.5,0.0,False,False,True,False,False,True,2.1,1.0,...,49.0,56.0,,14.0,21.0,29.0,36.0,50.0,56.0,-1239.0
1,1.55,0.0,False,True,False,False,True,False,1.25,1.0,...,43.0,49.0,63.0,7.0,14.0,28.0,36.0,43.0,50.0,63.0
2,2.75,0.0,False,True,False,True,False,False,1.4,1.0,...,50.0,56.0,64.0,14.0,22.0,28.0,35.0,49.0,57.0,63.0
3,3.1,0.0,False,True,False,False,False,True,1.55,1.0,...,41.0,48.0,63.0,7.0,14.0,27.0,35.0,41.0,48.0,62.0
4,5.5,0.0,True,False,False,False,False,True,1.28,1.0,...,38.0,45.0,52.0,3.0,10.0,17.0,24.0,38.0,45.0,51.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26535,2.7,0.0,True,False,False,False,False,True,2.25,0.0,...,36.0,50.0,58.0,8.0,14.0,21.0,28.0,35.0,49.0,57.0
26536,1.57,0.0,False,True,False,False,True,False,7.0,1.0,...,42.0,50.0,57.0,15.0,22.0,29.0,35.0,42.0,49.0,56.0
26537,1.9,1.0,False,True,False,False,False,True,1.57,0.0,...,43.0,49.0,57.0,8.0,21.0,28.0,35.0,42.0,50.0,56.0
26538,1.18,0.0,False,True,False,False,True,False,2.87,0.0,...,49.0,55.0,62.0,8.0,13.0,28.0,35.0,48.0,55.0,63.0


In [None]:
historical.columns.to_list()

['home_B365CH_last_1',
 'home_is_home_last_1',
 'home_win_last_1',
 'home_loss_last_1',
 'home_draw_last_1',
 'home_ht_win_last_1',
 'home_ht_loss_last_1',
 'home_ht_draw_last_1',
 'home_B365CH_last_2',
 'home_is_home_last_2',
 'home_win_last_2',
 'home_loss_last_2',
 'home_draw_last_2',
 'home_ht_win_last_2',
 'home_ht_loss_last_2',
 'home_ht_draw_last_2',
 'home_B365CH_last_3',
 'home_is_home_last_3',
 'home_win_last_3',
 'home_loss_last_3',
 'home_draw_last_3',
 'home_ht_win_last_3',
 'home_ht_loss_last_3',
 'home_ht_draw_last_3',
 'home_B365CH_last_4',
 'home_is_home_last_4',
 'home_win_last_4',
 'home_loss_last_4',
 'home_draw_last_4',
 'home_ht_win_last_4',
 'home_ht_loss_last_4',
 'home_ht_draw_last_4',
 'home_B365CH_last_5',
 'home_is_home_last_5',
 'home_win_last_5',
 'home_loss_last_5',
 'home_draw_last_5',
 'home_ht_win_last_5',
 'home_ht_loss_last_5',
 'home_ht_draw_last_5',
 'home_B365CH_last_6',
 'home_is_home_last_6',
 'home_win_last_6',
 'home_loss_last_6',
 'home_draw_

Now I am assembling the final data. I want the final dataset to be useful for multiple things. 

Below, I add the historical data to the raw dataframe I saved earlier and drop the index column because we won't be needing it any longer.

In [None]:
data = raw_df.merge(historical, left_index=True, right_on='index')
data = data.drop(columns='index')

In [None]:
nan_cols = data.columns[data.isna().any()].tolist()

# Print the names of the columns with NaN values and their counts
for col in nan_cols:
    print(col, data[col].isna().sum())

HTHG 4
HTAG 4
HS 1765
AS 1765
HST 1765
AST 1765
HF 1767
AF 1767
HC 1765
AC 1765
HY 4
AY 4
HR 4
AR 4
B365H 70
B365D 70
B365A 70
BWH 664
BWD 664
BWA 664
IWH 106
IWD 106
IWA 106
PSH 130
PSD 130
PSA 130
WHH 688
WHD 688
WHA 688
VCH 176
VCD 176
VCA 176
MaxH 15
MaxD 15
MaxA 15
AvgH 15
AvgD 15
AvgA 15
B365>2.5 83
B365<2.5 83
P>2.5 156
P<2.5 156
Max>2.5 16
Max<2.5 16
Avg>2.5 16
Avg<2.5 16
AHh 17
B365AHH 239
B365AHA 240
PAHH 132
PAHA 131
MaxAHH 17
MaxAHA 17
AvgAHH 17
AvgAHA 17
B365CH 32
B365CD 32
B365CA 32
BWCH 482
BWCD 482
BWCA 482
IWCH 80
IWCD 80
IWCA 80
PSCH 23
PSCD 23
PSCA 23
WHCH 652
WHCD 652
WHCA 652
VCCH 33
VCCD 33
VCCA 33
MaxCH 2
MaxCD 2
MaxCA 2
AvgCH 2
AvgCD 2
AvgCA 2
B365C>2.5 36
B365C<2.5 36
PC>2.5 46
PC<2.5 46
MaxC>2.5 2
MaxC<2.5 2
AvgC>2.5 2
AvgC<2.5 2
AHCh 2
B365CAHH 60
B365CAHA 60
PCAHH 24
PCAHA 24
MaxCAHH 2
MaxCAHA 2
AvgCAHH 2
AvgCAHA 2
home_B365CH_last_1 34
home_B365CH_last_2 28
home_B365CH_last_3 28
home_B365CH_last_4 32
home_B365CH_last_5 29
home_B365CH_last_6 32
home_B365CH_l

Last, I'll drop any rows containing NaN values. The data still needs a tiny bit of processing before it's ready to go (one-hot encoding, creation of targets, etc.) but I'll leave that for the next notebook.

In [None]:
data = data.dropna()
data

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,HTHG,HTAG,HS,AS,...,home_delta_5,home_delta_6,home_delta_7,away_delta_1,away_delta_2,away_delta_3,away_delta_4,away_delta_5,away_delta_6,away_delta_7
7021,E0,2022-08-05,Crystal Palace,Arsenal,0.0,2.0,0.0,1.0,10.0,10.0,...,97.0,102.0,107.0,75.0,81.0,85.0,89.0,96.0,104.0,107.0
9761,E0,2022-08-06,Fulham,Liverpool,2.0,2.0,1.0,0.0,9.0,11.0,...,109.0,113.0,118.0,76.0,81.0,88.0,91.0,98.0,104.0,109.0
3977,E0,2022-08-06,Bournemouth,Aston Villa,2.0,0.0,1.0,0.0,7.0,15.0,...,105.0,110.0,113.0,76.0,79.0,83.0,88.0,91.0,98.0,105.0
13517,E0,2022-08-06,Leeds,Wolves,2.0,1.0,1.0,1.0,12.0,15.0,...,98.0,103.0,119.0,76.0,83.0,87.0,91.0,98.0,104.0,120.0
16514,E0,2022-08-06,Newcastle,Nott'm Forest,2.0,0.0,0.0,0.0,23.0,5.0,...,105.0,108.0,111.0,91.0,95.0,98.0,102.0,105.0,110.0,113.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25091,G1,2020-07-11,Volos NFC,Atromitos,2.0,3.0,0.0,3.0,8.0,5.0,...,35.0,132.0,139.0,5.0,12.0,17.0,20.0,28.0,132.0,139.0
26280,G1,2020-07-11,Xanthi,Asteras Tripolis,1.0,2.0,0.0,1.0,12.0,7.0,...,28.0,132.0,139.0,7.0,13.0,20.0,26.0,34.0,132.0,140.0
17225,G1,2020-07-11,OFI Crete,Aris,0.0,1.0,0.0,1.0,10.0,9.0,...,21.0,27.0,35.0,3.0,6.0,10.0,13.0,21.0,27.0,35.0
18037,G1,2020-07-12,Panathinaikos,AEK,1.0,3.0,1.0,2.0,8.0,9.0,...,21.0,29.0,35.0,4.0,7.0,11.0,14.0,22.0,28.0,35.0


In [None]:
import os
os.chdir('/content/drive/MyDrive/WorkingFolder')
data.to_csv('processed-data.csv', index=False)

Notes for Professor Basye:

The following changes and additions should be made to the code in my opinion:

1. The historical data should be added before the PCA. 
  
    I had wanted to reduce the columns before running my 'add_last_n_rows' function on the dataframe because earlier iterations of that function took so long to run. By reducing the amount of columns I was looking to reduce the processing time for running that specific function, which worked.

    I've since made that function a lot more efficient, so now I think it makes more sense to do the PCA after everything for a few reasons.

    Mainly, I have to fill nans to perform the PCA. I can't drop any rows before adding the historical data because chronological gaps are not acceptable. This means that these filled nans end up polluting lots of rows with dubious data when the historical data is added in the form of PCA components in the last n rows. I could track which rows have nans and then at the end remove every row that has data from that row, but I think a less complicated solution would be to simply not PCA these columns before using them for the historical data generation. Then, since there is no longer need for uninterrupted chronological data, rows containing nan can be dropped without worry.

    The columns then can be PCAed if the user of my data wishes to do so.


2. I think I explain the part of the code where I use the dictionaries to rename the columns in the separated home and away dataframes so that they reflect the same statistic when merged.

  I should do a better job of explaining why I do this, because, although the code does what it is intended to do, its pretty confusing. Instead of just using H and A and swapping them if a team is away, I should rename the columns to self_stat and opp_stat (for every stat). This way, the column names will be more intuitive, especially when more prefixes are added, which I do. For example, to represent for a given team how many shots their opponent in the match played 7 games ago, home_L7_opp_shots is easier to understand than home_L7_away_shots.

3. If I plan to market this as an off the shelf preprocesser for the data in the archive (link at beginning) to be used for machine learning, the entire notebook must be more robust and generalizable. Currently, its pretty customized to my own ideas about what I was going to use it for.

Let me know if you have any other ideas, I'm probably forgetting problems.



