# Data Preparation Part 1

In this section of the code, our goal is to create a dataset containing basic variables related to individual Serie A matches. The variables that will compose this dataset are:

- **Date**: The date of the match.  
- **Season**: The football season (e.g., 2022/23).  
- **HomeTeam**: The name of the home team.  
- **AwayTeam**: The name of the away team.  
- **FTHG**: Full-time goals scored by the home team.  
- **FTAG**: Full-time goals scored by the away team.  
- **FTR**: Full-time result (H = Home Win, D = Draw, A = Away Win).  
- **HTGS**: Total goals scored by the home team up to the match.  
- **ATGS**: Total goals scored by the away team up to the match.  
- **HTGC**: Total goals conceded by the home team up to the match.  
- **ATGC**: Total goals conceded by the away team up to the match.  
- **HTP**: Points accumulated by the home team up to the match.  
- **ATP**: Points accumulated by the away team up to the match.  
- **B365H**: Betting odds for a home win (Bet365).  
- **B365D**: Betting odds for a draw (Bet365).  
- **B365A**: Betting odds for an away win (Bet365).  
- **HM1, HM2, HM3, HM4, HM5**: Outcomes of the home team’s 1st to 5th most recent matches.  
- **AM1, AM2, AM3, AM4, AM5**: Outcomes of the away team’s 1st to 5th most recent matches.  
- **MW**: Matchweek (round number of the season).  
- **gameId**: Unique identifier for the match.  
- **HTFormPtsStr**: Form of the home team as a string (e.g., WWDLD).  
- **ATFormPtsStr**: Form of the away team as a string.  
- **HTFormPts**: Points earned by the home team in the last 5 matches.  
- **ATFormPts**: Points earned by the away team in the last 5 matches.  
- **HTGD**: Goal difference for the home team up to the match.  
- **ATGD**: Goal difference for the away team up to the match.  
- **DiffPts**: Difference in points between the two teams.  
- **DiffFormPts**: Difference in recent form points between the teams.

Among these, **HTGD**, **ATGD**, **DiffPts**, and **DiffFormPts** are important variables that will be used in the predictive section and were derived in this section.

It is also important to note that the data collection was carried out using `DataScraper.py`, which downloads Serie A CSV files from the [football-data.co.uk](https://football-data.co.uk) website for the specified range of seasons and returns a dictionary of DataFrames, each corresponding to a season.

The final dataset, **GENERAL_STATS**, will be saved in the `data` directory.


In [52]:
from DataScraper import* 
import pandas as pd
from datetime import datetime
import numpy as np
import os
import csv
%matplotlib inline

DATA_PATH = 'data'

In [53]:
# Extract the dataset for each season from the dictionary "seasons_data"
seasons_data = download_serie_a_data_by_season(start_season="1617", end_season="2425", output_folder="data")
df17 = seasons_data['1617']
df18 = seasons_data['1718']
df19 = seasons_data['1819']
df20 = seasons_data['1920']
df21 = seasons_data['2021']
df22 = seasons_data['2122']
df23 = seasons_data['2223']
df24 = seasons_data['2324']
df25 = seasons_data['2425']
df25.head()

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA,BFECAHH,BFECAHA,Season
0,I1,17/08/2024,17:30,Genoa,Inter,2,2,D,1,1,...,1.87,2.08,1.85,2.13,1.89,2.05,1.82,2.09,1.9,2425
1,I1,17/08/2024,17:30,Parma,Fiorentina,1,1,D,1,0,...,1.86,2.05,1.88,2.07,1.91,2.01,1.84,2.09,1.9,2425
2,I1,17/08/2024,19:45,Empoli,Monza,0,0,D,0,0,...,1.83,2.12,1.83,2.14,1.86,2.08,1.79,2.16,1.85,2425
3,I1,17/08/2024,19:45,Milan,Torino,2,2,D,0,1,...,2.04,1.88,2.04,1.92,2.1,1.84,2.02,1.9,2.08,2425
4,I1,18/08/2024,17:30,Bologna,Udinese,1,1,D,0,0,...,1.83,2.09,1.85,2.15,1.86,2.08,1.81,2.12,1.87,2425


## Creation os GENERAL_STATS dataset

Here, we primarily used functions to manipulate the data and format it as needed

In [55]:
# Adjust the date format in each rows of the dataset
def parse_date_auto(date):
    """
    Automatically detects the date format and converts the date to a datetime object.
    """
    # Try parsing with the first format
    try:
        return datetime.strptime(date, '%d/%m/%y').date()
    except ValueError:
        # If the first format fails, try the second format
        return datetime.strptime(date, '%d/%m/%Y').date()

# Apply the function to the Date columns of the dataframes
df17.Date = df17.Date.apply(parse_date_auto)
df18.Date = df18.Date.apply(parse_date_auto)
df19.Date = df19.Date.apply(parse_date_auto)
df20.Date = df20.Date.apply(parse_date_auto)
df21.Date = df21.Date.apply(parse_date_auto)
df22.Date = df22.Date.apply(parse_date_auto)
df23.Date = df23.Date.apply(parse_date_auto)
df24.Date = df24.Date.apply(parse_date_auto)
df25.Date = df25.Date.apply(parse_date_auto)

In [56]:
df25.Date

0      2024-08-17
1      2024-08-17
2      2024-08-17
3      2024-08-17
4      2024-08-18
          ...    
163    2024-12-22
164    2024-12-22
165    2024-12-22
166    2024-12-23
167    2024-12-23
Name: Date, Length: 168, dtype: object

In [57]:
# Filtering only for column of our interest
col = ["Date","Season", "HomeTeam", "AwayTeam", "FTHG", "FTAG", "FTR", "HTHG", "HTAG",
       "HTR","HS","AS","HST","AST","HF","AF","HC","AC","HY","AY","HR","AR","B365H","B365D","B365A"]

df17 = df17.loc[:, col]
df18 = df18.loc[:, col]
df19 = df19.loc[:, col]
df20 = df20.loc[:, col]
df21 = df21.loc[:, col]
df22 = df22.loc[:, col]
df23 = df23.loc[:, col]
df24 = df24.loc[:, col]
df25 = df25.loc[:, col]

In [58]:
# Given number of matches, it extract the matchweek (MW)
def get_matchweek(playing_stat):
    """
    Adds matchweek feature to dataset
    (different MW for each 10 matches)
    """
    j = 1
    MatchWeek = []
    for i in range(len(playing_stat)):
        MatchWeek.append(j)
        if ((i + 1)% 10) == 0:
            j += 1
    playing_stat['MW'] = MatchWeek
    return playing_stat

df17 = get_matchweek(df17)
df18 = get_matchweek(df18)
df19 = get_matchweek(df19)
df20 = get_matchweek(df20)
df21 = get_matchweek(df21)
df22 = get_matchweek(df22)
df23 = get_matchweek(df23)
df24 = get_matchweek(df24)
df25 = get_matchweek(df25)
df25["MW"]

0       1
1       1
2       1
3       1
4       1
       ..
163    17
164    17
165    17
166    17
167    17
Name: MW, Length: 168, dtype: int64

Since the data was collected during a matchweek of the current season (24-25), not all matches had been played. Therefore, I wrote a function to select only the completed matchweeks.

In [60]:
def filter_complete_weeks(df):
    week_counts = df['MW'].value_counts()
    # Find the weeks with exactly 10 matches played (completed match-week)
    complete_weeks = week_counts[week_counts == 10].index
    # Filter the original dataset
    return df[df['MW'].isin(complete_weeks)]

df17 = filter_complete_weeks(df17)
df18 = filter_complete_weeks(df18)
df19 = filter_complete_weeks(df19)
df20 = filter_complete_weeks(df20)
df21 = filter_complete_weeks(df21)
df22 = filter_complete_weeks(df22)
df23 = filter_complete_weeks(df23)
df24 = filter_complete_weeks(df24)
df25 = filter_complete_weeks(df25)

df25["MW"]

0       1
1       1
2       1
3       1
4       1
       ..
155    16
156    16
157    16
158    16
159    16
Name: MW, Length: 160, dtype: int64

Goal scored and conceded: The code below calculates the cumulative goals scored and conceded for both the home team and the away team up to a given matchweek. It first aggregates the goals scored  and conceded for each team across all matchweeks and then appends these cumulative statistics to the dataset. The resulting columns ('HTGS', 'ATGS', 'HTGC', 'ATGC') represent the cumulative goals scored and conceded by the home and away teams, respectively, for each match.

In [62]:
# Gets the goals scored aggregated by teams and matchweek
def get_goals_scored(playing_stat):
    mw = max(playing_stat['MW'])
    teams = {team: [0] * mw for team in playing_stat['HomeTeam'].unique()}
    
    # Iterate through matches and update goals scored for each team
    for i in range(len(playing_stat)):
        week = playing_stat.iloc[i]['MW'] - 1  # 0-based index
        HTGS = playing_stat.iloc[i]['FTHG']
        ATGS = playing_stat.iloc[i]['FTAG']
        teams[playing_stat.iloc[i].HomeTeam][week] += HTGS
        teams[playing_stat.iloc[i].AwayTeam][week] += ATGS
    
    # Create the DataFrame
    GoalsScored = pd.DataFrame(teams).T
    
    # Calculate cumulative goals scored
    for i in range(1, mw):
        GoalsScored[i] = GoalsScored[i] + GoalsScored[i-1]
    
    return GoalsScored

def get_goals_conceded(playing_stat):
    mw = max(playing_stat['MW'])
    teams = {team: [0] * mw for team in playing_stat['HomeTeam'].unique()}
    
    # Iterate through matches and update goals conceded for each team
    for i in range(len(playing_stat)):
        week = playing_stat.iloc[i]['MW'] - 1  # 0-based index
        ATGC = playing_stat.iloc[i]['FTHG']  # Goals conceded by the away team
        HTGC = playing_stat.iloc[i]['FTAG']  # Goals conceded by the home team
        teams[playing_stat.iloc[i].HomeTeam][week] += HTGC
        teams[playing_stat.iloc[i].AwayTeam][week] += ATGC
    
    # Create the DataFrame
    GoalsConceded = pd.DataFrame(teams).T
    
    # Calculate cumulative goals conceded
    for i in range(1, mw):
        GoalsConceded[i] = GoalsConceded[i] + GoalsConceded[i-1]
    
    return GoalsConceded

def get_goal_stats(playing_stat):
    GC = get_goals_conceded(playing_stat)
    GS = get_goals_scored(playing_stat)
   
    j = 0
    HTGS = []
    ATGS = []
    HTGC = []
    ATGC = []
    
    # Associate goals scored and goals conceded for each team in each match
    for i in range(len(playing_stat)):
        ht = playing_stat.iloc[i].HomeTeam
        at = playing_stat.iloc[i].AwayTeam
        week = j
        
        HTGS.append(GS.loc[ht][week])
        ATGS.append(GS.loc[at][week])
        HTGC.append(GC.loc[ht][week])
        ATGC.append(GC.loc[at][week])
        
        # Increment the matchweek index every 10 matches
        if ((i + 1) % 10) == 0:
            j += 1
    
    # Add the columns with goals scored and goals conceded to the original dataframe
    playing_stat['HTGS'] = HTGS
    playing_stat['ATGS'] = ATGS
    playing_stat['HTGC'] = HTGC
    playing_stat['ATGC'] = ATGC
    
    return playing_stat

# Apply the function to all dataframes
df17 = get_goal_stats(df17)
df18 = get_goal_stats(df18)
df19 = get_goal_stats(df19)
df20 = get_goal_stats(df20)
df21 = get_goal_stats(df21)
df22 = get_goal_stats(df22)
df23 = get_goal_stats(df23)
df24 = get_goal_stats(df24)
df25 = get_goal_stats(df25)

In [63]:
# Here we can see that there is a mistmatch between the matched played by each team
match_counts = df25['HomeTeam'].value_counts() + df25['AwayTeam'].value_counts()
match_counts

Atalanta      16
Bologna       16
Cagliari      16
Como          16
Empoli        16
Fiorentina    15
Genoa         16
Inter         15
Juventus      16
Lazio         16
Lecce         16
Milan         16
Monza         16
Napoli        16
Parma         16
Roma          16
Torino        17
Udinese       16
Venezia       16
Verona        17
Name: count, dtype: int64

We encounter an issue when calculating the number of points for each team at each matchweek. The problem arises due to potential mismatches in the number of matches played by different teams during certain matchweeks. This is particularly relevant for the most recent season in the dataset, as the season is still ongoing.

To ensure accuracy and avoid results that deviate significantly from reality, we need to identify the last matchweek where all teams have played the same number of matches. This matchweek will serve as the final data point for our calculations. By doing so, we can eliminate inconsistencies caused by uneven match schedules and produce results that better reflect the actual standings and performance of the teams.

In [65]:
def find_balanced_matchweek(df):
    """
    Find the most recent matchweek where all teams had played the same number of matches.

    Args:
    df (pd.DataFrame): DataFrame containing the columns 'HomeTeam', 'AwayTeam', 'MW'.

    Returns:
    int: The matchweek number where all teams were balanced.
    """
    # Get all unique matchweeks sorted
    matchweeks = sorted(df['MW'].unique())
    
    for mw in reversed(matchweeks):  # Iterate from the most recent matchweeks
        # Filter for matchweeks up to the current one
        filtered_df = df[df['MW'] <= mw]
        
        # Count the number of matches per team
        match_counts = filtered_df['HomeTeam'].value_counts() + filtered_df['AwayTeam'].value_counts()
        
        # Check if all teams have the same number of matches
        if match_counts.nunique() == 1:
            return mw  # Return the most recent balanced matchweek
    
    return None  # No balanced matchweek found
find_balanced_matchweek(df25)

8

In [66]:
df25 = df25[df25["MW"]<=8]
df25.tail()

Unnamed: 0,Date,Season,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,HR,AR,B365H,B365D,B365A,MW,HTGS,ATGS,HTGC,ATGC
75,2024-10-20,2425,Lecce,Fiorentina,0,6,A,0,3,A,...,1,0,3.2,3.3,2.3,8,3,15,18,8
76,2024-10-20,2425,Venezia,Atalanta,0,2,A,0,1,A,...,0,0,5.5,4.0,1.62,8,5,18,14,13
77,2024-10-20,2425,Cagliari,Torino,3,2,H,1,1,D,...,0,0,2.8,3.1,2.7,8,8,14,13,14
78,2024-10-20,2425,Roma,Inter,0,1,A,0,0,D,...,0,0,4.2,3.7,1.85,8,8,17,6,9
79,2024-10-21,2425,Verona,Monza,0,3,A,0,1,A,...,0,0,2.3,3.0,3.5,8,12,8,15,9


Then, we calculate the cumulative points earned by the home team and away team up to
the given matchweek for each match. The function aggregates match results (win, draw, or loss)
into cumulative points for each team and adds these as new columns to the dataset
('HTP' for home team points and 'ATP' for away team points).

In [68]:
# Function to calculate points
def get_points(result):
    if result == 'W':
        return 3
    elif result == 'D':
        return 1
    else:
        return 0

# Function to calculate cumulative points
def get_cumulative_points(matchres, mw):
    matchres_points = matchres.applymap(get_points)
    for i in range(2, mw+1):
        matchres_points[i] = matchres_points[i] + matchres_points[i-1]
        
    matchres_points.insert(column=0, loc=0, value=[0*i for i in range(len(matchres_points))])
    return matchres_points

# Modified function to get match results
def get_match_result(playing_stat):
    # Create a dictionary with team names as keys
    teams = {team: [] for team in playing_stat['HomeTeam'].unique()}
    
    # Build the dictionary where the value is a list of match results
    for i in range(len(playing_stat)):
        if playing_stat.iloc[i].FTR == 'H':
            teams[playing_stat.iloc[i].HomeTeam].append('W')
            teams[playing_stat.iloc[i].AwayTeam].append('L')
        elif playing_stat.iloc[i].FTR == 'A':
            teams[playing_stat.iloc[i].AwayTeam].append('W')
            teams[playing_stat.iloc[i].HomeTeam].append('L')
        else:
            teams[playing_stat.iloc[i].AwayTeam].append('D')
            teams[playing_stat.iloc[i].HomeTeam].append('D')
    
    # Create the DataFrame with the results
    return pd.DataFrame(data=teams, index=[i for i in range(1, max(playing_stat['MW'])+1)]).T

# Function to aggregate points
def get_agg_points(playing_stat):
    matchres = get_match_result(playing_stat)
    cum_pts = get_cumulative_points(matchres, max(playing_stat['MW']))
    
    HTP = []
    ATP = []
    j = 0
    
    for i in range(len(playing_stat)):
        ht = playing_stat.iloc[i].HomeTeam
        at = playing_stat.iloc[i].AwayTeam
        HTP.append(cum_pts.loc[ht][j])
        ATP.append(cum_pts.loc[at][j])
        
        if ((i + 1)% 10) == 0:
            j += 1
            
    playing_stat['HTP'] = HTP
    playing_stat['ATP'] = ATP
    return playing_stat

# Apply the function to all dataframes
df17 = get_agg_points(df17)
df18 = get_agg_points(df18)
df19 = get_agg_points(df19)
df20 = get_agg_points(df20)
df21 = get_agg_points(df21)
df22 = get_agg_points(df22)
df23 = get_agg_points(df23)
df24 = get_agg_points(df24)
df25 = get_agg_points(df25)

  matchres_points = matchres.applymap(get_points)
  matchres_points = matchres.applymap(get_points)
  matchres_points = matchres.applymap(get_points)
  matchres_points = matchres.applymap(get_points)
  matchres_points = matchres.applymap(get_points)
  matchres_points = matchres.applymap(get_points)
  matchres_points = matchres.applymap(get_points)
  matchres_points = matchres.applymap(get_points)
  matchres_points = matchres.applymap(get_points)


This code below calculates and adds performance "form" metrics for each team based on their recent match results. The get_form function generates a string of results (e.g., "WWL") for the last num games of each team. The add_form function adds columns (HM<num> for home team and AM<num> for away team) to the dataset, representing each team's form in the last num games. The add_form_df function applies this process for 1 to 5 recent games, creating multiple form-related columns. Finally, these form metrics are added to season datasets (df17 to df25), providing insights into team performance trends over the most recent matches.

In [70]:
# See how each team in a match has performed in previous games
def get_form(playing_stat,num):
    form = get_match_result(playing_stat)
    form_final = form.copy()
    for i in range(num, max(playing_stat['MW'])+1):
        form_final[i] = ''
        j = 0
        while j < num:
            form_final[i] += form[i-j]
            j += 1
    return form_final

def add_form(playing_stat,num):
    form = get_form(playing_stat,num)
    h = ['M' for i in range(num * 10)]  # since form is not available for n MW (n*10)
    a = ['M' for i in range(num * 10)]
    
    j = num
    for i in range((num*10),playing_stat.shape[0]):
        ht = playing_stat.iloc[i].HomeTeam
        at = playing_stat.iloc[i].AwayTeam
        
        past = form.loc[ht][j]  # get past n results
        h.append(past[num-1])   # 0 index is most recent
        
        past = form.loc[at][j]  # get past n results.
        a.append(past[num-1])   # 0 index is most recent
        
        if ((i + 1)% 10) == 0:
            j = j + 1

    playing_stat['HM' + str(num)] = h                 
    playing_stat['AM' + str(num)] = a

    
    return playing_stat

def add_form_df(playing_statistics):
    playing_statistics = add_form(playing_statistics,1)
    playing_statistics = add_form(playing_statistics,2)
    playing_statistics = add_form(playing_statistics,3)
    playing_statistics = add_form(playing_statistics,4)
    playing_statistics = add_form(playing_statistics,5)
    return playing_statistics

df17 = add_form_df(df17)
df18 = add_form_df(df18)
df19 = add_form_df(df19)
df20 = add_form_df(df20)
df21 = add_form_df(df21)
df22 = add_form_df(df22)
df23 = add_form_df(df23)
df24 = add_form_df(df24)
df25 = add_form_df(df25)


In [71]:
# Rearranging the columns
cols = ['Date', 'Season', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTGS', 'ATGS', 
        'HTGC', 'ATGC', 'HTP', 'ATP', 'B365H', 'B365D', 
        'B365A', 'HM1', 'HM2', 'HM3', 'HM4', 'HM5', 'AM1', 'AM2', 'AM3', 'AM4', 'AM5', 'MW']

df17 = df17[cols]
df18 = df18[cols]
df19 = df19[cols]
df20 = df20[cols]
df21 = df21[cols]
df22 = df22[cols]
df23 = df23[cols]
df24 = df24[cols]
df25 = df25[cols]

In [72]:
df24.tail()

Unnamed: 0,Date,Season,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTGS,ATGS,HTGC,...,HM2,HM3,HM4,HM5,AM1,AM2,AM3,AM4,AM5,MW
375,2024-05-26,2324,Empoli,Roma,2,1,H,29,65,54,...,L,D,L,W,W,L,D,D,W,38
376,2024-05-26,2324,Frosinone,Udinese,0,1,A,44,37,69,...,L,D,W,D,D,W,D,D,L,38
377,2024-05-26,2324,Lazio,Sassuolo,1,1,D,49,43,39,...,W,D,W,W,L,L,W,L,L,38
378,2024-05-26,2324,Verona,Inter,2,2,D,38,89,51,...,L,W,L,W,D,W,L,W,W,38
379,2024-06-02,2324,Atalanta,Fiorentina,2,3,A,72,61,42,...,W,W,W,W,W,D,W,L,W,38


In [73]:
# Creating a whole dataset (based on gameID created with lambda function)
playing_stats = (pd.concat([df17,df18,df19,df20,df21,df22,df23,df24,df25], ignore_index=True)
                         .assign(gameId=lambda df: list(df.index + 1))
                         .sort_values('gameId'))

playing_stats.tail()

Unnamed: 0,Date,Season,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTGS,ATGS,HTGC,...,HM3,HM4,HM5,AM1,AM2,AM3,AM4,AM5,MW,gameId
3115,2024-10-20,2425,Lecce,Fiorentina,0,6,A,3,15,18,...,D,D,W,W,D,W,L,D,8,3116
3116,2024-10-20,2425,Venezia,Atalanta,0,2,A,5,18,14,...,W,L,L,W,D,L,W,L,8,3117
3117,2024-10-20,2425,Cagliari,Torino,3,2,H,8,14,13,...,L,L,L,L,L,W,D,W,8,3118
3118,2024-10-20,2425,Roma,Inter,0,1,A,8,17,6,...,W,D,D,W,W,L,D,W,8,3119
3119,2024-10-21,2425,Verona,Monza,0,3,A,12,8,15,...,L,L,W,D,L,L,D,D,8,3120


This code below calculates additional metrics to analyze team performance. It combines the results of the last five matches for home and away teams into strings (HTFormPtsStr and ATFormPtsStr) and computes the cumulative points from these results (HTFormPts and ATFormPts). It also calculates the goal difference for home and away teams (HTGD and ATGD) and the difference in total points (DiffPts) and form points over the last five matches (DiffFormPts) between home and away teams. These metrics are added to the playing_stats dataframe for further analysis.

In [75]:
# Gets the form points.
def get_form_points(string):
    total = 0
    for letter in string:
        total += get_points(letter)
    return total

# Previous five matches result for each team (home and away team) in a single column
playing_stats['HTFormPtsStr'] = playing_stats['HM1'] + playing_stats['HM2'] + playing_stats['HM3'] + playing_stats['HM4'] + playing_stats['HM5']
playing_stats['ATFormPtsStr'] = playing_stats['AM1'] + playing_stats['AM2'] + playing_stats['AM3'] + playing_stats['AM4'] + playing_stats['AM5']

# Same as before, but considering the cumulative points in the last 5 matches for home and away team
playing_stats['HTFormPts'] = playing_stats['HTFormPtsStr'].apply(get_form_points)
playing_stats['ATFormPts'] = playing_stats['ATFormPtsStr'].apply(get_form_points)

In [76]:
# Get goal difference
playing_stats['HTGD'] = playing_stats['HTGS'] - playing_stats['HTGC']
playing_stats['ATGD'] = playing_stats['ATGS'] - playing_stats['ATGC']

# Diff in points
playing_stats['DiffPts'] = playing_stats['HTP'] - playing_stats['ATP']

# Difference in former points, last 5 games
playing_stats['DiffFormPts'] = playing_stats['HTFormPts'] - playing_stats['ATFormPts']


In this step, we normalize key performance metrics by matchweek (MW) to standardize them as per-match averages. This allows us to make fair comparisons between teams, regardless of how many matches they’ve played. Metrics like goal differences (HTGD, ATGD), point differences (DiffPts, DiffFormPts), and total points (HTP, ATP) were chosen because they capture critical aspects of team performance. By scaling them, we focus on efficiency rather than raw totals, avoiding biases that arise from the cumulative nature of these metrics.

We included goal differences (HTGD, ATGD) to evaluate a team's ability to outscore opponents consistently. Point differences (DiffPts) highlight gaps in performance between home and away teams, while DiffFormPts captures short-term momentum over recent matches. Normalizing total points (HTP, ATP) ensures we assess a team’s average performance rate instead of inflated totals.

This normalization helps us compare teams at any point in the season, even when the number of matches played varies. It also facilitates cross-season analysis, making our data more interpretable and reliable for predictive modeling.

In [78]:
# Scale DiffPts , DiffFormPts, HTGD, ATGD by Matchweek.
cols = ['HTGD','ATGD','DiffPts','DiffFormPts','HTP','ATP']
playing_stats.MW = playing_stats.MW.astype(float)

for col in cols:
    playing_stats[col] = playing_stats[col] / playing_stats.MW

In [79]:
playing_stats.columns

Index(['Date', 'Season', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTGS',
       'ATGS', 'HTGC', 'ATGC', 'HTP', 'ATP', 'B365H', 'B365D', 'B365A', 'HM1',
       'HM2', 'HM3', 'HM4', 'HM5', 'AM1', 'AM2', 'AM3', 'AM4', 'AM5', 'MW',
       'gameId', 'HTFormPtsStr', 'ATFormPtsStr', 'HTFormPts', 'ATFormPts',
       'HTGD', 'ATGD', 'DiffPts', 'DiffFormPts'],
      dtype='object')

In [80]:
playing_stats = playing_stats.dropna().reset_index(drop=True) # cancel the rows with at least 1 Nan

In [102]:
playing_stats = playing_stats.drop(['HM1', 'HM2', 'HM3', 'HM4', 'HM5', 
                                    'AM1', 'AM2', 'AM3', 'AM4', 'AM5'], axis=1)

playing_stats.to_csv(os.path.join(DATA_PATH, 'GENERAL_STATS.csv'), index=False)

Finally, we saved the GENERAL_STATS dataset in the data directory.

Next, we continue the analysis in the notebook named "Data_preparation_2"