<a href="https://colab.research.google.com/github/khiemtranngoc/GoalNetAI-Multi-League-Football-Predictions/blob/main/england.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Metodology

Because our goal is to predict  football match results from 2023 then we should not use features that are only available after the match has ended, such as match statistics and goal results. These features are not useful for predicting matches that have not yet happened.

To predict football matches before they happen, we must create a prediction models with data that is available before each match starts. However, the data we have was for the end of each match, such as the number of goals and shots per team. This data could not be used directly to train prediction models, so we had to transform it (creating pre-match features based on the historic data)

* In the test(season 2023) we dont have information such as FTHG, FTAG, ...

### Features Not Suitable for Pre-Match Prediction:
* Goals and Results (FTHG, FTAG, FTR, HTHG, HTAG, HTR): These are outcomes of the match, not available before it starts.

* In-Match Statistics (HS, AS, HST, AST, HHW, AHW, HC, AC, HF, AF, HFKC, AFKC, HO, AO, HY, AY, HR, AR): These are also outcomes or events that occur during the match.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
file_path = '/content/drive/My Drive/test/england/0/2223.csv'
df2023 = pd.read_csv(file_path)

In [None]:
file_path_1 = '/content/drive/My Drive/test/england/1/2223.csv'
df2023_eng1 = pd.read_csv(file_path_1)

In [None]:
file_path_2 = '/content/drive/My Drive/test/england/2/2223.csv'
df2023_eng2 = pd.read_csv(file_path_2)

In [None]:
file_path_3 = '/content/drive/My Drive/test/england/3/2223.csv'
df2023_eng3 = pd.read_csv(file_path_3)

In [None]:
# Function that load all the seasonal dataset from train

def load_seasonal_data(base_path, country, league, start_season, end_season):
    seasonal_data = {}

    for season_start_year in range(start_season, end_season + 1):

        start_year_suffix = (season_start_year - 1) % 100
        end_year_suffix = season_start_year % 100

        season_str = f"{start_year_suffix:02d}{end_year_suffix:02d}"

        file_path = f"{base_path}/{country}/{league}/{season_str}.csv"

        seasonal_data[f'{league}{season_str}'] = pd.read_csv(file_path)

    return seasonal_data


base_path = "/content/drive/MyDrive/train"
country = "england"
league = "0"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

# Example: Access the data for the 2001/2002 season
# ger10102 = seasonal_datasets['ger10102']


In [None]:

eng01516 = seasonal_datasets['01516']
eng01617 = seasonal_datasets['01617']
eng01718 = seasonal_datasets['01718']
eng01819 = seasonal_datasets['01819']
eng01920 = seasonal_datasets['01920']
eng02021 = seasonal_datasets['02021']
eng02122 = seasonal_datasets['02122']

In [None]:
columns = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS',
        'HST', 'AST', 'HC', 'AC',
         "B365H", "B365D", "B365A" ]

In [None]:
df2016 = eng01516[columns]
df2017 = eng01617[columns]
df2018 = eng01718[columns]
df2019 = eng01819[columns]
df2020 = eng01920[columns]
df2021 = eng02021[columns]
df2022 = eng02122[columns]

In [None]:
# This function shows us where do we have missing value in a dataframe

def missing_values_summary(df):

    missing_counts = df.isnull().sum()

    missing_counts = missing_counts[missing_counts > 0]

    summary_df = pd.DataFrame(missing_counts, columns=['Missing Values Count'])
    summary_df.index.name = 'Column'

    return summary_df

In [None]:
summary = missing_values_summary(df2016)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2017)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2018)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2019)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2020)
print(summary)

        Missing Values Count
Column                      
HST                       11


In [None]:
summary = missing_values_summary(df2021)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2022)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
# Function to display rows with missing values from a DataFrame.

def show_rows_with_missing_values(df):

    rows_with_missing_values = df[df.isnull().any(axis=1)]

    return rows_with_missing_values


In [None]:
show_rows_with_missing_values(df2020)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
0,21/09/2019,Burnley,Norwich,2,0,H,13,11,,2,7,3,2.0,3.8,3.5
1,02/01/2020,Liverpool,Sheffield United,2,0,H,19,3,,2,8,4,1.2,6.5,13.0
2,16/07/2020,Southampton,Brighton,1,1,D,21,10,,2,8,2,2.15,3.3,3.5
3,12/01/2020,Bournemouth,Watford,0,3,A,10,18,,6,5,5,2.6,3.3,2.7
4,14/09/2019,Wolves,Chelsea,2,5,A,11,15,,6,7,5,2.9,3.3,2.5
5,26/12/2019,Man United,Newcastle,4,1,H,22,7,,2,5,0,1.33,5.25,9.0
6,06/10/2019,Man City,Wolves,0,2,A,18,7,,2,9,1,1.12,9.0,21.0
7,14/12/2019,Leicester,Norwich,1,1,D,18,10,,3,12,4,1.22,6.5,13.0
8,12/07/2020,Wolves,Everton,3,0,H,14,6,,2,5,2,2.05,3.3,3.8
9,08/12/2019,Aston Villa,Leicester,1,4,A,15,23,,8,8,5,4.33,3.8,1.8


In [None]:
def transform_goals_to_absolute(df):
    """
    Function to transform values in 'FTHG' (Full Time Home Team Goals) and
    'FTAG' (Full Time Away Team Goals) columns to their absolute values.

    Args:
    df (pd.DataFrame): DataFrame containing the match data.

    Returns:
    pd.DataFrame: Updated DataFrame with absolute values in the specified columns.
    """
    # Convert to absolute values
    df['FTHG'] = df['FTHG'].abs()
    df['FTAG'] = df['FTAG'].abs()

    return df

the reason why I created this function because sometime there are some negative values in column awayteamgoals (number of goal can not be negative)

In [None]:
df2016 = transform_goals_to_absolute(df2016)
df2017 = transform_goals_to_absolute(df2017)
df2018 = transform_goals_to_absolute(df2018)
df2019 = transform_goals_to_absolute(df2019)
df2020 = transform_goals_to_absolute(df2020)
df2021 = transform_goals_to_absolute(df2021)
df2022 = transform_goals_to_absolute(df2022)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTHG'] = df['FTHG'].abs()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTAG'] = df['FTAG'].abs()


Function showing extream outliers

In [None]:
def find_and_print_outlier_rows(df):
    """
    Identifies and prints rows containing outliers for all numerical columns in the DataFrame.

    Parameters:
    df (pd.DataFrame): The dataset.
    """
    for column in df.select_dtypes(include=['number']).columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 3 * IQR
        upper_bound = Q3 + 3 * IQR

        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

        if not outliers.empty:
            print(f"Rows with outliers in column '{column}':")
            print(outliers)
            print("\n")

In [None]:
print(find_and_print_outlier_rows(df2018))

Rows with outliers in column 'FTHG':
           Date   HomeTeam      AwayTeam  FTHG  FTAG FTR  HS  AS  HST  AST  \
91   14/10/2017   Man City         Stoke     7     2   H  20   5   11    1   
272  21/04/2018  West Brom     Liverpool   291     2   D  13   9    6    3   
332  24/02/2018  West Brom  Huddersfield   291     2   A  10  16    3    7   

     HC  AC  B365H  B365D  B365A  
91    5   0   1.14   9.50  21.00  
272   7   4   7.50   4.50   1.50  
332   6   3   1.95   3.29   4.75  


Rows with outliers in column 'B365H':
           Date        HomeTeam    AwayTeam  FTHG  FTAG FTR  HS  AS  HST  AST  \
11   13/05/2018     Southampton    Man City     0     1   A   8  13    3    2   
34   30/09/2017    Huddersfield   Tottenham     0     4   A   6  14    1    7   
77   12/03/2018           Stoke    Man City     0     2   A   3  17    0    6   
80   01/01/2018         Burnley   Liverpool     1     2   A  13  19    4    5   
83   13/12/2017         Swansea    Man City     0     4   A   7  

In [None]:
def filter_goals_under_30(df):
    """
    Filter the DataFrame to select rows where both 'FTHG' and 'FTAG' are smaller than 30,
    including rows where 'FTHG' or 'FTAG' might be NA.

    Parameters:
    df (pandas.DataFrame): The input DataFrame with football data.

    Returns:
    pandas.DataFrame: The filtered DataFrame.
    """
    filtered_df = df[((df['FTHG'] < 30) & (df['FTAG'] < 30)) | df['FTHG'].isna() | df['FTAG'].isna()]
    return filtered_df


In [None]:
df2016 = filter_goals_under_30(df2016)
df2017 = filter_goals_under_30(df2017)
df2018 = filter_goals_under_30(df2018)
df2019 = filter_goals_under_30(df2019)
df2020 = filter_goals_under_30(df2020)
df2021 = filter_goals_under_30(df2021)
df2022 = filter_goals_under_30(df2022)


In [None]:
# Function to selectively impute missing values in a DataFrame using KNNImputer.
# The imputation is applied only to columns with missing values, and results are rounded to integers.


from sklearn.impute import KNNImputer

def impute_missing_values_knn(df, n_neighbors=5):
    cols_with_missing = df.columns[df.isnull().any()]
    numeric_cols_with_missing = df[cols_with_missing].select_dtypes(include=[np.number]).columns

    imputer = KNNImputer(n_neighbors=n_neighbors)

    df_numeric_imputed = df.copy()
    if len(numeric_cols_with_missing) > 0:
        imputed_data = imputer.fit_transform(df[numeric_cols_with_missing])
        df_imputed = pd.DataFrame(imputed_data, columns=numeric_cols_with_missing, index=df.index)

        for col in numeric_cols_with_missing:
            df_numeric_imputed[col] = df_numeric_imputed[col].fillna(np.round(df_imputed[col]))

    return df_numeric_imputed


In [None]:
# filling Na values with KNN imputer

df2020 = impute_missing_values_knn(df2020)

  Function to preprocess football data and create new features:
   
  * Home and Away Team Win Rates from a season
  * Home and Away Team Goals Average per match from a season
  * Winning probabilities from Brokers's Betting Odds
  * goal ratio if the shot hits the target (total goal/ total shots on target) from each team

In [None]:
def preprocess_football_data(df):


    # Calculating win rates and average goals
    home_win_rate = df.groupby('HomeTeam')['FTR'].apply(lambda x: round((x == 'H').mean(), 2)).to_dict()
    away_win_rate = df.groupby('AwayTeam')['FTR'].apply(lambda x: round((x == 'A').mean(), 2)).to_dict()
    home_goals_avg = df.groupby('HomeTeam')['FTHG'].mean().apply(lambda x: round(x, 2)).to_dict()
    away_goals_avg = df.groupby('AwayTeam')['FTAG'].mean().apply(lambda x: round(x, 2)).to_dict()
    home_goals_conceded_avg = df.groupby('HomeTeam')['FTAG'].mean().apply(lambda x: round(x, 2)).to_dict()
    away_goals_conceded_avg = df.groupby('AwayTeam')['FTHG'].mean().apply(lambda x: round(x, 2)).to_dict()
    goal_ratio_H = df.groupby('HomeTeam').apply(lambda x: round(x['FTHG'].sum() / x['HST'].sum(),2) if x['HST'].sum() > 0 else 0)
    goal_ratio_A = df.groupby('AwayTeam').apply(lambda x: round(x['FTAG'].sum() / x['AST'].sum(),2) if x['AST'].sum() > 0 else 0)


    # Mapping the win rates and average goals to the main DataFrame
    df['HomeTeam_WinRate'] = df['HomeTeam'].map(home_win_rate)
    df['AwayTeam_WinRate'] = df['AwayTeam'].map(away_win_rate)
    df['HomeTeam_GoalsAvg'] = df['HomeTeam'].map(home_goals_avg)
    df['AwayTeam_GoalsAvg'] = df['AwayTeam'].map(away_goals_avg)
    df['HomeTeam_goals_conceded_avg'] = df['HomeTeam'].map(home_goals_conceded_avg)
    df['AwayTeam_goals_conceded_avg'] = df['AwayTeam'].map(away_goals_conceded_avg)


    # Calculating implied probabilities from betting odds
    df['Broker_prob_H'] = round(1 / df['B365H'], 2)
    df['Broker_prob_D'] = round(1 / df['B365D'], 2)
    df['Broker_prob_A'] = round(1 / df['B365A'], 2)

     # Calculate the total goals for each match
    df['total_goal'] = df['FTHG'] + df['FTAG']


    # Map the conversion rates back to the original DataFrame
    df['H_goal_ratio'] = df['HomeTeam'].map(goal_ratio_H)
    df['A_goal_ratio'] = df['AwayTeam'].map(goal_ratio_A)

    clean_df = df[df['HomeTeam'] != df['AwayTeam']]




    return clean_df

In [None]:
def add_adjusted_win_loss_ratio(df):
    def adjusted_win_loss_ratio(wins, draws, losses, total_matches):
        return ((3*wins + draws) - losses) / total_matches if total_matches > 0 else 0

    # Initialize a dictionary to track head-to-head stats
    head_to_head_stats = {}

    # Update head-to-head stats
    for index, row in df.iterrows():
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        if teams not in head_to_head_stats:
            head_to_head_stats[teams] = {'wins': {teams[0]: 0, teams[1]: 0},
                                         'draws': 0,
                                         'total_matches': 0}

        head_to_head_stats[teams]['total_matches'] += 1
        if row['FTR'] == 'H':
            head_to_head_stats[teams]['wins'][row['HomeTeam']] += 1
        elif row['FTR'] == 'D':
            head_to_head_stats[teams]['draws'] += 1
        elif row['FTR'] == 'A':
            head_to_head_stats[teams]['wins'][row['AwayTeam']] += 1

    # Calculate and add the adjusted win-loss ratio to the DataFrame
    def calculate_ratio_for_match(row):
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        stats = head_to_head_stats[teams]
        home_wins = stats['wins'][row['HomeTeam']]
        away_wins = stats['wins'][row['AwayTeam']]
        draws = stats['draws']
        total_matches = stats['total_matches']
        home_ratio = adjusted_win_loss_ratio(home_wins, draws, total_matches - home_wins - draws, total_matches)
        away_ratio = adjusted_win_loss_ratio(away_wins, draws, total_matches - away_wins - draws, total_matches)
        return pd.Series([home_ratio, away_ratio])

    df[['adjusted_win_lost_ratio_H', 'adjusted_win_lost_ratio_A']] = df.apply(calculate_ratio_for_match, axis=1)

    return df

I developed the features {'attack_strength_home_team'} and {'attack_strength_away_team'} for every team in the league. These features measure a team's ability to score goals compared to the league average, offering a consistent way to gauge their attacking strength.

In [None]:
def calculate_attack_strength(df):
    # Calculate total goals for each team
    total_home_goals = df.groupby('HomeTeam')['FTHG'].sum()
    total_away_goals = df.groupby('AwayTeam')['FTAG'].sum()

    # Calculate league averages for home and away goals
    average_home_goals = df['FTHG'].mean()
    average_away_goals = df['FTAG'].mean()

    # Calculate attack strength
    df['attack_strength_home_team'] = df['HomeTeam'].apply(lambda x: round(total_home_goals[x] / average_home_goals,2))
    df['attack_strength_away_team'] = df['AwayTeam'].apply(lambda x: round(total_away_goals[x] / average_away_goals,2))

    return df


In [None]:
df2016 =  preprocess_football_data(df2016)
df2017 =  preprocess_football_data(df2017)
df2018 =  preprocess_football_data(df2018)
df2019 =  preprocess_football_data(df2019)
df2020 =  preprocess_football_data(df2020)
df2021 =  preprocess_football_data(df2021)
df2022 =  preprocess_football_data(df2022)

In [None]:
df2016 = calculate_attack_strength(df2016)
df2017 = calculate_attack_strength(df2017)
df2018 = calculate_attack_strength(df2018)
df2019 = calculate_attack_strength(df2019)
df2020 = calculate_attack_strength(df2020)
df2021 = calculate_attack_strength(df2021)
df2022 = calculate_attack_strength(df2022)

In [None]:
df2016 =  add_adjusted_win_loss_ratio(df2016)
df2017 =  add_adjusted_win_loss_ratio(df2017)
df2018 =  add_adjusted_win_loss_ratio(df2018)
df2019 =  add_adjusted_win_loss_ratio(df2019)
df2020 =  add_adjusted_win_loss_ratio(df2020)
df2021 =  add_adjusted_win_loss_ratio(df2021)
df2022 =  add_adjusted_win_loss_ratio(df2022)

In [None]:


def process_time_data(df, target_year):
    # Convert 'Date' column to datetime
    df['Date'] = pd.to_datetime(df['Date'])

    # Extract 'Day', 'Month', and 'Year' from 'Date'
    df['Day'] = df['Date'].dt.day
    df['Month'] = df['Date'].dt.month
    df['Year'] = df['Date'].dt.year

    # Adjust 'Year' values
    df['Year'] = df['Year'].apply(lambda x: target_year if x != target_year else x)

    # Drop 'Day' and 'Month' columns
    df.drop(['Day', 'Month', 'Date'], axis=1, inplace=True)

    return df


In [None]:
df2016 = process_time_data(df2016, 2016)
df2017 = process_time_data(df2017, 2017)
df2018 = process_time_data(df2018, 2018)
df2019 = process_time_data(df2019, 2019)
df2020 = process_time_data(df2020, 2020)
df2021 = process_time_data(df2021, 2021)
df2022 = process_time_data(df2022, 2022)

  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
columns_to_drop = ['FTHG', 'FTAG', 'HTR', 'HC', 'AC', 'HST', 'AST', 'HS', 'AS']

In [None]:
eng0 = pd.concat([df2016, df2017, df2018, df2019, df2020, df2021, df2022], ignore_index=True)

In [None]:
eng0 = eng0.drop(columns=columns_to_drop, errors='ignore')

In [None]:
eng0.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,...,Broker_prob_D,Broker_prob_A,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year
0,Chelsea,Arsenal,H,2.45,3.5,3.0,0.26,0.42,1.68,1.84,...,0.29,0.33,3,0.34,0.37,21.49,29.17,3.0,-1.0,2016
1,Swansea,Sunderland,A,1.8,3.6,5.25,0.42,0.16,1.05,1.16,...,0.28,0.19,3,0.32,0.31,13.43,18.33,0.0,2.0,2016
2,Crystal Palace,Stoke,H,2.05,3.5,3.9,0.28,0.32,0.94,1.0,...,0.29,0.26,3,0.2,0.29,11.42,15.83,3.0,-1.0,2016
3,Liverpool,Man United,A,2.3,3.25,3.5,0.42,0.37,1.74,1.16,...,0.31,0.29,1,0.31,0.32,22.16,18.33,-1.0,3.0,2016
4,Watford,Swansea,H,2.75,3.4,2.75,0.32,0.21,1.05,1.16,...,0.29,0.36,1,0.27,0.3,13.43,18.33,1.0,1.0,2016


In [None]:
eng0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2596 entries, 0 to 2595
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     2596 non-null   object 
 1   AwayTeam                     2596 non-null   object 
 2   FTR                          2596 non-null   object 
 3   B365H                        2596 non-null   float64
 4   B365D                        2596 non-null   float64
 5   B365A                        2596 non-null   float64
 6   HomeTeam_WinRate             2596 non-null   float64
 7   AwayTeam_WinRate             2596 non-null   float64
 8   HomeTeam_GoalsAvg            2596 non-null   float64
 9   AwayTeam_GoalsAvg            2596 non-null   float64
 10  HomeTeam_goals_conceded_avg  2596 non-null   float64
 11  AwayTeam_goals_conceded_avg  2596 non-null   float64
 12  Broker_prob_H                2596 non-null   float64
 13  Broker_prob_D     

In [None]:
find_and_print_outlier_rows(eng0)

Rows with outliers in column 'B365H':
         HomeTeam   AwayTeam FTR  B365H  B365D  B365A  HomeTeam_WinRate  \
415          Hull   Man City   A   10.0   5.50   1.30              0.42   
426      West Ham  Tottenham   H    8.5   5.25   1.40              0.37   
437    Sunderland  Tottenham   D    9.5   5.25   1.36              0.16   
487       Burnley    Chelsea   D    9.0   5.00   1.40              0.53   
510    Sunderland  Liverpool   D   11.0   5.50   1.33              0.16   
...           ...        ...  ..    ...    ...    ...               ...   
2542      Everton   Man City   A   10.0   5.75   1.28              0.47   
2543      Norwich  Liverpool   A    9.0   5.75   1.30              0.17   
2554    Brentford   Man City   A   19.0   8.00   1.14              0.39   
2573      Watford   Man City   A   12.0   7.00   1.20              0.11   
2580  Southampton   Man City   D   10.0   6.50   1.25              0.32   

      AwayTeam_WinRate  HomeTeam_GoalsAvg  AwayTeam_GoalsAvg 

Load the second division England1

In [None]:
league = "1"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

In [None]:

eng11516 = seasonal_datasets['11516']
eng11617 = seasonal_datasets['11617']
eng11718 = seasonal_datasets['11718']
eng11819 = seasonal_datasets['11819']
eng11920 = seasonal_datasets['11920']
eng12021 = seasonal_datasets['12021']
eng12122 = seasonal_datasets['12122']

In [None]:
df20161 = eng11516[columns]
df20171 = eng11617[columns]
df20181 = eng11718[columns]
df20191 = eng11819[columns]
df20201 = eng11920[columns]
df20211 = eng12021[columns]
df20221 = eng12122[columns]

In [None]:
df20191 = df20191.drop(217)

In [None]:
df20161 = transform_goals_to_absolute(df20161)
df20171 = transform_goals_to_absolute(df20171)
df20181 = transform_goals_to_absolute(df20181)
df20191 = transform_goals_to_absolute(df20191)
df20201 = transform_goals_to_absolute(df20201)
df20211 = transform_goals_to_absolute(df20211)
df20221 = transform_goals_to_absolute(df20221)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTHG'] = df['FTHG'].abs()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTAG'] = df['FTAG'].abs()


In [None]:
df20161 = filter_goals_under_30(df20161)
df20171 = filter_goals_under_30(df20171)
df20181 = filter_goals_under_30(df20181)
df20191 = filter_goals_under_30(df20191)
df20201 = filter_goals_under_30(df20201)
df20211 = filter_goals_under_30(df20211)
df20221 = filter_goals_under_30(df20221)


In [None]:
def remove_rows_with_inf(df):
    """
    Remove rows from a DataFrame that contain infinite values.

    Parameters:
    df (pd.DataFrame): The DataFrame to process.

    Returns:
    pd.DataFrame: A new DataFrame with rows containing infinite values removed.
    """
    # Replace inf/-inf with NaN
    df.replace([np.inf, -np.inf], np.nan, inplace=True)

    # Drop rows with NaN (which are the inf/-inf values)
    df_cleaned = df.dropna()

    return df_cleaned

In [None]:
df20161 =  preprocess_football_data(df20161)
df20171 =  preprocess_football_data(df20171)
df20181 =  preprocess_football_data(df20181)
df20191 =  preprocess_football_data(df20191)
df20201 =  preprocess_football_data(df20201)
df20211 =  preprocess_football_data(df20211)
df20221 =  preprocess_football_data(df20221)

In [None]:
df20161 = calculate_attack_strength(df20161)
df20171 = calculate_attack_strength(df20171)
df20181 = calculate_attack_strength(df20181)
df20191 = calculate_attack_strength(df20191)
df20201 = calculate_attack_strength(df20201)
df20211 = calculate_attack_strength(df20211)
df20221 = calculate_attack_strength(df20221)

In [None]:
df20161 =  add_adjusted_win_loss_ratio(df20161)
df20171 =  add_adjusted_win_loss_ratio(df20171)
df20181 =  add_adjusted_win_loss_ratio(df20181)
df20191 =  add_adjusted_win_loss_ratio(df20191)
df20201 =  add_adjusted_win_loss_ratio(df20201)
df20211 =  add_adjusted_win_loss_ratio(df20211)
df20221 =  add_adjusted_win_loss_ratio(df20221)

In [None]:
df20161 = process_time_data(df20161, 2016)
df20171 = process_time_data(df20171, 2017)
df20181 = process_time_data(df20181, 2018)
df20191 = process_time_data(df20191, 2019)
df20201 = process_time_data(df20201, 2020)
df20211 = process_time_data(df20211, 2021)
df20221 = process_time_data(df20221, 2022)

  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
eng1 = pd.concat([df20161, df20171, df20181, df20191, df20201, df20211, df20221], ignore_index=True)

In [None]:
eng1 = eng1.drop(columns=columns_to_drop, errors='ignore')

In [None]:
eng1.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,...,Broker_prob_D,Broker_prob_A,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year
0,Wolves,Nott'm Forest,D,2.3,3.4,3.4,0.32,0.19,1.18,0.76,...,0.29,0.29,2,0.33,0.19,19.24,15.03,1.0,1.0,2016
1,Charlton,Huddersfield,A,2.3,3.4,3.4,0.18,0.26,0.95,1.13,...,0.29,0.29,3,0.21,0.28,15.54,24.42,-1.0,3.0,2016
2,Ipswich,Wolves,D,2.25,3.4,3.5,0.41,0.3,1.18,1.17,...,0.29,0.29,4,0.27,0.33,19.24,25.36,1.0,1.0,2016
3,Huddersfield,Birmingham,D,2.1,3.6,3.75,0.3,0.29,1.43,1.1,...,0.28,0.27,2,0.33,0.26,24.42,21.6,2.0,0.0,2016
4,Preston,Leeds,D,2.05,3.6,3.9,0.3,0.3,0.91,1.17,...,0.28,0.26,2,0.24,0.32,15.54,25.36,0.0,2.0,2016


In [None]:
eng1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3777 entries, 0 to 3776
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     3777 non-null   object 
 1   AwayTeam                     3777 non-null   object 
 2   FTR                          3777 non-null   object 
 3   B365H                        3777 non-null   float64
 4   B365D                        3777 non-null   float64
 5   B365A                        3777 non-null   float64
 6   HomeTeam_WinRate             3777 non-null   float64
 7   AwayTeam_WinRate             3777 non-null   float64
 8   HomeTeam_GoalsAvg            3777 non-null   float64
 9   AwayTeam_GoalsAvg            3777 non-null   float64
 10  HomeTeam_goals_conceded_avg  3777 non-null   float64
 11  AwayTeam_goals_conceded_avg  3777 non-null   float64
 12  Broker_prob_H                3777 non-null   float64
 13  Broker_prob_D     

Merge 2 divisions together

In [None]:
data2016 = pd.concat([df2016, df20161,], ignore_index=True)
data2017 = pd.concat([df2018, df20171,], ignore_index=True)
data2018 = pd.concat([df2018, df20181,], ignore_index=True)
data2019 = pd.concat([df2019, df20191,], ignore_index=True)
data2020 = pd.concat([df2020, df20201,], ignore_index=True)
data2021 = pd.concat([df2021, df20211,], ignore_index=True)
data2022 = pd.concat([df2022, df20221,], ignore_index=True)

In [None]:
columns_test = ['Date', 'HomeTeam', 'AwayTeam',"B365H", "B365D", "B365A" ]

In [None]:
df2023 = df2023[columns_test]

In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      380 non-null    object 
 1   HomeTeam  380 non-null    object 
 2   AwayTeam  380 non-null    object 
 3   B365H     380 non-null    float64
 4   B365D     380 non-null    float64
 5   B365A     380 non-null    float64
dtypes: float64(3), object(3)
memory usage: 17.9+ KB


In [None]:
df2023.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A
0,05/08/2022,Crystal Palace,Arsenal,4.2,3.6,1.85
1,06/08/2022,Fulham,Liverpool,11.0,6.0,1.25
2,06/08/2022,Bournemouth,Aston Villa,3.75,3.5,2.0
3,06/08/2022,Leeds,Wolves,2.25,3.4,3.2
4,06/08/2022,Newcastle,Nott'm Forest,1.66,3.8,5.25


In [None]:
show_rows_with_missing_values(df2023)

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A


In [None]:


def calculate_and_apply_overall_averages(season_dfs, new_season_df):
    # Initialize dictionaries for each metric
    metrics = {
        'HomeTeam_WinRate': 'HomeTeam', 'AwayTeam_WinRate': 'AwayTeam',
        'HomeTeam_GoalsAvg': 'HomeTeam', 'AwayTeam_GoalsAvg': 'AwayTeam',
        'HomeTeam_goals_conceded_avg': 'HomeTeam', 'AwayTeam_goals_conceded_avg': 'AwayTeam',
        'H_goal_ratio': 'HomeTeam', 'A_goal_ratio': 'AwayTeam',
        'attack_strength_home_team': 'HomeTeam', 'attack_strength_away_team': 'AwayTeam'
    }
    averages_dict = {metric: {} for metric in metrics}

    # Calculate the overall average for each team across all seasons
    for df in season_dfs:
        for metric, team_col in metrics.items():
            for team in df[team_col].unique():
                averages_dict[metric][team] = df[df[team_col] == team][metric].mean()

    # Apply the overall averages to df2023
    for metric, team_col in metrics.items():
        if metric not in new_season_df:
            new_season_df[metric] = pd.NA
        new_season_df[metric] = new_season_df[team_col].map(averages_dict[metric])

    return new_season_df

# List of DataFrames from 2016 to 2022
season_dfs = [data2016, data2017, data2018, data2019, data2020, data2021, data2022]


In [None]:
# Apply the overall averages to df2023
df2023 = calculate_and_apply_overall_averages(season_dfs, df2023)

In [None]:
df2023.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,AwayTeam_goals_conceded_avg,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team
0,05/08/2022,Crystal Palace,Arsenal,4.2,3.6,1.85,0.37,0.5,1.42,1.44,0.89,1.44,0.33,0.32,17.9,20.11
1,06/08/2022,Fulham,Liverpool,11.0,6.0,1.25,0.61,0.68,2.43,2.37,0.87,0.89,0.36,0.38,39.92,34.8
2,06/08/2022,Bournemouth,Aston Villa,3.75,3.5,2.0,0.52,0.33,1.71,1.11,0.86,1.39,0.35,0.3,25.66,15.47
3,06/08/2022,Leeds,Wolves,2.25,3.4,3.2,0.18,0.39,0.94,0.83,2.0,0.89,0.24,0.25,10.61,11.6
4,06/08/2022,Newcastle,Nott'm Forest,1.66,3.8,5.25,0.42,0.43,1.37,1.3,1.42,0.78,0.34,0.34,17.24,27.46


In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         380 non-null    object 
 1   HomeTeam                     380 non-null    object 
 2   AwayTeam                     380 non-null    object 
 3   B365H                        380 non-null    float64
 4   B365D                        380 non-null    float64
 5   B365A                        380 non-null    float64
 6   HomeTeam_WinRate             380 non-null    float64
 7   AwayTeam_WinRate             380 non-null    float64
 8   HomeTeam_GoalsAvg            380 non-null    float64
 9   AwayTeam_GoalsAvg            380 non-null    float64
 10  HomeTeam_goals_conceded_avg  380 non-null    float64
 11  AwayTeam_goals_conceded_avg  380 non-null    float64
 12  H_goal_ratio                 380 non-null    float64
 13  A_goal_ratio        

In [None]:
def calculate_head_to_head_stats(merged_df):
    # Initialize a dictionary to track head-to-head stats
    head_to_head_stats = {}

    # Update head-to-head stats using merged_df
    for index, row in merged_df.iterrows():
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        if teams not in head_to_head_stats:
            head_to_head_stats[teams] = {'wins': {teams[0]: 0, teams[1]: 0},
                                         'draws': 0,
                                         'total_matches': 0}

        head_to_head_stats[teams]['total_matches'] += 1
        if row['FTR'] == 'H':
            head_to_head_stats[teams]['wins'][row['HomeTeam']] += 1
        elif row['FTR'] == 'D':
            head_to_head_stats[teams]['draws'] += 1
        elif row['FTR'] == 'A':
            head_to_head_stats[teams]['wins'][row['AwayTeam']] += 1

    return head_to_head_stats

def adjusted_win_loss_ratio(wins, draws, losses, total_matches):
    ratio = ((3*wins + draws) - losses) / total_matches if total_matches > 0 else 0
    return round(ratio, 1)

def apply_adjusted_win_loss_ratio_to_2023(df2023, head_to_head_stats):
    def calculate_ratio_for_match(row):
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        stats = head_to_head_stats.get(teams, {'wins': {row['HomeTeam']: 0, row['AwayTeam']: 0}, 'draws': 0, 'total_matches': 0})
        home_wins = stats['wins'].get(row['HomeTeam'], 0)
        away_wins = stats['wins'].get(row['AwayTeam'], 0)
        draws = stats['draws']
        total_matches = stats['total_matches']
        home_ratio = adjusted_win_loss_ratio(home_wins, draws, total_matches - home_wins - draws, total_matches)
        away_ratio = adjusted_win_loss_ratio(away_wins, draws, total_matches - away_wins - draws, total_matches)
        return pd.Series([home_ratio, away_ratio])

    df2023[['adjusted_win_lost_ratio_H', 'adjusted_win_lost_ratio_A']] = df2023.apply(calculate_ratio_for_match, axis=1)
    return df2023

# Assuming merged_df is the DataFrame that contains data from 2015 to 2022
merged_df = eng0.copy()

# Calculate head-to-head stats using merged data
head_to_head_stats = calculate_head_to_head_stats(merged_df)

# Apply the adjusted win-loss ratio to df2023
df2023 = apply_adjusted_win_loss_ratio_to_2023(df2023, head_to_head_stats)


In [None]:
df2023.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,AwayTeam_goals_conceded_avg,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A
0,05/08/2022,Crystal Palace,Arsenal,4.2,3.6,1.85,0.37,0.5,1.42,1.44,0.89,1.44,0.33,0.32,17.9,20.11,0.8,1.2
1,06/08/2022,Fulham,Liverpool,11.0,6.0,1.25,0.61,0.68,2.43,2.37,0.87,0.89,0.36,0.38,39.92,34.8,-0.3,2.3
2,06/08/2022,Bournemouth,Aston Villa,3.75,3.5,2.0,0.52,0.33,1.71,1.11,0.86,1.39,0.35,0.3,25.66,15.47,2.0,0.0
3,06/08/2022,Leeds,Wolves,2.25,3.4,3.2,0.18,0.39,0.94,0.83,2.0,0.89,0.24,0.25,10.61,11.6,0.5,1.5
4,06/08/2022,Newcastle,Nott'm Forest,1.66,3.8,5.25,0.42,0.43,1.37,1.3,1.42,0.78,0.34,0.34,17.24,27.46,0.0,0.0


In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         380 non-null    object 
 1   HomeTeam                     380 non-null    object 
 2   AwayTeam                     380 non-null    object 
 3   B365H                        380 non-null    float64
 4   B365D                        380 non-null    float64
 5   B365A                        380 non-null    float64
 6   HomeTeam_WinRate             380 non-null    float64
 7   AwayTeam_WinRate             380 non-null    float64
 8   HomeTeam_GoalsAvg            380 non-null    float64
 9   AwayTeam_GoalsAvg            380 non-null    float64
 10  HomeTeam_goals_conceded_avg  380 non-null    float64
 11  AwayTeam_goals_conceded_avg  380 non-null    float64
 12  H_goal_ratio                 380 non-null    float64
 13  A_goal_ratio        

In [None]:
df2023 = process_time_data(df2023, 2023)

  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
def add_probability_B365(df):

    df['Broker_prob_H'] = round(1 / df['B365H'], 2)
    df['Broker_prob_D'] = round(1 / df['B365D'], 2)
    df['Broker_prob_A'] = round(1 / df['B365A'], 2)
    return df

In [None]:
df2023 = add_probability_B365(df2023)

In [None]:


def fill_missing_with_mean(df):
    """
    Fill missing values in each column of the DataFrame with the mean of that column.

    Parameters:
    df (pd.DataFrame): The dataset with missing values.

    Returns:
    pd.DataFrame: The DataFrame with missing values filled.
    """
    for column in df.columns:
        if df[column].dtype in ['float64', 'int64']:
          mean_value = round(df[column].mean(), 2)
          df[column].fillna(mean_value, inplace=True)
    return df


In [None]:
#df2023 = fill_missing_with_mean(df2023)

In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     380 non-null    object 
 1   AwayTeam                     380 non-null    object 
 2   B365H                        380 non-null    float64
 3   B365D                        380 non-null    float64
 4   B365A                        380 non-null    float64
 5   HomeTeam_WinRate             380 non-null    float64
 6   AwayTeam_WinRate             380 non-null    float64
 7   HomeTeam_GoalsAvg            380 non-null    float64
 8   AwayTeam_GoalsAvg            380 non-null    float64
 9   HomeTeam_goals_conceded_avg  380 non-null    float64
 10  AwayTeam_goals_conceded_avg  380 non-null    float64
 11  H_goal_ratio                 380 non-null    float64
 12  A_goal_ratio                 380 non-null    float64
 13  attack_strength_home

In [None]:
df2023.head()

Unnamed: 0,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,...,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year,Broker_prob_H,Broker_prob_D,Broker_prob_A
0,Crystal Palace,Arsenal,4.2,3.6,1.85,0.37,0.5,1.42,1.44,0.89,...,0.33,0.32,17.9,20.11,0.8,1.2,2023,0.24,0.28,0.54
1,Fulham,Liverpool,11.0,6.0,1.25,0.61,0.68,2.43,2.37,0.87,...,0.36,0.38,39.92,34.8,-0.3,2.3,2023,0.09,0.17,0.8
2,Bournemouth,Aston Villa,3.75,3.5,2.0,0.52,0.33,1.71,1.11,0.86,...,0.35,0.3,25.66,15.47,2.0,0.0,2023,0.27,0.29,0.5
3,Leeds,Wolves,2.25,3.4,3.2,0.18,0.39,0.94,0.83,2.0,...,0.24,0.25,10.61,11.6,0.5,1.5,2023,0.44,0.29,0.31
4,Newcastle,Nott'm Forest,1.66,3.8,5.25,0.42,0.43,1.37,1.3,1.42,...,0.34,0.34,17.24,27.46,0.0,0.0,2023,0.6,0.26,0.19


Here we prepear the train and the testset for Classification problem

In [None]:
train = eng0[eng0['Year'] < 2022]
validation = eng0[eng0['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['FTR']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['FTR']


In [None]:
X_test = df2023.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((2224, 21), (2224,), (372, 21), (372,), (380, 21))

In [None]:
y_train_enc = y_train.map({'H': 1, 'D': 0, 'A': 2})
y_validation_enc = y_validation.map({'H': 1, 'D': 0, 'A': 2})

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])


In [None]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = False, custom_metric = None)
models,pred = clf.fit(X_train, X_validation, y_train_enc, y_validation_enc)

  7%|▋         | 2/29 [00:00<00:04,  5.68it/s]

ROC AUC couldn't be calculated for AdaBoostClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BaggingClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BernoulliNB
multi_class must be in ('ovo', 'ovr')


 14%|█▍        | 4/29 [00:02<00:15,  1.64it/s]

ROC AUC couldn't be calculated for CalibratedClassifierCV
multi_class must be in ('ovo', 'ovr')
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
ROC AUC couldn't be calculated for DecisionTreeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for DummyClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for ExtraTreeClassifier
multi_class must be in ('ovo', 'ovr')


 31%|███       | 9/29 [00:02<00:04,  4.15it/s]

ROC AUC couldn't be calculated for ExtraTreesClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for GaussianNB
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for KNeighborsClassifier
multi_class must be in ('ovo', 'ovr')


 41%|████▏     | 12/29 [00:02<00:03,  5.19it/s]

ROC AUC couldn't be calculated for LabelPropagation
multi_class must be in ('ovo', 'ovr')


 45%|████▍     | 13/29 [00:03<00:03,  4.68it/s]

ROC AUC couldn't be calculated for LabelSpreading
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LinearDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 52%|█████▏    | 15/29 [00:03<00:03,  4.29it/s]

ROC AUC couldn't be calculated for LinearSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LogisticRegression
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NearestCentroid
multi_class must be in ('ovo', 'ovr')


 62%|██████▏   | 18/29 [00:04<00:02,  5.22it/s]

ROC AUC couldn't be calculated for NuSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for PassiveAggressiveClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for Perceptron
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for QuadraticDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 86%|████████▌ | 25/29 [00:04<00:00,  7.44it/s]

ROC AUC couldn't be calculated for RandomForestClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifierCV
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for SGDClassifier
multi_class must be in ('ovo', 'ovr')


 93%|█████████▎| 27/29 [00:05<00:00,  7.78it/s]

ROC AUC couldn't be calculated for SVC
multi_class must be in ('ovo', 'ovr')
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'


 97%|█████████▋| 28/29 [00:05<00:00,  6.25it/s]

ROC AUC couldn't be calculated for XGBClassifier
multi_class must be in ('ovo', 'ovr')
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000315 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1037
[LightGBM] [Info] Number of data points in the train set: 2224, number of used features: 21
[LightGBM] [Info] Start training from score -1.439862
[LightGBM] [Info] Start training from score -0.811380
[LightGBM] [Info] Start training from score -1.143207


100%|██████████| 29/29 [00:05<00:00,  5.08it/s]

ROC AUC couldn't be calculated for LGBMClassifier
multi_class must be in ('ovo', 'ovr')





In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.72,0.68,,0.71,0.31
XGBClassifier,0.72,0.68,,0.71,0.35
ExtraTreesClassifier,0.72,0.66,,0.7,0.32
RandomForestClassifier,0.72,0.66,,0.7,0.49
CalibratedClassifierCV,0.73,0.66,,0.69,1.71
QuadraticDiscriminantAnalysis,0.65,0.65,,0.67,0.04
LinearDiscriminantAnalysis,0.69,0.65,,0.69,0.06
SGDClassifier,0.73,0.65,,0.68,0.09
LogisticRegression,0.69,0.65,,0.69,0.08
LinearSVC,0.72,0.65,,0.69,0.49


In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Initialize the models
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
decision_tree_clf = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
bagging_clf.fit(X_train, y_train_enc)
extra_trees_clf.fit(X_train, y_train_enc)
random_forest_clf.fit(X_train, y_train_enc)
decision_tree_clf.fit(X_train, y_train_enc)
xgb_clf.fit(X_train, y_train_enc)

# Predict y_validation
y_pred_bagging = bagging_clf.predict(X_validation)
y_pred_extra_trees = extra_trees_clf.predict(X_validation)
y_pred_random_forest = random_forest_clf.predict(X_validation)
y_pred_decision_tree = decision_tree_clf.predict(X_validation)
y_pred_xgb = xgb_clf.predict(X_validation)

# Calculate accuracy and F1 score for each model
accuracy_bagging = accuracy_score(y_validation_enc, y_pred_bagging)
f1_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

accuracy_extra_trees = accuracy_score(y_validation_enc, y_pred_extra_trees)
f1_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')

accuracy_random_forest = accuracy_score(y_validation_enc, y_pred_random_forest)
f1_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

accuracy_decision_tree = accuracy_score(y_validation_enc, y_pred_decision_tree)
f1_decision_tree = f1_score(y_validation_enc, y_pred_decision_tree, average='macro')

accuracy_xgb = accuracy_score(y_validation_enc, y_pred_xgb)
f1_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

# Print out the performance
print(f'Bagging Classifier - Accuracy: {accuracy_bagging}, F1 Score: {f1_bagging}')
print(f'Extra Trees Classifier - Accuracy: {accuracy_extra_trees}, F1 Score: {f1_extra_trees}')
print(f'Random Forest Classifier - Accuracy: {accuracy_random_forest}, F1 Score: {f1_random_forest}')
print(f'Decision Tree Classifier - Accuracy: {accuracy_decision_tree}, F1 Score: {f1_decision_tree}')
print(f'XGB Classifier - Accuracy: {accuracy_xgb}, F1 Score: {f1_xgb}')


Bagging Classifier - Accuracy: 0.7150537634408602, F1 Score: 0.6757859756277478
Extra Trees Classifier - Accuracy: 0.7150537634408602, F1 Score: 0.6689921123241107
Random Forest Classifier - Accuracy: 0.7096774193548387, F1 Score: 0.6583018680875515
Decision Tree Classifier - Accuracy: 0.6397849462365591, F1 Score: 0.6055871766644024
XGB Classifier - Accuracy: 0.7150537634408602, F1 Score: 0.6807346579397772


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_extra_trees = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



grid_search_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_clf, param_grid_random_forest, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_extra_trees = GridSearchCV(extra_trees_clf, param_grid_extra_trees, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)

grid_search_xgb.fit(X_train, y_train_enc)
grid_search_random_forest.fit(X_train, y_train_enc)
grid_search_extra_trees.fit(X_train, y_train_enc)

best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

best_params_extra_trees = grid_search_extra_trees.best_params_
best_score_extra_trees = grid_search_extra_trees.best_score_


y_pred_xgb = grid_search_xgb.best_estimator_.predict(X_validation)
f1_score_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
f1_score_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

y_pred_extra_trees = grid_search_extra_trees.best_estimator_.predict(X_validation)
f1_score_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')


results = {
    "XGB Classifier": {
        "Best Parameters": best_params_xgb,
        "Best Score": best_score_xgb,
        "F1 Score on Validation": f1_score_xgb
    },
    "Random Forest Classifier": {
        "Best Parameters": best_params_random_forest,
        "Best Score": best_score_random_forest,
        "F1 Score on Validation": f1_score_random_forest
    },
    "Extra Trees Classifier": {
        "Best Parameters": best_params_extra_trees,
        "Best Score": best_score_extra_trees,
        "F1 Score on Validation": f1_score_extra_trees
    }
}





Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
results

{'XGB Classifier': {'Best Parameters': {'colsample_bytree': 1,
   'learning_rate': 0.01,
   'max_depth': 10,
   'n_estimators': 200,
   'subsample': 0.5},
  'Best Score': 0.6878814473501809,
  'F1 Score on Validation': 0.6757704510981255},
 'Random Forest Classifier': {'Best Parameters': {'max_depth': 10,
   'min_samples_leaf': 2,
   'min_samples_split': 2,
   'n_estimators': 50},
  'Best Score': 0.6756194834376436,
  'F1 Score on Validation': 0.6657406685911479},
 'Extra Trees Classifier': {'Best Parameters': {'max_depth': None,
   'min_samples_leaf': 4,
   'min_samples_split': 10,
   'n_estimators': 200},
  'Best Score': 0.6791089005560773,
  'F1 Score on Validation': 0.6403738000865223}}

In [None]:
optimal_xgb_clf = XGBClassifier(
    colsample_bytree=1,
    learning_rate=0.01,
    max_depth=10,
    n_estimators=200,
    subsample=0.5,
    random_state=42  # Optional for reproducibility
)

In [None]:
optimal_xgb_clf.fit(X_train, y_train_enc)

In [None]:
y_pred_test_xgb = optimal_xgb_clf.predict(X_test)

In [None]:
# Define the inverse mapping
inverse_mapping = {1: 'H', 0: 'D', 2: 'A'}

# Convert y_pred_test_xgb back to original form
y_pred_test_xgb_original = [inverse_mapping[label] for label in y_pred_test_xgb]


In [None]:
predictions_df = pd.DataFrame(y_pred_test_xgb_original, columns=['Predictions'])

In [None]:
predictions_df.to_csv('england_0.csv', index=False)

In [None]:
train = eng0[eng0['Year'] < 2022]
validation = eng0[eng0['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['total_goal']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['total_goal']

In [None]:
X_test = df2023.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((2224, 21), (2224,), (372, 21), (372,), (380, 21))

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
from lazypredict.Supervised import LazyRegressor
# Create an instance of LazyRegressor
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)

# Fit the model
models, predictions = reg.fit(X_train, X_validation, y_train, y_validation)

 21%|██▏       | 9/42 [00:04<00:22,  1.50it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 74%|███████▍  | 31/42 [00:10<00:03,  3.49it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:13<00:00,  3.05it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000316 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1037
[LightGBM] [Info] Number of data points in the train set: 2224, number of used features: 21
[LightGBM] [Info] Start training from score 2.721223





In [None]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GradientBoostingRegressor,0.07,0.13,1.52,0.59
RandomForestRegressor,0.07,0.13,1.52,1.79
LassoLarsIC,0.07,0.12,1.52,0.04
LassoLarsCV,0.07,0.12,1.52,0.1
LassoCV,0.07,0.12,1.52,0.22
ElasticNetCV,0.07,0.12,1.52,0.43
LarsCV,0.07,0.12,1.52,0.08
OrthogonalMatchingPursuitCV,0.07,0.12,1.52,0.02
BayesianRidge,0.07,0.12,1.52,0.07
NuSVR,0.07,0.12,1.52,0.28


In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
elastic_net_cv = ElasticNetCV(cv=5, random_state=42)  # Adjust parameters as necessary
poisson_regressor = PoissonRegressor()
svr = SVR()  # Default parameters, adjust as necessary
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_reg = XGBRegressor(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
elastic_net_cv.fit(X_train, y_train)
poisson_regressor.fit(X_train, y_train)
svr.fit(X_train, y_train)
random_forest_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
y_pred_poisson_regressor = poisson_regressor.predict(X_validation)
y_pred_svr = svr.predict(X_validation)
y_pred_random_forest = random_forest_reg.predict(X_validation)
y_pred_xgb = xgb_reg.predict(X_validation)

# Predict y_validation
y_pred_elastic_net_cv_rounded = np.rint(y_pred_elastic_net_cv)
y_pred_poisson_regressor_rounded = np.rint(y_pred_poisson_regressor)
y_pred_svr_rounded = np.rint(y_pred_svr)
y_pred_random_forest_rounded = np.rint(y_pred_random_forest)
y_pred_xgb_rounded = np.rint(y_pred_xgb)

# Calculate MAE and R2 score using rounded predictions
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv_rounded)
r2_elastic_net_cv = r2_score(y_validation, y_pred_elastic_net_cv_rounded)

mae_poisson_regressor = mean_absolute_error(y_validation, y_pred_poisson_regressor_rounded)
r2_poisson_regressor = r2_score(y_validation, y_pred_poisson_regressor_rounded)

mae_svr = mean_absolute_error(y_validation, y_pred_svr_rounded)
r2_svr = r2_score(y_validation, y_pred_svr_rounded)

mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest_rounded)
r2_random_forest = r2_score(y_validation, y_pred_random_forest_rounded)

mae_xgb = mean_absolute_error(y_validation, y_pred_xgb_rounded)
r2_xgb = r2_score(y_validation, y_pred_xgb_rounded)

# Print out the performance with rounded predictions
print(f'ElasticNetCV - MAE: {mae_elastic_net_cv}, R2 Score: {r2_elastic_net_cv}')
print(f'Poisson Regressor - MAE: {mae_poisson_regressor}, R2 Score: {r2_poisson_regressor}')
print(f'SVR - MAE: {mae_svr}, R2 Score: {r2_svr}')
print(f'Random Forest Regressor - MAE: {mae_random_forest}, R2 Score: {r2_random_forest}')
print(f'XGB Regressor - MAE: {mae_xgb}, R2 Score: {r2_xgb}')

ElasticNetCV - MAE: 1.2096774193548387, R2 Score: 0.09689977582153209
Poisson Regressor - MAE: 1.2553763440860215, R2 Score: 0.04299852370277213
SVR - MAE: 1.3172043010752688, R2 Score: -0.014970747443818766
Random Forest Regressor - MAE: 1.2634408602150538, R2 Score: 0.05418557602930718
XGB Regressor - MAE: 1.3091397849462365, R2 Score: -0.03836185685384663


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Parameter grids
param_grid_poisson = {
    'alpha': [0.01, 0.1, 1, 10],
    'max_iter': [100, 300, 500]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV setup
grid_search_poisson = GridSearchCV(poisson_regressor, param_grid_poisson, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_reg, param_grid_random_forest, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)

# Fitting models
grid_search_poisson.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)

# Best parameters and scores
best_params_poisson = grid_search_poisson.best_params_
best_score_poisson = grid_search_poisson.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

# Predict and calculate MAE
y_pred_poisson = grid_search_poisson.best_estimator_.predict(X_validation)
mae_poisson = mean_absolute_error(y_validation, y_pred_poisson)

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest)

# Results
results = {
    "Poisson Regressor": {
        "Best Parameters": best_params_poisson,
        "Best Score (Negative MAE)": best_score_poisson,
        "MAE on Validation": mae_poisson
    },
    "Random Forest Regressor": {
        "Best Parameters": best_params_random_forest,
        "Best Score (Negative MAE)": best_score_random_forest,
        "MAE on Validation": mae_random_forest
    }
}

# ElasticNetCV already uses cross-validation for parameter tuning, so we directly fit it and predict
elastic_net_cv = ElasticNetCV(cv=5, random_state=42).fit(X_train, y_train)
y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv)

results["ElasticNetCV"] = {
    "Best Parameters": elastic_net_cv.get_params(),
    "MAE on Validation": mae_elastic_net_cv
}

# Print results
for model, info in results.items():
    print(f"{model}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Poisson Regressor:
  Best Parameters: {'alpha': 0.01, 'max_iter': 500}
  Best Score (Negative MAE): -1.2172429878846054
  MAE on Validation: 1.239164861229366

Random Forest Regressor:
  Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
  Best Score (Negative MAE): -1.2437080342044213
  MAE on Validation: 1.2454685112476562

ElasticNetCV:
  Best Parameters: {'alphas': None, 'copy_X': True, 'cv': 5, 'eps': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': 0}
  MAE on Validation: 1.2421628590874898



In [None]:

# Set the best parameters for PoissonRegressor
best_params_poisson = {
    'alpha': 0.01,
    'max_iter': 500
}

# Initialize and fit the PoissonRegressor with the best parameters
poisson_model = PoissonRegressor(**best_params_poisson)
poisson_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_test_poisson = poisson_model.predict(X_test)

# Round the predictions to the nearest integer and convert to int type
y_pred_test_poisson_rounded = np.rint(y_pred_test_poisson).astype(int)

# y_pred_test_poisson_rounded contains the final integer predictions for X_test


In [None]:


# Convert the predictions to a DataFrame
predictions_df_poisson = pd.DataFrame(y_pred_test_poisson_rounded, columns=['Predicted_Total_Goals'])

# Save the DataFrame to a CSV file
predictions_df_poisson.to_csv('england_0.csv', index=False)


Apply features for df2023 from Eng1

In [None]:
df2023_eng1.head()

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,Referee,B365H,B365D,B365A,BWH,...,AvgC<2.5,AHCh,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA
0,E1,29/07/2022,20:00,Huddersfield,Burnley,J Linington,2.9,3.2,2.5,2.9,...,1.63,0.0,2.09,1.81,2.1,1.82,2.14,1.83,2.09,1.78
1,E1,30/07/2022,15:00,Blackburn,QPR,T Bramall,2.0,3.5,3.75,2.05,...,1.86,-0.5,1.99,1.91,2.01,1.9,2.01,1.95,1.97,1.88
2,E1,30/07/2022,15:00,Blackpool,Reading,D Webb,1.95,3.5,4.0,1.98,...,1.82,-0.5,2.08,1.82,2.06,1.86,2.11,1.86,2.04,1.82
3,E1,30/07/2022,15:00,Cardiff,Norwich,T Robinson,3.1,3.4,2.3,3.0,...,1.71,0.25,2.01,1.89,2.0,1.9,2.03,2.0,1.95,1.89
4,E1,30/07/2022,15:00,Hull,Bristol City,D Whitestone,2.37,3.25,3.1,2.35,...,1.73,-0.25,2.1,1.7,2.12,1.81,2.17,1.81,2.09,1.77


In [None]:
df2023_eng1 = df2023_eng1[columns_test]

In [None]:
df2023_eng1.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      552 non-null    object 
 1   HomeTeam  552 non-null    object 
 2   AwayTeam  552 non-null    object 
 3   B365H     552 non-null    float64
 4   B365D     552 non-null    float64
 5   B365A     552 non-null    float64
dtypes: float64(3), object(3)
memory usage: 26.0+ KB


In [None]:
season_dfs_2 = [df20161, df20171, df20181, df20191, df20201, df20211, df20221]

In [None]:
# Apply the overall averages to df2023
df2023_eng1 = calculate_and_apply_overall_averages(season_dfs_2, df2023_eng1)

In [None]:
df2023_eng1.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,AwayTeam_goals_conceded_avg,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team
0,29/07/2022,Huddersfield,Burnley,2.9,3.2,2.5,0.55,0.48,1.45,1.48,0.95,0.91,0.36,0.41,22.81,31.93
1,30/07/2022,Blackburn,QPR,2.0,3.5,3.75,0.52,0.39,1.57,1.3,1.13,1.48,0.31,0.37,25.66,27.46
2,30/07/2022,Blackpool,Reading,1.95,3.5,4.0,0.48,0.27,1.26,0.95,1.13,1.77,0.32,0.28,20.67,19.22
3,30/07/2022,Cardiff,Norwich,3.1,3.4,2.3,0.3,0.65,0.96,1.57,1.26,0.91,0.26,0.34,15.68,34.29
4,30/07/2022,Hull,Bristol City,2.37,3.25,3.1,0.3,0.32,0.96,1.23,1.22,2.05,0.32,0.36,15.68,24.72


In [None]:
df2023_eng1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         552 non-null    object 
 1   HomeTeam                     552 non-null    object 
 2   AwayTeam                     552 non-null    object 
 3   B365H                        552 non-null    float64
 4   B365D                        552 non-null    float64
 5   B365A                        552 non-null    float64
 6   HomeTeam_WinRate             552 non-null    float64
 7   AwayTeam_WinRate             552 non-null    float64
 8   HomeTeam_GoalsAvg            552 non-null    float64
 9   AwayTeam_GoalsAvg            552 non-null    float64
 10  HomeTeam_goals_conceded_avg  552 non-null    float64
 11  AwayTeam_goals_conceded_avg  552 non-null    float64
 12  H_goal_ratio                 552 non-null    float64
 13  A_goal_ratio        

In [None]:
merged_df = eng1.copy()

# Calculate head-to-head stats using merged data
head_to_head_stats = calculate_head_to_head_stats(merged_df)

# Apply the adjusted win-loss ratio to df2023
df2023_eng1 = apply_adjusted_win_loss_ratio_to_2023(df2023_eng1, head_to_head_stats)

In [None]:
df2023_eng1.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,AwayTeam_goals_conceded_avg,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A
0,29/07/2022,Huddersfield,Burnley,2.9,3.2,2.5,0.55,0.48,1.45,1.48,0.95,0.91,0.36,0.41,22.81,31.93,-1.0,3.0
1,30/07/2022,Blackburn,QPR,2.0,3.5,3.75,0.52,0.39,1.57,1.3,1.13,1.48,0.31,0.37,25.66,27.46,1.5,0.5
2,30/07/2022,Blackpool,Reading,1.95,3.5,4.0,0.48,0.27,1.26,0.95,1.13,1.77,0.32,0.28,20.67,19.22,3.0,-1.0
3,30/07/2022,Cardiff,Norwich,3.1,3.4,2.3,0.3,0.65,0.96,1.57,1.26,0.91,0.26,0.34,15.68,34.29,0.3,1.7
4,30/07/2022,Hull,Bristol City,2.37,3.25,3.1,0.3,0.32,0.96,1.23,1.22,2.05,0.32,0.36,15.68,24.72,0.2,1.8


In [None]:
df2023_eng1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         552 non-null    object 
 1   HomeTeam                     552 non-null    object 
 2   AwayTeam                     552 non-null    object 
 3   B365H                        552 non-null    float64
 4   B365D                        552 non-null    float64
 5   B365A                        552 non-null    float64
 6   HomeTeam_WinRate             552 non-null    float64
 7   AwayTeam_WinRate             552 non-null    float64
 8   HomeTeam_GoalsAvg            552 non-null    float64
 9   AwayTeam_GoalsAvg            552 non-null    float64
 10  HomeTeam_goals_conceded_avg  552 non-null    float64
 11  AwayTeam_goals_conceded_avg  552 non-null    float64
 12  H_goal_ratio                 552 non-null    float64
 13  A_goal_ratio        

In [None]:
df2023_eng1 = process_time_data(df2023_eng1, 2023)

In [None]:
df2023_eng1 = add_probability_B365(df2023_eng1)

In [None]:
df2023_eng1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     552 non-null    object 
 1   AwayTeam                     552 non-null    object 
 2   B365H                        552 non-null    float64
 3   B365D                        552 non-null    float64
 4   B365A                        552 non-null    float64
 5   HomeTeam_WinRate             552 non-null    float64
 6   AwayTeam_WinRate             552 non-null    float64
 7   HomeTeam_GoalsAvg            552 non-null    float64
 8   AwayTeam_GoalsAvg            552 non-null    float64
 9   HomeTeam_goals_conceded_avg  552 non-null    float64
 10  AwayTeam_goals_conceded_avg  552 non-null    float64
 11  H_goal_ratio                 552 non-null    float64
 12  A_goal_ratio                 552 non-null    float64
 13  attack_strength_home

In [None]:
train = eng1[eng1['Year'] < 2022]
validation = eng1[eng1['Year'] == 2022]

In [None]:
train = remove_rows_with_inf(train)

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['FTR']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['FTR']

In [None]:
X_test = df2023_eng1.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((3235, 21), (3235,), (541, 21), (541,), (552, 21))

In [None]:
y_train_enc = y_train.map({'H': 1, 'D': 0, 'A': 2})
y_validation_enc = y_validation.map({'H': 1, 'D': 0, 'A': 2})

In [None]:
from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = False, custom_metric = None)
models,pred = clf.fit(X_train, X_validation, y_train_enc, y_validation_enc)

  3%|▎         | 1/29 [00:00<00:11,  2.50it/s]

ROC AUC couldn't be calculated for AdaBoostClassifier
multi_class must be in ('ovo', 'ovr')


  7%|▋         | 2/29 [00:00<00:09,  2.89it/s]

ROC AUC couldn't be calculated for BaggingClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BernoulliNB
multi_class must be in ('ovo', 'ovr')


 14%|█▍        | 4/29 [00:04<00:31,  1.27s/it]

ROC AUC couldn't be calculated for CalibratedClassifierCV
multi_class must be in ('ovo', 'ovr')
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
ROC AUC couldn't be calculated for DecisionTreeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for DummyClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for ExtraTreeClassifier
multi_class must be in ('ovo', 'ovr')


 31%|███       | 9/29 [00:04<00:09,  2.18it/s]

ROC AUC couldn't be calculated for ExtraTreesClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for GaussianNB
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for KNeighborsClassifier
multi_class must be in ('ovo', 'ovr')


 41%|████▏     | 12/29 [00:05<00:06,  2.76it/s]

ROC AUC couldn't be calculated for LabelPropagation
multi_class must be in ('ovo', 'ovr')


 45%|████▍     | 13/29 [00:06<00:06,  2.45it/s]

ROC AUC couldn't be calculated for LabelSpreading
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LinearDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 59%|█████▊    | 17/29 [00:07<00:03,  3.30it/s]

ROC AUC couldn't be calculated for LinearSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LogisticRegression
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NearestCentroid
multi_class must be in ('ovo', 'ovr')


 62%|██████▏   | 18/29 [00:07<00:03,  2.75it/s]

ROC AUC couldn't be calculated for NuSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for PassiveAggressiveClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for Perceptron
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for QuadraticDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 86%|████████▌ | 25/29 [00:08<00:00,  5.01it/s]

ROC AUC couldn't be calculated for RandomForestClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifierCV
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for SGDClassifier
multi_class must be in ('ovo', 'ovr')


 90%|████████▉ | 26/29 [00:09<00:00,  4.29it/s]

ROC AUC couldn't be calculated for SVC
multi_class must be in ('ovo', 'ovr')
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'


 97%|█████████▋| 28/29 [00:09<00:00,  4.69it/s]

ROC AUC couldn't be calculated for XGBClassifier
multi_class must be in ('ovo', 'ovr')
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000321 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 992
[LightGBM] [Info] Number of data points in the train set: 3235, number of used features: 21
[LightGBM] [Info] Start training from score -1.306418
[LightGBM] [Info] Start training from score -0.842569
[LightGBM] [Info] Start training from score -1.208620


100%|██████████| 29/29 [00:09<00:00,  3.00it/s]

ROC AUC couldn't be calculated for LGBMClassifier
multi_class must be in ('ovo', 'ovr')





In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ExtraTreesClassifier,0.71,0.66,,0.7,0.41
LGBMClassifier,0.7,0.66,,0.69,0.3
BaggingClassifier,0.68,0.66,,0.68,0.3
RandomForestClassifier,0.7,0.65,,0.69,0.65
XGBClassifier,0.7,0.65,,0.68,0.33
NuSVC,0.68,0.62,,0.65,0.64
SVC,0.67,0.61,,0.64,0.43
ExtraTreeClassifier,0.63,0.61,,0.63,0.02
DecisionTreeClassifier,0.63,0.61,,0.63,0.04
PassiveAggressiveClassifier,0.61,0.6,,0.62,0.03


In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Initialize the models
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
decision_tree_clf = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
bagging_clf.fit(X_train, y_train_enc)
extra_trees_clf.fit(X_train, y_train_enc)
random_forest_clf.fit(X_train, y_train_enc)
decision_tree_clf.fit(X_train, y_train_enc)
xgb_clf.fit(X_train, y_train_enc)

# Predict y_validation
y_pred_bagging = bagging_clf.predict(X_validation)
y_pred_extra_trees = extra_trees_clf.predict(X_validation)
y_pred_random_forest = random_forest_clf.predict(X_validation)
y_pred_decision_tree = decision_tree_clf.predict(X_validation)
y_pred_xgb = xgb_clf.predict(X_validation)

# Calculate accuracy and F1 score for each model
accuracy_bagging = accuracy_score(y_validation_enc, y_pred_bagging)
f1_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

accuracy_extra_trees = accuracy_score(y_validation_enc, y_pred_extra_trees)
f1_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')

accuracy_random_forest = accuracy_score(y_validation_enc, y_pred_random_forest)
f1_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

accuracy_decision_tree = accuracy_score(y_validation_enc, y_pred_decision_tree)
f1_decision_tree = f1_score(y_validation_enc, y_pred_decision_tree, average='macro')

accuracy_xgb = accuracy_score(y_validation_enc, y_pred_xgb)
f1_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

# Print out the performance
print(f'Bagging Classifier - Accuracy: {accuracy_bagging}, F1 Score: {f1_bagging}')
print(f'Extra Trees Classifier - Accuracy: {accuracy_extra_trees}, F1 Score: {f1_extra_trees}')
print(f'Random Forest Classifier - Accuracy: {accuracy_random_forest}, F1 Score: {f1_random_forest}')
print(f'Decision Tree Classifier - Accuracy: {accuracy_decision_tree}, F1 Score: {f1_decision_tree}')
print(f'XGB Classifier - Accuracy: {accuracy_xgb}, F1 Score: {f1_xgb}')

Bagging Classifier - Accuracy: 0.7005545286506469, F1 Score: 0.6642997287636857
Extra Trees Classifier - Accuracy: 0.711645101663586, F1 Score: 0.673195183351374
Random Forest Classifier - Accuracy: 0.7024029574861368, F1 Score: 0.658228365610221
Decision Tree Classifier - Accuracy: 0.6303142329020333, F1 Score: 0.6094727831524919
XGB Classifier - Accuracy: 0.6950092421441775, F1 Score: 0.6573774497172099


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_extra_trees = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



grid_search_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_clf, param_grid_random_forest, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_extra_trees = GridSearchCV(extra_trees_clf, param_grid_extra_trees, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)

grid_search_xgb.fit(X_train, y_train_enc)
grid_search_random_forest.fit(X_train, y_train_enc)
grid_search_extra_trees.fit(X_train, y_train_enc)

best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

best_params_extra_trees = grid_search_extra_trees.best_params_
best_score_extra_trees = grid_search_extra_trees.best_score_


y_pred_xgb = grid_search_xgb.best_estimator_.predict(X_validation)
f1_score_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
f1_score_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

y_pred_extra_trees = grid_search_extra_trees.best_estimator_.predict(X_validation)
f1_score_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')


results = {
    "XGB Classifier": {
        "Best Parameters": best_params_xgb,
        "Best Score": best_score_xgb,
        "F1 Score on Validation": f1_score_xgb
    },
    "Random Forest Classifier": {
        "Best Parameters": best_params_random_forest,
        "Best Score": best_score_random_forest,
        "F1 Score on Validation": f1_score_random_forest
    },
    "Extra Trees Classifier": {
        "Best Parameters": best_params_extra_trees,
        "Best Score": best_score_extra_trees,
        "F1 Score on Validation": f1_score_extra_trees
    }
}




Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
results

{'XGB Classifier': {'Best Parameters': {'colsample_bytree': 1,
   'learning_rate': 0.01,
   'max_depth': 6,
   'n_estimators': 50,
   'subsample': 0.5},
  'Best Score': 0.6734051003203225,
  'F1 Score on Validation': 0.6809656629254962},
 'Random Forest Classifier': {'Best Parameters': {'max_depth': 20,
   'min_samples_leaf': 4,
   'min_samples_split': 10,
   'n_estimators': 50},
  'Best Score': 0.6596997148157182,
  'F1 Score on Validation': 0.654858734416592},
 'Extra Trees Classifier': {'Best Parameters': {'max_depth': None,
   'min_samples_leaf': 2,
   'min_samples_split': 2,
   'n_estimators': 50},
  'Best Score': 0.6582164692467405,
  'F1 Score on Validation': 0.6799438501880362}}

In [None]:
optimal_xgb_clf = XGBClassifier(
    colsample_bytree=1,
    learning_rate=0.01,
    max_depth=6,
    n_estimators=50,
    subsample=0.5,
    random_state=42  # Optional for reproducibility
)

In [None]:
optimal_xgb_clf.fit(X_train, y_train_enc)

In [None]:
# Make predictions on the test set
y_pred_test_xgb = optimal_xgb_clf.predict(X_test)

In [None]:
# Define the inverse mapping
inverse_mapping = {1: 'H', 0: 'D', 2: 'A'}

# Convert y_pred_test_xgb back to original form
y_pred_test_xgb_original = [inverse_mapping[label] for label in y_pred_test_xgb]


In [None]:
predictions_df = pd.DataFrame(y_pred_test_xgb_original, columns=['Predictions'])

In [None]:
predictions_df.to_csv('england_1.csv', index=False)

In [None]:
train = eng1[eng1['Year'] < 2022]
validation = eng1[eng1['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['total_goal']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['total_goal']

In [None]:
X_test = df2023_eng1.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((3235, 21), (3235,), (541, 21), (541,), (552, 21))

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
# Create an instance of LazyRegressor
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)

# Fit the model
models, predictions = reg.fit(X_train, X_validation, y_train, y_validation)

 21%|██▏       | 9/42 [00:03<00:15,  2.07it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 76%|███████▌  | 32/42 [00:15<00:05,  1.89it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:20<00:00,  2.09it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000185 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 992
[LightGBM] [Info] Number of data points in the train set: 3235, number of used features: 21
[LightGBM] [Info] Start training from score 2.532921





In [None]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
OrthogonalMatchingPursuitCV,0.07,0.1,1.48,0.05
LassoLarsIC,0.07,0.1,1.48,0.03
LassoLarsCV,0.07,0.1,1.48,0.05
LarsCV,0.07,0.1,1.48,0.05
LassoCV,0.07,0.1,1.48,0.22
ElasticNetCV,0.07,0.1,1.48,0.26
BayesianRidge,0.07,0.1,1.48,0.02
RidgeCV,0.06,0.1,1.48,0.02
LinearRegression,0.06,0.1,1.48,0.02
TransformedTargetRegressor,0.06,0.1,1.48,0.02


In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
elastic_net_cv = ElasticNetCV(cv=5, random_state=42)  # Adjust parameters as necessary
poisson_regressor = PoissonRegressor()
svr = SVR()  # Default parameters, adjust as necessary
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_reg = XGBRegressor(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
elastic_net_cv.fit(X_train, y_train)
poisson_regressor.fit(X_train, y_train)
svr.fit(X_train, y_train)
random_forest_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
y_pred_poisson_regressor = poisson_regressor.predict(X_validation)
y_pred_svr = svr.predict(X_validation)
y_pred_random_forest = random_forest_reg.predict(X_validation)
y_pred_xgb = xgb_reg.predict(X_validation)

# Predict y_validation
y_pred_elastic_net_cv_rounded = np.rint(y_pred_elastic_net_cv)
y_pred_poisson_regressor_rounded = np.rint(y_pred_poisson_regressor)
y_pred_svr_rounded = np.rint(y_pred_svr)
y_pred_random_forest_rounded = np.rint(y_pred_random_forest)
y_pred_xgb_rounded = np.rint(y_pred_xgb)

# Calculate MAE and R2 score using rounded predictions
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv_rounded)
r2_elastic_net_cv = r2_score(y_validation, y_pred_elastic_net_cv_rounded)

mae_poisson_regressor = mean_absolute_error(y_validation, y_pred_poisson_regressor_rounded)
r2_poisson_regressor = r2_score(y_validation, y_pred_poisson_regressor_rounded)

mae_svr = mean_absolute_error(y_validation, y_pred_svr_rounded)
r2_svr = r2_score(y_validation, y_pred_svr_rounded)

mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest_rounded)
r2_random_forest = r2_score(y_validation, y_pred_random_forest_rounded)

mae_xgb = mean_absolute_error(y_validation, y_pred_xgb_rounded)
r2_xgb = r2_score(y_validation, y_pred_xgb_rounded)

# Print out the performance with rounded predictions
print(f'ElasticNetCV - MAE: {mae_elastic_net_cv}, R2 Score: {r2_elastic_net_cv}')
print(f'Poisson Regressor - MAE: {mae_poisson_regressor}, R2 Score: {r2_poisson_regressor}')
print(f'SVR - MAE: {mae_svr}, R2 Score: {r2_svr}')
print(f'Random Forest Regressor - MAE: {mae_random_forest}, R2 Score: {r2_random_forest}')
print(f'XGB Regressor - MAE: {mae_xgb}, R2 Score: {r2_xgb}')

ElasticNetCV - MAE: 1.1589648798521257, R2 Score: 0.06970659559250825
Poisson Regressor - MAE: 1.1996303142329021, R2 Score: 0.003203086585311432
SVR - MAE: 1.2273567467652495, R2 Score: -0.10033078539180185
Random Forest Regressor - MAE: 1.1737523105360443, R2 Score: 0.025874737383219437
XGB Regressor - MAE: 1.2920517560073936, R2 Score: -0.15549846900004471


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Parameter grids
param_grid_poisson = {
    'alpha': [0.01, 0.1, 1, 10],
    'max_iter': [100, 300, 500]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV setup
grid_search_poisson = GridSearchCV(poisson_regressor, param_grid_poisson, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_reg, param_grid_random_forest, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)

# Fitting models
grid_search_poisson.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)

# Best parameters and scores
best_params_poisson = grid_search_poisson.best_params_
best_score_poisson = grid_search_poisson.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

# Predict and calculate MAE
y_pred_poisson = grid_search_poisson.best_estimator_.predict(X_validation)
mae_poisson = mean_absolute_error(y_validation, y_pred_poisson)

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest)

# Results
results = {
    "Poisson Regressor": {
        "Best Parameters": best_params_poisson,
        "Best Score (Negative MAE)": best_score_poisson,
        "MAE on Validation": mae_poisson
    },
    "Random Forest Regressor": {
        "Best Parameters": best_params_random_forest,
        "Best Score (Negative MAE)": best_score_random_forest,
        "MAE on Validation": mae_random_forest
    }
}

# ElasticNetCV already uses cross-validation for parameter tuning, so we directly fit it and predict
elastic_net_cv = ElasticNetCV(cv=5, random_state=42).fit(X_train, y_train)
y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv)

results["ElasticNetCV"] = {
    "Best Parameters": elastic_net_cv.get_params(),
    "MAE on Validation": mae_elastic_net_cv
}

# Print results
for model, info in results.items():
    print(f"{model}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Poisson Regressor:
  Best Parameters: {'alpha': 0.01, 'max_iter': 500}
  Best Score (Negative MAE): -1.1760288561025591
  MAE on Validation: 1.1752358429133842

Random Forest Regressor:
  Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
  Best Score (Negative MAE): -1.1917588688068341
  MAE on Validation: 1.1820894700518194

ElasticNetCV:
  Best Parameters: {'alphas': None, 'copy_X': True, 'cv': 5, 'eps': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': 0}
  MAE on Validation: 1.178936757572926



In [None]:


# Best parameters for ElasticNetCV
best_params_elastic_net = {
    'alphas': None,
    'copy_X': True,
    'cv': 5,
    'eps': 0.001,
    'fit_intercept': True,
    'l1_ratio': 0.5,
    'max_iter': 1000,
    'n_alphas': 100,
    'n_jobs': None,
    'positive': False,
    'precompute': 'auto',
    'random_state': 42,
    'selection': 'cyclic',
    'tol': 0.0001,
    'verbose': 0
}

# Initialize and fit the ElasticNetCV model
elastic_net_model = ElasticNetCV(**best_params_elastic_net)
elastic_net_model.fit(X_train, y_train)

# Predict on the test set
y_pred_test_elastic_net = elastic_net_model.predict(X_test)

# Round predictions to nearest integer and convert to int type
y_pred_test_elastic_net_rounded = np.rint(y_pred_test_elastic_net).astype(int)

# y_pred_test_elastic_net_rounded contains the final integer predictions for X_test


In [None]:


# Convert predictions to a DataFrame
predictions_df_elastic_net = pd.DataFrame(y_pred_test_elastic_net_rounded, columns=['Predicted_Total_Goals'])

# Save to CSV
predictions_df_elastic_net.to_csv('england_1.csv', index=False)


Load the third division england2

In [None]:

country = "england"
league = "2"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

In [None]:
eng21516 = seasonal_datasets['21516']
eng21617 = seasonal_datasets['21617']
eng21718 = seasonal_datasets['21718']
eng21819 = seasonal_datasets['21819']
eng21920 = seasonal_datasets['21920']
eng22021 = seasonal_datasets['22021']
eng22122 = seasonal_datasets['22122']

In [None]:
df20162 = eng21516[columns]
df20172 = eng21617[columns]
df20182 = eng21718[columns]
df20192 = eng21819[columns]
df20202 = eng21920[columns]
df20212 = eng22021[columns]
df20222 = eng22122[columns]

In [None]:
summary = missing_values_summary(df20162)
print(summary)

          Missing Values Count
Column                        
Date                         1
HomeTeam                     1
AwayTeam                     1
FTHG                         1
FTAG                         1
FTR                          1
HS                           1
AS                           1
HST                          1
AST                          1
HC                           1
AC                           1
B365H                        1
B365D                        1
B365A                        1


In [None]:
summary = missing_values_summary(df20172)
print(summary)

        Missing Values Count
Column                      
Date                      16


In [None]:
summary = missing_values_summary(df20182)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df20192)
print(summary)

        Missing Values Count
Column                      
Date                      16


In [None]:
summary = missing_values_summary(df20202)
print(summary)

        Missing Values Count
Column                      
B365H                      2
B365D                      2
B365A                      2


In [None]:
summary = missing_values_summary(df20212)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df20222)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
show_rows_with_missing_values(df20162)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
423,,,,,,,,,,,,,,,


In [None]:
df20162 = df20162.drop(423)

In [None]:
show_rows_with_missing_values(df20172)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
0,,Chesterfield,Bristol Rvs,3,-1,H,6,17,5,5,1,7,3.8,3.6,2.05
1,,Northampton,Walsall,2,-1,H,10,7,7,2,7,5,2.2,3.5,3.5
2,,Port Vale,Fleetwood Town,2,-1,H,6,13,3,5,7,9,2.9,3.4,2.6
3,,Rochdale,Coventry,2,0,H,10,5,5,0,5,2,1.53,4.5,6.5
4,,Sheffield United,Chesterfield,3,2,H,16,10,8,3,11,3,1.18,6.5,15.0
5,,Swindon,Oldham,0,0,D,26,8,6,0,6,5,2.63,3.25,2.63
6,,Oldham,Fleetwood Town,2,0,H,8,6,5,2,6,2,3.0,3.3,2.55
7,,AFC Wimbledon,Southend,0,2,A,10,7,0,6,3,3,3.0,3.25,2.6
8,,Sheffield United,Scunthorpe,1,1,D,19,6,4,1,6,1,1.6,4.0,5.25
9,,Oldham,Bristol Rvs,0,2,A,9,15,4,5,6,4,3.1,3.3,2.5


In [None]:
df20162 = transform_goals_to_absolute(df20162)
df20172 = transform_goals_to_absolute(df20172)
df20182 = transform_goals_to_absolute(df20182)
df20192 = transform_goals_to_absolute(df20192)
df20202 = transform_goals_to_absolute(df20202)
df20212 = transform_goals_to_absolute(df20212)
df20222 = transform_goals_to_absolute(df20222)

In [None]:
df20162 = filter_goals_under_30(df20162)
df20172 = filter_goals_under_30(df20172)
df20182 = filter_goals_under_30(df20182)
df20192 = filter_goals_under_30(df20192)
df20202 = filter_goals_under_30(df20202)
df20212 = filter_goals_under_30(df20212)
df20222 = filter_goals_under_30(df20222)


In [None]:
df20162 = process_time_data(df20162, 2016)
df20172 = process_time_data(df20172, 2017)
df20182 = process_time_data(df20182, 2018)
df20192 = process_time_data(df20192, 2019)
df20202 = process_time_data(df20202, 2020)
df20212 = process_time_data(df20212, 2021)
df20222 = process_time_data(df20222, 2022)

In [None]:

df20202 = impute_missing_values_knn(df20202)

In [None]:
df20162 =  preprocess_football_data(df20162)
df20172 =  preprocess_football_data(df20172)
df20182 =  preprocess_football_data(df20182)
df20192 =  preprocess_football_data(df20192)
df20202 =  preprocess_football_data(df20202)
df20212 =  preprocess_football_data(df20212)
df20222 =  preprocess_football_data(df20222)

In [None]:
df20162 = calculate_attack_strength(df20162)
df20172 = calculate_attack_strength(df20172)
df20182 = calculate_attack_strength(df20182)
df20192 = calculate_attack_strength(df20192)
df20202 = calculate_attack_strength(df20202)
df20212 = calculate_attack_strength(df20212)
df20222 = calculate_attack_strength(df20222)

In [None]:
df20162 =  add_adjusted_win_loss_ratio(df20162)
df20172 =  add_adjusted_win_loss_ratio(df20172)
df20182 =  add_adjusted_win_loss_ratio(df20182)
df20192 =  add_adjusted_win_loss_ratio(df20192)
df20202 =  add_adjusted_win_loss_ratio(df20202)
df20212 =  add_adjusted_win_loss_ratio(df20212)
df20222 =  add_adjusted_win_loss_ratio(df20222)

In [None]:
eng2 = pd.concat([df20162, df20172, df20182, df20192, df20202, df20212, df20222], ignore_index=True)

In [None]:
eng2 = eng2.drop(columns=columns_to_drop, errors='ignore')

In [None]:
eng2.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,Year,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,...,Broker_prob_H,Broker_prob_D,Broker_prob_A,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A
0,Swindon,Walsall,H,4.2,3.8,1.91,2016.0,0.45,0.57,1.73,...,0.24,0.26,0.52,3.0,0.37,0.31,25.99,33.5,2.0,0.0
1,Oldham,Peterboro,A,2.7,3.4,2.8,2016.0,0.32,0.41,1.09,...,0.37,0.29,0.36,6.0,0.26,0.33,16.41,31.82,1.0,1.0
2,Sheffield United,Rochdale,H,2.1,3.75,3.5,2016.0,0.48,0.32,1.61,...,0.48,0.27,0.29,5.0,0.35,0.28,25.31,22.61,1.0,1.0
3,Walsall,Bury,A,2.25,3.5,3.4,2016.0,0.48,0.23,1.35,...,0.44,0.29,0.29,1.0,0.27,0.22,21.2,15.91,1.0,1.0
4,Scunthorpe,Doncaster,H,1.95,3.5,4.33,2016.0,0.5,0.17,1.18,...,0.51,0.29,0.23,2.0,0.21,0.25,17.78,17.59,3.0,-1.0


In [None]:
eng2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3630 entries, 0 to 3629
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     3630 non-null   object 
 1   AwayTeam                     3630 non-null   object 
 2   FTR                          3630 non-null   object 
 3   B365H                        3630 non-null   float64
 4   B365D                        3630 non-null   float64
 5   B365A                        3630 non-null   float64
 6   Year                         3630 non-null   float64
 7   HomeTeam_WinRate             3630 non-null   float64
 8   AwayTeam_WinRate             3630 non-null   float64
 9   HomeTeam_GoalsAvg            3630 non-null   float64
 10  AwayTeam_GoalsAvg            3630 non-null   float64
 11  HomeTeam_goals_conceded_avg  3630 non-null   float64
 12  AwayTeam_goals_conceded_avg  3630 non-null   float64
 13  Broker_prob_H     

Load the forth division England3

In [None]:

country = "england"
league = "3"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

In [None]:
eng31516 = seasonal_datasets['31516']
eng31617 = seasonal_datasets['31617']
eng31718 = seasonal_datasets['31718']
eng31819 = seasonal_datasets['31819']
eng31920 = seasonal_datasets['31920']
eng32021 = seasonal_datasets['32021']
eng32122 = seasonal_datasets['32122']

In [None]:
df20163 = eng31516[columns]
df20173 = eng31617[columns]
df20183 = eng31718[columns]
df20193 = eng31819[columns]
df20203 = eng31920[columns]
df20213 = eng32021[columns]
df20223 = eng32122[columns]

In [None]:
summary = missing_values_summary(df20163)
print(summary)


          Missing Values Count
Column                        
Date                         1
HomeTeam                     1
AwayTeam                     1
FTHG                         1
FTAG                         1
FTR                          1
HS                           1
AS                           1
HST                          1
AST                          1
HC                           1
AC                           1
B365H                        1
B365D                        1
B365A                        1


In [None]:
summary = missing_values_summary(df20173)
print(summary)


Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df20183)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df20193)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df20203)
print(summary)

        Missing Values Count
Column                      
B365H                      4
B365D                      4
B365A                      4


In [None]:
summary = missing_values_summary(df20213)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df20223)
print(summary)

        Missing Values Count
Column                      
B365H                      1
B365D                      1
B365A                      1


In [None]:
show_rows_with_missing_values(df20163)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
423,,,,,,,,,,,,,,,


In [None]:
df20163 = df20163.drop(423)

In [None]:
df20163 = transform_goals_to_absolute(df20163)
df20173 = transform_goals_to_absolute(df20173)
df20183 = transform_goals_to_absolute(df20183)
df20193 = transform_goals_to_absolute(df20193)
df20203 = transform_goals_to_absolute(df20203)
df20213 = transform_goals_to_absolute(df20213)
df20223 = transform_goals_to_absolute(df20223)

In [None]:
df20163 = filter_goals_under_30(df20163)
df20173 = filter_goals_under_30(df20173)
df20183 = filter_goals_under_30(df20183)
df20193 = filter_goals_under_30(df20193)
df20203 = filter_goals_under_30(df20203)
df20213 = filter_goals_under_30(df20213)
df20223 = filter_goals_under_30(df20223)


In [None]:
df20163 = process_time_data(df20163, 2016)
df20173 = process_time_data(df20173, 2017)
df20183 = process_time_data(df20183, 2018)
df20193 = process_time_data(df20193, 2019)
df20203 = process_time_data(df20203, 2020)
df20213 = process_time_data(df20213, 2021)
df20223 = process_time_data(df20223, 2022)

In [None]:
df20203 = impute_missing_values_knn(df20203)
df20223 = impute_missing_values_knn(df20223)

In [None]:
df20163 =  preprocess_football_data(df20163)
df20173 =  preprocess_football_data(df20173)
df20183 =  preprocess_football_data(df20183)
df20193 =  preprocess_football_data(df20193)
df20203 =  preprocess_football_data(df20203)
df20213 =  preprocess_football_data(df20213)
df20223 =  preprocess_football_data(df20223)

In [None]:
df20163 = calculate_attack_strength(df20163)
df20173 = calculate_attack_strength(df20173)
df20183 = calculate_attack_strength(df20183)
df20193 = calculate_attack_strength(df20193)
df20203 = calculate_attack_strength(df20203)
df20213 = calculate_attack_strength(df20213)
df20223 = calculate_attack_strength(df20223)

In [None]:
df20163 =  add_adjusted_win_loss_ratio(df20163)
df20173 =  add_adjusted_win_loss_ratio(df20173)
df20183 =  add_adjusted_win_loss_ratio(df20183)
df20193 =  add_adjusted_win_loss_ratio(df20193)
df20203 =  add_adjusted_win_loss_ratio(df20203)
df20213 =  add_adjusted_win_loss_ratio(df20213)
df20223 =  add_adjusted_win_loss_ratio(df20223)

In [None]:
eng3 = pd.concat([df20163, df20173, df20183, df20193, df20203, df20213, df20223], ignore_index=True)

In [None]:
eng3 = eng3.drop(columns=columns_to_drop, errors='ignore')

In [None]:
eng3.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,Year,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,...,Broker_prob_H,Broker_prob_D,Broker_prob_A,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A
0,Barnet,Mansfield,A,2.6,3.4,2.9,2016,0.57,0.45,1.61,...,0.38,0.29,0.34,4.0,0.41,0.35,26.83,19.98,0.0,2.0
1,Newport County,Morecambe,A,2.6,3.4,2.9,2016,0.17,0.23,0.91,...,0.38,0.29,0.34,3.0,0.24,0.39,15.23,25.36,1.0,1.0
2,Luton,York,D,2.05,3.8,3.6,2016,0.27,0.04,1.14,...,0.49,0.26,0.28,2.0,0.29,0.32,18.13,13.83,2.0,0.0
3,Wycombe,Hartlepool,H,1.83,3.6,5.0,2016,0.39,0.24,1.09,...,0.55,0.28,0.2,3.0,0.36,0.27,18.13,15.37,1.0,1.0
4,Accrington,Plymouth,H,2.25,3.5,3.4,2016,0.48,0.52,1.87,...,0.44,0.29,0.29,3.0,0.3,0.32,31.18,25.36,1.0,1.0


In [None]:
eng3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3663 entries, 0 to 3662
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     3663 non-null   object 
 1   AwayTeam                     3663 non-null   object 
 2   FTR                          3663 non-null   object 
 3   B365H                        3663 non-null   float64
 4   B365D                        3663 non-null   float64
 5   B365A                        3663 non-null   float64
 6   Year                         3663 non-null   int64  
 7   HomeTeam_WinRate             3663 non-null   float64
 8   AwayTeam_WinRate             3663 non-null   float64
 9   HomeTeam_GoalsAvg            3663 non-null   float64
 10  AwayTeam_GoalsAvg            3663 non-null   float64
 11  HomeTeam_goals_conceded_avg  3663 non-null   float64
 12  AwayTeam_goals_conceded_avg  3663 non-null   float64
 13  Broker_prob_H     

Merge the datasets from third and fourth division

In [None]:
data20162 = pd.concat([df20161, df20162, df20163,], ignore_index=True)
data20172 = pd.concat([df20171, df20172, df20173,], ignore_index=True)
data20182 = pd.concat([df20181, df20182, df20183,], ignore_index=True)
data20192 = pd.concat([df20191, df20192, df20193,], ignore_index=True)
data20202 = pd.concat([df20201, df20202, df20203,], ignore_index=True)
data20212 = pd.concat([df20211, df20212, df20213,], ignore_index=True)
data20222 = pd.concat([df20221, df20222, df20223,], ignore_index=True)

In [None]:
df2023_eng2.head()

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,Referee,B365H,B365D,B365A,BWH,...,AvgC<2.5,AHCh,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA
0,E2,30/07/2022,15:00,Accrington,Charlton,B Speedie,2.55,3.4,2.7,2.6,...,1.92,0.0,2.05,1.75,2.08,1.83,2.16,1.83,2.06,1.77
1,E2,30/07/2022,15:00,Bristol Rvs,Forest Green,R Lewis,2.3,3.3,3.2,2.25,...,1.75,-0.25,2.1,1.77,2.09,1.82,2.13,1.82,2.07,1.76
2,E2,30/07/2022,15:00,Cambridge,Milton Keynes Dons,J Busby,4.0,3.5,1.9,3.9,...,1.61,0.25,1.7,2.1,1.76,2.14,1.86,2.14,1.78,2.05
3,E2,30/07/2022,15:00,Cheltenham,Peterboro,T Nield,3.1,3.4,2.3,2.9,...,2.04,0.25,1.93,1.93,1.98,1.9,1.98,1.99,1.89,1.92
4,E2,30/07/2022,15:00,Derby,Oxford,A Backhouse,2.2,3.4,3.3,2.3,...,1.81,-0.25,1.75,2.05,1.79,2.1,1.86,2.14,1.79,2.03


In [None]:
df2023_eng2 = df2023_eng2[columns_test]

In [None]:
df2023_eng2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      552 non-null    object 
 1   HomeTeam  552 non-null    object 
 2   AwayTeam  552 non-null    object 
 3   B365H     552 non-null    float64
 4   B365D     552 non-null    float64
 5   B365A     552 non-null    float64
dtypes: float64(3), object(3)
memory usage: 26.0+ KB


In [None]:
season_dfs_3 = [data20162, data20172, data20182, data20192, data20202, data20212, data20222]

In [None]:
# Apply the overall averages to df2023
df2023_eng2 = calculate_and_apply_overall_averages(season_dfs_3, df2023_eng2)

In [None]:
df2023_eng2.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,AwayTeam_goals_conceded_avg,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team
0,30/07/2022,Accrington,Charlton,2.55,3.4,2.7,0.52,0.3,1.78,1.0,1.43,1.35,0.33,0.26,27.45,19.08
1,30/07/2022,Bristol Rvs,Forest Green,2.3,3.3,3.2,0.61,0.39,1.65,1.78,0.87,1.13,0.38,0.35,27.97,37.09
2,30/07/2022,Cambridge,Milton Keynes Dons,4.0,3.5,1.9,0.35,0.59,1.22,1.95,1.26,0.95,0.37,0.37,18.75,35.68
3,30/07/2022,Cheltenham,Peterboro,3.1,3.4,2.3,0.43,,1.43,,1.3,,0.32,,22.1,
4,30/07/2022,Derby,Oxford,2.2,3.4,3.3,,0.39,,1.52,,1.39,,0.35,,29.04


In [None]:
df2023_eng2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         552 non-null    object 
 1   HomeTeam                     552 non-null    object 
 2   AwayTeam                     552 non-null    object 
 3   B365H                        552 non-null    float64
 4   B365D                        552 non-null    float64
 5   B365A                        552 non-null    float64
 6   HomeTeam_WinRate             552 non-null    float64
 7   AwayTeam_WinRate             552 non-null    float64
 8   HomeTeam_GoalsAvg            552 non-null    float64
 9   AwayTeam_GoalsAvg            552 non-null    float64
 10  HomeTeam_goals_conceded_avg  552 non-null    float64
 11  AwayTeam_goals_conceded_avg  552 non-null    float64
 12  H_goal_ratio                 552 non-null    float64
 13  A_goal_ratio        

In [None]:
merged_df = eng2.copy()

# Calculate head-to-head stats using merged data
head_to_head_stats = calculate_head_to_head_stats(merged_df)

# Apply the adjusted win-loss ratio to df2023
df2023_eng2 = apply_adjusted_win_loss_ratio_to_2023(df2023_eng2, head_to_head_stats)

In [None]:
df2023_eng2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         552 non-null    object 
 1   HomeTeam                     552 non-null    object 
 2   AwayTeam                     552 non-null    object 
 3   B365H                        552 non-null    float64
 4   B365D                        552 non-null    float64
 5   B365A                        552 non-null    float64
 6   HomeTeam_WinRate             552 non-null    float64
 7   AwayTeam_WinRate             552 non-null    float64
 8   HomeTeam_GoalsAvg            552 non-null    float64
 9   AwayTeam_GoalsAvg            552 non-null    float64
 10  HomeTeam_goals_conceded_avg  552 non-null    float64
 11  AwayTeam_goals_conceded_avg  552 non-null    float64
 12  H_goal_ratio                 552 non-null    float64
 13  A_goal_ratio        

In [None]:
df2023_eng2 = process_time_data(df2023_eng2, 2023)

In [None]:
df2023_eng2 = add_probability_B365(df2023_eng2)

In [None]:
df2023_eng2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     552 non-null    object 
 1   AwayTeam                     552 non-null    object 
 2   B365H                        552 non-null    float64
 3   B365D                        552 non-null    float64
 4   B365A                        552 non-null    float64
 5   HomeTeam_WinRate             552 non-null    float64
 6   AwayTeam_WinRate             552 non-null    float64
 7   HomeTeam_GoalsAvg            552 non-null    float64
 8   AwayTeam_GoalsAvg            552 non-null    float64
 9   HomeTeam_goals_conceded_avg  552 non-null    float64
 10  AwayTeam_goals_conceded_avg  552 non-null    float64
 11  H_goal_ratio                 552 non-null    float64
 12  A_goal_ratio                 552 non-null    float64
 13  attack_strength_home

In [None]:


def fill_missing_with_mean(df):
    """
    Fill missing values in each column of the DataFrame with the mean of that column.

    Parameters:
    df (pd.DataFrame): The dataset with missing values.

    Returns:
    pd.DataFrame: The DataFrame with missing values filled.
    """
    for column in df.columns:
        if df[column].dtype in ['float64', 'int64']:  # Check if the column is numerical
            df[column].fillna(df[column].mean(), inplace=True)
    return df


In [None]:
df2023_eng2 = fill_missing_with_mean(df2023_eng2)

In [None]:
df2023_eng2.head(10)

Unnamed: 0,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,...,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year,Broker_prob_H,Broker_prob_D,Broker_prob_A
0,Accrington,Charlton,2.55,3.4,2.7,0.52,0.3,1.78,1.0,1.43,...,0.33,0.26,27.45,19.08,1.7,0.3,2023,0.39,0.29,0.37
1,Bristol Rvs,Forest Green,2.3,3.3,3.2,0.61,0.39,1.65,1.78,0.87,...,0.38,0.35,27.97,37.09,0.0,0.0,2023,0.43,0.3,0.31
2,Cambridge,Milton Keynes Dons,4.0,3.5,1.9,0.35,0.59,1.22,1.95,1.26,...,0.37,0.37,18.75,35.68,-1.0,3.0,2023,0.25,0.29,0.53
3,Cheltenham,Peterboro,3.1,3.4,2.3,0.43,0.13,1.43,0.7,1.3,...,0.32,0.3,22.1,14.65,0.0,0.0,2023,0.32,0.29,0.43
4,Derby,Oxford,2.2,3.4,3.3,0.48,0.39,1.3,1.52,0.96,...,0.33,0.35,21.38,29.04,0.0,0.0,2023,0.45,0.29,0.3
5,Ipswich,Bolton,1.9,3.4,4.2,0.45,0.38,1.68,1.19,1.0,...,0.45,0.33,24.77,20.74,0.3,1.7,2023,0.53,0.29,0.24
6,Lincoln,Exeter,2.37,3.2,3.2,0.32,0.39,1.09,1.22,1.23,...,0.24,0.23,16.07,25.33,0.0,0.0,2023,0.42,0.31,0.31
7,Morecambe,Shrewsbury,2.9,3.3,2.45,0.3,0.13,1.43,0.74,1.52,...,0.37,0.23,22.1,14.11,1.0,1.0,2023,0.34,0.3,0.41
8,Plymouth,Barnsley,2.1,3.4,3.6,0.55,0.04,1.2,0.65,0.95,...,0.26,0.2,16.07,13.73,-1.0,3.0,2023,0.48,0.29,0.28
9,Port Vale,Fleetwood Town,1.95,3.4,4.0,0.48,0.14,1.52,1.27,0.96,...,0.42,0.32,25.76,23.23,2.0,0.0,2023,0.51,0.29,0.25


In [None]:
train = eng2[eng2['Year'] < 2022]
validation = eng2[eng2['Year'] == 2022]

In [None]:
train = remove_rows_with_inf(train)

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['FTR']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['FTR']

In [None]:
X_test = df2023_eng2.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((3087, 21), (3087,), (541, 21), (541,), (552, 21))

In [None]:
y_train_enc = y_train.map({'H': 1, 'D': 0, 'A': 2})
y_validation_enc = y_validation.map({'H': 1, 'D': 0, 'A': 2})

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = False, custom_metric = None)
models,pred = clf.fit(X_train, X_validation, y_train_enc, y_validation_enc)

  7%|▋         | 2/29 [00:00<00:05,  4.65it/s]

ROC AUC couldn't be calculated for AdaBoostClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BaggingClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BernoulliNB
multi_class must be in ('ovo', 'ovr')


 14%|█▍        | 4/29 [00:03<00:22,  1.13it/s]

ROC AUC couldn't be calculated for CalibratedClassifierCV
multi_class must be in ('ovo', 'ovr')
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
ROC AUC couldn't be calculated for DecisionTreeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for DummyClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for ExtraTreeClassifier
multi_class must be in ('ovo', 'ovr')


 31%|███       | 9/29 [00:03<00:06,  2.89it/s]

ROC AUC couldn't be calculated for ExtraTreesClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for GaussianNB
multi_class must be in ('ovo', 'ovr')


 38%|███▊      | 11/29 [00:03<00:05,  3.54it/s]

ROC AUC couldn't be calculated for KNeighborsClassifier
multi_class must be in ('ovo', 'ovr')


 41%|████▏     | 12/29 [00:04<00:05,  2.93it/s]

ROC AUC couldn't be calculated for LabelPropagation
multi_class must be in ('ovo', 'ovr')


 48%|████▊     | 14/29 [00:05<00:05,  2.76it/s]

ROC AUC couldn't be calculated for LabelSpreading
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LinearDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 55%|█████▌    | 16/29 [00:06<00:05,  2.49it/s]

ROC AUC couldn't be calculated for LinearSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LogisticRegression
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NearestCentroid
multi_class must be in ('ovo', 'ovr')


 62%|██████▏   | 18/29 [00:07<00:04,  2.37it/s]

ROC AUC couldn't be calculated for NuSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for PassiveAggressiveClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for Perceptron
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for QuadraticDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 76%|███████▌  | 22/29 [00:08<00:02,  3.07it/s]

ROC AUC couldn't be calculated for RandomForestClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifierCV
multi_class must be in ('ovo', 'ovr')


 86%|████████▌ | 25/29 [00:08<00:00,  4.22it/s]

ROC AUC couldn't be calculated for SGDClassifier
multi_class must be in ('ovo', 'ovr')


 90%|████████▉ | 26/29 [00:08<00:00,  3.82it/s]

ROC AUC couldn't be calculated for SVC
multi_class must be in ('ovo', 'ovr')
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'


 97%|█████████▋| 28/29 [00:09<00:00,  4.28it/s]

ROC AUC couldn't be calculated for XGBClassifier
multi_class must be in ('ovo', 'ovr')
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000186 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1010
[LightGBM] [Info] Number of data points in the train set: 3087, number of used features: 21
[LightGBM] [Info] Start training from score -1.345356
[LightGBM] [Info] Start training from score -0.843526
[LightGBM] [Info] Start training from score -1.173244


100%|██████████| 29/29 [00:09<00:00,  3.04it/s]

ROC AUC couldn't be calculated for LGBMClassifier
multi_class must be in ('ovo', 'ovr')





In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BaggingClassifier,0.7,0.67,,0.7,0.19
ExtraTreesClassifier,0.72,0.67,,0.7,0.45
LGBMClassifier,0.7,0.66,,0.69,0.29
CalibratedClassifierCV,0.71,0.66,,0.66,2.54
LogisticRegression,0.69,0.66,,0.68,0.12
LinearSVC,0.71,0.66,,0.66,0.92
RandomForestClassifier,0.7,0.65,,0.68,0.89
SGDClassifier,0.7,0.65,,0.67,0.17
XGBClassifier,0.69,0.65,,0.68,0.33
LinearDiscriminantAnalysis,0.68,0.64,,0.66,0.1


In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Initialize the models
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
decision_tree_clf = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
bagging_clf.fit(X_train, y_train_enc)
extra_trees_clf.fit(X_train, y_train_enc)
random_forest_clf.fit(X_train, y_train_enc)
decision_tree_clf.fit(X_train, y_train_enc)
xgb_clf.fit(X_train, y_train_enc)

# Predict y_validation
y_pred_bagging = bagging_clf.predict(X_validation)
y_pred_extra_trees = extra_trees_clf.predict(X_validation)
y_pred_random_forest = random_forest_clf.predict(X_validation)
y_pred_decision_tree = decision_tree_clf.predict(X_validation)
y_pred_xgb = xgb_clf.predict(X_validation)

# Calculate accuracy and F1 score for each model
accuracy_bagging = accuracy_score(y_validation_enc, y_pred_bagging)
f1_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

accuracy_extra_trees = accuracy_score(y_validation_enc, y_pred_extra_trees)
f1_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')

accuracy_random_forest = accuracy_score(y_validation_enc, y_pred_random_forest)
f1_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

accuracy_decision_tree = accuracy_score(y_validation_enc, y_pred_decision_tree)
f1_decision_tree = f1_score(y_validation_enc, y_pred_decision_tree, average='macro')

accuracy_xgb = accuracy_score(y_validation_enc, y_pred_xgb)
f1_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

# Print out the performance
print(f'Bagging Classifier - Accuracy: {accuracy_bagging}, F1 Score: {f1_bagging}')
print(f'Extra Trees Classifier - Accuracy: {accuracy_extra_trees}, F1 Score: {f1_extra_trees}')
print(f'Random Forest Classifier - Accuracy: {accuracy_random_forest}, F1 Score: {f1_random_forest}')
print(f'Decision Tree Classifier - Accuracy: {accuracy_decision_tree}, F1 Score: {f1_decision_tree}')
print(f'XGB Classifier - Accuracy: {accuracy_xgb}, F1 Score: {f1_xgb}')


Bagging Classifier - Accuracy: 0.7264325323475046, F1 Score: 0.6902399155727936
Extra Trees Classifier - Accuracy: 0.7153419593345656, F1 Score: 0.6691941102526071
Random Forest Classifier - Accuracy: 0.7042513863216266, F1 Score: 0.6448098635438578
Decision Tree Classifier - Accuracy: 0.6598890942698706, F1 Score: 0.6370709292948488
XGB Classifier - Accuracy: 0.6894639556377079, F1 Score: 0.6450806271428021


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}

param_grid_bagging = {
    'n_estimators': [50, 100, 200],
    'max_samples': [0.5, 1.0],
    'max_features': [0.5, 1.0],
    'bootstrap': [True, False],
    'bootstrap_features': [True, False]
}


param_grid_extra_trees = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



grid_search_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_bagging = GridSearchCV(bagging_clf, param_grid_bagging, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_extra_trees = GridSearchCV(extra_trees_clf, param_grid_extra_trees, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)

grid_search_xgb.fit(X_train, y_train_enc)
grid_search_bagging.fit(X_train, y_train_enc)
grid_search_extra_trees.fit(X_train, y_train_enc)

best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

best_params_bagging = grid_search_bagging.best_params_
best_score_bagging = grid_search_bagging.best_score_

best_params_extra_trees = grid_search_extra_trees.best_params_
best_score_extra_trees = grid_search_extra_trees.best_score_


y_pred_xgb = grid_search_xgb.best_estimator_.predict(X_validation)
f1_score_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

y_pred_bagging = grid_search_bagging.best_estimator_.predict(X_validation)
f1_score_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

y_pred_extra_trees = grid_search_extra_trees.best_estimator_.predict(X_validation)
f1_score_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')


results = {
    "XGB Classifier": {
        "Best Parameters": best_params_xgb,
        "Best Score": best_score_xgb,
        "F1 Score on Validation": f1_score_xgb
    },
    "Bagging Classifier": {
        "Best Parameters": best_params_bagging,
        "Best Score": best_score_bagging,
        "F1 Score on Validation": f1_score_bagging
    },
    "Extra Trees Classifier": {
        "Best Parameters": best_params_extra_trees,
        "Best Score": best_score_extra_trees,
        "F1 Score on Validation": f1_score_extra_trees
    }
}




Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
results

{'XGB Classifier': {'Best Parameters': {'colsample_bytree': 1,
   'learning_rate': 0.01,
   'max_depth': 10,
   'n_estimators': 200,
   'subsample': 0.5},
  'Best Score': 0.6642928593728763,
  'F1 Score on Validation': 0.6726827814892725},
 'Bagging Classifier': {'Best Parameters': {'bootstrap': False,
   'bootstrap_features': True,
   'max_features': 0.5,
   'max_samples': 0.5,
   'n_estimators': 50},
  'Best Score': 0.6575368663024754,
  'F1 Score on Validation': 0.6637285934333448},
 'Extra Trees Classifier': {'Best Parameters': {'max_depth': 20,
   'min_samples_leaf': 1,
   'min_samples_split': 10,
   'n_estimators': 100},
  'Best Score': 0.6559756967500191,
  'F1 Score on Validation': 0.6623357575853291}}

In [None]:
optimal_xgb_clf = XGBClassifier(
    colsample_bytree=1,
    learning_rate=0.01,
    max_depth=10,
    n_estimators=200,
    subsample=0.5,
    random_state=42  # Optional for reproducibility
)

In [None]:
optimal_xgb_clf.fit(X_train, y_train_enc)

In [None]:
y_pred_test_xgb = optimal_xgb_clf.predict(X_test)

In [None]:
# Define the inverse mapping
inverse_mapping = {1: 'H', 0: 'D', 2: 'A'}

# Convert y_pred_test_xgb back to original form
y_pred_test_xgb_original = [inverse_mapping[label] for label in y_pred_test_xgb]


In [None]:
predictions_df = pd.DataFrame(y_pred_test_xgb_original, columns=['Predictions'])

In [None]:
predictions_df.to_csv('england_2.csv', index=False)

In [None]:
train = eng2[eng2['Year'] < 2022]
validation = eng2[eng2['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['total_goal']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['total_goal']

In [None]:
X_test = df2023_eng2.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((3087, 21), (3087,), (541, 21), (541,), (552, 21))

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
from lazypredict.Supervised import LazyRegressor

# Create an instance of LazyRegressor
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)

# Fit the model
models, predictions = reg.fit(X_train, X_validation, y_train, y_validation)

 21%|██▏       | 9/42 [00:05<00:33,  1.01s/it]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 74%|███████▍  | 31/42 [00:15<00:04,  2.54it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:20<00:00,  2.06it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000721 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1010
[LightGBM] [Info] Number of data points in the train set: 3087, number of used features: 21
[LightGBM] [Info] Start training from score 2.609653





In [None]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HuberRegressor,0.07,0.11,1.55,0.09
LarsCV,0.07,0.1,1.56,0.06
LassoLarsCV,0.07,0.1,1.56,0.07
LassoCV,0.07,0.1,1.56,0.24
LassoLarsIC,0.07,0.1,1.56,0.03
ElasticNetCV,0.07,0.1,1.56,0.56
LinearSVR,0.07,0.1,1.56,0.1
OrthogonalMatchingPursuitCV,0.07,0.1,1.56,0.03
BayesianRidge,0.06,0.1,1.56,0.06
RidgeCV,0.06,0.1,1.56,0.03


In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
elastic_net_cv = ElasticNetCV(cv=5, random_state=42)  # Adjust parameters as necessary
poisson_regressor = PoissonRegressor()
svr = SVR()  # Default parameters, adjust as necessary
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_reg = XGBRegressor(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
elastic_net_cv.fit(X_train, y_train)
poisson_regressor.fit(X_train, y_train)
svr.fit(X_train, y_train)
random_forest_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
y_pred_poisson_regressor = poisson_regressor.predict(X_validation)
y_pred_svr = svr.predict(X_validation)
y_pred_random_forest = random_forest_reg.predict(X_validation)
y_pred_xgb = xgb_reg.predict(X_validation)

# Predict y_validation
y_pred_elastic_net_cv_rounded = np.rint(y_pred_elastic_net_cv)
y_pred_poisson_regressor_rounded = np.rint(y_pred_poisson_regressor)
y_pred_svr_rounded = np.rint(y_pred_svr)
y_pred_random_forest_rounded = np.rint(y_pred_random_forest)
y_pred_xgb_rounded = np.rint(y_pred_xgb)

# Calculate MAE and R2 score using rounded predictions
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv_rounded)
r2_elastic_net_cv = r2_score(y_validation, y_pred_elastic_net_cv_rounded)

mae_poisson_regressor = mean_absolute_error(y_validation, y_pred_poisson_regressor_rounded)
r2_poisson_regressor = r2_score(y_validation, y_pred_poisson_regressor_rounded)

mae_svr = mean_absolute_error(y_validation, y_pred_svr_rounded)
r2_svr = r2_score(y_validation, y_pred_svr_rounded)

mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest_rounded)
r2_random_forest = r2_score(y_validation, y_pred_random_forest_rounded)

mae_xgb = mean_absolute_error(y_validation, y_pred_xgb_rounded)
r2_xgb = r2_score(y_validation, y_pred_xgb_rounded)

# Print out the performance with rounded predictions
print(f'ElasticNetCV - MAE: {mae_elastic_net_cv}, R2 Score: {r2_elastic_net_cv}')
print(f'Poisson Regressor - MAE: {mae_poisson_regressor}, R2 Score: {r2_poisson_regressor}')
print(f'SVR - MAE: {mae_svr}, R2 Score: {r2_svr}')
print(f'Random Forest Regressor - MAE: {mae_random_forest}, R2 Score: {r2_random_forest}')
print(f'XGB Regressor - MAE: {mae_xgb}, R2 Score: {r2_xgb}')

ElasticNetCV - MAE: 1.2384473197781884, R2 Score: 0.07096787227702273
Poisson Regressor - MAE: 1.268022181146026, R2 Score: -0.00963932704011805
SVR - MAE: 1.3049907578558226, R2 Score: -0.18041729169507703
Random Forest Regressor - MAE: 1.2754158964879851, R2 Score: 0.021783818456394433
XGB Regressor - MAE: 1.3475046210720887, R2 Score: -0.15104348177442417


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Parameter grids
param_grid_poisson = {
    'alpha': [0.01, 0.1, 1, 10],
    'max_iter': [100, 300, 500]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV setup
grid_search_poisson = GridSearchCV(poisson_regressor, param_grid_poisson, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_reg, param_grid_random_forest, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)

# Fitting models
grid_search_poisson.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)

# Best parameters and scores
best_params_poisson = grid_search_poisson.best_params_
best_score_poisson = grid_search_poisson.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

# Predict and calculate MAE
y_pred_poisson = grid_search_poisson.best_estimator_.predict(X_validation)
mae_poisson = mean_absolute_error(y_validation, y_pred_poisson)

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest)

# Results
results = {
    "Poisson Regressor": {
        "Best Parameters": best_params_poisson,
        "Best Score (Negative MAE)": best_score_poisson,
        "MAE on Validation": mae_poisson
    },
    "Random Forest Regressor": {
        "Best Parameters": best_params_random_forest,
        "Best Score (Negative MAE)": best_score_random_forest,
        "MAE on Validation": mae_random_forest
    }
}

# ElasticNetCV already uses cross-validation for parameter tuning, so we directly fit it and predict
elastic_net_cv = ElasticNetCV(cv=5, random_state=42).fit(X_train, y_train)
y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv)

results["ElasticNetCV"] = {
    "Best Parameters": elastic_net_cv.get_params(),
    "MAE on Validation": mae_elastic_net_cv
}

# Print results
for model, info in results.items():
    print(f"{model}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Poisson Regressor:
  Best Parameters: {'alpha': 0.01, 'max_iter': 500}
  Best Score (Negative MAE): -1.225078562380644
  MAE on Validation: 1.2412785384895986

Random Forest Regressor:
  Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 100}
  Best Score (Negative MAE): -1.241475374075374
  MAE on Validation: 1.275200364342873

ElasticNetCV:
  Best Parameters: {'alphas': None, 'copy_X': True, 'cv': 5, 'eps': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': 0}
  MAE on Validation: 1.2519290654884558



In [None]:

# Setting the best parameters for PoissonRegressor
best_params_poisson = {
    'alpha': 0.01,
    'max_iter': 500
}

# Initializing and fitting the PoissonRegressor with the best parameters
poisson_model = PoissonRegressor(**best_params_poisson)
poisson_model.fit(X_train, y_train)

# Making predictions on the test set
y_pred_test_poisson = poisson_model.predict(X_test)

# Rounding the predictions to the nearest integer and converting to int type
y_pred_test_poisson_rounded = np.rint(y_pred_test_poisson).astype(int)

# The variable y_pred_test_poisson_rounded contains the final integer predictions for X_test


In [None]:


# Converting the predictions to a DataFrame
predictions_df_poisson = pd.DataFrame(y_pred_test_poisson_rounded, columns=['Predicted_Total_Goals'])

# Saving the DataFrame to a CSV file
predictions_df_poisson.to_csv('england_2.csv', index=False)


Apply features for Eng3

In [None]:
df2023_eng3.head()

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,Referee,B365H,B365D,B365A,BWH,...,AvgC<2.5,AHCh,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA
0,E3,30/07/2022,15:00,AFC Wimbledon,Gillingham,S Mather,2.37,3.3,3.1,2.37,...,1.48,-0.25,2.05,1.8,2.09,1.81,2.13,1.81,2.06,1.76
1,E3,30/07/2022,15:00,Bradford,Doncaster,B Madden,1.85,3.6,4.2,1.85,...,1.94,-0.75,1.9,1.95,1.9,1.99,1.96,1.99,1.9,1.91
2,E3,30/07/2022,15:00,Carlisle,Crawley Town,T Kirk,2.4,3.2,3.0,2.37,...,1.77,0.0,1.8,2.05,1.88,2.02,1.88,2.11,1.79,2.03
3,E3,30/07/2022,15:00,Harrogate,Swindon,T Reeves,3.4,3.5,2.1,3.1,...,2.11,0.25,1.77,2.1,1.77,2.14,1.84,2.14,1.76,2.06
4,E3,30/07/2022,15:00,Leyton Orient,Grimsby,C Pollard,2.2,3.25,3.4,2.15,...,1.66,-0.25,1.8,2.05,1.79,2.1,1.86,2.1,1.8,2.02


In [None]:
df2023_eng3 = df2023_eng3[columns_test]

In [None]:
df2023_eng3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      552 non-null    object 
 1   HomeTeam  552 non-null    object 
 2   AwayTeam  552 non-null    object 
 3   B365H     551 non-null    float64
 4   B365D     551 non-null    float64
 5   B365A     551 non-null    float64
dtypes: float64(3), object(3)
memory usage: 26.0+ KB


In [None]:
df2023_eng3 = fill_missing_with_mean(df2023_eng3)

In [None]:
#season_dfs_4 = [df20163, df20173, df20183, df20193, df20203, df20213, df20223]

In [None]:
# Apply the overall averages to df2023
df2023_eng3 = calculate_and_apply_overall_averages(season_dfs_3, df2023_eng3)

In [None]:
df2023_eng3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         552 non-null    object 
 1   HomeTeam                     552 non-null    object 
 2   AwayTeam                     552 non-null    object 
 3   B365H                        552 non-null    float64
 4   B365D                        552 non-null    float64
 5   B365A                        552 non-null    float64
 6   HomeTeam_WinRate             529 non-null    float64
 7   AwayTeam_WinRate             529 non-null    float64
 8   HomeTeam_GoalsAvg            529 non-null    float64
 9   AwayTeam_GoalsAvg            529 non-null    float64
 10  HomeTeam_goals_conceded_avg  529 non-null    float64
 11  AwayTeam_goals_conceded_avg  529 non-null    float64
 12  H_goal_ratio                 529 non-null    float64
 13  A_goal_ratio        

In [None]:
merged_df = eng3.copy()

# Calculate head-to-head stats using merged data
head_to_head_stats = calculate_head_to_head_stats(merged_df)

# Apply the adjusted win-loss ratio to df2023
df2023_eng3 = apply_adjusted_win_loss_ratio_to_2023(df2023_eng3, head_to_head_stats)

In [None]:
df2023_eng3.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         552 non-null    object 
 1   HomeTeam                     552 non-null    object 
 2   AwayTeam                     552 non-null    object 
 3   B365H                        552 non-null    float64
 4   B365D                        552 non-null    float64
 5   B365A                        552 non-null    float64
 6   HomeTeam_WinRate             529 non-null    float64
 7   AwayTeam_WinRate             529 non-null    float64
 8   HomeTeam_GoalsAvg            529 non-null    float64
 9   AwayTeam_GoalsAvg            529 non-null    float64
 10  HomeTeam_goals_conceded_avg  529 non-null    float64
 11  AwayTeam_goals_conceded_avg  529 non-null    float64
 12  H_goal_ratio                 529 non-null    float64
 13  A_goal_ratio        

In [None]:
df2023_eng3 = fill_missing_with_mean(df2023_eng3)

In [None]:
df2023_eng3 = process_time_data(df2023_eng3, 2023)

In [None]:
df2023_eng3 = add_probability_B365(df2023_eng3)

In [None]:
df2023_eng3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     552 non-null    object 
 1   AwayTeam                     552 non-null    object 
 2   B365H                        552 non-null    float64
 3   B365D                        552 non-null    float64
 4   B365A                        552 non-null    float64
 5   HomeTeam_WinRate             552 non-null    float64
 6   AwayTeam_WinRate             552 non-null    float64
 7   HomeTeam_GoalsAvg            552 non-null    float64
 8   AwayTeam_GoalsAvg            552 non-null    float64
 9   HomeTeam_goals_conceded_avg  552 non-null    float64
 10  AwayTeam_goals_conceded_avg  552 non-null    float64
 11  H_goal_ratio                 552 non-null    float64
 12  A_goal_ratio                 552 non-null    float64
 13  attack_strength_home

In [None]:
train = eng3[eng3['Year'] < 2022]
validation = eng3[eng3['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['FTR']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['FTR']

In [None]:
X_test = df2023_eng3.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((3122, 21), (3122,), (541, 21), (541,), (552, 21))

In [None]:
y_train_enc = y_train.map({'H': 1, 'D': 0, 'A': 2})
y_validation_enc = y_validation.map({'H': 1, 'D': 0, 'A': 2})

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = False, custom_metric = None)
models,pred = clf.fit(X_train, X_validation, y_train_enc, y_validation_enc)

  3%|▎         | 1/29 [00:00<00:13,  2.08it/s]

ROC AUC couldn't be calculated for AdaBoostClassifier
multi_class must be in ('ovo', 'ovr')


  7%|▋         | 2/29 [00:00<00:10,  2.68it/s]

ROC AUC couldn't be calculated for BaggingClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BernoulliNB
multi_class must be in ('ovo', 'ovr')


 21%|██        | 6/29 [00:07<00:27,  1.21s/it]

ROC AUC couldn't be calculated for CalibratedClassifierCV
multi_class must be in ('ovo', 'ovr')
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
ROC AUC couldn't be calculated for DecisionTreeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for DummyClassifier
multi_class must be in ('ovo', 'ovr')


 28%|██▊       | 8/29 [00:07<00:15,  1.34it/s]

ROC AUC couldn't be calculated for ExtraTreeClassifier
multi_class must be in ('ovo', 'ovr')


 34%|███▍      | 10/29 [00:08<00:11,  1.59it/s]

ROC AUC couldn't be calculated for ExtraTreesClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for GaussianNB
multi_class must be in ('ovo', 'ovr')


 38%|███▊      | 11/29 [00:08<00:10,  1.79it/s]

ROC AUC couldn't be calculated for KNeighborsClassifier
multi_class must be in ('ovo', 'ovr')


 41%|████▏     | 12/29 [00:09<00:09,  1.74it/s]

ROC AUC couldn't be calculated for LabelPropagation
multi_class must be in ('ovo', 'ovr')


 45%|████▍     | 13/29 [00:10<00:09,  1.72it/s]

ROC AUC couldn't be calculated for LabelSpreading
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LinearDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 59%|█████▊    | 17/29 [00:10<00:03,  3.03it/s]

ROC AUC couldn't be calculated for LinearSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LogisticRegression
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NearestCentroid
multi_class must be in ('ovo', 'ovr')


 62%|██████▏   | 18/29 [00:11<00:04,  2.68it/s]

ROC AUC couldn't be calculated for NuSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for PassiveAggressiveClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for Perceptron
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for QuadraticDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 86%|████████▌ | 25/29 [00:12<00:00,  5.36it/s]

ROC AUC couldn't be calculated for RandomForestClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifierCV
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for SGDClassifier
multi_class must be in ('ovo', 'ovr')


 93%|█████████▎| 27/29 [00:12<00:00,  5.28it/s]

ROC AUC couldn't be calculated for SVC
multi_class must be in ('ovo', 'ovr')
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'


 97%|█████████▋| 28/29 [00:13<00:00,  4.74it/s]

ROC AUC couldn't be calculated for XGBClassifier
multi_class must be in ('ovo', 'ovr')
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000222 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1037
[LightGBM] [Info] Number of data points in the train set: 3122, number of used features: 21
[LightGBM] [Info] Start training from score -1.315211
[LightGBM] [Info] Start training from score -0.877649
[LightGBM] [Info] Start training from score -1.152573


100%|██████████| 29/29 [00:13<00:00,  2.17it/s]

ROC AUC couldn't be calculated for LGBMClassifier
multi_class must be in ('ovo', 'ovr')





In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.68,0.65,,0.67,0.31
RandomForestClassifier,0.68,0.65,,0.67,0.63
ExtraTreesClassifier,0.67,0.63,,0.66,0.82
XGBClassifier,0.65,0.62,,0.64,0.32
LogisticRegression,0.64,0.61,,0.63,0.09
SGDClassifier,0.64,0.61,,0.62,0.1
LinearDiscriminantAnalysis,0.63,0.6,,0.63,0.05
CalibratedClassifierCV,0.64,0.6,,0.61,6.67
NuSVC,0.63,0.6,,0.62,0.53
LinearSVC,0.65,0.6,,0.6,0.69


In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Initialize the models
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
decision_tree_clf = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
bagging_clf.fit(X_train, y_train_enc)
extra_trees_clf.fit(X_train, y_train_enc)
random_forest_clf.fit(X_train, y_train_enc)
decision_tree_clf.fit(X_train, y_train_enc)
xgb_clf.fit(X_train, y_train_enc)

# Predict y_validation
y_pred_bagging = bagging_clf.predict(X_validation)
y_pred_extra_trees = extra_trees_clf.predict(X_validation)
y_pred_random_forest = random_forest_clf.predict(X_validation)
y_pred_decision_tree = decision_tree_clf.predict(X_validation)
y_pred_xgb = xgb_clf.predict(X_validation)

# Calculate accuracy and F1 score for each model
accuracy_bagging = accuracy_score(y_validation_enc, y_pred_bagging)
f1_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

accuracy_extra_trees = accuracy_score(y_validation_enc, y_pred_extra_trees)
f1_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')

accuracy_random_forest = accuracy_score(y_validation_enc, y_pred_random_forest)
f1_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

accuracy_decision_tree = accuracy_score(y_validation_enc, y_pred_decision_tree)
f1_decision_tree = f1_score(y_validation_enc, y_pred_decision_tree, average='macro')

accuracy_xgb = accuracy_score(y_validation_enc, y_pred_xgb)
f1_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

# Print out the performance
print(f'Bagging Classifier - Accuracy: {accuracy_bagging}, F1 Score: {f1_bagging}')
print(f'Extra Trees Classifier - Accuracy: {accuracy_extra_trees}, F1 Score: {f1_extra_trees}')
print(f'Random Forest Classifier - Accuracy: {accuracy_random_forest}, F1 Score: {f1_random_forest}')
print(f'Decision Tree Classifier - Accuracy: {accuracy_decision_tree}, F1 Score: {f1_decision_tree}')
print(f'XGB Classifier - Accuracy: {accuracy_xgb}, F1 Score: {f1_xgb}')


Bagging Classifier - Accuracy: 0.6857670979667283, F1 Score: 0.6625865661381513
Extra Trees Classifier - Accuracy: 0.6709796672828097, F1 Score: 0.6357733243639284
Random Forest Classifier - Accuracy: 0.6765249537892791, F1 Score: 0.6499791750170734
Decision Tree Classifier - Accuracy: 0.6025878003696857, F1 Score: 0.5931177215580885
XGB Classifier - Accuracy: 0.6487985212569316, F1 Score: 0.6245146942138872


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_extra_trees = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



grid_search_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_clf, param_grid_random_forest, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_extra_trees = GridSearchCV(extra_trees_clf, param_grid_extra_trees, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)

grid_search_xgb.fit(X_train, y_train_enc)
grid_search_random_forest.fit(X_train, y_train_enc)
grid_search_extra_trees.fit(X_train, y_train_enc)

best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

best_params_extra_trees = grid_search_extra_trees.best_params_
best_score_extra_trees = grid_search_extra_trees.best_score_


y_pred_xgb = grid_search_xgb.best_estimator_.predict(X_validation)
f1_score_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
f1_score_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

y_pred_extra_trees = grid_search_extra_trees.best_estimator_.predict(X_validation)
f1_score_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')


results = {
    "XGB Classifier": {
        "Best Parameters": best_params_xgb,
        "Best Score": best_score_xgb,
        "F1 Score on Validation": f1_score_xgb
    },
    "Random Forest Classifier": {
        "Best Parameters": best_params_random_forest,
        "Best Score": best_score_random_forest,
        "F1 Score on Validation": f1_score_random_forest
    },
    "Extra Trees Classifier": {
        "Best Parameters": best_params_extra_trees,
        "Best Score": best_score_extra_trees,
        "F1 Score on Validation": f1_score_extra_trees
    }
}








Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
results

{'XGB Classifier': {'Best Parameters': {'colsample_bytree': 1,
   'learning_rate': 0.2,
   'max_depth': 3,
   'n_estimators': 50,
   'subsample': 0.5},
  'Best Score': 0.6940952479758251,
  'F1 Score on Validation': 0.6572533681092159},
 'Random Forest Classifier': {'Best Parameters': {'max_depth': None,
   'min_samples_leaf': 4,
   'min_samples_split': 10,
   'n_estimators': 200},
  'Best Score': 0.6745237488016496,
  'F1 Score on Validation': 0.6460861281395862},
 'Extra Trees Classifier': {'Best Parameters': {'max_depth': None,
   'min_samples_leaf': 1,
   'min_samples_split': 10,
   'n_estimators': 100},
  'Best Score': 0.6733647770301773,
  'F1 Score on Validation': 0.6529363980150712}}

In [None]:
optimal_xgb_clf = XGBClassifier(
    colsample_bytree=1,
    learning_rate=0.2,
    max_depth=3,
    n_estimators=50,
    subsample=0.5,
    random_state=42  # Optional for reproducibility
)

In [None]:
optimal_xgb_clf.fit(X_train, y_train_enc)

In [None]:
y_pred_test_xgb = optimal_xgb_clf.predict(X_test)

In [None]:
# Define the inverse mapping
inverse_mapping = {1: 'H', 0: 'D', 2: 'A'}

# Convert y_pred_test_xgb back to original form
y_pred_test_xgb_original = [inverse_mapping[label] for label in y_pred_test_xgb]


In [None]:
predictions_df = pd.DataFrame(y_pred_test_xgb_original, columns=['Predictions'])

In [None]:
predictions_df.to_csv('england_3.csv', index=False)

In [None]:
train = eng3[eng3['Year'] < 2022]
validation = eng3[eng3['Year'] == 2022]

In [None]:
train = remove_rows_with_inf(train)

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['total_goal']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['total_goal']

In [None]:
X_test = df2023_eng3.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((3122, 21), (3122,), (541, 21), (541,), (552, 21))

In [None]:
from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
from lazypredict.Supervised import LazyRegressor

# Create an instance of LazyRegressor
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)

# Fit the model
models, predictions = reg.fit(X_train, X_validation, y_train, y_validation)

 21%|██▏       | 9/42 [00:03<00:15,  2.11it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 74%|███████▍  | 31/42 [00:16<00:04,  2.43it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


 98%|█████████▊| 41/42 [00:27<00:01,  1.12s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001099 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1037
[LightGBM] [Info] Number of data points in the train set: 3122, number of used features: 21
[LightGBM] [Info] Start training from score 2.567265


100%|██████████| 42/42 [00:28<00:00,  1.50it/s]


In [None]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HuberRegressor,0.06,0.09,1.49,0.19
LinearSVR,0.05,0.09,1.49,0.1
SGDRegressor,0.05,0.09,1.49,0.05
BayesianRidge,0.05,0.08,1.49,0.07
RidgeCV,0.05,0.08,1.49,0.05
Ridge,0.05,0.08,1.49,0.03
LinearRegression,0.05,0.08,1.49,0.03
TransformedTargetRegressor,0.05,0.08,1.49,0.09
LassoCV,0.05,0.08,1.49,0.2
LassoLarsCV,0.05,0.08,1.49,0.1


In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
elastic_net_cv = ElasticNetCV(cv=5, random_state=42)  # Adjust parameters as necessary
poisson_regressor = PoissonRegressor()
svr = SVR()  # Default parameters, adjust as necessary
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_reg = XGBRegressor(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
elastic_net_cv.fit(X_train, y_train)
poisson_regressor.fit(X_train, y_train)
svr.fit(X_train, y_train)
random_forest_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
y_pred_poisson_regressor = poisson_regressor.predict(X_validation)
y_pred_svr = svr.predict(X_validation)
y_pred_random_forest = random_forest_reg.predict(X_validation)
y_pred_xgb = xgb_reg.predict(X_validation)

# Predict y_validation
y_pred_elastic_net_cv_rounded = np.rint(y_pred_elastic_net_cv)
y_pred_poisson_regressor_rounded = np.rint(y_pred_poisson_regressor)
y_pred_svr_rounded = np.rint(y_pred_svr)
y_pred_random_forest_rounded = np.rint(y_pred_random_forest)
y_pred_xgb_rounded = np.rint(y_pred_xgb)

# Calculate MAE and R2 score using rounded predictions
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv_rounded)
r2_elastic_net_cv = r2_score(y_validation, y_pred_elastic_net_cv_rounded)

mae_poisson_regressor = mean_absolute_error(y_validation, y_pred_poisson_regressor_rounded)
r2_poisson_regressor = r2_score(y_validation, y_pred_poisson_regressor_rounded)

mae_svr = mean_absolute_error(y_validation, y_pred_svr_rounded)
r2_svr = r2_score(y_validation, y_pred_svr_rounded)

mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest_rounded)
r2_random_forest = r2_score(y_validation, y_pred_random_forest_rounded)

mae_xgb = mean_absolute_error(y_validation, y_pred_xgb_rounded)
r2_xgb = r2_score(y_validation, y_pred_xgb_rounded)

# Print out the performance with rounded predictions
print(f'ElasticNetCV - MAE: {mae_elastic_net_cv}, R2 Score: {r2_elastic_net_cv}')
print(f'Poisson Regressor - MAE: {mae_poisson_regressor}, R2 Score: {r2_poisson_regressor}')
print(f'SVR - MAE: {mae_svr}, R2 Score: {r2_svr}')
print(f'Random Forest Regressor - MAE: {mae_random_forest}, R2 Score: {r2_random_forest}')
print(f'XGB Regressor - MAE: {mae_xgb}, R2 Score: {r2_xgb}')

ElasticNetCV - MAE: 1.1682070240295748, R2 Score: 0.04143196509100722
Poisson Regressor - MAE: 1.2051756007393715, R2 Score: -0.0026226672582174704
SVR - MAE: 1.2033271719038816, R2 Score: -0.08845324407653465
Random Forest Regressor - MAE: 1.1737523105360443, R2 Score: 0.0026942711287579746
XGB Regressor - MAE: 1.2402957486136783, R2 Score: -0.10668274711759307


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Parameter grids
param_grid_poisson = {
    'alpha': [0.01, 0.1, 1, 10],
    'max_iter': [100, 300, 500]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV setup
grid_search_poisson = GridSearchCV(poisson_regressor, param_grid_poisson, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_reg, param_grid_random_forest, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)

# Fitting models
grid_search_poisson.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)

# Best parameters and scores
best_params_poisson = grid_search_poisson.best_params_
best_score_poisson = grid_search_poisson.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

# Predict and calculate MAE
y_pred_poisson = grid_search_poisson.best_estimator_.predict(X_validation)
mae_poisson = mean_absolute_error(y_validation, y_pred_poisson)

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest)

# Results
results = {
    "Poisson Regressor": {
        "Best Parameters": best_params_poisson,
        "Best Score (Negative MAE)": best_score_poisson,
        "MAE on Validation": mae_poisson
    },
    "Random Forest Regressor": {
        "Best Parameters": best_params_random_forest,
        "Best Score (Negative MAE)": best_score_random_forest,
        "MAE on Validation": mae_random_forest
    }
}

# ElasticNetCV already uses cross-validation for parameter tuning, so we directly fit it and predict
elastic_net_cv = ElasticNetCV(cv=5, random_state=42).fit(X_train, y_train)
y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv)

results["ElasticNetCV"] = {
    "Best Parameters": elastic_net_cv.get_params(),
    "MAE on Validation": mae_elastic_net_cv
}

# Print results
for model, info in results.items():
    print(f"{model}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Poisson Regressor:
  Best Parameters: {'alpha': 0.01, 'max_iter': 500}
  Best Score (Negative MAE): -1.2246273473961609
  MAE on Validation: 1.157360881111414

Random Forest Regressor:
  Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
  Best Score (Negative MAE): -1.2403224739396108
  MAE on Validation: 1.1732153729427395

ElasticNetCV:
  Best Parameters: {'alphas': None, 'copy_X': True, 'cv': 5, 'eps': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': 0}
  MAE on Validation: 1.1764156280411942



In [None]:
from sklearn.linear_model import ElasticNetCV
import numpy as np

# Best parameters for ElasticNetCV
best_params_elastic_net = {
    'alphas': None,
    'copy_X': True,
    'cv': 5,
    'eps': 0.001,
    'fit_intercept': True,
    'l1_ratio': 0.5,
    'max_iter': 1000,
    'n_alphas': 100,
    'n_jobs': None,
    'positive': False,
    'precompute': 'auto',
    'random_state': 42,
    'selection': 'cyclic',
    'tol': 0.0001,
    'verbose': 0
}

# Initialize and fit the ElasticNetCV model
elastic_net_model = ElasticNetCV(**best_params_elastic_net)
elastic_net_model.fit(X_train, y_train)

# Predict on the test set
y_pred_test_elastic_net = elastic_net_model.predict(X_test)

# Round predictions to nearest integer and convert to int type
y_pred_test_elastic_net_rounded = np.rint(y_pred_test_elastic_net).astype(int)

# y_pred_test_elastic_net_rounded contains the final integer predictions for X_test


In [None]:


# Convert predictions to a DataFrame
predictions_df_elastic_net = pd.DataFrame(y_pred_test_elastic_net_rounded, columns=['Predicted_Total_Goals'])

# Save to CSV
predictions_df_elastic_net.to_csv('england_3.csv', index=False)


In [None]:

from sklearn.preprocessing import LabelEncoder

def label_encode(df):

    le = LabelEncoder()

    df['HomeTeam'] = le.fit_transform(df['HomeTeam'])
    df['AwayTeam'] = le.fit_transform(df['AwayTeam'])
    df["FTR"] = le.fit_transform(df["FTR"])

    return df

In [None]:
ita1 = label_encode(ita1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['HomeTeam'] = le.fit_transform(df['HomeTeam'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['AwayTeam'] = le.fit_transform(df['AwayTeam'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["FTR"] = le.fit_transform(df["FTR"])


In [None]:
 # Scale the DataFrame columns except for specified columns ['Date', 'HomeTeam', 'AwayTeam'].


from sklearn.preprocessing import StandardScaler

def scale_dataframe(df, columns_to_exclude=['Date', 'HomeTeam', 'AwayTeam', 'FTR', 'total_goal', 'Year', 'Month', 'Day']):

    columns_to_scale = [col for col in df.columns if col not in columns_to_exclude]

    scaler = StandardScaler()

    df_scaled = df.copy()
    df_scaled[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])

    return df_scaled


In [None]:
ita1 = scale_dataframe(ita1)

In [None]:
ita1.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,...,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year,Month,Day
0,0,14,2,0.007422,-0.70658,-0.594024,-0.0571,-0.355353,-0.182728,-0.522711,...,3.0,-0.122001,0.452391,-0.009147,-0.094845,0.000571,-0.000571,2015,10,28
1,8,18,1,-0.589648,-0.307719,0.099611,-0.316761,-0.844907,-0.222474,-1.239729,...,0.0,-0.157422,-0.64888,-0.168166,-1.059628,0.743978,-0.743978,2016,3,19
2,11,18,2,-0.539893,-0.529308,0.033424,0.514156,-0.844907,-0.165912,-1.239729,...,4.0,-0.130856,-0.64888,0.054275,-1.059628,0.000571,-0.000571,2016,1,17
3,9,2,2,-0.68916,-0.08613,0.761476,0.669953,-0.028984,-0.120051,-1.32664,...,2.0,-0.130856,-0.64888,0.181117,-1.059628,0.743978,-0.743978,2015,9,23
4,12,0,2,-0.714038,0.206368,0.761476,1.29314,-0.844907,-0.165912,-1.109362,...,1.0,-0.139711,-0.64888,0.054275,-0.92202,0.743978,-0.743978,2015,8,23


In [None]:
df = ita1.copy()

In [None]:
train_df = df[df['Year'] != 2022]
test_df = df[df['Year'] == 2022]

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split


X_train = train_df.drop(['FTR', 'total_goal'], axis=1)
y_train = train_df['FTR']
X_test = test_df.drop(['FTR', 'total_goal'], axis=1)
y_test = test_df['FTR']




In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((2415, 23), (188, 23), (2415,), (188,))

In [None]:
y_test

2234    0
2235    0
2236    2
2241    2
2243    1
       ..
2596    0
2599    2
2600    2
2602    0
2604    0
Name: FTR, Length: 188, dtype: int64

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)


rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Accuracy: 0.6808510638297872
Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.68      0.72        66
           1       0.58      0.41      0.48        51
           2       0.67      0.87      0.76        71

    accuracy                           0.68       188
   macro avg       0.67      0.66      0.65       188
weighted avg       0.68      0.68      0.67       188



In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Fit the model to the training data
dt.fit(X_train, y_train)

# Predictions
y_pred = dt.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Accuracy: 0.6808510638297872
Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.67      0.68        66
           1       0.59      0.51      0.55        51
           2       0.72      0.82      0.77        71

    accuracy                           0.68       188
   macro avg       0.67      0.66      0.66       188
weighted avg       0.68      0.68      0.68       188



In [None]:
feature_importances = dt.feature_importances_
features = X_train.columns

importances = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
importances = importances.sort_values(by='Importance', ascending=False)

print(importances)

                        Feature  Importance
19    adjusted_win_lost_ratio_A    0.470414
22                          Day    0.047313
21                        Month    0.045298
10  AwayTeam_goals_conceded_avg    0.042106
5              HomeTeam_WinRate    0.034849
14                 H_goal_ratio    0.030963
15                 A_goal_ratio    0.027428
4                         B365A    0.027374
9   HomeTeam_goals_conceded_avg    0.025929
17    attack_strength_away_team    0.024716
6              AwayTeam_WinRate    0.022080
8             AwayTeam_GoalsAvg    0.021817
0                      HomeTeam    0.021344
13               Broker_prob__A    0.020772
2                         B365H    0.019427
1                      AwayTeam    0.018953
16    attack_strength_home_team    0.018418
3                         B365D    0.018322
20                         Year    0.017831
7             HomeTeam_GoalsAvg    0.015276
11                Broker_prob_H    0.014713
12               Broker_prob__D 

In [None]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = True, custom_metric = None)
models,pred = clf.fit(X_train, X_test, y_train, y_test)

 97%|█████████▋| 28/29 [00:20<00:00,  1.30it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000871 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1128
[LightGBM] [Info] Number of data points in the train set: 2415, number of used features: 23
[LightGBM] [Info] Start training from score -1.130161
[LightGBM] [Info] Start training from score -1.419554
[LightGBM] [Info] Start training from score -0.831957


100%|██████████| 29/29 [00:21<00:00,  1.33it/s]


In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DecisionTreeClassifier,0.68,0.66,,0.68,0.15
RandomForestClassifier,0.68,0.66,,0.67,1.4
BaggingClassifier,0.68,0.65,,0.67,0.48
LabelPropagation,0.68,0.65,,0.67,1.18
LabelSpreading,0.68,0.65,,0.67,1.69
ExtraTreesClassifier,0.68,0.65,,0.66,1.72
KNeighborsClassifier,0.66,0.64,,0.65,0.21
LGBMClassifier,0.66,0.64,,0.65,0.97
RidgeClassifierCV,0.68,0.63,,0.6,0.04
RidgeClassifier,0.68,0.63,,0.6,0.13


In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# You can use these best parameters to create a new RandomForest model
best_rf = RandomForestClassifier(**best_params)
best_rf.fit(X_train, y_train)

# Predict and evaluate with the optimized model
optimized_y_pred = best_rf.predict(X_test)
optimized_accuracy = accuracy_score(y_test, optimized_y_pred)
optimized_report = classification_report(y_test, optimized_y_pred)

print("Optimized Accuracy:", optimized_accuracy)
print("Optimized Classification Report:\n", optimized_report)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 100}
Optimized Accuracy: 0.675531914893617
Optimized Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.65      0.72        66
           1       0.59      0.37      0.46        51
           2       0.64      0.92      0.75        71

    accuracy                           0.68       188
   macro avg       0.68      0.65      0.64       188
weighted avg       0.68      0.68      0.66       188



In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

param_grid = {
    'criterion': ['gini', 'entropy'],         # Function to measure the quality of a split
    'max_depth': [None, 10, 20, 30, 40, 50],  # Maximum number of levels in each decision tree
    'min_samples_split': [2, 5, 10],          # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4],            # Minimum number of samples required at each leaf node
}


# Create GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# You can use these best parameters to create a new RandomForest model
best_dt = DecisionTreeClassifier(**best_params)
best_dt.fit(X_train, y_train)

# Predict and evaluate with the optimized model
optimized_y_pred = best_dt.predict(X_test)
optimized_accuracy = accuracy_score(y_test, optimized_y_pred)
optimized_report = classification_report(y_test, optimized_y_pred)

print("Optimized Accuracy:", optimized_accuracy)
print("Optimized Classification Report:\n", optimized_report)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5}
Optimized Accuracy: 0.6808510638297872
Optimized Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.70      0.70        66
           1       0.63      0.51      0.57        51
           2       0.68      0.79      0.73        71

    accuracy                           0.68       188
   macro avg       0.67      0.67      0.67       188
weighted avg       0.68      0.68      0.68       188



In [None]:
optimized_y_pred

array([2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 1, 0, 2, 2, 1,
       2, 0, 2, 1, 2, 2, 0, 1, 2, 2, 2, 0, 0, 2, 1, 0, 1, 2, 2, 2, 2, 2,
       2, 0, 2, 2, 0, 0, 2, 0, 2, 2, 2, 0, 2, 0, 0, 2, 2, 0, 2, 1, 2, 1,
       1, 0, 2, 0, 2, 2, 0, 2, 1, 2, 1, 2, 2, 2, 0, 0, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 0, 0, 0, 2, 2, 0, 2, 2, 2, 1, 2,
       2, 2, 0, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 2, 2, 0, 2, 2, 0, 2, 1, 0,
       1, 2, 0, 1, 0, 0, 1, 2, 1, 1, 0, 2, 2, 0, 0, 2, 2, 2, 0, 2, 1, 2,
       2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2,
       0, 0, 2, 1, 0, 2, 0, 0, 2, 2, 2, 0])

In [None]:

X_train = train_df.drop(['FTR', 'total_goal'], axis=1)
y_train = train_df['total_goal']
X_test = test_df.drop(['FTR', 'total_goal'], axis=1)
y_test = test_df['total_goal']

In [None]:
from lazypredict.Supervised import LazyRegressor

reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models,pred = reg.fit(X_train, X_test, y_train, y_test)
models

 21%|██▏       | 9/42 [00:02<00:10,  3.15it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 74%|███████▍  | 31/42 [00:14<00:04,  2.37it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:18<00:00,  2.26it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000662 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1128
[LightGBM] [Info] Number of data points in the train set: 2417, number of used features: 23
[LightGBM] [Info] Start training from score 3.996276





Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LinearSVR,-0.02,0.11,1.59,0.13
HuberRegressor,-0.03,0.09,1.6,0.23
SVR,-0.05,0.08,1.61,0.45
NuSVR,-0.05,0.08,1.61,0.36
GradientBoostingRegressor,-0.05,0.08,1.61,0.99
PoissonRegressor,-0.07,0.06,1.63,0.04
HistGradientBoostingRegressor,-0.1,0.03,1.65,2.26
LGBMRegressor,-0.11,0.03,1.66,0.14
BaggingRegressor,-0.11,0.03,1.66,0.27
RandomForestRegressor,-0.11,0.03,1.66,2.62


In [None]:
ita1.to_csv("ita1.csv", index=False)

In [None]:
# Function that load all the seasonal dataset from train

def load_seasonal_data(base_path, country, league, start_season, end_season):
    seasonal_data = {}

    for season_start_year in range(start_season, end_season + 1):

        start_year_suffix = (season_start_year - 1) % 100
        end_year_suffix = season_start_year % 100

        season_str = f"{start_year_suffix:02d}{end_year_suffix:02d}"

        file_path = f"{base_path}/{country}/{league}/{season_str}.csv"

        seasonal_data[f'{league}{season_str}'] = pd.read_csv(file_path)

    return seasonal_data


base_path = "/content/drive/MyDrive/train"
country = "italy"
league = "2"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

In [None]:
ita2.to_csv("ita2.csv", index=False)