<a href="https://colab.research.google.com/github/khiemtranngoc/GoalNetAI-Multi-League-Football-Predictions/blob/main/spain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Metodology

Because our goal is to predict  football match results from 2023 then we should not use features that are only available after the match has ended, such as match statistics and goal results. These features are not useful for predicting matches that have not yet happened.

To predict football matches before they happen, we must create a prediction models with data that is available before each match starts. However, the data we have was for the end of each match, such as the number of goals and shots per team. This data could not be used directly to train prediction models, so we had to transform it (creating pre-match features based on the historic data)

* In the test(season 2023) we dont have information such as FTHG, FTAG, ...

### Features Not Suitable for Pre-Match Prediction:
* Goals and Results (FTHG, FTAG, FTR, HTHG, HTAG, HTR): These are outcomes of the match, not available before it starts.

* In-Match Statistics (HS, AS, HST, AST, HHW, AHW, HC, AC, HF, AF, HFKC, AFKC, HO, AO, HY, AY, HR, AR): These are also outcomes or events that occur during the match.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# Function that load all the seasonal dataset from train

def load_seasonal_data(base_path, country, league, start_season, end_season):
    seasonal_data = {}

    for season_start_year in range(start_season, end_season + 1):

        start_year_suffix = (season_start_year - 1) % 100
        end_year_suffix = season_start_year % 100

        season_str = f"{start_year_suffix:02d}{end_year_suffix:02d}"

        file_path = f"{base_path}/{country}/{league}/{season_str}.csv"

        seasonal_data[f'{league}{season_str}'] = pd.read_csv(file_path)

    return seasonal_data


base_path = "/content/drive/MyDrive/train"
country = "spain"
league = "1"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

# Example: Access the data for the 2001/2002 season
# ger10102 = seasonal_datasets['ger10102']

In [None]:

spn11516 = seasonal_datasets['11516']
spn11617 = seasonal_datasets['11617']
spn11718 = seasonal_datasets['11718']
spn11819 = seasonal_datasets['11819']
spn11920 = seasonal_datasets['11920']
spn12021 = seasonal_datasets['12021']
spn12122 = seasonal_datasets['12122']

In [None]:
columns = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS',
        'HST', 'AST', 'HC', 'AC',
         "B365H", "B365D", "B365A" ]

Because we aim to predict the 2023 match result, more recent data is likely more reflective of the current state of teams and the league. Focusing on more recent data might reduce the risk of overfitting to historical trends that are no longer relevant. For all that reason, we consider choosing dataset from season 2016 to season 2022

In [None]:
df2016 = spn11516[columns]
df2017 = spn11617[columns]
df2018 = spn11718[columns]
df2019 = spn11819[columns]
df2020 = spn11920[columns]
df2021 = spn12021[columns]
df2022 = spn12122[columns]

In [None]:
# This function shows us where do we have missing value in a dataframe

def missing_values_summary(df):

    missing_counts = df.isnull().sum()

    missing_counts = missing_counts[missing_counts > 0]

    summary_df = pd.DataFrame(missing_counts, columns=['Missing Values Count'])
    summary_df.index.name = 'Column'

    return summary_df


In [None]:
summary = missing_values_summary(df2016)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2017)
print(summary)

        Missing Values Count
Column                      
FTR                       11


In [None]:
summary = missing_values_summary(df2018)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2019)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2020)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2021)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2022)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
# Function to display rows with missing values from a DataFrame.

def show_rows_with_missing_values(df):

    rows_with_missing_values = df[df.isnull().any(axis=1)]

    return rows_with_missing_values



In [None]:
print(show_rows_with_missing_values(df2017))

        Date     HomeTeam    AwayTeam  FTHG  FTAG  FTR  HS  AS  HST  AST  HC  \
0   24/09/16        Eibar    Sociedad     2     0  NaN  17   3    3    0   5   
1   11/02/17       Alaves   Barcelona     0     6  NaN   9  21    4    9   3   
2   08/05/17      Leganes       Betis     4     0  NaN   6  13    4    3   2   
3   17/02/17      Granada       Betis     4     1  NaN  10   6    8    2   4   
4   21/09/16  Real Madrid  Villarreal     1     1  NaN  22   8    6    3  17   
5   22/01/17        Eibar   Barcelona     0     4  NaN  14  14    6    7   8   
6   16/10/16     Sp Gijon    Valencia     1     2  NaN  10   7    2    3   4   
7   07/01/17     Sociedad     Sevilla     0     4  NaN  11  11    5   10   7   
8   30/04/17        Celta  Ath Bilbao     0     3  NaN   5  22    0   12   0   
9   18/12/16    Barcelona     Espanol     4     1  NaN  14   4    7    1   5   
10  19/03/17    Barcelona    Valencia     4     2  NaN  28   6   12    5   6   

    AC  B365H  B365D  B365A  
0    3   

In the dataset fromseason 2017 we can clearly see which team is the winner based on columns 'FTHG' and 'FTAG'.I do not want to remove these rows, so I can just simply inpute the missing value (result) based on 2 mentioned columns

In [None]:
df2017.loc[0, 'FTR'] = 'H'
df2017.loc[1, 'FTR'] = 'A'
df2017.loc[2, 'FTR'] = 'H'
df2017.loc[3, 'FTR'] = 'H'
df2017.loc[4, 'FTR'] = 'D'
df2017.loc[5, 'FTR'] = 'A'
df2017.loc[6, 'FTR'] = 'A'
df2017.loc[7, 'FTR'] = 'A'
df2017.loc[8, 'FTR'] = 'A'
df2017.loc[9, 'FTR'] = 'H'
df2017.loc[10, 'FTR'] = 'H'

In [None]:
#  Function to transform values in 'FTHG' (Full Time Home Team Goals) and'FTAG' (Full Time Away Team Goals) columns to their absolute values.

def transform_goals_to_absolute(df):

    df['FTHG'] = df['FTHG'].abs()
    df['FTAG'] = df['FTAG'].abs()

    return df

the reason why I created this function because sometime there are some negative values in columns 'FTHG' or 'FTAG' (number of goal can not be negative)

In [None]:
df2016 = transform_goals_to_absolute(df2016)
df2017 = transform_goals_to_absolute(df2017)
df2018 = transform_goals_to_absolute(df2018)
df2019 = transform_goals_to_absolute(df2019)
df2020 = transform_goals_to_absolute(df2020)
df2021 = transform_goals_to_absolute(df2021)
df2022 = transform_goals_to_absolute(df2022)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTHG'] = df['FTHG'].abs()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTAG'] = df['FTAG'].abs()


In [None]:
#  Identifies and prints rows containing outliers for all numerical columns in the DataFrame.

def find_and_print_outlier_rows(df):

    for column in df.select_dtypes(include=['number']).columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 3 * IQR
        upper_bound = Q3 + 3 * IQR

        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

        if not outliers.empty:
            print(f"Rows with outliers in column '{column}':")
            print(outliers)
            print("\n")

In [None]:
find_and_print_outlier_rows(df2016)

Rows with outliers in column 'FTHG':
         Date     HomeTeam    AwayTeam  FTHG  FTAG FTR  HS  AS  HST  AST  HC  \
63   23/04/16    Barcelona    Sp Gijon     6     0   H  24   5   10    3   6   
87   31/01/16  Real Madrid     Espanol     6     0   H  18  15    7    5   3   
100  05/03/16  Real Madrid       Celta     7     1   H  20   8   12    3  10   
138  14/02/16    Barcelona       Celta     6     1   H  19  10   11    4   1   
142  17/01/16    Barcelona  Ath Bilbao     6     0   H  20   9    8    3   6   
160  20/12/15  Real Madrid   Vallecano    10     2   H  30   9   15    4  10   
359  12/03/16    Barcelona      Getafe     6     0   H  16   6    9    1   5   

     AC  B365H  B365D  B365A  
63    1   1.04   17.0   34.0  
87    1   1.10   11.0   21.0  
100   5   1.30    6.0    8.0  
138   1   1.08   11.0   26.0  
142   2   1.14    8.0   19.0  
160   4   1.10   11.0   21.0  
359   3   1.06   15.0   26.0  


Rows with outliers in column 'HC':
         Date HomeTeam  AwayTeam  FTH

In [None]:
# Filter the DataFrame to select rows where both 'FTHG' and 'FTAG' are smaller than 30,including rows where 'FTHG' or 'FTAG' might be NA.

def filter_goals_under_30(df):

    filtered_df = df[((df['FTHG'] < 30) & (df['FTAG'] < 30)) | df['FTHG'].isna() | df['FTAG'].isna()]
    return filtered_df

Occasionally, the 'FTAG' and 'FTHG' columns hold extraordinarily high values. For example, suggesting Liverpool could score 608 goals in a single match strikes me as complete nonsense

In [None]:
df2016 = filter_goals_under_30(df2016)
df2017 = filter_goals_under_30(df2017)
df2018 = filter_goals_under_30(df2018)
df2019 = filter_goals_under_30(df2019)
df2020 = filter_goals_under_30(df2020)
df2021 = filter_goals_under_30(df2021)
df2022 = filter_goals_under_30(df2022)

In [None]:
# Function to selectively impute missing values in a DataFrame using KNNImputer.
# The imputation is applied only to columns with missing values, and results are rounded to integers.


from sklearn.impute import KNNImputer

def impute_missing_values_knn(df, n_neighbors=5):
    cols_with_missing = df.columns[df.isnull().any()]
    numeric_cols_with_missing = df[cols_with_missing].select_dtypes(include=[np.number]).columns

    imputer = KNNImputer(n_neighbors=n_neighbors)

    df_numeric_imputed = df.copy()
    if len(numeric_cols_with_missing) > 0:
        imputed_data = imputer.fit_transform(df[numeric_cols_with_missing])
        df_imputed = pd.DataFrame(imputed_data, columns=numeric_cols_with_missing, index=df.index)

        for col in numeric_cols_with_missing:
            df_numeric_imputed[col] = df_numeric_imputed[col].fillna(np.round(df_imputed[col]))

    return df_numeric_imputed

  Function to preprocess football data and create new features:
   
  * Home and Away Team Win Rates
  * Home and Away Team Goals Average
  * Winning probabilities from Brokers's Betting Odds
  * goal ratio if the shot hits the target (total goal/ total shots on target)
  * remove rows where the home team is the same as away team

In [None]:
def preprocess_football_data(df):


    # Calculating win rates and average goals
    home_win_rate = df.groupby('HomeTeam')['FTR'].apply(lambda x: round((x == 'H').mean(), 2)).to_dict()
    away_win_rate = df.groupby('AwayTeam')['FTR'].apply(lambda x: round((x == 'A').mean(), 2)).to_dict()
    home_goals_avg = df.groupby('HomeTeam')['FTHG'].mean().apply(lambda x: round(x, 2)).to_dict()
    away_goals_avg = df.groupby('AwayTeam')['FTAG'].mean().apply(lambda x: round(x, 2)).to_dict()
    home_goals_conceded_avg = df.groupby('HomeTeam')['FTAG'].mean().apply(lambda x: round(x, 2)).to_dict()
    away_goals_conceded_avg = df.groupby('AwayTeam')['FTHG'].mean().apply(lambda x: round(x, 2)).to_dict()
    goal_ratio_H = df.groupby('HomeTeam').apply(lambda x: round(x['FTHG'].sum() / x['HST'].sum(),2) if x['HST'].sum() > 0 else 0)
    goal_ratio_A = df.groupby('AwayTeam').apply(lambda x: round(x['FTAG'].sum() / x['AST'].sum(),2) if x['AST'].sum() > 0 else 0)


    # Mapping the win rates and average goals to the main DataFrame
    df['HomeTeam_WinRate'] = df['HomeTeam'].map(home_win_rate)
    df['AwayTeam_WinRate'] = df['AwayTeam'].map(away_win_rate)
    df['HomeTeam_GoalsAvg'] = df['HomeTeam'].map(home_goals_avg)
    df['AwayTeam_GoalsAvg'] = df['AwayTeam'].map(away_goals_avg)
    df['HomeTeam_goals_conceded_avg'] = df['HomeTeam'].map(home_goals_conceded_avg)
    df['AwayTeam_goals_conceded_avg'] = df['AwayTeam'].map(away_goals_conceded_avg)


    # Calculating implied probabilities from betting odds
    df['Broker_prob_H'] = round(1 / df['B365H'], 2)
    df['Broker_prob_D'] = round(1 / df['B365D'], 2)
    df['Broker_prob_A'] = round(1 / df['B365A'], 2)

     # Calculate the total goals for each match
    df['total_goal'] = df['FTHG'] + df['FTAG']


    # Map the conversion rates back to the original DataFrame
    df['H_goal_ratio'] = df['HomeTeam'].map(goal_ratio_H)
    df['A_goal_ratio'] = df['AwayTeam'].map(goal_ratio_A)

    clean_df = df[df['HomeTeam'] != df['AwayTeam']]




    return clean_df

This function below  analyzes soccer matches by taking into account the historical performance of teams against each other, rather than looking at their overall performance. This can provide insights into how teams perform against specific opponents, which might be different from their general performance trends.

A team might generally be strong but consistently struggle against a particular rival.

The formula used is ((3*wins + draws) - losses) / total_matches

It awards 3 points for a win, 1 point for a draw, and subtracts points for losses.

For example: Barcelona vs Getafe 4 times in season 2018
Barcelona won all the matches and Getafe lost all, so the adjusted winrate ratio of Barcelona, when plays againts Getafe is:
((3*4 + 0) - 0) / 4 = 3

And because Getafe lost all when played agiants Barcelona, its adjusted winrate ratio, when Getafe plays againts Barcelona is -1

this would change if Barcelona played againts strong rival such as Real Madrid

So this feature is a Head-to-Head Performance Indicator



In [None]:
def add_adjusted_win_loss_ratio(df):
    def adjusted_win_loss_ratio(wins, draws, losses, total_matches):
        return ((3*wins + draws) - losses) / total_matches if total_matches > 0 else 0

    # Initialize a dictionary to track head-to-head stats
    head_to_head_stats = {}

    # Update head-to-head stats
    for index, row in df.iterrows():
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        if teams not in head_to_head_stats:
            head_to_head_stats[teams] = {'wins': {teams[0]: 0, teams[1]: 0},
                                         'draws': 0,
                                         'total_matches': 0}

        head_to_head_stats[teams]['total_matches'] += 1
        if row['FTR'] == 'H':
            head_to_head_stats[teams]['wins'][row['HomeTeam']] += 1
        elif row['FTR'] == 'D':
            head_to_head_stats[teams]['draws'] += 1
        elif row['FTR'] == 'A':
            head_to_head_stats[teams]['wins'][row['AwayTeam']] += 1

    # Calculate and add the adjusted win-loss ratio to the DataFrame
    def calculate_ratio_for_match(row):
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        stats = head_to_head_stats[teams]
        home_wins = stats['wins'][row['HomeTeam']]
        away_wins = stats['wins'][row['AwayTeam']]
        draws = stats['draws']
        total_matches = stats['total_matches']
        home_ratio = adjusted_win_loss_ratio(home_wins, draws, total_matches - home_wins - draws, total_matches)
        away_ratio = adjusted_win_loss_ratio(away_wins, draws, total_matches - away_wins - draws, total_matches)
        return pd.Series([home_ratio, away_ratio])

    df[['adjusted_win_lost_ratio_H', 'adjusted_win_lost_ratio_A']] = df.apply(calculate_ratio_for_match, axis=1)

    return df

I developed the features 'attack_strength_home_team' and 'attack_strength_away_team' for every team in the league.

calculate_attack_strength is designed to evaluate the offensive capabilities of soccer teams within a league, specifically by calculating their "attack strength." This is done by comparing the goal-scoring performance of each team to the league average, separately for home and away matches.


In [None]:
def calculate_attack_strength(df):
    # Calculate total goals for each team
    total_home_goals = df.groupby('HomeTeam')['FTHG'].sum()
    total_away_goals = df.groupby('AwayTeam')['FTAG'].sum()

    # Calculate league averages for home and away goals
    average_home_goals = df['FTHG'].mean()
    average_away_goals = df['FTAG'].mean()

    # Calculate attack strength
    df['attack_strength_home_team'] = df['HomeTeam'].apply(lambda x: round(total_home_goals[x] / average_home_goals,2))
    df['attack_strength_away_team'] = df['AwayTeam'].apply(lambda x: round(total_away_goals[x] / average_away_goals,2))

    return df


In [None]:
df2016 =  preprocess_football_data(df2016)
df2017 =  preprocess_football_data(df2017)
df2018 =  preprocess_football_data(df2018)
df2019 =  preprocess_football_data(df2019)
df2020 =  preprocess_football_data(df2020)
df2021 =  preprocess_football_data(df2021)
df2022 =  preprocess_football_data(df2022)

In [None]:
df2016 = calculate_attack_strength(df2016)
df2017 = calculate_attack_strength(df2017)
df2018 = calculate_attack_strength(df2018)
df2019 = calculate_attack_strength(df2019)
df2020 = calculate_attack_strength(df2020)
df2021 = calculate_attack_strength(df2021)
df2022 = calculate_attack_strength(df2022)

In [None]:
df2016 =  add_adjusted_win_loss_ratio(df2016)
df2017 =  add_adjusted_win_loss_ratio(df2017)
df2018 =  add_adjusted_win_loss_ratio(df2018)
df2019 =  add_adjusted_win_loss_ratio(df2019)
df2020 =  add_adjusted_win_loss_ratio(df2020)
df2021 =  add_adjusted_win_loss_ratio(df2021)
df2022 =  add_adjusted_win_loss_ratio(df2022)

In [None]:
def process_time_data(df, target_year):
    # Convert 'Date' column to datetime
    df['Date'] = pd.to_datetime(df['Date'])

    # Extract 'Day', 'Month', and 'Year' from 'Date'
    df['Day'] = df['Date'].dt.day
    df['Month'] = df['Date'].dt.month
    df['Year'] = df['Date'].dt.year

    # Adjust 'Year' values
    df['Year'] = df['Year'].apply(lambda x: target_year if x != target_year else x)

    # Drop 'Day' and 'Month' columns
    df.drop(['Day', 'Month', 'Date'], axis=1, inplace=True)

    return df


In [None]:
df2016 = process_time_data(df2016, 2016)
df2017 = process_time_data(df2017, 2017)
df2018 = process_time_data(df2018, 2018)
df2019 = process_time_data(df2019, 2019)
df2020 = process_time_data(df2020, 2020)
df2021 = process_time_data(df2021, 2021)
df2022 = process_time_data(df2022, 2022)

  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
columns_to_drop = ['FTHG', 'FTAG', 'HTR', 'HC', 'AC', 'HST', 'AST', 'HS', 'AS'] # these columns would not be available in test set

In [None]:
spn1 = pd.concat( [df2016, df2017, df2018, df2019, df2020, df2021, df2022], ignore_index=True)

In [None]:
spn1 = spn1.drop(columns=columns_to_drop, errors='ignore')

In [None]:
spn1.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,...,Broker_prob_D,Broker_prob_A,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year
0,Eibar,Celta,D,3.75,3.4,2.05,0.39,0.42,1.28,1.16,...,0.29,0.49,2,0.3,0.34,14.21,19.49,0.0,2.0,2016
1,Levante,Las Palmas,H,2.38,3.1,3.25,0.39,0.21,1.22,1.05,...,0.32,0.31,5,0.3,0.23,13.59,17.71,2.0,0.0,2016
2,Valencia,Villarreal,A,2.2,3.3,3.4,0.32,0.32,1.32,0.95,...,0.3,0.29,2,0.32,0.32,15.45,15.94,-1.0,3.0,2016
3,Valencia,Sp Gijon,A,1.75,3.75,4.75,0.32,0.17,1.32,0.67,...,0.27,0.21,1,0.32,0.33,15.45,10.63,1.0,1.0,2016
4,Las Palmas,Sevilla,H,3.3,3.4,2.2,0.42,0.0,1.32,0.65,...,0.29,0.45,2,0.32,0.17,15.45,9.74,1.0,1.0,2016


In [None]:
spn1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2602 entries, 0 to 2601
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     2602 non-null   object 
 1   AwayTeam                     2602 non-null   object 
 2   FTR                          2602 non-null   object 
 3   B365H                        2602 non-null   float64
 4   B365D                        2602 non-null   float64
 5   B365A                        2602 non-null   float64
 6   HomeTeam_WinRate             2602 non-null   float64
 7   AwayTeam_WinRate             2602 non-null   float64
 8   HomeTeam_GoalsAvg            2602 non-null   float64
 9   AwayTeam_GoalsAvg            2602 non-null   float64
 10  HomeTeam_goals_conceded_avg  2602 non-null   float64
 11  AwayTeam_goals_conceded_avg  2602 non-null   float64
 12  Broker_prob_H                2602 non-null   float64
 13  Broker_prob_D     

Apply the same process for the Spaind second divison

Load the 2.tier spain

In [None]:
base_path = "/content/drive/MyDrive/train"
country = "spain"
league = "2"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

In [None]:

spn21718 = seasonal_datasets['21718']
spn21819 = seasonal_datasets['21819']
spn21920 = seasonal_datasets['21920']
spn22021 = seasonal_datasets['22021']
spn22122 = seasonal_datasets['22122']

In [None]:
df20182 = spn21718[columns]
df20192 = spn21819[columns]
df20202 = spn21920[columns]
df20212 = spn22021[columns]
df20222 = spn22122[columns]

In [None]:
summary = missing_values_summary(df20182)
print(summary)

        Missing Values Count
Column                      
B365H                      1
B365D                      1
B365A                      1


In [None]:
summary = missing_values_summary(df20192)
print(summary)

        Missing Values Count
Column                      
HS                        21
AS                        21
HST                       21
AST                       21
HC                        21
AC                        21
B365H                     21
B365D                     21
B365A                     21


In [None]:
summary = missing_values_summary(df20202)
print(summary)

        Missing Values Count
Column                      
B365H                      2
B365D                      2
B365A                      2


In [None]:
summary = missing_values_summary(df20212)
print(summary)

        Missing Values Count
Column                      
B365H                      8
B365D                      8
B365A                      8


In [None]:
summary = missing_values_summary(df20222)
print(summary)

        Missing Values Count
Column                      
B365H                      2
B365D                      2
B365A                      2


In [None]:
df20182 = transform_goals_to_absolute(df20182)
df20192 = transform_goals_to_absolute(df20192)
df20202 = transform_goals_to_absolute(df20202)
df20212 = transform_goals_to_absolute(df20212)
df20222 = transform_goals_to_absolute(df20222)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTHG'] = df['FTHG'].abs()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTAG'] = df['FTAG'].abs()


In [None]:
df20182 = filter_goals_under_30(df20182)
df20192 = filter_goals_under_30(df20192)
df20202 = filter_goals_under_30(df20202)
df20212 = filter_goals_under_30(df20212)
df20222 = filter_goals_under_30(df20222)

In [None]:
df20182 = process_time_data(df20182, 2018)
df20192 = process_time_data(df20192, 2019)
df20202 = process_time_data(df20202, 2020)
df20212 = process_time_data(df20212, 2021)
df20222 = process_time_data(df20222, 2022)

  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])


In [None]:

df20182 = impute_missing_values_knn(df20182)
df20192 = impute_missing_values_knn(df20192)
df20202 = impute_missing_values_knn(df20202)
df20212 = impute_missing_values_knn(df20212)
df20222 = impute_missing_values_knn(df20222)

In [None]:
df20182 =  preprocess_football_data(df20182)
df20192 =  preprocess_football_data(df20192)
df20202 =  preprocess_football_data(df20202)
df20212 =  preprocess_football_data(df20212)
df20222 =  preprocess_football_data(df20222)

In [None]:
df20182 = calculate_attack_strength(df20182)
df20192 = calculate_attack_strength(df20192)
df20202 = calculate_attack_strength(df20202)
df20212 = calculate_attack_strength(df20212)
df20222 = calculate_attack_strength(df20222)

In [None]:
df20182 =  add_adjusted_win_loss_ratio(df20182)
df20192 =  add_adjusted_win_loss_ratio(df20192)
df20202 =  add_adjusted_win_loss_ratio(df20202)
df20212 =  add_adjusted_win_loss_ratio(df20212)
df20222 =  add_adjusted_win_loss_ratio(df20222)

In [None]:
spn2 = pd.concat([df20182, df20192, df20202, df20212, df20222], ignore_index=True)

In [None]:
spn2 = spn2.drop(columns=columns_to_drop, errors='ignore')

In [None]:
spn2.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,Year,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,...,Broker_prob_H,Broker_prob_D,Broker_prob_A,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A
0,Reus Deportiu,Valladolid,D,3.2,3.0,2.39,2018,0.35,0.25,1.0,...,0.31,0.33,0.42,4,0.28,0.31,14.45,27.16,0.0,2.0
1,Sp Gijon,Sevilla B,H,1.57,3.75,6.25,2018,0.71,0.1,1.86,...,0.64,0.27,0.16,3,0.38,0.15,28.18,11.95,3.0,-1.0
2,Tenerife,Granada,D,1.95,3.3,4.0,2018,0.55,0.15,1.85,...,0.51,0.3,0.25,4,0.33,0.26,26.73,19.55,0.0,2.0
3,Cadiz,Lorca,D,1.5,3.6,9.0,2018,0.48,0.05,1.19,...,0.67,0.28,0.11,0,0.32,0.17,18.06,13.04,0.0,2.0
4,Lugo,Huesca,A,4.5,3.0,1.95,2018,0.48,0.38,1.1,...,0.22,0.33,0.51,2,0.26,0.34,16.62,32.59,-1.0,3.0


In [None]:
spn2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2263 entries, 0 to 2262
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     2263 non-null   object 
 1   AwayTeam                     2263 non-null   object 
 2   FTR                          2263 non-null   object 
 3   B365H                        2263 non-null   float64
 4   B365D                        2263 non-null   float64
 5   B365A                        2263 non-null   float64
 6   Year                         2263 non-null   int64  
 7   HomeTeam_WinRate             2263 non-null   float64
 8   AwayTeam_WinRate             2263 non-null   float64
 9   HomeTeam_GoalsAvg            2263 non-null   float64
 10  AwayTeam_GoalsAvg            2263 non-null   float64
 11  HomeTeam_goals_conceded_avg  2263 non-null   float64
 12  AwayTeam_goals_conceded_avg  2263 non-null   float64
 13  Broker_prob_H     

Merge 2 season together

In [None]:
data2018 = pd.concat([df2018, df20182,], ignore_index=True)
data2019 = pd.concat([df2019, df20192,], ignore_index=True)
data2020 = pd.concat([df2020, df20202,], ignore_index=True)
data2021 = pd.concat([df2021, df20212,], ignore_index=True)
data2022 = pd.concat([df2022, df20222,], ignore_index=True)

The issue with the 2023 test set is the promotion of new teams from the second division to the first, information that isn't included in my first division training set. Therefore, I need to merge data from both divisions to incorporate details about the promoted teams. Note that this merged data is exclusively for reference and not used in training.

In [None]:
file_path = '/content/drive/My Drive/test/spain/1/2223.csv'
df2023 = pd.read_csv(file_path)

In [None]:
columns_test = ['Date', 'HomeTeam', 'AwayTeam',"B365H", "B365D", "B365A" ]

In [None]:
df2023 = df2023[columns_test]

In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      380 non-null    object 
 1   HomeTeam  380 non-null    object 
 2   AwayTeam  380 non-null    object 
 3   B365H     380 non-null    float64
 4   B365D     380 non-null    float64
 5   B365A     380 non-null    float64
dtypes: float64(3), object(3)
memory usage: 17.9+ KB


The primary aim of this function is to enhance the new season's data with past performance metrics, aiding in analysis or prediction of team performances in the upcoming season.


For example, if Manchester United's average goals from 2018 to 2022 were 1.8, 2.0, 2.1, 1.9, and 2.0 goals per game, respectively, I would calculate the average as follows: 1.81 + 2.0 + 2.13 + 1.94 + 2.0 / 5 = 1.92. So, for the 2023 season, Manchester United's "Average Goals Scored at Home" feature would be set at 1.92.

In [None]:
def calculate_and_apply_overall_averages(season_dfs, new_season_df):
    # Initialize dictionaries for each metric
    metrics = {
        'HomeTeam_WinRate': 'HomeTeam', 'AwayTeam_WinRate': 'AwayTeam',
        'HomeTeam_GoalsAvg': 'HomeTeam', 'AwayTeam_GoalsAvg': 'AwayTeam',
        'HomeTeam_goals_conceded_avg': 'HomeTeam', 'AwayTeam_goals_conceded_avg': 'AwayTeam',
        'H_goal_ratio': 'HomeTeam', 'A_goal_ratio': 'AwayTeam',
        'attack_strength_home_team': 'HomeTeam', 'attack_strength_away_team': 'AwayTeam'
    }
    averages_dict = {metric: {} for metric in metrics}

    # Calculate the overall average for each team across all seasons
    for df in season_dfs:
        for metric, team_col in metrics.items():
            for team in df[team_col].unique():
                averages_dict[metric][team] = df[df[team_col] == team][metric].mean()

    # Apply the overall averages to df2023
    for metric, team_col in metrics.items():
        if metric not in new_season_df:
            new_season_df[metric] = pd.NA
        new_season_df[metric] = new_season_df[team_col].map(averages_dict[metric])

    return new_season_df

# List of DataFrames from 2016 to 2022
season_dfs = [df2016, df2017, data2018, data2019, data2020, data2021, data2022]


In [None]:
df2023 = calculate_and_apply_overall_averages(season_dfs, df2023)

This function below  analyzes soccer matches by taking into account the historical performance of teams against each other, rather than looking at their overall performance. It would analyze all record from 2016 to 2022 (train set), not like the same one before (the first one analyzes only in a specific season)

In [None]:

def calculate_head_to_head_stats(merged_df):
    # Initialize a dictionary to track head-to-head stats
    head_to_head_stats = {}

    # Update head-to-head stats using merged_df
    for index, row in merged_df.iterrows():
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        if teams not in head_to_head_stats:
            head_to_head_stats[teams] = {'wins': {teams[0]: 0, teams[1]: 0},
                                         'draws': 0,
                                         'total_matches': 0}

        head_to_head_stats[teams]['total_matches'] += 1
        if row['FTR'] == 'H':
            head_to_head_stats[teams]['wins'][row['HomeTeam']] += 1
        elif row['FTR'] == 'D':
            head_to_head_stats[teams]['draws'] += 1
        elif row['FTR'] == 'A':
            head_to_head_stats[teams]['wins'][row['AwayTeam']] += 1

    return head_to_head_stats

def adjusted_win_loss_ratio(wins, draws, losses, total_matches):
    ratio = ((3*wins + draws) - losses) / total_matches if total_matches > 0 else 0
    return round(ratio, 1)

def apply_adjusted_win_loss_ratio_to_2023(df2023, head_to_head_stats):
    def calculate_ratio_for_match(row):
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        stats = head_to_head_stats.get(teams, {'wins': {row['HomeTeam']: 0, row['AwayTeam']: 0}, 'draws': 0, 'total_matches': 0})
        home_wins = stats['wins'].get(row['HomeTeam'], 0)
        away_wins = stats['wins'].get(row['AwayTeam'], 0)
        draws = stats['draws']
        total_matches = stats['total_matches']
        home_ratio = adjusted_win_loss_ratio(home_wins, draws, total_matches - home_wins - draws, total_matches)
        away_ratio = adjusted_win_loss_ratio(away_wins, draws, total_matches - away_wins - draws, total_matches)
        return pd.Series([home_ratio, away_ratio])

    df2023[['adjusted_win_lost_ratio_H', 'adjusted_win_lost_ratio_A']] = df2023.apply(calculate_ratio_for_match, axis=1)
    return df2023


merged_df = spn1.copy()


head_to_head_stats = calculate_head_to_head_stats(merged_df)

# Apply the adjusted win-loss ratio to df2023
df2023 = apply_adjusted_win_loss_ratio_to_2023(df2023, head_to_head_stats)

In [None]:
df2023.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,AwayTeam_goals_conceded_avg,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A
0,12/08/2022,Osasuna,Sevilla,3.2,3.1,2.4,0.26,0.32,0.89,0.89,1.37,0.68,0.26,0.27,11.93,15.61,-0.5,2.5
1,13/08/2022,Celta,Espanol,1.8,3.75,4.75,0.37,0.06,1.37,0.82,1.21,1.88,0.34,0.22,18.25,12.86,1.2,0.8
2,13/08/2022,Valladolid,Villarreal,3.9,3.6,1.9,0.65,0.32,1.85,1.21,0.75,1.0,0.32,0.33,26.48,21.13,0.3,1.7
3,13/08/2022,Barcelona,Vallecano,1.22,7.0,12.0,0.63,0.16,1.95,0.68,1.0,1.47,0.35,0.23,25.97,11.94,1.7,0.3
4,14/08/2022,Cadiz,Sociedad,3.6,3.2,2.2,0.17,0.37,1.06,1.26,1.33,1.47,0.28,0.3,13.34,22.04,-1.0,3.0


In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         380 non-null    object 
 1   HomeTeam                     380 non-null    object 
 2   AwayTeam                     380 non-null    object 
 3   B365H                        380 non-null    float64
 4   B365D                        380 non-null    float64
 5   B365A                        380 non-null    float64
 6   HomeTeam_WinRate             380 non-null    float64
 7   AwayTeam_WinRate             380 non-null    float64
 8   HomeTeam_GoalsAvg            380 non-null    float64
 9   AwayTeam_GoalsAvg            380 non-null    float64
 10  HomeTeam_goals_conceded_avg  380 non-null    float64
 11  AwayTeam_goals_conceded_avg  380 non-null    float64
 12  H_goal_ratio                 380 non-null    float64
 13  A_goal_ratio        

In [None]:
df2023 = process_time_data(df2023, 2023)

  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
def add_probability_B365(df):

    df['Broker_prob_H'] = round(1 / df['B365H'], 2)
    df['Broker_prob_D'] = round(1 / df['B365D'], 2)
    df['Broker_prob_A'] = round(1 / df['B365A'], 2)
    return df

In [None]:
df2023 = add_probability_B365(df2023)

In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     380 non-null    object 
 1   AwayTeam                     380 non-null    object 
 2   B365H                        380 non-null    float64
 3   B365D                        380 non-null    float64
 4   B365A                        380 non-null    float64
 5   HomeTeam_WinRate             380 non-null    float64
 6   AwayTeam_WinRate             380 non-null    float64
 7   HomeTeam_GoalsAvg            380 non-null    float64
 8   AwayTeam_GoalsAvg            380 non-null    float64
 9   HomeTeam_goals_conceded_avg  380 non-null    float64
 10  AwayTeam_goals_conceded_avg  380 non-null    float64
 11  H_goal_ratio                 380 non-null    float64
 12  A_goal_ratio                 380 non-null    float64
 13  attack_strength_home

Classification task

I use 2022 season ass validation set

In [None]:
train = spn1[spn1['Year'] < 2022]
validation = spn1[spn1['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['FTR']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['FTR']

In [None]:
X_test = df2023.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((2230, 21), (2230,), (372, 21), (372,), (380, 21))

In [None]:
y_train_enc = y_train.map({'H': 1, 'D': 0, 'A': 2})
y_validation_enc = y_validation.map({'H': 1, 'D': 0, 'A': 2})

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

I utilize LazyPredict to gain an overview of various models, helping me identify which ones are most promising. Based on this analysis, I will select the five best models.

In [None]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = False, custom_metric = None)
models,pred = clf.fit(X_train, X_validation, y_train_enc, y_validation_enc)

  7%|▋         | 2/29 [00:00<00:05,  5.26it/s]

ROC AUC couldn't be calculated for AdaBoostClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BaggingClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BernoulliNB
multi_class must be in ('ovo', 'ovr')


 14%|█▍        | 4/29 [00:02<00:16,  1.50it/s]

ROC AUC couldn't be calculated for CalibratedClassifierCV
multi_class must be in ('ovo', 'ovr')
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
ROC AUC couldn't be calculated for DecisionTreeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for DummyClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for ExtraTreeClassifier
multi_class must be in ('ovo', 'ovr')


 31%|███       | 9/29 [00:02<00:05,  3.79it/s]

ROC AUC couldn't be calculated for ExtraTreesClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for GaussianNB
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for KNeighborsClassifier
multi_class must be in ('ovo', 'ovr')


 41%|████▏     | 12/29 [00:03<00:03,  4.79it/s]

ROC AUC couldn't be calculated for LabelPropagation
multi_class must be in ('ovo', 'ovr')


 45%|████▍     | 13/29 [00:03<00:03,  4.48it/s]

ROC AUC couldn't be calculated for LabelSpreading
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LinearDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 59%|█████▊    | 17/29 [00:04<00:02,  5.27it/s]

ROC AUC couldn't be calculated for LinearSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LogisticRegression
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NearestCentroid
multi_class must be in ('ovo', 'ovr')


 62%|██████▏   | 18/29 [00:04<00:02,  4.66it/s]

ROC AUC couldn't be calculated for NuSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for PassiveAggressiveClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for Perceptron
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for QuadraticDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 86%|████████▌ | 25/29 [00:05<00:00,  7.41it/s]

ROC AUC couldn't be calculated for RandomForestClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifierCV
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for SGDClassifier
multi_class must be in ('ovo', 'ovr')


 93%|█████████▎| 27/29 [00:05<00:00,  7.51it/s]

ROC AUC couldn't be calculated for SVC
multi_class must be in ('ovo', 'ovr')
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'
ROC AUC couldn't be calculated for XGBClassifier
multi_class must be in ('ovo', 'ovr')
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000477 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1026
[LightGBM] [Info] Number of data points in the train set: 2230, number of used features: 21
[LightGBM] [Info] Start training from score -1.351915
[LightGBM] [Info] Start training from score -0.782199
[LightGBM] [Info] Start training from score -1.259286


100%|██████████| 29/29 [00:05<00:00,  4.91it/s]

ROC AUC couldn't be calculated for LGBMClassifier
multi_class must be in ('ovo', 'ovr')





In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.72,0.7,,0.72,0.26
SVC,0.73,0.68,,0.71,0.25
RandomForestClassifier,0.72,0.68,,0.71,0.5
ExtraTreesClassifier,0.71,0.68,,0.7,0.35
BaggingClassifier,0.69,0.67,,0.69,0.16
LGBMClassifier,0.69,0.67,,0.69,0.23
NuSVC,0.7,0.66,,0.69,0.33
SGDClassifier,0.7,0.66,,0.68,0.07
LogisticRegression,0.68,0.65,,0.67,0.08
Perceptron,0.68,0.65,,0.67,0.02


In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Initialize the models
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
decision_tree_clf = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)

# Fit the models
bagging_clf.fit(X_train, y_train_enc)
extra_trees_clf.fit(X_train, y_train_enc)
random_forest_clf.fit(X_train, y_train_enc)
decision_tree_clf.fit(X_train, y_train_enc)
xgb_clf.fit(X_train, y_train_enc)

# Predict y_validation
y_pred_bagging = bagging_clf.predict(X_validation)
y_pred_extra_trees = extra_trees_clf.predict(X_validation)
y_pred_random_forest = random_forest_clf.predict(X_validation)
y_pred_decision_tree = decision_tree_clf.predict(X_validation)
y_pred_xgb = xgb_clf.predict(X_validation)

# Calculate accuracy and F1 score for each model
accuracy_bagging = accuracy_score(y_validation_enc, y_pred_bagging)
f1_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

accuracy_extra_trees = accuracy_score(y_validation_enc, y_pred_extra_trees)
f1_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')

accuracy_random_forest = accuracy_score(y_validation_enc, y_pred_random_forest)
f1_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

accuracy_decision_tree = accuracy_score(y_validation_enc, y_pred_decision_tree)
f1_decision_tree = f1_score(y_validation_enc, y_pred_decision_tree, average='macro')

accuracy_xgb = accuracy_score(y_validation_enc, y_pred_xgb)
f1_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

# Print out the performance
print(f'Bagging Classifier - Accuracy: {accuracy_bagging}, F1 Score: {f1_bagging}')
print(f'Extra Trees Classifier - Accuracy: {accuracy_extra_trees}, F1 Score: {f1_extra_trees}')
print(f'Random Forest Classifier - Accuracy: {accuracy_random_forest}, F1 Score: {f1_random_forest}')
print(f'Decision Tree Classifier - Accuracy: {accuracy_decision_tree}, F1 Score: {f1_decision_tree}')
print(f'XGB Classifier - Accuracy: {accuracy_xgb}, F1 Score: {f1_xgb}')

Bagging Classifier - Accuracy: 0.6881720430107527, F1 Score: 0.6693389051561406
Extra Trees Classifier - Accuracy: 0.706989247311828, F1 Score: 0.6904039625082244
Random Forest Classifier - Accuracy: 0.7231182795698925, F1 Score: 0.7008498735818188
Decision Tree Classifier - Accuracy: 0.6397849462365591, F1 Score: 0.6351089928922348
XGB Classifier - Accuracy: 0.7204301075268817, F1 Score: 0.7106470772726663


Based on the results I pick 3 best model from the top 5 for hyperparameters tunning using Gridsearch

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_extra_trees = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



grid_search_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_clf, param_grid_random_forest, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_extra_trees = GridSearchCV(extra_trees_clf, param_grid_extra_trees, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)

grid_search_xgb.fit(X_train, y_train_enc)
grid_search_random_forest.fit(X_train, y_train_enc)
grid_search_extra_trees.fit(X_train, y_train_enc)

best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

best_params_extra_trees = grid_search_extra_trees.best_params_
best_score_extra_trees = grid_search_extra_trees.best_score_


y_pred_xgb = grid_search_xgb.best_estimator_.predict(X_validation)
f1_score_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
f1_score_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

y_pred_extra_trees = grid_search_extra_trees.best_estimator_.predict(X_validation)
f1_score_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')


results = {
    "XGB Classifier": {
        "Best Parameters": best_params_xgb,
        "Best Score": best_score_xgb,
        "F1 Score on Validation": f1_score_xgb
    },
    "Random Forest Classifier": {
        "Best Parameters": best_params_random_forest,
        "Best Score": best_score_random_forest,
        "F1 Score on Validation": f1_score_random_forest
    },
    "Extra Trees Classifier": {
        "Best Parameters": best_params_extra_trees,
        "Best Score": best_score_extra_trees,
        "F1 Score on Validation": f1_score_extra_trees
    }
}



Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
results

{'XGB Classifier': {'Best Parameters': {'colsample_bytree': 1,
   'learning_rate': 0.1,
   'max_depth': 3,
   'n_estimators': 50,
   'subsample': 0.5},
  'Best Score': 0.6941905203729608,
  'F1 Score on Validation': 0.6996657034392882},
 'Random Forest Classifier': {'Best Parameters': {'max_depth': 10,
   'min_samples_leaf': 1,
   'min_samples_split': 10,
   'n_estimators': 200},
  'Best Score': 0.6741715360960414,
  'F1 Score on Validation': 0.6901265113642401},
 'Extra Trees Classifier': {'Best Parameters': {'max_depth': 10,
   'min_samples_leaf': 4,
   'min_samples_split': 2,
   'n_estimators': 100},
  'Best Score': 0.6869174550510095,
  'F1 Score on Validation': 0.7002351044690635}}

Based on result pick the final model with best hyperparameters to apply for the test set

In [None]:
optimal_xgb_clf = XGBClassifier(
    colsample_bytree=1,
    learning_rate=0.1,
    max_depth=3,
    n_estimators=50,
    subsample=0.5,
    random_state=42
)

In [None]:
optimal_xgb_clf.fit(X_train, y_train_enc)

In [None]:
y_pred_test_xgb = optimal_xgb_clf.predict(X_test)

In [None]:
# Define the inverse mapping
inverse_mapping = {1: 'H', 0: 'D', 2: 'A'}

# Convert y_pred_test_xgb back to original form
y_pred_test_xgb_original = [inverse_mapping[label] for label in y_pred_test_xgb]


In [None]:
predictions_df = pd.DataFrame(y_pred_test_xgb_original, columns=['Predictions'])

In [None]:
predictions_df.to_csv('spain_1.csv', index=False)

Regression task

In [None]:
train = spn1[spn1['Year'] < 2022]
validation = spn1[spn1['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['total_goal']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['total_goal']

In [None]:
X_test = df2023.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((2230, 21), (2230,), (372, 21), (372,), (380, 21))

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
from lazypredict.Supervised import LazyRegressor


reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)

models, predictions = reg.fit(X_train, X_validation, y_train, y_validation)

 21%|██▏       | 9/42 [00:02<00:08,  3.78it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 74%|███████▍  | 31/42 [00:16<00:04,  2.62it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:21<00:00,  1.99it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000543 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1026
[LightGBM] [Info] Number of data points in the train set: 2230, number of used features: 21
[LightGBM] [Info] Start training from score 2.658744





In [None]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GradientBoostingRegressor,0.13,0.18,1.56,0.56
NuSVR,0.11,0.16,1.58,0.27
RandomForestRegressor,0.11,0.16,1.58,1.74
LGBMRegressor,0.1,0.15,1.58,0.14
ExtraTreesRegressor,0.1,0.15,1.59,1.02
HistGradientBoostingRegressor,0.1,0.15,1.59,7.43
MLPRegressor,0.1,0.15,1.59,3.45
SVR,0.09,0.14,1.59,0.35
LassoLarsCV,0.09,0.14,1.6,0.1
LassoCV,0.09,0.14,1.6,0.37


In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
elastic_net_cv = ElasticNetCV(cv=5, random_state=42)
poisson_regressor = PoissonRegressor()
svr = SVR()
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_reg = XGBRegressor(random_state=42)

# Fit the models
elastic_net_cv.fit(X_train, y_train)
poisson_regressor.fit(X_train, y_train)
svr.fit(X_train, y_train)
random_forest_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
y_pred_poisson_regressor = poisson_regressor.predict(X_validation)
y_pred_svr = svr.predict(X_validation)
y_pred_random_forest = random_forest_reg.predict(X_validation)
y_pred_xgb = xgb_reg.predict(X_validation)

# Predict y_validation
y_pred_elastic_net_cv_rounded = np.rint(y_pred_elastic_net_cv)
y_pred_poisson_regressor_rounded = np.rint(y_pred_poisson_regressor)
y_pred_svr_rounded = np.rint(y_pred_svr)
y_pred_random_forest_rounded = np.rint(y_pred_random_forest)
y_pred_xgb_rounded = np.rint(y_pred_xgb)

# Calculate MAE and R2 score using rounded predictions
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv_rounded)
r2_elastic_net_cv = r2_score(y_validation, y_pred_elastic_net_cv_rounded)

mae_poisson_regressor = mean_absolute_error(y_validation, y_pred_poisson_regressor_rounded)
r2_poisson_regressor = r2_score(y_validation, y_pred_poisson_regressor_rounded)

mae_svr = mean_absolute_error(y_validation, y_pred_svr_rounded)
r2_svr = r2_score(y_validation, y_pred_svr_rounded)

mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest_rounded)
r2_random_forest = r2_score(y_validation, y_pred_random_forest_rounded)

mae_xgb = mean_absolute_error(y_validation, y_pred_xgb_rounded)
r2_xgb = r2_score(y_validation, y_pred_xgb_rounded)

# Print out the performance with rounded predictions
print(f'ElasticNetCV - MAE: {mae_elastic_net_cv}, R2 Score: {r2_elastic_net_cv}')
print(f'Poisson Regressor - MAE: {mae_poisson_regressor}, R2 Score: {r2_poisson_regressor}')
print(f'SVR - MAE: {mae_svr}, R2 Score: {r2_svr}')
print(f'Random Forest Regressor - MAE: {mae_random_forest}, R2 Score: {r2_random_forest}')
print(f'XGB Regressor - MAE: {mae_xgb}, R2 Score: {r2_xgb}')

ElasticNetCV - MAE: 1.2795698924731183, R2 Score: 0.06975293145596662
Poisson Regressor - MAE: 1.2661290322580645, R2 Score: 0.07609964634856725
SVR - MAE: 1.346774193548387, R2 Score: -0.08891494085904883
Random Forest Regressor - MAE: 1.228494623655914, R2 Score: 0.1395667952745735
XGB Regressor - MAE: 1.303763440860215, R2 Score: 0.010819150310389491


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Parameter grids
param_grid_poisson = {
    'alpha': [0.01, 0.1, 1, 10],
    'max_iter': [100, 300, 500]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV setup
grid_search_poisson = GridSearchCV(poisson_regressor, param_grid_poisson, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_reg, param_grid_random_forest, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)

# Fitting models
grid_search_poisson.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)

# Best parameters and scores
best_params_poisson = grid_search_poisson.best_params_
best_score_poisson = grid_search_poisson.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

# Predict and calculate MAE
y_pred_poisson = grid_search_poisson.best_estimator_.predict(X_validation)
mae_poisson = mean_absolute_error(y_validation, y_pred_poisson)

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest)

# Results
results = {
    "Poisson Regressor": {
        "Best Parameters": best_params_poisson,
        "Best Score (Negative MAE)": best_score_poisson,
        "MAE on Validation": mae_poisson
    },
    "Random Forest Regressor": {
        "Best Parameters": best_params_random_forest,
        "Best Score (Negative MAE)": best_score_random_forest,
        "MAE on Validation": mae_random_forest
    }
}


elastic_net_cv = ElasticNetCV(cv=5, random_state=42).fit(X_train, y_train)
y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv)

results["ElasticNetCV"] = {
    "Best Parameters": elastic_net_cv.get_params(),
    "MAE on Validation": mae_elastic_net_cv
}

# Print results
for model, info in results.items():
    print(f"{model}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Poisson Regressor:
  Best Parameters: {'alpha': 0.01, 'max_iter': 500}
  Best Score (Negative MAE): -1.1967427766571797
  MAE on Validation: 1.2651641198849248

Random Forest Regressor:
  Best Parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 50}
  Best Score (Negative MAE): -1.216321108304
  MAE on Validation: 1.2474086003903386

ElasticNetCV:
  Best Parameters: {'alphas': None, 'copy_X': True, 'cv': 5, 'eps': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': 0}
  MAE on Validation: 1.264681409800629



In [None]:


# Set the best parameters for PoissonRegressor
best_params_poisson = {
    'alpha': 0.01,
    'max_iter': 500
}

# Initialize and fit the PoissonRegressor with the best parameters
poisson_model = PoissonRegressor(**best_params_poisson)
poisson_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_test_poisson = poisson_model.predict(X_test)

# Round the predictions to the nearest integer and convert to int type
y_pred_test_poisson_rounded = np.rint(y_pred_test_poisson).astype(int)




In [None]:


# Convert predictions to a DataFrame
predictions_df_poisson = pd.DataFrame(y_pred_test_poisson_rounded, columns=['Predicted_Total_Goals'])

# Save to CSV
predictions_df_poisson.to_csv('spain_1.csv', index=False)


In [None]:
file_path_2 = '/content/drive/My Drive/test/spain/2/2223.csv'
df2023_spn2 = pd.read_csv(file_path_2)

In [None]:
df2023_spn2 = df2023_spn2[columns_test]

In [None]:
df2023_spn2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462 entries, 0 to 461
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      462 non-null    object 
 1   HomeTeam  462 non-null    object 
 2   AwayTeam  462 non-null    object 
 3   B365H     459 non-null    float64
 4   B365D     459 non-null    float64
 5   B365A     459 non-null    float64
dtypes: float64(3), object(3)
memory usage: 21.8+ KB


In [None]:
df2023_spn2 = impute_missing_values_knn(df2023_spn2)

In [None]:
df2023_spn2 = calculate_and_apply_overall_averages(season_dfs, df2023_spn2)

In [None]:
df2023_spn2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462 entries, 0 to 461
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         462 non-null    object 
 1   HomeTeam                     462 non-null    object 
 2   AwayTeam                     462 non-null    object 
 3   B365H                        462 non-null    float64
 4   B365D                        462 non-null    float64
 5   B365A                        462 non-null    float64
 6   HomeTeam_WinRate             420 non-null    float64
 7   AwayTeam_WinRate             420 non-null    float64
 8   HomeTeam_GoalsAvg            420 non-null    float64
 9   AwayTeam_GoalsAvg            420 non-null    float64
 10  HomeTeam_goals_conceded_avg  420 non-null    float64
 11  AwayTeam_goals_conceded_avg  420 non-null    float64
 12  H_goal_ratio                 420 non-null    float64
 13  A_goal_ratio        

In [None]:
merged_df = spn2.copy()

# Calculate head-to-head stats using merged data
head_to_head_stats = calculate_head_to_head_stats(merged_df)

# Apply the adjusted win-loss ratio to df2023
df2023_spn2 = apply_adjusted_win_loss_ratio_to_2023(df2023_spn2, head_to_head_stats)

In [None]:

def fill_missing_with_mean(df):
    """
    Fill missing values in each column of the DataFrame with the mean of that column.

    Parameters:
    df (pd.DataFrame): The dataset with missing values.

    Returns:
    pd.DataFrame: The DataFrame with missing values filled.
    """
    for column in df.columns:
        if df[column].dtype in ['float64', 'int64']:
          mean_value = round(df[column].mean(), 2)
          df[column].fillna(mean_value, inplace=True)
    return df

In [None]:
df2023_spn2 = fill_missing_with_mean(df2023_spn2)

In [None]:
df2023_spn2 = process_time_data(df2023_spn2, 2023)

In [None]:
df2023_spn2 = add_probability_B365(df2023_spn2)

In [None]:
df2023_spn2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462 entries, 0 to 461
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     462 non-null    object 
 1   AwayTeam                     462 non-null    object 
 2   B365H                        462 non-null    float64
 3   B365D                        462 non-null    float64
 4   B365A                        462 non-null    float64
 5   HomeTeam_WinRate             462 non-null    float64
 6   AwayTeam_WinRate             462 non-null    float64
 7   HomeTeam_GoalsAvg            462 non-null    float64
 8   AwayTeam_GoalsAvg            462 non-null    float64
 9   HomeTeam_goals_conceded_avg  462 non-null    float64
 10  AwayTeam_goals_conceded_avg  462 non-null    float64
 11  H_goal_ratio                 462 non-null    float64
 12  A_goal_ratio                 462 non-null    float64
 13  attack_strength_home

In [None]:
train = spn2[spn2['Year'] < 2022]
validation = spn2[spn2['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['FTR']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['FTR']

In [None]:
X_test = df2023_spn2.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((1810, 21), (1810,), (453, 21), (453,), (462, 21))

In [None]:
y_train_enc = y_train.map({'H': 1, 'D': 0, 'A': 2})
y_validation_enc = y_validation.map({'H': 1, 'D': 0, 'A': 2})

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = False, custom_metric = None)
models,pred = clf.fit(X_train, X_validation, y_train_enc, y_validation_enc)

  7%|▋         | 2/29 [00:00<00:04,  5.87it/s]

ROC AUC couldn't be calculated for AdaBoostClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BaggingClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BernoulliNB
multi_class must be in ('ovo', 'ovr')


 14%|█▍        | 4/29 [00:01<00:14,  1.77it/s]

ROC AUC couldn't be calculated for CalibratedClassifierCV
multi_class must be in ('ovo', 'ovr')
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
ROC AUC couldn't be calculated for DecisionTreeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for DummyClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for ExtraTreeClassifier
multi_class must be in ('ovo', 'ovr')


 31%|███       | 9/29 [00:02<00:04,  4.34it/s]

ROC AUC couldn't be calculated for ExtraTreesClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for GaussianNB
multi_class must be in ('ovo', 'ovr')


 38%|███▊      | 11/29 [00:02<00:03,  5.01it/s]

ROC AUC couldn't be calculated for KNeighborsClassifier
multi_class must be in ('ovo', 'ovr')


 41%|████▏     | 12/29 [00:02<00:03,  4.97it/s]

ROC AUC couldn't be calculated for LabelPropagation
multi_class must be in ('ovo', 'ovr')


 45%|████▍     | 13/29 [00:03<00:03,  4.76it/s]

ROC AUC couldn't be calculated for LabelSpreading
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LinearDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 52%|█████▏    | 15/29 [00:03<00:03,  4.33it/s]

ROC AUC couldn't be calculated for LinearSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LogisticRegression
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NearestCentroid
multi_class must be in ('ovo', 'ovr')


 62%|██████▏   | 18/29 [00:03<00:01,  5.60it/s]

ROC AUC couldn't be calculated for NuSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for PassiveAggressiveClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for Perceptron
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for QuadraticDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 86%|████████▌ | 25/29 [00:04<00:00,  8.29it/s]

ROC AUC couldn't be calculated for RandomForestClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifierCV
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for SGDClassifier
multi_class must be in ('ovo', 'ovr')


 93%|█████████▎| 27/29 [00:04<00:00,  8.56it/s]

ROC AUC couldn't be calculated for SVC
multi_class must be in ('ovo', 'ovr')
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'
ROC AUC couldn't be calculated for XGBClassifier
multi_class must be in ('ovo', 'ovr')
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000315 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 753
[LightGBM] [Info] Number of data points in the train set: 1810, number of used features: 21
[LightGBM] [Info] Start training from score -1.173145
[LightGBM] [Info] Start training from score -0.822740
[LightGBM] [Info] Start training from score -1.380785


100%|██████████| 29/29 [00:06<00:00,  4.33it/s]

ROC AUC couldn't be calculated for LGBMClassifier
multi_class must be in ('ovo', 'ovr')





In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LinearDiscriminantAnalysis,0.69,0.68,,0.69,0.09
LinearSVC,0.69,0.66,,0.66,0.44
LogisticRegression,0.68,0.66,,0.65,0.07
RidgeClassifier,0.69,0.66,,0.66,0.02
CalibratedClassifierCV,0.68,0.66,,0.65,1.58
RidgeClassifierCV,0.67,0.64,,0.63,0.03
RandomForestClassifier,0.68,0.64,,0.66,0.46
BaggingClassifier,0.66,0.64,,0.66,0.14
Perceptron,0.66,0.64,,0.65,0.02
SVC,0.66,0.63,,0.62,0.21


In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Initialize the models
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
decision_tree_clf = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
bagging_clf.fit(X_train, y_train_enc)
extra_trees_clf.fit(X_train, y_train_enc)
random_forest_clf.fit(X_train, y_train_enc)
decision_tree_clf.fit(X_train, y_train_enc)
xgb_clf.fit(X_train, y_train_enc)

# Predict y_validation
y_pred_bagging = bagging_clf.predict(X_validation)
y_pred_extra_trees = extra_trees_clf.predict(X_validation)
y_pred_random_forest = random_forest_clf.predict(X_validation)
y_pred_decision_tree = decision_tree_clf.predict(X_validation)
y_pred_xgb = xgb_clf.predict(X_validation)

# Calculate accuracy and F1 score for each model
accuracy_bagging = accuracy_score(y_validation_enc, y_pred_bagging)
f1_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

accuracy_extra_trees = accuracy_score(y_validation_enc, y_pred_extra_trees)
f1_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')

accuracy_random_forest = accuracy_score(y_validation_enc, y_pred_random_forest)
f1_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

accuracy_decision_tree = accuracy_score(y_validation_enc, y_pred_decision_tree)
f1_decision_tree = f1_score(y_validation_enc, y_pred_decision_tree, average='macro')

accuracy_xgb = accuracy_score(y_validation_enc, y_pred_xgb)
f1_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

# Print out the performance
print(f'Bagging Classifier - Accuracy: {accuracy_bagging}, F1 Score: {f1_bagging}')
print(f'Extra Trees Classifier - Accuracy: {accuracy_extra_trees}, F1 Score: {f1_extra_trees}')
print(f'Random Forest Classifier - Accuracy: {accuracy_random_forest}, F1 Score: {f1_random_forest}')
print(f'Decision Tree Classifier - Accuracy: {accuracy_decision_tree}, F1 Score: {f1_decision_tree}')
print(f'XGB Classifier - Accuracy: {accuracy_xgb}, F1 Score: {f1_xgb}')

Bagging Classifier - Accuracy: 0.6534216335540839, F1 Score: 0.6095258984533376
Extra Trees Classifier - Accuracy: 0.6710816777041942, F1 Score: 0.6268114397210746
Random Forest Classifier - Accuracy: 0.6754966887417219, F1 Score: 0.6340491112230243
Decision Tree Classifier - Accuracy: 0.6379690949227373, F1 Score: 0.6256919620072238
XGB Classifier - Accuracy: 0.6534216335540839, F1 Score: 0.6166563664326027


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_extra_trees = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



grid_search_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_clf, param_grid_random_forest, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_extra_trees = GridSearchCV(extra_trees_clf, param_grid_extra_trees, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)

grid_search_xgb.fit(X_train, y_train_enc)
grid_search_random_forest.fit(X_train, y_train_enc)
grid_search_extra_trees.fit(X_train, y_train_enc)

best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

best_params_extra_trees = grid_search_extra_trees.best_params_
best_score_extra_trees = grid_search_extra_trees.best_score_


y_pred_xgb = grid_search_xgb.best_estimator_.predict(X_validation)
f1_score_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
f1_score_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

y_pred_extra_trees = grid_search_extra_trees.best_estimator_.predict(X_validation)
f1_score_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')


results = {
    "XGB Classifier": {
        "Best Parameters": best_params_xgb,
        "Best Score": best_score_xgb,
        "F1 Score on Validation": f1_score_xgb
    },
    "Random Forest Classifier": {
        "Best Parameters": best_params_random_forest,
        "Best Score": best_score_random_forest,
        "F1 Score on Validation": f1_score_random_forest
    },
    "Extra Trees Classifier": {
        "Best Parameters": best_params_extra_trees,
        "Best Score": best_score_extra_trees,
        "F1 Score on Validation": f1_score_extra_trees
    }
}





Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
results

{'XGB Classifier': {'Best Parameters': {'colsample_bytree': 0.5,
   'learning_rate': 0.1,
   'max_depth': 3,
   'n_estimators': 100,
   'subsample': 0.5},
  'Best Score': 0.6763730021425417,
  'F1 Score on Validation': 0.6349898276893151},
 'Random Forest Classifier': {'Best Parameters': {'max_depth': 10,
   'min_samples_leaf': 1,
   'min_samples_split': 10,
   'n_estimators': 50},
  'Best Score': 0.6611250452228863,
  'F1 Score on Validation': 0.6290708883441837},
 'Extra Trees Classifier': {'Best Parameters': {'max_depth': 10,
   'min_samples_leaf': 1,
   'min_samples_split': 2,
   'n_estimators': 50},
  'Best Score': 0.6645691434303042,
  'F1 Score on Validation': 0.6368114861057509}}

In [None]:
# Initialize the XGBClassifier with the best hyperparameters
optimal_xgb_clf = XGBClassifier(
    colsample_bytree=0.5,
    learning_rate=0.1,
    max_depth=3,
    n_estimators=100,
    subsample=0.5,
    random_state=42  # Optional for reproducibility
)

In [None]:
optimal_xgb_clf.fit(X_train, y_train_enc)

In [None]:
y_pred_test_xgb = optimal_xgb_clf.predict(X_test)

In [None]:
# Define the inverse mapping
inverse_mapping = {1: 'H', 0: 'D', 2: 'A'}

# Convert predictions back to the original form
y_pred_test_xgb_original = [inverse_mapping[label] for label in y_pred_test_xgb]


In [None]:
predictions_df = pd.DataFrame(y_pred_test_xgb_original, columns=['Predictions'])

In [None]:
predictions_df.to_csv('spain_2.csv', index=False)

In [None]:
train = spn2[spn2['Year'] < 2022]
validation = spn2[spn2['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['total_goal']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['total_goal']

In [None]:
X_test = df2023_spn2.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((1810, 21), (1810,), (453, 21), (453,), (462, 21))

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:

# Create an instance of LazyRegressor
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)

# Fit the model
models, predictions = reg.fit(X_train, X_validation, y_train, y_validation)

 21%|██▏       | 9/42 [00:01<00:07,  4.70it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 74%|███████▍  | 31/42 [00:08<00:03,  3.05it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:14<00:00,  2.86it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000303 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 753
[LightGBM] [Info] Number of data points in the train set: 1810, number of used features: 21
[LightGBM] [Info] Start training from score 2.195028





In [None]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LassoLarsIC,0.07,0.11,1.42,0.04
PoissonRegressor,0.07,0.11,1.42,0.72
LassoCV,0.07,0.11,1.42,0.21
LassoLarsCV,0.07,0.11,1.42,0.08
ElasticNetCV,0.07,0.11,1.42,0.3
LarsCV,0.06,0.11,1.42,0.06
OrthogonalMatchingPursuitCV,0.06,0.1,1.42,0.02
HuberRegressor,0.06,0.1,1.42,0.06
BayesianRidge,0.06,0.1,1.43,0.03
LinearSVR,0.06,0.1,1.43,0.05


In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
elastic_net_cv = ElasticNetCV(cv=5, random_state=42)  # Adjust parameters as necessary
poisson_regressor = PoissonRegressor()
svr = SVR()  # Default parameters, adjust as necessary
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_reg = XGBRegressor(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
elastic_net_cv.fit(X_train, y_train)
poisson_regressor.fit(X_train, y_train)
svr.fit(X_train, y_train)
random_forest_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
y_pred_poisson_regressor = poisson_regressor.predict(X_validation)
y_pred_svr = svr.predict(X_validation)
y_pred_random_forest = random_forest_reg.predict(X_validation)
y_pred_xgb = xgb_reg.predict(X_validation)

# Predict y_validation
y_pred_elastic_net_cv_rounded = np.rint(y_pred_elastic_net_cv)
y_pred_poisson_regressor_rounded = np.rint(y_pred_poisson_regressor)
y_pred_svr_rounded = np.rint(y_pred_svr)
y_pred_random_forest_rounded = np.rint(y_pred_random_forest)
y_pred_xgb_rounded = np.rint(y_pred_xgb)

# Calculate MAE and R2 score using rounded predictions
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv_rounded)
r2_elastic_net_cv = r2_score(y_validation, y_pred_elastic_net_cv_rounded)

mae_poisson_regressor = mean_absolute_error(y_validation, y_pred_poisson_regressor_rounded)
r2_poisson_regressor = r2_score(y_validation, y_pred_poisson_regressor_rounded)

mae_svr = mean_absolute_error(y_validation, y_pred_svr_rounded)
r2_svr = r2_score(y_validation, y_pred_svr_rounded)

mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest_rounded)
r2_random_forest = r2_score(y_validation, y_pred_random_forest_rounded)

mae_xgb = mean_absolute_error(y_validation, y_pred_xgb_rounded)
r2_xgb = r2_score(y_validation, y_pred_xgb_rounded)

# Print out the performance with rounded predictions
print(f'ElasticNetCV - MAE: {mae_elastic_net_cv}, R2 Score: {r2_elastic_net_cv}')
print(f'Poisson Regressor - MAE: {mae_poisson_regressor}, R2 Score: {r2_poisson_regressor}')
print(f'SVR - MAE: {mae_svr}, R2 Score: {r2_svr}')
print(f'Random Forest Regressor - MAE: {mae_random_forest}, R2 Score: {r2_random_forest}')
print(f'XGB Regressor - MAE: {mae_xgb}, R2 Score: {r2_xgb}')

ElasticNetCV - MAE: 1.1324503311258278, R2 Score: 0.048962778899861004
Poisson Regressor - MAE: 1.1567328918322295, R2 Score: 0.0070195988923676955
SVR - MAE: 1.1788079470198676, R2 Score: -0.0827192978678506
Random Forest Regressor - MAE: 1.1523178807947019, R2 Score: 0.030429745873294167
XGB Regressor - MAE: 1.2472406181015452, R2 Score: -0.14417093369278255


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Parameter grids
param_grid_poisson = {
    'alpha': [0.01, 0.1, 1, 10],
    'max_iter': [100, 300, 500]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV setup
grid_search_poisson = GridSearchCV(poisson_regressor, param_grid_poisson, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_reg, param_grid_random_forest, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)

# Fitting models
grid_search_poisson.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)

# Best parameters and scores
best_params_poisson = grid_search_poisson.best_params_
best_score_poisson = grid_search_poisson.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

# Predict and calculate MAE
y_pred_poisson = grid_search_poisson.best_estimator_.predict(X_validation)
mae_poisson = mean_absolute_error(y_validation, y_pred_poisson)

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest)

# Results
results = {
    "Poisson Regressor": {
        "Best Parameters": best_params_poisson,
        "Best Score (Negative MAE)": best_score_poisson,
        "MAE on Validation": mae_poisson
    },
    "Random Forest Regressor": {
        "Best Parameters": best_params_random_forest,
        "Best Score (Negative MAE)": best_score_random_forest,
        "MAE on Validation": mae_random_forest
    }
}

# ElasticNetCV already uses cross-validation for parameter tuning, so we directly fit it and predict
elastic_net_cv = ElasticNetCV(cv=5, random_state=42).fit(X_train, y_train)
y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv)

results["ElasticNetCV"] = {
    "Best Parameters": elastic_net_cv.get_params(),
    "MAE on Validation": mae_elastic_net_cv
}

# Print results
for model, info in results.items():
    print(f"{model}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Poisson Regressor:
  Best Parameters: {'alpha': 0.01, 'max_iter': 300}
  Best Score (Negative MAE): -1.0712823623934493
  MAE on Validation: 1.1427228947764518

Random Forest Regressor:
  Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 100}
  Best Score (Negative MAE): -1.1080347800221886
  MAE on Validation: 1.1547492304012426

ElasticNetCV:
  Best Parameters: {'alphas': None, 'copy_X': True, 'cv': 5, 'eps': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': 0}
  MAE on Validation: 1.1494289415085541



In [None]:


# Set the best parameters for PoissonRegressor
best_params_poisson = {
    'alpha': 0.01,
    'max_iter': 300
}

# Initialize and fit the PoissonRegressor with the best parameters
poisson_model = PoissonRegressor(**best_params_poisson)
poisson_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_test_poisson = poisson_model.predict(X_test)

# Round the predictions to the nearest integer and convert to int type
y_pred_test_poisson_rounded = np.rint(y_pred_test_poisson).astype(int)

# y_pred_test_poisson_rounded contains the final integer predictions for X_test


In [None]:


# Convert predictions to a DataFrame
predictions_df_poisson = pd.DataFrame(y_pred_test_poisson_rounded, columns=['Predicted_Total_Goals'])

# Save to CSV
predictions_df_poisson.to_csv('spain_2.csv', index=False)
