<a href="https://colab.research.google.com/github/khiemtranngoc/GoalNetAI-Multi-League-Football-Predictions/blob/main/france.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Metodology

Because our goal is to predict  football match results from 2023 then we should not use features that are only available after the match has ended, such as match statistics and goal results. These features are not useful for predicting matches that have not yet happened.

To predict football matches before they happen, we must create a prediction models with data that is available before each match starts. However, the data we have was for the end of each match, such as the number of goals and shots per team. This data could not be used directly to train prediction models, so we had to transform it (creating pre-match features based on the historic data)

* In the test(season 2023) we dont have information such as FTHG, FTAG, ...

### Features Not Suitable for Pre-Match Prediction:
* Goals and Results (FTHG, FTAG, FTR, HTHG, HTAG, HTR): These are outcomes of the match, not available before it starts.

* In-Match Statistics (HS, AS, HST, AST, HHW, AHW, HC, AC, HF, AF, HFKC, AFKC, HO, AO, HY, AY, HR, AR): These are also outcomes or events that occur during the match.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# Function that load all the seasonal dataset from train

def load_seasonal_data(base_path, country, league, start_season, end_season):
    seasonal_data = {}

    for season_start_year in range(start_season, end_season + 1):

        start_year_suffix = (season_start_year - 1) % 100
        end_year_suffix = season_start_year % 100

        season_str = f"{start_year_suffix:02d}{end_year_suffix:02d}"

        file_path = f"{base_path}/{country}/{league}/{season_str}.csv"

        seasonal_data[f'{league}{season_str}'] = pd.read_csv(file_path)

    return seasonal_data


base_path = "/content/drive/MyDrive/train"
country = "france"
league = "1"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

# Example: Access the data for the 2001/2002 season
# ger10102 = seasonal_datasets['ger10102']


In [None]:

fra11516 = seasonal_datasets['11516']
fra11617 = seasonal_datasets['11617']
fra11718 = seasonal_datasets['11718']
fra11819 = seasonal_datasets['11819']
fra11920 = seasonal_datasets['11920']
fra12021 = seasonal_datasets['12021']
fra12122 = seasonal_datasets['12122']

In [None]:
columns = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS',
        'HST', 'AST', 'HC', 'AC',
         "B365H", "B365D", "B365A" ]

In [None]:
df2016 = fra11516[columns]
df2017 = fra11617[columns]
df2018 = fra11718[columns]
df2019 = fra11819[columns]
df2020 = fra11920[columns]
df2021 = fra12021[columns]
df2022 = fra12122[columns]

In [None]:
# This function shows us where do we have missing value in a dataframe

def missing_values_summary(df):

    missing_counts = df.isnull().sum()

    missing_counts = missing_counts[missing_counts > 0]

    summary_df = pd.DataFrame(missing_counts, columns=['Missing Values Count'])
    summary_df.index.name = 'Column'

    return summary_df

In [None]:
summary = missing_values_summary(df2016)
print(summary)

        Missing Values Count
Column                      
B365H                      1
B365D                      1
B365A                      1


In [None]:
summary = missing_values_summary(df2017)
print(summary)

        Missing Values Count
Column                      
HS                         1
AS                         1
HST                        1
AST                        1
HC                         1
AC                         1


In [None]:
summary = missing_values_summary(df2018)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2019)
print(summary)

        Missing Values Count
Column                      
AST                       11


In [None]:
summary = missing_values_summary(df2020)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2021)
print(summary)

        Missing Values Count
Column                      
B365H                      2
B365D                      2
B365A                      2


In [None]:
summary = missing_values_summary(df2022)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
# Function to display rows with missing values from a DataFrame.

def show_rows_with_missing_values(df):

    rows_with_missing_values = df[df.isnull().any(axis=1)]

    return rows_with_missing_values


In [None]:
show_rows_with_missing_values(df2017)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
320,16/04/17,Bastia,Lyon,0,3,A,,,,,,,5.5,4.0,1.57


In [None]:
def transform_goals_to_absolute(df):
    """
    Function to transform values in 'FTHG' (Full Time Home Team Goals) and
    'FTAG' (Full Time Away Team Goals) columns to their absolute values.

    Args:
    df (pd.DataFrame): DataFrame containing the match data.

    Returns:
    pd.DataFrame: Updated DataFrame with absolute values in the specified columns.
    """
    # Convert to absolute values
    df['FTHG'] = df['FTHG'].abs()
    df['FTAG'] = df['FTAG'].abs()

    return df

the reason why I created this function because sometime there are some negative values in column awayteamgoals (number of goal can not be negative)

In [None]:
df2016 = transform_goals_to_absolute(df2016)
df2017 = transform_goals_to_absolute(df2017)
df2018 = transform_goals_to_absolute(df2018)
df2019 = transform_goals_to_absolute(df2019)
df2020 = transform_goals_to_absolute(df2020)
df2021 = transform_goals_to_absolute(df2021)
df2022 = transform_goals_to_absolute(df2022)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTHG'] = df['FTHG'].abs()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTAG'] = df['FTAG'].abs()


Function showing extream outliers

In [None]:
def find_and_print_outlier_rows(df):
    """
    Identifies and prints rows containing outliers for all numerical columns in the DataFrame.

    Parameters:
    df (pd.DataFrame): The dataset.
    """
    for column in df.select_dtypes(include=['number']).columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 3 * IQR
        upper_bound = Q3 + 3 * IQR

        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

        if not outliers.empty:
            print(f"Rows with outliers in column '{column}':")
            print(outliers)
            print("\n")

In [None]:
find_and_print_outlier_rows(df2019)

Rows with outliers in column 'FTHG':
           Date  HomeTeam  AwayTeam  FTHG  FTAG FTR  HS  AS  HST  AST  HC  AC  \
68   19/01/2019  Paris SG  Guingamp     9     0   H  26   2   12  0.0   8   1   
272  01/05/2019    Rennes    Monaco   468     2   D  10  16    4  5.0   5   9   
332  10/03/2019  Toulouse  Guingamp   468     0   H  19   4    4  0.0   8   2   

     B365H  B365D  B365A  
68    1.10   10.0  26.00  
272   3.50    3.3   2.15  
332   2.15    3.2   3.50  


Rows with outliers in column 'B365H':
           Date    HomeTeam  AwayTeam  FTHG  FTAG FTR  HS  AS  HST   AST  HC  \
39   02/12/2018    Bordeaux  Paris SG     2     2   D  18   9    4   2.0   4   
144  29/09/2018        Nice  Paris SG     0     3   A   9  21    1   8.0   3   
156  11/05/2019      Angers  Paris SG     1     2   A  14   9    4   4.0   2   
179  11/11/2018      Monaco  Paris SG     0     4   A  15  14    2   9.0   6   
182  28/10/2018   Marseille  Paris SG     0     2   A  12  11    2   5.0   5   
198  23/09

In [None]:
def filter_goals_under_30(df):
    """
    Filter the DataFrame to select rows where both 'FTHG' and 'FTAG' are smaller than 30,
    including rows where 'FTHG' or 'FTAG' might be NA.

    Parameters:
    df (pandas.DataFrame): The input DataFrame with football data.

    Returns:
    pandas.DataFrame: The filtered DataFrame.
    """
    filtered_df = df[((df['FTHG'] < 30) & (df['FTAG'] < 30)) | df['FTHG'].isna() | df['FTAG'].isna()]
    return filtered_df

In [None]:
df2016 = filter_goals_under_30(df2016)
df2017 = filter_goals_under_30(df2017)
df2018 = filter_goals_under_30(df2018)
df2019 = filter_goals_under_30(df2019)
df2020 = filter_goals_under_30(df2020)
df2021 = filter_goals_under_30(df2021)
df2022 = filter_goals_under_30(df2022)


In [None]:
# Function to selectively impute missing values in a DataFrame using KNNImputer.
# The imputation is applied only to columns with missing values, and results are rounded to integers.


from sklearn.impute import KNNImputer

def impute_missing_values_knn(df, n_neighbors=5):
    cols_with_missing = df.columns[df.isnull().any()]
    numeric_cols_with_missing = df[cols_with_missing].select_dtypes(include=[np.number]).columns

    imputer = KNNImputer(n_neighbors=n_neighbors)

    df_numeric_imputed = df.copy()
    if len(numeric_cols_with_missing) > 0:
        imputed_data = imputer.fit_transform(df[numeric_cols_with_missing])
        df_imputed = pd.DataFrame(imputed_data, columns=numeric_cols_with_missing, index=df.index)

        for col in numeric_cols_with_missing:
            df_numeric_imputed[col] = df_numeric_imputed[col].fillna(np.round(df_imputed[col]))

    return df_numeric_imputed


In [None]:
# filling Na values with KNN imputer
df2016 = impute_missing_values_knn(df2016)
df2017 = impute_missing_values_knn(df2017)
df2019 = impute_missing_values_knn(df2019)
df2021 = impute_missing_values_knn(df2021)

  Function to preprocess football data and create new features:
   
  * Home and Away Team Win Rates from a season
  * Home and Away Team Goals Average per match from a season
  * Winning probabilities from Brokers's Betting Odds
  * goal ratio if the shot hits the target (total goal/ total shots on target) from each team

In [None]:
def preprocess_football_data(df):


    # Calculating win rates and average goals
    home_win_rate = df.groupby('HomeTeam')['FTR'].apply(lambda x: round((x == 'H').mean(), 2)).to_dict()
    away_win_rate = df.groupby('AwayTeam')['FTR'].apply(lambda x: round((x == 'A').mean(), 2)).to_dict()
    home_goals_avg = df.groupby('HomeTeam')['FTHG'].mean().apply(lambda x: round(x, 2)).to_dict()
    away_goals_avg = df.groupby('AwayTeam')['FTAG'].mean().apply(lambda x: round(x, 2)).to_dict()
    home_goals_conceded_avg = df.groupby('HomeTeam')['FTAG'].mean().apply(lambda x: round(x, 2)).to_dict()
    away_goals_conceded_avg = df.groupby('AwayTeam')['FTHG'].mean().apply(lambda x: round(x, 2)).to_dict()
    goal_ratio_H = df.groupby('HomeTeam').apply(lambda x: round(x['FTHG'].sum() / x['HST'].sum(),2) if x['HST'].sum() > 0 else 0)
    goal_ratio_A = df.groupby('AwayTeam').apply(lambda x: round(x['FTAG'].sum() / x['AST'].sum(),2) if x['AST'].sum() > 0 else 0)


    # Mapping the win rates and average goals to the main DataFrame
    df['HomeTeam_WinRate'] = df['HomeTeam'].map(home_win_rate)
    df['AwayTeam_WinRate'] = df['AwayTeam'].map(away_win_rate)
    df['HomeTeam_GoalsAvg'] = df['HomeTeam'].map(home_goals_avg)
    df['AwayTeam_GoalsAvg'] = df['AwayTeam'].map(away_goals_avg)
    df['HomeTeam_goals_conceded_avg'] = df['HomeTeam'].map(home_goals_conceded_avg)
    df['AwayTeam_goals_conceded_avg'] = df['AwayTeam'].map(away_goals_conceded_avg)


    # Calculating implied probabilities from betting odds
    df['Broker_prob_H'] = round(1 / df['B365H'], 2)
    df['Broker_prob_D'] = round(1 / df['B365D'], 2)
    df['Broker_prob_A'] = round(1 / df['B365A'], 2)

     # Calculate the total goals for each match
    df['total_goal'] = df['FTHG'] + df['FTAG']


    # Map the conversion rates back to the original DataFrame
    df['H_goal_ratio'] = df['HomeTeam'].map(goal_ratio_H)
    df['A_goal_ratio'] = df['AwayTeam'].map(goal_ratio_A)

    clean_df = df[df['HomeTeam'] != df['AwayTeam']]




    return clean_df

In [None]:
def add_adjusted_win_loss_ratio(df):
    def adjusted_win_loss_ratio(wins, draws, losses, total_matches):
        return ((3*wins + draws) - losses) / total_matches if total_matches > 0 else 0

    # Initialize a dictionary to track head-to-head stats
    head_to_head_stats = {}

    # Update head-to-head stats
    for index, row in df.iterrows():
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        if teams not in head_to_head_stats:
            head_to_head_stats[teams] = {'wins': {teams[0]: 0, teams[1]: 0},
                                         'draws': 0,
                                         'total_matches': 0}

        head_to_head_stats[teams]['total_matches'] += 1
        if row['FTR'] == 'H':
            head_to_head_stats[teams]['wins'][row['HomeTeam']] += 1
        elif row['FTR'] == 'D':
            head_to_head_stats[teams]['draws'] += 1
        elif row['FTR'] == 'A':
            head_to_head_stats[teams]['wins'][row['AwayTeam']] += 1

    # Calculate and add the adjusted win-loss ratio to the DataFrame
    def calculate_ratio_for_match(row):
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        stats = head_to_head_stats[teams]
        home_wins = stats['wins'][row['HomeTeam']]
        away_wins = stats['wins'][row['AwayTeam']]
        draws = stats['draws']
        total_matches = stats['total_matches']
        home_ratio = adjusted_win_loss_ratio(home_wins, draws, total_matches - home_wins - draws, total_matches)
        away_ratio = adjusted_win_loss_ratio(away_wins, draws, total_matches - away_wins - draws, total_matches)
        return pd.Series([home_ratio, away_ratio])

    df[['adjusted_win_lost_ratio_H', 'adjusted_win_lost_ratio_A']] = df.apply(calculate_ratio_for_match, axis=1)

    return df

I developed the features {'attack_strength_home_team'} and {'attack_strength_away_team'} for every team in the league. These features measure a team's ability to score goals compared to the league average, offering a consistent way to gauge their attacking strength.

In [None]:
def calculate_attack_strength(df):
    # Calculate total goals for each team
    total_home_goals = df.groupby('HomeTeam')['FTHG'].sum()
    total_away_goals = df.groupby('AwayTeam')['FTAG'].sum()

    # Calculate league averages for home and away goals
    average_home_goals = df['FTHG'].mean()
    average_away_goals = df['FTAG'].mean()

    # Calculate attack strength
    df['attack_strength_home_team'] = df['HomeTeam'].apply(lambda x: round(total_home_goals[x] / average_home_goals,2))
    df['attack_strength_away_team'] = df['AwayTeam'].apply(lambda x: round(total_away_goals[x] / average_away_goals,2))

    return df


In [None]:
df2016 =  preprocess_football_data(df2016)
df2017 =  preprocess_football_data(df2017)
df2018 =  preprocess_football_data(df2018)
df2019 =  preprocess_football_data(df2019)
df2020 =  preprocess_football_data(df2020)
df2021 =  preprocess_football_data(df2021)
df2022 =  preprocess_football_data(df2022)

In [None]:
df2016 = calculate_attack_strength(df2016)
df2017 = calculate_attack_strength(df2017)
df2018 = calculate_attack_strength(df2018)
df2019 = calculate_attack_strength(df2019)
df2020 = calculate_attack_strength(df2020)
df2021 = calculate_attack_strength(df2021)
df2022 = calculate_attack_strength(df2022)

In [None]:
df2016 =  add_adjusted_win_loss_ratio(df2016)
df2017 =  add_adjusted_win_loss_ratio(df2017)
df2018 =  add_adjusted_win_loss_ratio(df2018)
df2019 =  add_adjusted_win_loss_ratio(df2019)
df2020 =  add_adjusted_win_loss_ratio(df2020)
df2021 =  add_adjusted_win_loss_ratio(df2021)
df2022 =  add_adjusted_win_loss_ratio(df2022)

In [None]:


def process_time_data(df, target_year):
    # Convert 'Date' column to datetime
    df['Date'] = pd.to_datetime(df['Date'])

    # Extract 'Day', 'Month', and 'Year' from 'Date'
    df['Day'] = df['Date'].dt.day
    df['Month'] = df['Date'].dt.month
    df['Year'] = df['Date'].dt.year

    # Adjust 'Year' values
    df['Year'] = df['Year'].apply(lambda x: target_year if x != target_year else x)

    # Drop 'Day' and 'Month' columns
    df.drop(['Day', 'Month', 'Date'], axis=1, inplace=True)

    return df


In [None]:
df2016 = process_time_data(df2016, 2016)
df2017 = process_time_data(df2017, 2017)
df2018 = process_time_data(df2018, 2018)
df2019 = process_time_data(df2019, 2019)
df2020 = process_time_data(df2020, 2020)
df2021 = process_time_data(df2021, 2021)
df2022 = process_time_data(df2022, 2022)

  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
columns_to_drop = ['FTHG', 'FTAG', 'HTR', 'HC', 'AC', 'HST', 'AST', 'HS', 'AS']

In [None]:
fra1 = pd.concat([df2016, df2017, df2018, df2019, df2020, df2021, df2022], ignore_index=True)

In [None]:
fra1 = fra1.drop(columns=columns_to_drop, errors='ignore')

In [None]:
fra1.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,...,Broker_prob_D,Broker_prob_A,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year
0,Guingamp,Lille,D,2.6,3.1,2.9,0.33,0.32,1.61,0.95,...,0.32,0.34,2.0,0.46,0.25,20.07,16.5,1.0,1.0,2016
1,Monaco,Reims,D,1.62,3.8,6.0,0.53,0.16,1.58,0.84,...,0.26,0.17,4.0,0.33,0.23,20.76,14.66,2.0,0.0,2016
2,Reims,Toulouse,A,2.6,3.1,2.9,0.37,0.16,1.47,0.84,...,0.32,0.34,4.0,0.31,0.21,19.38,14.66,0.0,2.0,2016
3,Montpellier,St Etienne,A,2.5,3.1,3.1,0.5,0.37,1.44,0.89,...,0.32,0.32,3.0,0.32,0.31,17.99,15.58,-1.0,3.0,2016
4,Nice,Monaco,A,4.0,3.1,2.05,0.63,0.37,1.68,1.42,...,0.32,0.49,3.0,0.38,0.35,22.14,24.74,-1.0,3.0,2016


In [None]:
fra1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     2500 non-null   object 
 1   AwayTeam                     2500 non-null   object 
 2   FTR                          2500 non-null   object 
 3   B365H                        2500 non-null   float64
 4   B365D                        2500 non-null   float64
 5   B365A                        2500 non-null   float64
 6   HomeTeam_WinRate             2500 non-null   float64
 7   AwayTeam_WinRate             2500 non-null   float64
 8   HomeTeam_GoalsAvg            2500 non-null   float64
 9   AwayTeam_GoalsAvg            2500 non-null   float64
 10  HomeTeam_goals_conceded_avg  2500 non-null   float64
 11  AwayTeam_goals_conceded_avg  2500 non-null   float64
 12  Broker_prob_H                2500 non-null   float64
 13  Broker_prob_D     

In [None]:
def find_and_print_outliers(df):
    """
    Identifies and prints outliers for all numerical columns in the DataFrame.

    Parameters:
    df (pd.DataFrame): The dataset.
    """
    for column in df.select_dtypes(include=['number']).columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 3 * IQR
        upper_bound = Q3 + 3 * IQR

        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

        if not outliers.empty:
            print(f"Outliers for {column}:")
            print(outliers[column])
            print("\n")

In [None]:
find_and_print_outliers(fra1)

Outliers for B365H:
44       7.5
54      10.0
57       8.0
80      10.0
177      7.5
        ... 
2369     8.5
2430     7.5
2436    17.0
2448     7.5
2464     7.5
Name: B365H, Length: 114, dtype: float64


Outliers for B365D:
19       6.5
39       7.0
66       6.0
85       7.0
127     10.0
        ... 
2356     8.0
2426     6.5
2436     8.0
2472     8.5
2473     6.5
Name: B365D, Length: 164, dtype: float64


Outliers for B365A:
19      13.0
39      17.0
66      12.0
85      17.0
127     34.0
        ... 
2341    17.0
2349    13.0
2356    12.0
2433    13.0
2473    12.0
Name: B365A, Length: 99, dtype: float64


Outliers for HomeTeam_GoalsAvg:
745     3.82
770     3.82
787     3.82
789     3.82
793     3.82
797     3.82
805     3.82
849     3.82
851     3.82
869     3.82
887     3.82
888     3.82
927     3.82
978     3.82
1014    3.82
1035    3.82
1062    3.82
Name: HomeTeam_GoalsAvg, dtype: float64


Outliers for Broker_prob_D:
127     0.10
151     0.10
157     0.10
397     0.11
590     

Load the second divison France2

In [None]:

country = "france"
league = "2"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

In [None]:
fra21718 = seasonal_datasets['21718']
fra21819 = seasonal_datasets['21819']
fra21920 = seasonal_datasets['21920']
fra22021 = seasonal_datasets['22021']
fra22122 = seasonal_datasets['22122']

In [None]:
df20182 = fra21718[columns]
df20192 = fra21819[columns]
df20202 = fra21920[columns]
df20212 = fra22021[columns]
df20222 = fra22122[columns]

In [None]:
summary = missing_values_summary(df20182)
print(summary)

        Missing Values Count
Column                      
B365A                     11


In [None]:
summary = missing_values_summary(df20192)
print(summary)

        Missing Values Count
Column                      
FTHG                      11


In [None]:
summary = missing_values_summary(df20202)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df20212)
print(summary)

        Missing Values Count
Column                      
HS                         1
AS                         1
HST                        1
AST                        1
HC                         1
AC                         1
B365H                      3
B365D                      3
B365A                      3


In [None]:
summary = missing_values_summary(df20222)
print(summary)

        Missing Values Count
Column                      
HS                         1
AS                         1
HST                        1
AST                        1
HC                        12
AC                         1
B365H                      3
B365D                      3
B365A                      3


In [None]:
df20182 = transform_goals_to_absolute(df20182)
df20192 = transform_goals_to_absolute(df20192)
df20202 = transform_goals_to_absolute(df20202)
df20212 = transform_goals_to_absolute(df20212)
df20222 = transform_goals_to_absolute(df20222)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTHG'] = df['FTHG'].abs()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTAG'] = df['FTAG'].abs()


In [None]:
show_rows_with_missing_values(df20182)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
0,08/09/17,Auxerre,Tours,1,1,D,10,8,4,5,6,2,2.0,3.0,
1,19/01/18,Bourg Peronnas,Reims,0,1,A,7,12,2,6,7,6,4.75,3.75,
2,30/04/18,Lens,Paris FC,1,0,H,12,4,4,3,8,3,2.2,3.2,
3,22/01/18,Paris FC,Ajaccio,2,1,H,6,9,4,3,3,4,2.29,2.89,
4,25/08/17,Sochaux,Chateauroux,1,5,A,6,15,1,10,9,6,1.67,3.5,
5,15/12/17,Quevilly Rouen,Brest,1,4,A,10,13,5,10,7,3,3.2,3.39,
6,19/09/17,Paris FC,Orleans,1,0,H,17,1,5,0,8,2,2.3,3.0,
7,28/11/17,Chateauroux,Paris FC,0,0,D,10,5,5,2,3,4,2.87,2.87,
8,24/04/18,Orleans,Bourg Peronnas,5,1,H,9,11,8,4,3,3,2.04,3.39,
9,24/11/17,Orleans,Clermont,1,2,A,11,11,8,4,4,4,2.89,3.1,


In [None]:
df20182 = impute_missing_values_knn(df20182)

In [None]:


def impute_goals_based_on_result(df):
    """
    Impute missing values in 'FTHG' and 'FTAG' based on 'FTR', following specific rules.

    Parameters:
    df (pd.DataFrame): The dataset containing the columns 'FTHG', 'FTAG', and 'FTR'.

    Returns:
    pd.DataFrame: The DataFrame with imputed values.
    """
    for index, row in df.iterrows():
        if row['FTR'] == 'H':
            # Home team wins
            if pd.isna(row['FTHG']) and not pd.isna(row['FTAG']):
                df.at[index, 'FTHG'] = row['FTAG'] + 1
            elif not pd.isna(row['FTHG']) and pd.isna(row['FTAG']):
                df.at[index, 'FTAG'] = row['FTHG'] - 1 if row['FTHG'] > 0 else 0

        elif row['FTR'] == 'A':
            # Away team wins
            if pd.isna(row['FTAG']) and not pd.isna(row['FTHG']):
                df.at[index, 'FTAG'] = row['FTHG'] + 1
            elif not pd.isna(row['FTAG']) and pd.isna(row['FTHG']):
                df.at[index, 'FTHG'] = row['FTAG'] - 1 if row['FTAG'] > 0 else 0

        elif row['FTR'] == 'D':
            # Draw
            if pd.isna(row['FTHG']) and not pd.isna(row['FTAG']):
                df.at[index, 'FTHG'] = row['FTAG']
            elif pd.isna(row['FTAG']) and not pd.isna(row['FTHG']):
                df.at[index, 'FTAG'] = row['FTHG']

    return df

# Example usage:
# df = pd.read_csv('your_dataset.csv')
# df_imputed = impute_goals_based_on_result(df)



In [None]:
show_rows_with_missing_values(df20192)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
0,31/08/2018,Le Havre,Orleans,,1,H,12,14,4,4,3,6,1.7,3.6,5.0
1,21/01/2019,Paris FC,Brest,,1,A,7,13,1,4,4,5,2.75,3.0,2.75
2,04/05/2019,Lens,Clermont,,0,H,6,18,4,3,3,8,1.75,3.6,4.75
3,28/01/2019,Clermont,Le Havre,,0,D,17,5,5,2,7,1,2.0,3.1,4.33
4,24/08/2018,Orleans,Paris FC,,0,H,12,7,5,4,2,1,3.25,2.9,2.4
5,22/12/2018,Lens,Ajaccio,,2,A,15,7,4,2,3,2,1.66,3.4,6.0
6,22/09/2018,Paris FC,Metz,,1,H,7,16,3,4,2,4,3.4,3.0,2.3
7,04/12/2018,Le Havre,Chateauroux,,1,H,18,8,6,3,3,1,1.75,3.4,5.0
8,26/04/2019,Orleans,Troyes,,1,A,10,10,3,5,4,4,2.55,3.2,2.87
9,30/11/2018,Lorient,Lens,,2,D,6,13,4,4,6,4,2.3,3.0,3.3


In [None]:
df20192 = impute_goals_based_on_result(df20192)

In [None]:
df20192.head(10)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
0,31/08/2018,Le Havre,Orleans,2.0,1,H,12,14,4,4,3,6,1.7,3.6,5.0
1,21/01/2019,Paris FC,Brest,0.0,1,A,7,13,1,4,4,5,2.75,3.0,2.75
2,04/05/2019,Lens,Clermont,1.0,0,H,6,18,4,3,3,8,1.75,3.6,4.75
3,28/01/2019,Clermont,Le Havre,0.0,0,D,17,5,5,2,7,1,2.0,3.1,4.33
4,24/08/2018,Orleans,Paris FC,1.0,0,H,12,7,5,4,2,1,3.25,2.9,2.4
5,22/12/2018,Lens,Ajaccio,1.0,2,A,15,7,4,2,3,2,1.66,3.4,6.0
6,22/09/2018,Paris FC,Metz,2.0,1,H,7,16,3,4,2,4,3.4,3.0,2.3
7,04/12/2018,Le Havre,Chateauroux,2.0,1,H,18,8,6,3,3,1,1.75,3.4,5.0
8,26/04/2019,Orleans,Troyes,0.0,1,A,10,10,3,5,4,4,2.55,3.2,2.87
9,30/11/2018,Lorient,Lens,2.0,2,D,6,13,4,4,6,4,2.3,3.0,3.3


In [None]:
show_rows_with_missing_values(df20212)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
42,14/04/2021,Clermont,Amiens,3,0,H,10.0,7.0,4.0,1.0,3.0,7.0,,,
83,22/12/2020,Niort,Valenciennes,0,3,A,,,,,,,2.55,3.1,2.75
126,13/03/2021,Toulouse,Chambly,4,0,H,18.0,9.0,8.0,1.0,8.0,2.0,,,
226,13/02/2021,Clermont,Chambly,1,0,H,15.0,3.0,4.0,1.0,6.0,3.0,,,


In [None]:
df20212 = df20212.drop(83)

In [None]:
df20212 = impute_missing_values_knn(df20212)

In [None]:
show_rows_with_missing_values(df20222)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
0,28/08/2021,Guingamp,Ajaccio,1,1,D,8.0,8.0,3.0,4.0,,5.0,2.37,3.1,3.1
1,22/01/2022,Le Havre,Rodez,0,0,D,14.0,3.0,3.0,1.0,,1.0,2.25,3.0,3.0
2,30/04/2022,Sochaux,Bastia,2,1,H,9.0,4.0,3.0,0.0,,1.0,1.7,3.4,4.33
3,01/02/2022,Pau FC,Nimes,0,3,A,21.0,15.0,6.0,8.0,,3.0,2.25,2.9,3.1
4,21/08/2021,Pau FC,Bastia,2,0,H,16.0,11.0,5.0,1.0,,8.0,2.37,3.1,3.1
5,21/12/2021,Paris FC,Amiens,1,0,H,5.0,12.0,2.0,3.0,,1.0,1.83,3.25,4.0
6,18/09/2021,Pau FC,Valenciennes,1,1,D,6.0,15.0,3.0,4.0,,5.0,2.0,3.2,4.0
7,03/12/2021,Guingamp,Dijon,3,2,H,9.0,8.0,6.0,4.0,,3.0,2.37,3.2,3.0
8,22/04/2022,Nimes,Rodez,3,2,H,6.0,16.0,5.0,5.0,,7.0,2.87,2.9,2.45
9,20/11/2021,Nancy,Rodez,0,2,A,17.0,14.0,10.0,9.0,,3.0,2.7,3.1,2.75


In [None]:
df20222 = df20222.drop(58)

In [None]:
df20222 = impute_missing_values_knn(df20222)

In [None]:
df20182 = filter_goals_under_30(df20182)
df20192 = filter_goals_under_30(df20192)
df20202 = filter_goals_under_30(df20202)
df20212 = filter_goals_under_30(df20212)
df20222 = filter_goals_under_30(df20222)

In [None]:
df20182 = process_time_data(df20182, 2018)
df20192 = process_time_data(df20192, 2019)
df20202 = process_time_data(df20202, 2020)
df20212 = process_time_data(df20212, 2021)
df20222 = process_time_data(df20222, 2022)

  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
df20182 =  preprocess_football_data(df20182)
df20192 =  preprocess_football_data(df20192)
df20202 =  preprocess_football_data(df20202)
df20212 =  preprocess_football_data(df20212)
df20222 =  preprocess_football_data(df20222)

In [None]:
df20182 = calculate_attack_strength(df20182)
df20192 = calculate_attack_strength(df20192)
df20202 = calculate_attack_strength(df20202)
df20212 = calculate_attack_strength(df20212)
df20222 = calculate_attack_strength(df20222)

In [None]:
df20182 =  add_adjusted_win_loss_ratio(df20182)
df20192 =  add_adjusted_win_loss_ratio(df20192)
df20202 =  add_adjusted_win_loss_ratio(df20202)
df20212 =  add_adjusted_win_loss_ratio(df20212)
df20222 =  add_adjusted_win_loss_ratio(df20222)

In [None]:
fra2 = pd.concat([df20182, df20192, df20202, df20212, df20222], ignore_index=True)

In [None]:
fra2 = fra2.drop(columns=columns_to_drop, errors='ignore')

In [None]:
fra2.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,Year,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,...,Broker_prob_H,Broker_prob_D,Broker_prob_A,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A
0,Auxerre,Tours,D,2.0,3.0,4.0,2018,0.37,0.06,1.37,...,0.5,0.33,0.25,2.0,0.26,0.23,16.91,10.83,2.0,0.0
1,Bourg Peronnas,Reims,A,4.75,3.75,4.0,2018,0.37,0.74,1.79,...,0.21,0.27,0.25,1.0,0.39,0.32,22.11,26.67,-1.0,3.0
2,Lens,Paris FC,H,2.2,3.2,4.0,2018,0.39,0.33,1.33,...,0.45,0.31,0.25,1.0,0.26,0.28,15.61,19.17,2.0,0.0
3,Paris FC,Ajaccio,H,2.29,2.89,4.0,2018,0.53,0.32,1.21,...,0.44,0.35,0.25,3.0,0.28,0.35,14.96,21.67,1.0,1.0
4,Sochaux,Chateauroux,A,1.67,3.5,4.0,2018,0.5,0.37,1.61,...,0.6,0.29,0.25,6.0,0.33,0.25,18.86,19.17,0.0,2.0


In [None]:
fra2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1756 entries, 0 to 1755
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     1756 non-null   object 
 1   AwayTeam                     1756 non-null   object 
 2   FTR                          1756 non-null   object 
 3   B365H                        1756 non-null   float64
 4   B365D                        1756 non-null   float64
 5   B365A                        1756 non-null   float64
 6   Year                         1756 non-null   int64  
 7   HomeTeam_WinRate             1756 non-null   float64
 8   AwayTeam_WinRate             1756 non-null   float64
 9   HomeTeam_GoalsAvg            1756 non-null   float64
 10  AwayTeam_GoalsAvg            1756 non-null   float64
 11  HomeTeam_goals_conceded_avg  1756 non-null   float64
 12  AwayTeam_goals_conceded_avg  1756 non-null   float64
 13  Broker_prob_H     

Merge 2 division together

In [None]:
data2018 = pd.concat([df2018, df20182,], ignore_index=True)
data2019 = pd.concat([df2019, df20192,], ignore_index=True)
data2020 = pd.concat([df2020, df20202,], ignore_index=True)
data2021 = pd.concat([df2021, df20212,], ignore_index=True)
data2022 = pd.concat([df2022, df20222,], ignore_index=True)

In [None]:
file_path = '/content/drive/My Drive/test/france/1/2223.csv'
df2023 = pd.read_csv(file_path)

In [None]:
columns_test = ['Date', 'HomeTeam', 'AwayTeam',"B365H", "B365D", "B365A" ]

In [None]:
df2023 = df2023[columns_test]

In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      380 non-null    object 
 1   HomeTeam  380 non-null    object 
 2   AwayTeam  380 non-null    object 
 3   B365H     380 non-null    float64
 4   B365D     380 non-null    float64
 5   B365A     380 non-null    float64
dtypes: float64(3), object(3)
memory usage: 17.9+ KB


In [None]:
df2023.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A
0,05/08/2022,Lyon,Ajaccio,1.33,5.25,8.5
1,06/08/2022,Strasbourg,Monaco,2.62,3.25,2.75
2,06/08/2022,Clermont,Paris SG,9.5,6.5,1.25
3,07/08/2022,Toulouse,Nice,2.87,3.3,2.5
4,07/08/2022,Angers,Nantes,2.7,3.25,2.7


In [None]:
show_rows_with_missing_values(df2023)

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A


In [None]:


def calculate_and_apply_overall_averages(season_dfs, new_season_df):
    # Initialize dictionaries for each metric
    metrics = {
        'HomeTeam_WinRate': 'HomeTeam', 'AwayTeam_WinRate': 'AwayTeam',
        'HomeTeam_GoalsAvg': 'HomeTeam', 'AwayTeam_GoalsAvg': 'AwayTeam',
        'HomeTeam_goals_conceded_avg': 'HomeTeam', 'AwayTeam_goals_conceded_avg': 'AwayTeam',
        'H_goal_ratio': 'HomeTeam', 'A_goal_ratio': 'AwayTeam',
        'attack_strength_home_team': 'HomeTeam', 'attack_strength_away_team': 'AwayTeam'
    }
    averages_dict = {metric: {} for metric in metrics}

    # Calculate the overall average for each team across all seasons
    for df in season_dfs:
        for metric, team_col in metrics.items():
            for team in df[team_col].unique():
                averages_dict[metric][team] = df[df[team_col] == team][metric].mean()

    # Apply the overall averages to df2023
    for metric, team_col in metrics.items():
        if metric not in new_season_df:
            new_season_df[metric] = pd.NA
        new_season_df[metric] = new_season_df[team_col].map(averages_dict[metric])

    return new_season_df

# List of DataFrames from 2016 to 2022
season_dfs = [df2016, df2017, data2018, data2019, data2020, data2021, data2022]


In [None]:
# Apply the overall averages to df2023
df2023 = calculate_and_apply_overall_averages(season_dfs, df2023)

In [None]:
df2023.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,AwayTeam_goals_conceded_avg,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team
0,05/08/2022,Lyon,Ajaccio,1.33,5.25,8.5,0.61,0.47,2.28,0.95,1.39,0.58,0.36,0.31,25.62,17.71
1,06/08/2022,Strasbourg,Monaco,2.62,3.25,2.75,0.58,0.42,1.89,1.37,0.95,1.26,0.43,0.38,22.5,21.24
2,06/08/2022,Clermont,Paris SG,9.5,6.5,1.25,0.21,0.5,1.16,2.06,1.74,1.22,0.29,0.38,13.75,30.22
3,07/08/2022,Toulouse,Nice,2.87,3.3,2.5,0.72,0.47,2.67,1.24,0.72,0.76,0.44,0.36,38.05,17.15
4,07/08/2022,Angers,Nantes,2.7,3.25,2.7,0.42,0.21,1.21,1.16,1.21,1.47,0.32,0.36,14.38,17.97


In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         380 non-null    object 
 1   HomeTeam                     380 non-null    object 
 2   AwayTeam                     380 non-null    object 
 3   B365H                        380 non-null    float64
 4   B365D                        380 non-null    float64
 5   B365A                        380 non-null    float64
 6   HomeTeam_WinRate             380 non-null    float64
 7   AwayTeam_WinRate             380 non-null    float64
 8   HomeTeam_GoalsAvg            380 non-null    float64
 9   AwayTeam_GoalsAvg            380 non-null    float64
 10  HomeTeam_goals_conceded_avg  380 non-null    float64
 11  AwayTeam_goals_conceded_avg  380 non-null    float64
 12  H_goal_ratio                 380 non-null    float64
 13  A_goal_ratio        

In [None]:
def calculate_head_to_head_stats(merged_df):
    # Initialize a dictionary to track head-to-head stats
    head_to_head_stats = {}

    # Update head-to-head stats using merged_df
    for index, row in merged_df.iterrows():
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        if teams not in head_to_head_stats:
            head_to_head_stats[teams] = {'wins': {teams[0]: 0, teams[1]: 0},
                                         'draws': 0,
                                         'total_matches': 0}

        head_to_head_stats[teams]['total_matches'] += 1
        if row['FTR'] == 'H':
            head_to_head_stats[teams]['wins'][row['HomeTeam']] += 1
        elif row['FTR'] == 'D':
            head_to_head_stats[teams]['draws'] += 1
        elif row['FTR'] == 'A':
            head_to_head_stats[teams]['wins'][row['AwayTeam']] += 1

    return head_to_head_stats

def adjusted_win_loss_ratio(wins, draws, losses, total_matches):
    ratio = ((3*wins + draws) - losses) / total_matches if total_matches > 0 else 0
    return round(ratio, 1)

def apply_adjusted_win_loss_ratio_to_2023(df2023, head_to_head_stats):
    def calculate_ratio_for_match(row):
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        stats = head_to_head_stats.get(teams, {'wins': {row['HomeTeam']: 0, row['AwayTeam']: 0}, 'draws': 0, 'total_matches': 0})
        home_wins = stats['wins'].get(row['HomeTeam'], 0)
        away_wins = stats['wins'].get(row['AwayTeam'], 0)
        draws = stats['draws']
        total_matches = stats['total_matches']
        home_ratio = adjusted_win_loss_ratio(home_wins, draws, total_matches - home_wins - draws, total_matches)
        away_ratio = adjusted_win_loss_ratio(away_wins, draws, total_matches - away_wins - draws, total_matches)
        return pd.Series([home_ratio, away_ratio])

    df2023[['adjusted_win_lost_ratio_H', 'adjusted_win_lost_ratio_A']] = df2023.apply(calculate_ratio_for_match, axis=1)
    return df2023

# Assuming merged_df is the DataFrame that contains data from 2015 to 2022
merged_df = fra1.copy()

# Calculate head-to-head stats using merged data
head_to_head_stats = calculate_head_to_head_stats(merged_df)

# Apply the adjusted win-loss ratio to df2023
df2023 = apply_adjusted_win_loss_ratio_to_2023(df2023, head_to_head_stats)


In [None]:
df2023.head(10)

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,AwayTeam_goals_conceded_avg,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A
0,05/08/2022,Lyon,Ajaccio,1.33,5.25,8.5,0.61,0.47,2.28,0.95,1.39,0.58,0.36,0.31,25.62,17.71,0.0,0.0
1,06/08/2022,Strasbourg,Monaco,2.62,3.25,2.75,0.58,0.42,1.89,1.37,0.95,1.26,0.43,0.38,22.5,21.24,1.4,0.6
2,06/08/2022,Clermont,Paris SG,9.5,6.5,1.25,0.21,0.5,1.16,2.06,1.74,1.22,0.29,0.38,13.75,30.22,-1.0,3.0
3,07/08/2022,Toulouse,Nice,2.87,3.3,2.5,0.72,0.47,2.67,1.24,0.72,0.76,0.44,0.36,38.05,17.15,0.4,1.6
4,07/08/2022,Angers,Nantes,2.7,3.25,2.7,0.42,0.21,1.21,1.16,1.21,1.47,0.32,0.36,14.38,17.97,0.4,1.6
5,07/08/2022,Lens,Brest,1.57,4.33,5.25,0.47,0.32,1.84,1.11,1.05,1.47,0.3,0.31,21.88,17.15,0.5,1.5
6,07/08/2022,Lille,Auxerre,1.7,4.0,4.5,0.39,0.56,1.33,1.5,1.22,1.06,0.32,0.29,15.0,26.57,0.0,0.0
7,07/08/2022,Montpellier,Troyes,2.2,3.5,3.3,0.37,0.21,1.42,0.95,1.37,1.79,0.28,0.26,16.88,14.7,1.3,0.7
8,07/08/2022,Rennes,Lorient,1.36,5.25,8.5,0.72,0.11,2.72,0.78,0.72,2.22,0.45,0.2,30.62,11.43,1.8,0.2
9,07/08/2022,Marseille,Reims,1.57,4.2,5.5,0.47,0.32,1.74,0.95,1.16,1.11,0.37,0.25,20.62,14.7,0.8,1.2


In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         380 non-null    object 
 1   HomeTeam                     380 non-null    object 
 2   AwayTeam                     380 non-null    object 
 3   B365H                        380 non-null    float64
 4   B365D                        380 non-null    float64
 5   B365A                        380 non-null    float64
 6   HomeTeam_WinRate             380 non-null    float64
 7   AwayTeam_WinRate             380 non-null    float64
 8   HomeTeam_GoalsAvg            380 non-null    float64
 9   AwayTeam_GoalsAvg            380 non-null    float64
 10  HomeTeam_goals_conceded_avg  380 non-null    float64
 11  AwayTeam_goals_conceded_avg  380 non-null    float64
 12  H_goal_ratio                 380 non-null    float64
 13  A_goal_ratio        

In [None]:
df2023 = process_time_data(df2023, 2023)

  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
def add_probability_B365(df):

    df['Broker_prob_H'] = round(1 / df['B365H'], 2)
    df['Broker_prob_D'] = round(1 / df['B365D'], 2)
    df['Broker_prob_A'] = round(1 / df['B365A'], 2)
    return df

In [None]:
df2023 = add_probability_B365(df2023)

In [None]:


def fill_missing_with_mean(df):
    """
    Fill missing values in each column of the DataFrame with the mean of that column.

    Parameters:
    df (pd.DataFrame): The dataset with missing values.

    Returns:
    pd.DataFrame: The DataFrame with missing values filled.
    """
    for column in df.columns:
        if df[column].dtype in ['float64', 'int64']:
          mean_value = round(df[column].mean(), 2)
          df[column].fillna(mean_value, inplace=True)
    return df


In [None]:
#df2023 = fill_missing_with_mean(df2023)

In [None]:
df2023.head()

Unnamed: 0,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,...,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year,Broker_prob_H,Broker_prob_D,Broker_prob_A
0,Istanbulspor,Trabzonspor,4.7,3.75,1.61,0.51,0.61,1.68,1.83,1.07,...,0.34,0.37,19.81,27.32,0.0,0.0,2023,0.21,0.27,0.62
1,Sivasspor,Gaziantep,1.95,3.5,3.3,0.33,0.05,1.5,0.68,1.17,...,0.32,0.19,17.08,10.76,1.0,1.0,2023,0.51,0.29,0.3
2,Besiktas,Kayserispor,1.5,4.33,5.25,0.53,0.11,1.68,1.16,1.11,...,0.3,0.32,20.24,18.21,2.2,-0.2,2023,0.67,0.23,0.19
3,Giresunspor,Ad. Demirspor,2.87,3.4,2.2,0.37,0.33,1.21,1.28,1.11,...,0.28,0.26,14.55,19.04,1.0,1.0,2023,0.35,0.29,0.45
4,Karagumruk,Alanyaspor,2.55,3.5,2.4,0.5,0.56,1.44,1.56,0.94,...,0.33,0.37,16.44,23.18,0.5,1.5,2023,0.39,0.29,0.42


In [None]:
train = fra1[fra1['Year'] < 2022]
validation = fra1[fra1['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['FTR']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['FTR']

In [None]:
X_test = df2023.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((2130, 21), (2130,), (370, 21), (370,), (380, 21))

In [None]:
y_train_enc = y_train.map({'H': 1, 'D': 0, 'A': 2})
y_validation_enc = y_validation.map({'H': 1, 'D': 0, 'A': 2})

In [None]:
from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = False, custom_metric = None)
models,pred = clf.fit(X_train, X_validation, y_train_enc, y_validation_enc)

  7%|▋         | 2/29 [00:00<00:06,  4.07it/s]

ROC AUC couldn't be calculated for AdaBoostClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BaggingClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BernoulliNB
multi_class must be in ('ovo', 'ovr')


 21%|██        | 6/29 [00:03<00:12,  1.90it/s]

ROC AUC couldn't be calculated for CalibratedClassifierCV
multi_class must be in ('ovo', 'ovr')
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
ROC AUC couldn't be calculated for DecisionTreeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for DummyClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for ExtraTreeClassifier
multi_class must be in ('ovo', 'ovr')


 31%|███       | 9/29 [00:04<00:07,  2.66it/s]

ROC AUC couldn't be calculated for ExtraTreesClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for GaussianNB
multi_class must be in ('ovo', 'ovr')


 38%|███▊      | 11/29 [00:04<00:05,  3.32it/s]

ROC AUC couldn't be calculated for KNeighborsClassifier
multi_class must be in ('ovo', 'ovr')


 41%|████▏     | 12/29 [00:04<00:06,  2.77it/s]

ROC AUC couldn't be calculated for LabelPropagation
multi_class must be in ('ovo', 'ovr')


 48%|████▊     | 14/29 [00:05<00:05,  2.76it/s]

ROC AUC couldn't be calculated for LabelSpreading
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LinearDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 55%|█████▌    | 16/29 [00:07<00:05,  2.27it/s]

ROC AUC couldn't be calculated for LinearSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LogisticRegression
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NearestCentroid
multi_class must be in ('ovo', 'ovr')


 72%|███████▏  | 21/29 [00:07<00:01,  4.19it/s]

ROC AUC couldn't be calculated for NuSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for PassiveAggressiveClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for Perceptron
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for QuadraticDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 76%|███████▌  | 22/29 [00:09<00:03,  1.82it/s]

ROC AUC couldn't be calculated for RandomForestClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifier
multi_class must be in ('ovo', 'ovr')


 86%|████████▌ | 25/29 [00:10<00:01,  2.63it/s]

ROC AUC couldn't be calculated for RidgeClassifierCV
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for SGDClassifier
multi_class must be in ('ovo', 'ovr')


 90%|████████▉ | 26/29 [00:11<00:01,  1.67it/s]

ROC AUC couldn't be calculated for SVC
multi_class must be in ('ovo', 'ovr')
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'


 97%|█████████▋| 28/29 [00:16<00:01,  1.31s/it]

ROC AUC couldn't be calculated for XGBClassifier
multi_class must be in ('ovo', 'ovr')
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002087 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1069
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 21
[LightGBM] [Info] Start training from score -1.335940
[LightGBM] [Info] Start training from score -0.817997
[LightGBM] [Info] Start training from score -1.218157


100%|██████████| 29/29 [00:17<00:00,  1.64it/s]

ROC AUC couldn't be calculated for LGBMClassifier
multi_class must be in ('ovo', 'ovr')





In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.66,0.64,,0.66,4.85
BaggingClassifier,0.65,0.62,,0.65,0.16
CalibratedClassifierCV,0.66,0.62,,0.63,2.7
LGBMClassifier,0.65,0.62,,0.64,1.13
RandomForestClassifier,0.66,0.61,,0.64,1.91
LinearSVC,0.66,0.61,,0.63,1.01
ExtraTreesClassifier,0.65,0.61,,0.64,0.62
LogisticRegression,0.64,0.61,,0.64,0.19
RidgeClassifierCV,0.66,0.61,,0.61,0.15
RidgeClassifier,0.66,0.61,,0.61,0.07


In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Initialize the models
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
decision_tree_clf = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
bagging_clf.fit(X_train, y_train_enc)
extra_trees_clf.fit(X_train, y_train_enc)
random_forest_clf.fit(X_train, y_train_enc)
decision_tree_clf.fit(X_train, y_train_enc)
xgb_clf.fit(X_train, y_train_enc)

# Predict y_validation
y_pred_bagging = bagging_clf.predict(X_validation)
y_pred_extra_trees = extra_trees_clf.predict(X_validation)
y_pred_random_forest = random_forest_clf.predict(X_validation)
y_pred_decision_tree = decision_tree_clf.predict(X_validation)
y_pred_xgb = xgb_clf.predict(X_validation)

# Calculate accuracy and F1 score for each model
accuracy_bagging = accuracy_score(y_validation_enc, y_pred_bagging)
f1_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

accuracy_extra_trees = accuracy_score(y_validation_enc, y_pred_extra_trees)
f1_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')

accuracy_random_forest = accuracy_score(y_validation_enc, y_pred_random_forest)
f1_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

accuracy_decision_tree = accuracy_score(y_validation_enc, y_pred_decision_tree)
f1_decision_tree = f1_score(y_validation_enc, y_pred_decision_tree, average='macro')

accuracy_xgb = accuracy_score(y_validation_enc, y_pred_xgb)
f1_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

# Print out the performance
print(f'Bagging Classifier - Accuracy: {accuracy_bagging}, F1 Score: {f1_bagging}')
print(f'Extra Trees Classifier - Accuracy: {accuracy_extra_trees}, F1 Score: {f1_extra_trees}')
print(f'Random Forest Classifier - Accuracy: {accuracy_random_forest}, F1 Score: {f1_random_forest}')
print(f'Decision Tree Classifier - Accuracy: {accuracy_decision_tree}, F1 Score: {f1_decision_tree}')
print(f'XGB Classifier - Accuracy: {accuracy_xgb}, F1 Score: {f1_xgb}')

Bagging Classifier - Accuracy: 0.6486486486486487, F1 Score: 0.6164742255406044
Extra Trees Classifier - Accuracy: 0.6513513513513514, F1 Score: 0.6123298970923959
Random Forest Classifier - Accuracy: 0.6594594594594595, F1 Score: 0.6182216389017422
Decision Tree Classifier - Accuracy: 0.6081081081081081, F1 Score: 0.5886000280955406
XGB Classifier - Accuracy: 0.6621621621621622, F1 Score: 0.6388929798723613


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_extra_trees = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



grid_search_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_clf, param_grid_random_forest, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_extra_trees = GridSearchCV(extra_trees_clf, param_grid_extra_trees, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)

grid_search_xgb.fit(X_train, y_train_enc)
grid_search_random_forest.fit(X_train, y_train_enc)
grid_search_extra_trees.fit(X_train, y_train_enc)

best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

best_params_extra_trees = grid_search_extra_trees.best_params_
best_score_extra_trees = grid_search_extra_trees.best_score_


y_pred_xgb = grid_search_xgb.best_estimator_.predict(X_validation)
f1_score_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
f1_score_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

y_pred_extra_trees = grid_search_extra_trees.best_estimator_.predict(X_validation)
f1_score_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')


results = {
    "XGB Classifier": {
        "Best Parameters": best_params_xgb,
        "Best Score": best_score_xgb,
        "F1 Score on Validation": f1_score_xgb
    },
    "Random Forest Classifier": {
        "Best Parameters": best_params_random_forest,
        "Best Score": best_score_random_forest,
        "F1 Score on Validation": f1_score_random_forest
    },
    "Extra Trees Classifier": {
        "Best Parameters": best_params_extra_trees,
        "Best Score": best_score_extra_trees,
        "F1 Score on Validation": f1_score_extra_trees
    }
}



Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
results

{'XGB Classifier': {'Best Parameters': {'colsample_bytree': 1,
   'learning_rate': 0.1,
   'max_depth': 3,
   'n_estimators': 50,
   'subsample': 0.5},
  'Best Score': 0.6958908504493688,
  'F1 Score on Validation': 0.6305943404100227},
 'Random Forest Classifier': {'Best Parameters': {'max_depth': None,
   'min_samples_leaf': 4,
   'min_samples_split': 2,
   'n_estimators': 50},
  'Best Score': 0.6767756317069551,
  'F1 Score on Validation': 0.6235016493593476},
 'Extra Trees Classifier': {'Best Parameters': {'max_depth': 10,
   'min_samples_leaf': 2,
   'min_samples_split': 10,
   'n_estimators': 100},
  'Best Score': 0.6727562429260343,
  'F1 Score on Validation': 0.5968012718938813}}

In [None]:
optimal_xgb_clf = XGBClassifier(
    colsample_bytree=1,
    learning_rate=0.1,
    max_depth=3,
    n_estimators=50,
    subsample=0.5,
    random_state=42
)

In [None]:
optimal_xgb_clf.fit(X_train, y_train_enc)

In [None]:
y_pred_test = optimal_xgb_clf.predict(X_test)

In [None]:
inverse_mapping = {1: 'H', 0: 'D', 2: 'A'}

# Convert y_pred_test back to original form
y_pred_test_original = [inverse_mapping[label] for label in y_pred_test]

In [None]:
predictions_df = pd.DataFrame(y_pred_test_original, columns=['Predictions'])

In [None]:
predictions_df.to_csv('france_1.csv', index=False)

In [None]:
train = fra1[fra1['Year'] < 2022]
validation = fra1[fra1['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['total_goal']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['total_goal']

In [None]:
X_test = df2023.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((2130, 21), (2130,), (370, 21), (370,), (380, 21))

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
from lazypredict.Supervised import LazyRegressor

# Create an instance of LazyRegressor
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)

# Fit the model
models, predictions = reg.fit(X_train, X_validation, y_train, y_validation)

 21%|██▏       | 9/42 [00:01<00:07,  4.37it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 74%|███████▍  | 31/42 [00:08<00:03,  2.82it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:11<00:00,  3.55it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000432 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1069
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 21
[LightGBM] [Info] Start training from score 2.631455





In [None]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NuSVR,0.12,0.17,1.53,0.42
SVR,0.1,0.15,1.55,0.32
LinearSVR,0.1,0.15,1.55,0.09
TransformedTargetRegressor,0.1,0.15,1.55,0.02
LinearRegression,0.1,0.15,1.55,0.03
Ridge,0.1,0.15,1.55,0.02
RidgeCV,0.1,0.15,1.55,0.03
BayesianRidge,0.1,0.15,1.55,0.04
HuberRegressor,0.1,0.15,1.55,0.06
Lars,0.1,0.15,1.55,0.02


In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
elastic_net_cv = ElasticNetCV(cv=5, random_state=42)  # Adjust parameters as necessary
poisson_regressor = PoissonRegressor()
svr = SVR()  # Default parameters, adjust as necessary
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_reg = XGBRegressor(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
elastic_net_cv.fit(X_train, y_train)
poisson_regressor.fit(X_train, y_train)
svr.fit(X_train, y_train)
random_forest_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
y_pred_poisson_regressor = poisson_regressor.predict(X_validation)
y_pred_svr = svr.predict(X_validation)
y_pred_random_forest = random_forest_reg.predict(X_validation)
y_pred_xgb = xgb_reg.predict(X_validation)

# Predict y_validation
y_pred_elastic_net_cv_rounded = np.rint(y_pred_elastic_net_cv)
y_pred_poisson_regressor_rounded = np.rint(y_pred_poisson_regressor)
y_pred_svr_rounded = np.rint(y_pred_svr)
y_pred_random_forest_rounded = np.rint(y_pred_random_forest)
y_pred_xgb_rounded = np.rint(y_pred_xgb)

# Calculate MAE and R2 score using rounded predictions
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv_rounded)
r2_elastic_net_cv = r2_score(y_validation, y_pred_elastic_net_cv_rounded)

mae_poisson_regressor = mean_absolute_error(y_validation, y_pred_poisson_regressor_rounded)
r2_poisson_regressor = r2_score(y_validation, y_pred_poisson_regressor_rounded)

mae_svr = mean_absolute_error(y_validation, y_pred_svr_rounded)
r2_svr = r2_score(y_validation, y_pred_svr_rounded)

mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest_rounded)
r2_random_forest = r2_score(y_validation, y_pred_random_forest_rounded)

mae_xgb = mean_absolute_error(y_validation, y_pred_xgb_rounded)
r2_xgb = r2_score(y_validation, y_pred_xgb_rounded)

# Print out the performance with rounded predictions
print(f'ElasticNetCV - MAE: {mae_elastic_net_cv}, R2 Score: {r2_elastic_net_cv}')
print(f'Poisson Regressor - MAE: {mae_poisson_regressor}, R2 Score: {r2_poisson_regressor}')
print(f'SVR - MAE: {mae_svr}, R2 Score: {r2_svr}')
print(f'Random Forest Regressor - MAE: {mae_random_forest}, R2 Score: {r2_random_forest}')
print(f'XGB Regressor - MAE: {mae_xgb}, R2 Score: {r2_xgb}')

ElasticNetCV - MAE: 1.2864864864864864, R2 Score: 0.10440458870334379
Poisson Regressor - MAE: 1.3, R2 Score: 0.0405695966215609
SVR - MAE: 1.3756756756756756, R2 Score: -0.23954215967760595
Random Forest Regressor - MAE: 1.2972972972972974, R2 Score: 0.07963280073130863
XGB Regressor - MAE: 1.3972972972972972, R2 Score: -0.0718562103284448


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Parameter grids
param_grid_poisson = {
    'alpha': [0.01, 0.1, 1, 10],
    'max_iter': [100, 300, 500]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV setup
grid_search_poisson = GridSearchCV(poisson_regressor, param_grid_poisson, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_reg, param_grid_random_forest, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)

# Fitting models
grid_search_poisson.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)

# Best parameters and scores
best_params_poisson = grid_search_poisson.best_params_
best_score_poisson = grid_search_poisson.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

# Predict and calculate MAE
y_pred_poisson = grid_search_poisson.best_estimator_.predict(X_validation)
mae_poisson = mean_absolute_error(y_validation, y_pred_poisson)

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest)

# Results
results = {
    "Poisson Regressor": {
        "Best Parameters": best_params_poisson,
        "Best Score (Negative MAE)": best_score_poisson,
        "MAE on Validation": mae_poisson
    },
    "Random Forest Regressor": {
        "Best Parameters": best_params_random_forest,
        "Best Score (Negative MAE)": best_score_random_forest,
        "MAE on Validation": mae_random_forest
    }
}

# ElasticNetCV already uses cross-validation for parameter tuning, so we directly fit it and predict
elastic_net_cv = ElasticNetCV(cv=5, random_state=42).fit(X_train, y_train)
y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv)

results["ElasticNetCV"] = {
    "Best Parameters": elastic_net_cv.get_params(),
    "MAE on Validation": mae_elastic_net_cv
}

# Print results
for model, info in results.items():
    print(f"{model}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Poisson Regressor:
  Best Parameters: {'alpha': 0.01, 'max_iter': 500}
  Best Score (Negative MAE): -1.1652823172811428
  MAE on Validation: 1.2783951884974707

Random Forest Regressor:
  Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
  Best Score (Negative MAE): -1.1883455207951912
  MAE on Validation: 1.304480948033359

ElasticNetCV:
  Best Parameters: {'alphas': None, 'copy_X': True, 'cv': 5, 'eps': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': 0}
  MAE on Validation: 1.2934344637688804



In [None]:


# Set the best parameters for PoissonRegressor
best_params_poisson = {
    'alpha': 0.01,
    'max_iter': 500
}

# Initialize and fit the PoissonRegressor with the best parameters
poisson_model = PoissonRegressor(**best_params_poisson)
poisson_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_test_poisson = poisson_model.predict(X_test)

# Round the predictions to the nearest integer and convert to int type
y_pred_test_poisson_rounded = np.rint(y_pred_test_poisson).astype(int)

# y_pred_test_poisson_rounded contains the final integer predictions for X_test


In [None]:

# Convert predictions to a DataFrame
predictions_df_poisson = pd.DataFrame(y_pred_test_poisson_rounded, columns=['Predicted_Total_Goals'])

# Save to CSV
predictions_df_poisson.to_csv('france_1.csv', index=False)


In [None]:
file_path_2 = '/content/drive/My Drive/test/france/2/2223.csv'
df2023_fra2 = pd.read_csv(file_path_2)

In [None]:
df2023_fra2 = df2023_fra2[columns_test]

In [None]:
df2023_fra2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379 entries, 0 to 378
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      379 non-null    object 
 1   HomeTeam  379 non-null    object 
 2   AwayTeam  379 non-null    object 
 3   B365H     377 non-null    float64
 4   B365D     377 non-null    float64
 5   B365A     377 non-null    float64
dtypes: float64(3), object(3)
memory usage: 17.9+ KB


In [None]:
df2023_fra2 = impute_missing_values_knn(df2023_fra2)

In [None]:
df2023_fra2 = calculate_and_apply_overall_averages(season_dfs, df2023_fra2)

In [None]:
df2023_fra2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379 entries, 0 to 378
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         379 non-null    object 
 1   HomeTeam                     379 non-null    object 
 2   AwayTeam                     379 non-null    object 
 3   B365H                        379 non-null    float64
 4   B365D                        379 non-null    float64
 5   B365A                        379 non-null    float64
 6   HomeTeam_WinRate             341 non-null    float64
 7   AwayTeam_WinRate             341 non-null    float64
 8   HomeTeam_GoalsAvg            341 non-null    float64
 9   AwayTeam_GoalsAvg            341 non-null    float64
 10  HomeTeam_goals_conceded_avg  341 non-null    float64
 11  AwayTeam_goals_conceded_avg  341 non-null    float64
 12  H_goal_ratio                 341 non-null    float64
 13  A_goal_ratio        

In [None]:
merged_df = fra2.copy()

# Calculate head-to-head stats using merged data
head_to_head_stats = calculate_head_to_head_stats(merged_df)

# Apply the adjusted win-loss ratio to df2023
df2023_fra2 = apply_adjusted_win_loss_ratio_to_2023(df2023_fra2, head_to_head_stats)

In [None]:

def fill_missing_with_mean(df):
    """
    Fill missing values in each column of the DataFrame with the mean of that column.

    Parameters:
    df (pd.DataFrame): The dataset with missing values.

    Returns:
    pd.DataFrame: The DataFrame with missing values filled.
    """
    for column in df.columns:
        if df[column].dtype in ['float64', 'int64']:
          mean_value = round(df[column].mean(), 2)
          df[column].fillna(mean_value, inplace=True)
    return df

In [None]:
df2023_fra2 = fill_missing_with_mean(df2023_fra2)

In [None]:
df2023_fra2 = process_time_data(df2023_fra2, 2023)

In [None]:
df2023_fra2 = add_probability_B365(df2023_fra2)

In [None]:
df2023_fra2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379 entries, 0 to 378
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     379 non-null    object 
 1   AwayTeam                     379 non-null    object 
 2   B365H                        379 non-null    float64
 3   B365D                        379 non-null    float64
 4   B365A                        379 non-null    float64
 5   HomeTeam_WinRate             379 non-null    float64
 6   AwayTeam_WinRate             379 non-null    float64
 7   HomeTeam_GoalsAvg            379 non-null    float64
 8   AwayTeam_GoalsAvg            379 non-null    float64
 9   HomeTeam_goals_conceded_avg  379 non-null    float64
 10  AwayTeam_goals_conceded_avg  379 non-null    float64
 11  H_goal_ratio                 379 non-null    float64
 12  A_goal_ratio                 379 non-null    float64
 13  attack_strength_home

In [None]:
train = fra2[fra2['Year'] < 2022]
validation = fra2[fra2['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['FTR']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['FTR']

In [None]:
X_test = df2023_fra2.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((1385, 21), (1385,), (371, 21), (371,), (379, 21))

In [None]:
y_train_enc = y_train.map({'H': 1, 'D': 0, 'A': 2})
y_validation_enc = y_validation.map({'H': 1, 'D': 0, 'A': 2})

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = False, custom_metric = None)
models,pred = clf.fit(X_train, X_validation, y_train_enc, y_validation_enc)

  7%|▋         | 2/29 [00:00<00:06,  4.36it/s]

ROC AUC couldn't be calculated for AdaBoostClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BaggingClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BernoulliNB
multi_class must be in ('ovo', 'ovr')


 14%|█▍        | 4/29 [00:01<00:11,  2.09it/s]

ROC AUC couldn't be calculated for CalibratedClassifierCV
multi_class must be in ('ovo', 'ovr')
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
ROC AUC couldn't be calculated for DecisionTreeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for DummyClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for ExtraTreeClassifier
multi_class must be in ('ovo', 'ovr')


 31%|███       | 9/29 [00:02<00:03,  5.27it/s]

ROC AUC couldn't be calculated for ExtraTreesClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for GaussianNB
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for KNeighborsClassifier
multi_class must be in ('ovo', 'ovr')


 48%|████▊     | 14/29 [00:02<00:01,  7.51it/s]

ROC AUC couldn't be calculated for LabelPropagation
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LabelSpreading
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LinearDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 62%|██████▏   | 18/29 [00:02<00:01,  7.81it/s]

ROC AUC couldn't be calculated for LinearSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LogisticRegression
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NearestCentroid
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NuSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for PassiveAggressiveClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for Perceptron
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for QuadraticDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 86%|████████▌ | 25/29 [00:03<00:00, 10.40it/s]

ROC AUC couldn't be calculated for RandomForestClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifierCV
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for SGDClassifier
multi_class must be in ('ovo', 'ovr')


 93%|█████████▎| 27/29 [00:03<00:00, 11.16it/s]

ROC AUC couldn't be calculated for SVC
multi_class must be in ('ovo', 'ovr')
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'
ROC AUC couldn't be calculated for XGBClassifier
multi_class must be in ('ovo', 'ovr')
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000104 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 803
[LightGBM] [Info] Number of data points in the train set: 1385, number of used features: 21
[LightGBM] [Info] Start training from score -1.262194
[LightGBM] [Info] Start training from score -0.863554
[LightGBM] [Info] Start training from score -1.219740


100%|██████████| 29/29 [00:04<00:00,  6.97it/s]

ROC AUC couldn't be calculated for LGBMClassifier
multi_class must be in ('ovo', 'ovr')





In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LinearSVC,0.7,0.66,,0.66,0.33
XGBClassifier,0.67,0.66,,0.68,0.22
CalibratedClassifierCV,0.69,0.66,,0.67,1.22
RidgeClassifier,0.7,0.66,,0.66,0.02
LGBMClassifier,0.67,0.65,,0.68,0.23
RidgeClassifierCV,0.7,0.65,,0.66,0.02
NuSVC,0.67,0.65,,0.67,0.13
SGDClassifier,0.69,0.65,,0.66,0.09
RandomForestClassifier,0.67,0.65,,0.67,0.39
LogisticRegression,0.67,0.64,,0.66,0.04


In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Initialize the models
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
decision_tree_clf = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
bagging_clf.fit(X_train, y_train_enc)
extra_trees_clf.fit(X_train, y_train_enc)
random_forest_clf.fit(X_train, y_train_enc)
decision_tree_clf.fit(X_train, y_train_enc)
xgb_clf.fit(X_train, y_train_enc)

# Predict y_validation
y_pred_bagging = bagging_clf.predict(X_validation)
y_pred_extra_trees = extra_trees_clf.predict(X_validation)
y_pred_random_forest = random_forest_clf.predict(X_validation)
y_pred_decision_tree = decision_tree_clf.predict(X_validation)
y_pred_xgb = xgb_clf.predict(X_validation)

# Calculate accuracy and F1 score for each model
accuracy_bagging = accuracy_score(y_validation_enc, y_pred_bagging)
f1_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

accuracy_extra_trees = accuracy_score(y_validation_enc, y_pred_extra_trees)
f1_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')

accuracy_random_forest = accuracy_score(y_validation_enc, y_pred_random_forest)
f1_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

accuracy_decision_tree = accuracy_score(y_validation_enc, y_pred_decision_tree)
f1_decision_tree = f1_score(y_validation_enc, y_pred_decision_tree, average='macro')

accuracy_xgb = accuracy_score(y_validation_enc, y_pred_xgb)
f1_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

# Print out the performance
print(f'Bagging Classifier - Accuracy: {accuracy_bagging}, F1 Score: {f1_bagging}')
print(f'Extra Trees Classifier - Accuracy: {accuracy_extra_trees}, F1 Score: {f1_extra_trees}')
print(f'Random Forest Classifier - Accuracy: {accuracy_random_forest}, F1 Score: {f1_random_forest}')
print(f'Decision Tree Classifier - Accuracy: {accuracy_decision_tree}, F1 Score: {f1_decision_tree}')
print(f'XGB Classifier - Accuracy: {accuracy_xgb}, F1 Score: {f1_xgb}')


Bagging Classifier - Accuracy: 0.6630727762803235, F1 Score: 0.6543297983503412
Extra Trees Classifier - Accuracy: 0.6576819407008087, F1 Score: 0.6467333107763403
Random Forest Classifier - Accuracy: 0.6738544474393531, F1 Score: 0.660276933468423
Decision Tree Classifier - Accuracy: 0.6361185983827493, F1 Score: 0.6211747197000594
XGB Classifier - Accuracy: 0.6738544474393531, F1 Score: 0.6624503341159472


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_extra_trees = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



grid_search_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_clf, param_grid_random_forest, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_extra_trees = GridSearchCV(extra_trees_clf, param_grid_extra_trees, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)

grid_search_xgb.fit(X_train, y_train_enc)
grid_search_random_forest.fit(X_train, y_train_enc)
grid_search_extra_trees.fit(X_train, y_train_enc)

best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

best_params_extra_trees = grid_search_extra_trees.best_params_
best_score_extra_trees = grid_search_extra_trees.best_score_


y_pred_xgb = grid_search_xgb.best_estimator_.predict(X_validation)
f1_score_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
f1_score_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

y_pred_extra_trees = grid_search_extra_trees.best_estimator_.predict(X_validation)
f1_score_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')


results = {
    "XGB Classifier": {
        "Best Parameters": best_params_xgb,
        "Best Score": best_score_xgb,
        "F1 Score on Validation": f1_score_xgb
    },
    "Random Forest Classifier": {
        "Best Parameters": best_params_random_forest,
        "Best Score": best_score_random_forest,
        "F1 Score on Validation": f1_score_random_forest
    },
    "Extra Trees Classifier": {
        "Best Parameters": best_params_extra_trees,
        "Best Score": best_score_extra_trees,
        "F1 Score on Validation": f1_score_extra_trees
    }
}







Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
results

{'XGB Classifier': {'Best Parameters': {'colsample_bytree': 0.7,
   'learning_rate': 0.01,
   'max_depth': 3,
   'n_estimators': 50,
   'subsample': 0.5},
  'Best Score': 0.6422608005763537,
  'F1 Score on Validation': 0.6523488562091503},
 'Random Forest Classifier': {'Best Parameters': {'max_depth': None,
   'min_samples_leaf': 4,
   'min_samples_split': 10,
   'n_estimators': 100},
  'Best Score': 0.6557714447062095,
  'F1 Score on Validation': 0.6388091440723019},
 'Extra Trees Classifier': {'Best Parameters': {'max_depth': 10,
   'min_samples_leaf': 4,
   'min_samples_split': 2,
   'n_estimators': 200},
  'Best Score': 0.636978349930654,
  'F1 Score on Validation': 0.6461043217680386}}

In [None]:
# Initialize XGBClassifier with the best hyperparameters
optimal_xgb_clf = XGBClassifier(
    colsample_bytree=0.7,
    learning_rate=0.01,
    max_depth=3,
    n_estimators=50,
    subsample=0.5,
    random_state=42  # Optional for reproducibility
)

In [None]:
optimal_xgb_clf.fit(X_train, y_train_enc)

In [None]:
y_pred_test_xgb = optimal_xgb_clf.predict(X_test)

In [None]:
# Define the inverse mapping
inverse_mapping = {1: 'H', 0: 'D', 2: 'A'}

# Convert predictions back to the original form
y_pred_test_xgb_original = [inverse_mapping[label] for label in y_pred_test_xgb]


In [None]:
predictions_df = pd.DataFrame(y_pred_test_xgb_original, columns=['Predictions'])

In [None]:
predictions_df.to_csv('france_2.csv', index=False)

In [None]:
train = fra2[fra2['Year'] < 2022]
validation = fra2[fra2['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['total_goal']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['total_goal']

In [None]:
X_test = df2023_fra2.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((1385, 21), (1385,), (371, 21), (371,), (379, 21))

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:

# Create an instance of LazyRegressor
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)

# Fit the model
models, predictions = reg.fit(X_train, X_validation, y_train, y_validation)

 21%|██▏       | 9/42 [00:01<00:07,  4.29it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 74%|███████▍  | 31/42 [00:07<00:03,  3.06it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:09<00:00,  4.22it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000377 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 803
[LightGBM] [Info] Number of data points in the train set: 1385, number of used features: 21
[LightGBM] [Info] Start training from score 2.392780





In [None]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
OrthogonalMatchingPursuitCV,0.09,0.14,1.34,0.03
LassoCV,0.08,0.13,1.35,0.2
LassoLarsCV,0.08,0.13,1.35,0.05
LarsCV,0.08,0.13,1.35,0.08
ElasticNetCV,0.08,0.13,1.35,0.24
LassoLarsIC,0.08,0.13,1.35,0.02
LinearRegression,0.08,0.13,1.35,0.02
TransformedTargetRegressor,0.08,0.13,1.35,0.02
Ridge,0.08,0.13,1.35,0.02
RidgeCV,0.08,0.13,1.35,0.04


In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
elastic_net_cv = ElasticNetCV(cv=5, random_state=42)  # Adjust parameters as necessary
poisson_regressor = PoissonRegressor()
svr = SVR()  # Default parameters, adjust as necessary
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_reg = XGBRegressor(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
elastic_net_cv.fit(X_train, y_train)
poisson_regressor.fit(X_train, y_train)
svr.fit(X_train, y_train)
random_forest_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
y_pred_poisson_regressor = poisson_regressor.predict(X_validation)
y_pred_svr = svr.predict(X_validation)
y_pred_random_forest = random_forest_reg.predict(X_validation)
y_pred_xgb = xgb_reg.predict(X_validation)

# Predict y_validation
y_pred_elastic_net_cv_rounded = np.rint(y_pred_elastic_net_cv)
y_pred_poisson_regressor_rounded = np.rint(y_pred_poisson_regressor)
y_pred_svr_rounded = np.rint(y_pred_svr)
y_pred_random_forest_rounded = np.rint(y_pred_random_forest)
y_pred_xgb_rounded = np.rint(y_pred_xgb)

# Calculate MAE and R2 score using rounded predictions
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv_rounded)
r2_elastic_net_cv = r2_score(y_validation, y_pred_elastic_net_cv_rounded)

mae_poisson_regressor = mean_absolute_error(y_validation, y_pred_poisson_regressor_rounded)
r2_poisson_regressor = r2_score(y_validation, y_pred_poisson_regressor_rounded)

mae_svr = mean_absolute_error(y_validation, y_pred_svr_rounded)
r2_svr = r2_score(y_validation, y_pred_svr_rounded)

mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest_rounded)
r2_random_forest = r2_score(y_validation, y_pred_random_forest_rounded)

mae_xgb = mean_absolute_error(y_validation, y_pred_xgb_rounded)
r2_xgb = r2_score(y_validation, y_pred_xgb_rounded)

# Print out the performance with rounded predictions
print(f'ElasticNetCV - MAE: {mae_elastic_net_cv}, R2 Score: {r2_elastic_net_cv}')
print(f'Poisson Regressor - MAE: {mae_poisson_regressor}, R2 Score: {r2_poisson_regressor}')
print(f'SVR - MAE: {mae_svr}, R2 Score: {r2_svr}')
print(f'Random Forest Regressor - MAE: {mae_random_forest}, R2 Score: {r2_random_forest}')
print(f'XGB Regressor - MAE: {mae_xgb}, R2 Score: {r2_xgb}')

ElasticNetCV - MAE: 1.0619946091644206, R2 Score: 0.08530825380555973
Poisson Regressor - MAE: 1.0970350404312668, R2 Score: 0.024928321121376218
SVR - MAE: 1.1185983827493262, R2 Score: -0.03673629098161979
Random Forest Regressor - MAE: 1.0862533692722371, R2 Score: 0.06603806252337352
XGB Regressor - MAE: 1.1212938005390836, R2 Score: -0.025174176212308108


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Parameter grids
param_grid_poisson = {
    'alpha': [0.01, 0.1, 1, 10],
    'max_iter': [100, 300, 500]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV setup
grid_search_poisson = GridSearchCV(poisson_regressor, param_grid_poisson, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_reg, param_grid_random_forest, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)

# Fitting models
grid_search_poisson.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)

# Best parameters and scores
best_params_poisson = grid_search_poisson.best_params_
best_score_poisson = grid_search_poisson.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

# Predict and calculate MAE
y_pred_poisson = grid_search_poisson.best_estimator_.predict(X_validation)
mae_poisson = mean_absolute_error(y_validation, y_pred_poisson)

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest)

# Results
results = {
    "Poisson Regressor": {
        "Best Parameters": best_params_poisson,
        "Best Score (Negative MAE)": best_score_poisson,
        "MAE on Validation": mae_poisson
    },
    "Random Forest Regressor": {
        "Best Parameters": best_params_random_forest,
        "Best Score (Negative MAE)": best_score_random_forest,
        "MAE on Validation": mae_random_forest
    }
}

# ElasticNetCV already uses cross-validation for parameter tuning, so we directly fit it and predict
elastic_net_cv = ElasticNetCV(cv=5, random_state=42).fit(X_train, y_train)
y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv)

results["ElasticNetCV"] = {
    "Best Parameters": elastic_net_cv.get_params(),
    "MAE on Validation": mae_elastic_net_cv
}

# Print results
for model, info in results.items():
    print(f"{model}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Poisson Regressor:
  Best Parameters: {'alpha': 0.1, 'max_iter': 500}
  Best Score (Negative MAE): -1.2013370429101318
  MAE on Validation: 1.086936212788763

Random Forest Regressor:
  Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
  Best Score (Negative MAE): -1.235656231444986
  MAE on Validation: 1.0834296920512434

ElasticNetCV:
  Best Parameters: {'alphas': None, 'copy_X': True, 'cv': 5, 'eps': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': 0}
  MAE on Validation: 1.0830078982584197



In [None]:

# Best parameters for ElasticNetCV
best_params_elastic_net = {
    'alphas': None,
    'copy_X': True,
    'cv': 5,
    'eps': 0.001,
    'fit_intercept': True,
    'l1_ratio': 0.5,
    'max_iter': 1000,
    'n_alphas': 100,
    'n_jobs': None,
    'positive': False,
    'precompute': 'auto',
    'random_state': 42,
    'selection': 'cyclic',
    'tol': 0.0001,
    'verbose': 0
}

# Initialize and fit the ElasticNetCV model
elastic_net_model = ElasticNetCV(**best_params_elastic_net)
elastic_net_model.fit(X_train, y_train)

# Predict on the test set
y_pred_test_elastic_net = elastic_net_model.predict(X_test)

# Round predictions to nearest integer and convert to int type
y_pred_test_elastic_net_rounded = np.rint(y_pred_test_elastic_net).astype(int)

# y_pred_test_elastic_net_rounded contains the final integer predictions for X_test


In [None]:
import pandas as pd

# Convert predictions to a DataFrame
predictions_df_elastic_net = pd.DataFrame(y_pred_test_elastic_net_rounded, columns=['Predicted_Total_Goals'])

# Save to CSV
predictions_df_elastic_net.to_csv('france_2.csv', index=False)


In [None]:
df2 = fra2.copy()

In [None]:
X_train_2 = df2.drop(['FTR', 'total_goal'], axis=1)
y_train_2 = df2['FTR']
X_test_2 = df2023_fra2.copy()
X_test_2 = X_test_2[X_train_2.columns]

In [None]:
X_train_2.shape, y_train_2.shape, X_test_2.shape

((1756, 21), (1756,), (379, 21))

In [None]:

from sklearn.preprocessing import LabelEncoder

def label_encode(df):

    le = LabelEncoder()

    df['HomeTeam'] = le.fit_transform(df['HomeTeam'])
    df['AwayTeam'] = le.fit_transform(df['AwayTeam'])
    df["FTR"] = le.fit_transform(df["FTR"])

    return df

In [None]:
ita1 = label_encode(ita1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['HomeTeam'] = le.fit_transform(df['HomeTeam'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['AwayTeam'] = le.fit_transform(df['AwayTeam'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["FTR"] = le.fit_transform(df["FTR"])


In [None]:
 # Scale the DataFrame columns except for specified columns ['Date', 'HomeTeam', 'AwayTeam'].


from sklearn.preprocessing import StandardScaler

def scale_dataframe(df, columns_to_exclude=['Date', 'HomeTeam', 'AwayTeam', 'FTR', 'total_goal', 'Year', 'Month', 'Day']):

    columns_to_scale = [col for col in df.columns if col not in columns_to_exclude]

    scaler = StandardScaler()

    df_scaled = df.copy()
    df_scaled[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])

    return df_scaled


In [None]:
ita1 = scale_dataframe(ita1)

In [None]:
ita1.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,...,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year,Month,Day
0,0,14,2,0.007422,-0.70658,-0.594024,-0.0571,-0.355353,-0.182728,-0.522711,...,3.0,-0.122001,0.452391,-0.009147,-0.094845,0.000571,-0.000571,2015,10,28
1,8,18,1,-0.589648,-0.307719,0.099611,-0.316761,-0.844907,-0.222474,-1.239729,...,0.0,-0.157422,-0.64888,-0.168166,-1.059628,0.743978,-0.743978,2016,3,19
2,11,18,2,-0.539893,-0.529308,0.033424,0.514156,-0.844907,-0.165912,-1.239729,...,4.0,-0.130856,-0.64888,0.054275,-1.059628,0.000571,-0.000571,2016,1,17
3,9,2,2,-0.68916,-0.08613,0.761476,0.669953,-0.028984,-0.120051,-1.32664,...,2.0,-0.130856,-0.64888,0.181117,-1.059628,0.743978,-0.743978,2015,9,23
4,12,0,2,-0.714038,0.206368,0.761476,1.29314,-0.844907,-0.165912,-1.109362,...,1.0,-0.139711,-0.64888,0.054275,-0.92202,0.743978,-0.743978,2015,8,23


In [None]:
df = ita1.copy()

In [None]:
train_df = df[df['Year'] != 2022]
test_df = df[df['Year'] == 2022]

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split


X_train = train_df.drop(['FTR', 'total_goal'], axis=1)
y_train = train_df['FTR']
X_test = test_df.drop(['FTR', 'total_goal'], axis=1)
y_test = test_df['FTR']




In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((2415, 23), (188, 23), (2415,), (188,))

In [None]:
y_test

2234    0
2235    0
2236    2
2241    2
2243    1
       ..
2596    0
2599    2
2600    2
2602    0
2604    0
Name: FTR, Length: 188, dtype: int64

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)


rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Accuracy: 0.6808510638297872
Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.68      0.72        66
           1       0.58      0.41      0.48        51
           2       0.67      0.87      0.76        71

    accuracy                           0.68       188
   macro avg       0.67      0.66      0.65       188
weighted avg       0.68      0.68      0.67       188



In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Fit the model to the training data
dt.fit(X_train, y_train)

# Predictions
y_pred = dt.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Accuracy: 0.6808510638297872
Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.67      0.68        66
           1       0.59      0.51      0.55        51
           2       0.72      0.82      0.77        71

    accuracy                           0.68       188
   macro avg       0.67      0.66      0.66       188
weighted avg       0.68      0.68      0.68       188



In [None]:
feature_importances = dt.feature_importances_
features = X_train.columns

importances = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
importances = importances.sort_values(by='Importance', ascending=False)

print(importances)

                        Feature  Importance
19    adjusted_win_lost_ratio_A    0.470414
22                          Day    0.047313
21                        Month    0.045298
10  AwayTeam_goals_conceded_avg    0.042106
5              HomeTeam_WinRate    0.034849
14                 H_goal_ratio    0.030963
15                 A_goal_ratio    0.027428
4                         B365A    0.027374
9   HomeTeam_goals_conceded_avg    0.025929
17    attack_strength_away_team    0.024716
6              AwayTeam_WinRate    0.022080
8             AwayTeam_GoalsAvg    0.021817
0                      HomeTeam    0.021344
13               Broker_prob__A    0.020772
2                         B365H    0.019427
1                      AwayTeam    0.018953
16    attack_strength_home_team    0.018418
3                         B365D    0.018322
20                         Year    0.017831
7             HomeTeam_GoalsAvg    0.015276
11                Broker_prob_H    0.014713
12               Broker_prob__D 

In [None]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = True, custom_metric = None)
models,pred = clf.fit(X_train, X_test, y_train, y_test)

 97%|█████████▋| 28/29 [00:20<00:00,  1.30it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000871 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1128
[LightGBM] [Info] Number of data points in the train set: 2415, number of used features: 23
[LightGBM] [Info] Start training from score -1.130161
[LightGBM] [Info] Start training from score -1.419554
[LightGBM] [Info] Start training from score -0.831957


100%|██████████| 29/29 [00:21<00:00,  1.33it/s]


In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DecisionTreeClassifier,0.68,0.66,,0.68,0.15
RandomForestClassifier,0.68,0.66,,0.67,1.4
BaggingClassifier,0.68,0.65,,0.67,0.48
LabelPropagation,0.68,0.65,,0.67,1.18
LabelSpreading,0.68,0.65,,0.67,1.69
ExtraTreesClassifier,0.68,0.65,,0.66,1.72
KNeighborsClassifier,0.66,0.64,,0.65,0.21
LGBMClassifier,0.66,0.64,,0.65,0.97
RidgeClassifierCV,0.68,0.63,,0.6,0.04
RidgeClassifier,0.68,0.63,,0.6,0.13


In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# You can use these best parameters to create a new RandomForest model
best_rf = RandomForestClassifier(**best_params)
best_rf.fit(X_train, y_train)

# Predict and evaluate with the optimized model
optimized_y_pred = best_rf.predict(X_test)
optimized_accuracy = accuracy_score(y_test, optimized_y_pred)
optimized_report = classification_report(y_test, optimized_y_pred)

print("Optimized Accuracy:", optimized_accuracy)
print("Optimized Classification Report:\n", optimized_report)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 100}
Optimized Accuracy: 0.675531914893617
Optimized Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.65      0.72        66
           1       0.59      0.37      0.46        51
           2       0.64      0.92      0.75        71

    accuracy                           0.68       188
   macro avg       0.68      0.65      0.64       188
weighted avg       0.68      0.68      0.66       188



In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

param_grid = {
    'criterion': ['gini', 'entropy'],         # Function to measure the quality of a split
    'max_depth': [None, 10, 20, 30, 40, 50],  # Maximum number of levels in each decision tree
    'min_samples_split': [2, 5, 10],          # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4],            # Minimum number of samples required at each leaf node
}


# Create GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# You can use these best parameters to create a new RandomForest model
best_dt = DecisionTreeClassifier(**best_params)
best_dt.fit(X_train, y_train)

# Predict and evaluate with the optimized model
optimized_y_pred = best_dt.predict(X_test)
optimized_accuracy = accuracy_score(y_test, optimized_y_pred)
optimized_report = classification_report(y_test, optimized_y_pred)

print("Optimized Accuracy:", optimized_accuracy)
print("Optimized Classification Report:\n", optimized_report)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5}
Optimized Accuracy: 0.6808510638297872
Optimized Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.70      0.70        66
           1       0.63      0.51      0.57        51
           2       0.68      0.79      0.73        71

    accuracy                           0.68       188
   macro avg       0.67      0.67      0.67       188
weighted avg       0.68      0.68      0.68       188



In [None]:
optimized_y_pred

array([2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 1, 0, 2, 2, 1,
       2, 0, 2, 1, 2, 2, 0, 1, 2, 2, 2, 0, 0, 2, 1, 0, 1, 2, 2, 2, 2, 2,
       2, 0, 2, 2, 0, 0, 2, 0, 2, 2, 2, 0, 2, 0, 0, 2, 2, 0, 2, 1, 2, 1,
       1, 0, 2, 0, 2, 2, 0, 2, 1, 2, 1, 2, 2, 2, 0, 0, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 0, 0, 0, 2, 2, 0, 2, 2, 2, 1, 2,
       2, 2, 0, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 2, 2, 0, 2, 2, 0, 2, 1, 0,
       1, 2, 0, 1, 0, 0, 1, 2, 1, 1, 0, 2, 2, 0, 0, 2, 2, 2, 0, 2, 1, 2,
       2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2,
       0, 0, 2, 1, 0, 2, 0, 0, 2, 2, 2, 0])

In [None]:

X_train = train_df.drop(['FTR', 'total_goal'], axis=1)
y_train = train_df['total_goal']
X_test = test_df.drop(['FTR', 'total_goal'], axis=1)
y_test = test_df['total_goal']

In [None]:
from lazypredict.Supervised import LazyRegressor

reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models,pred = reg.fit(X_train, X_test, y_train, y_test)
models

 21%|██▏       | 9/42 [00:02<00:10,  3.15it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 74%|███████▍  | 31/42 [00:14<00:04,  2.37it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:18<00:00,  2.26it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000662 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1128
[LightGBM] [Info] Number of data points in the train set: 2417, number of used features: 23
[LightGBM] [Info] Start training from score 3.996276





Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LinearSVR,-0.02,0.11,1.59,0.13
HuberRegressor,-0.03,0.09,1.6,0.23
SVR,-0.05,0.08,1.61,0.45
NuSVR,-0.05,0.08,1.61,0.36
GradientBoostingRegressor,-0.05,0.08,1.61,0.99
PoissonRegressor,-0.07,0.06,1.63,0.04
HistGradientBoostingRegressor,-0.1,0.03,1.65,2.26
LGBMRegressor,-0.11,0.03,1.66,0.14
BaggingRegressor,-0.11,0.03,1.66,0.27
RandomForestRegressor,-0.11,0.03,1.66,2.62


In [None]:
ita1.to_csv("ita1.csv", index=False)

In [None]:
# Function that load all the seasonal dataset from train

def load_seasonal_data(base_path, country, league, start_season, end_season):
    seasonal_data = {}

    for season_start_year in range(start_season, end_season + 1):

        start_year_suffix = (season_start_year - 1) % 100
        end_year_suffix = season_start_year % 100

        season_str = f"{start_year_suffix:02d}{end_year_suffix:02d}"

        file_path = f"{base_path}/{country}/{league}/{season_str}.csv"

        seasonal_data[f'{league}{season_str}'] = pd.read_csv(file_path)

    return seasonal_data


base_path = "/content/drive/MyDrive/train"
country = "italy"
league = "2"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

In [None]:
ita2.to_csv("ita2.csv", index=False)