<a href="https://colab.research.google.com/github/khiemtranngoc/GoalNetAI-Multi-League-Football-Predictions/blob/main/germany.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Metodology

Because our goal is to predict  football match results from 2023 then we should not use features that are only available after the match has ended, such as match statistics and goal results. These features are not useful for predicting matches that have not yet happened.

To predict football matches before they happen, we must create a prediction models with data that is available before each match starts. However, the data we have was for the end of each match, such as the number of goals and shots per team. This data could not be used directly to train prediction models, so we had to transform it (creating pre-match features based on the historic data)

* In the test(season 2023) we dont have information such as FTHG, FTAG, ...

### Features Not Suitable for Pre-Match Prediction:
* Goals and Results (FTHG, FTAG, FTR, HTHG, HTAG, HTR): These are outcomes of the match, not available before it starts.

* In-Match Statistics (HS, AS, HST, AST, HHW, AHW, HC, AC, HF, AF, HFKC, AFKC, HO, AO, HY, AY, HR, AR): These are also outcomes or events that occur during the match.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# Function that load all the seasonal dataset from train

def load_seasonal_data(base_path, country, league, start_season, end_season):
    seasonal_data = {}

    for season_start_year in range(start_season, end_season + 1):

        start_year_suffix = (season_start_year - 1) % 100
        end_year_suffix = season_start_year % 100

        season_str = f"{start_year_suffix:02d}{end_year_suffix:02d}"

        file_path = f"{base_path}/{country}/{league}/{season_str}.csv"

        seasonal_data[f'{league}{season_str}'] = pd.read_csv(file_path)

    return seasonal_data


base_path = "/content/drive/MyDrive/train"
country = "germany"
league = "1"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

# Example: Access the data for the 2001/2002 season
# ger10102 = seasonal_datasets['ger10102']


In [None]:

ger11516 = seasonal_datasets['11516']
ger11617 = seasonal_datasets['11617']
ger11718 = seasonal_datasets['11718']
ger11819 = seasonal_datasets['11819']
ger11920 = seasonal_datasets['11920']
ger12021 = seasonal_datasets['12021']
ger12122 = seasonal_datasets['12122']

In [None]:
columns = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS',
        'HST', 'AST', 'HC', 'AC',
         "B365H", "B365D", "B365A" ]

Because we aim to predict the 2023 match result, more recent data is likely more reflective of the current state of teams and the league. Focusing on more recent data might reduce the risk of overfitting to historical trends that are no longer relevant. For all that reason, we consider choosing dataset from season 2015 to season 2022

In [None]:
df2016 = ger11516[columns]
df2017 = ger11617[columns]
df2018 = ger11718[columns]
df2019 = ger11819[columns]
df2020 = ger11920[columns]
df2021 = ger12021[columns]
df2022 = ger12122[columns]

In [None]:
# This function shows us where do we have missing value in a dataframe

def missing_values_summary(df):

    missing_counts = df.isnull().sum()

    missing_counts = missing_counts[missing_counts > 0]

    summary_df = pd.DataFrame(missing_counts, columns=['Missing Values Count'])
    summary_df.index.name = 'Column'

    return summary_df


In [None]:
summary = missing_values_summary(df2016)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2017)
print(summary)

        Missing Values Count
Column                      
FTR                        9


In [None]:
summary = missing_values_summary(df2018)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2019)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2020)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df2021)
print(summary)

        Missing Values Count
Column                      
FTAG                       7


In [None]:
summary = missing_values_summary(df2022)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
# Function to display rows with missing values from a DataFrame.

def show_rows_with_missing_values(df):

    rows_with_missing_values = df[df.isnull().any(axis=1)]

    return rows_with_missing_values




In [None]:
print(show_rows_with_missing_values(df2017))

       Date    HomeTeam       AwayTeam  FTHG  FTAG  FTR  HS  AS  HST  AST  HC  \
0  20/05/17      Hertha     Leverkusen     2    -1  NaN  14  16    5   11   3   
1  02/10/16  Schalke 04     M'gladbach     4    -1  NaN  15  12    8    4   3   
2  29/10/16   Darmstadt     RB Leipzig     0     2  NaN   4  11    0    4   2   
3  21/09/16  RB Leipzig     M'gladbach     1     1  NaN   9   3    5    1   5   
4  17/09/16  Hoffenheim      Wolfsburg     0     0  NaN  20  20    5    9   3   
5  08/04/17     FC Koln     M'gladbach     2     3  NaN   9  16    2    7   5   
6  23/09/16    Dortmund       Freiburg     3     1  NaN  24   9    5    5  15   
7  18/02/17    Dortmund      Wolfsburg     3     0  NaN  24   6   10    3   9   
8  09/09/16  Schalke 04  Bayern Munich     0     2  NaN   9  18    2    2   2   

   AC  B365H  B365D  B365A  
0   6   2.05    3.4   3.80  
1   5   2.20    3.4   3.30  
2   6   6.00    3.8   1.62  
3   2   2.38    3.5   3.00  
4  11   3.10    3.6   2.25  
5   6   2.63   

We just have 7 missing values from column FTR, which we can decide easily the value by the score from each team

In [None]:
df2017.loc[0, 'FTR'] = 'H'
df2017.loc[1, 'FTR'] = 'H'
df2017.loc[2, 'FTR'] = 'A'
df2017.loc[3, 'FTR'] = 'D'
df2017.loc[4, 'FTR'] = 'D'
df2017.loc[5, 'FTR'] = 'A'
df2017.loc[6, 'FTR'] = 'H'
df2017.loc[7, 'FTR'] = 'H'
df2017.loc[8, 'FTR'] = 'A'

In [None]:
show_rows_with_missing_values(df2021)

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC,B365H,B365D,B365A
2,28/11/2020,Dortmund,FC Koln,1,,A,14,7,4,5,6,2,1.2,7.0,13.0
3,17/10/2020,M'gladbach,Wolfsburg,1,,D,13,14,2,5,5,4,1.85,3.8,4.0
4,03/10/2020,Werder Bremen,Bielefeld,1,,H,7,10,2,5,2,1,2.15,3.5,3.3
5,10/04/2021,Ein Frankfurt,Wolfsburg,4,,H,9,20,5,7,3,7,2.37,3.5,2.9
6,23/10/2020,Stuttgart,FC Koln,1,,D,9,14,4,4,3,5,1.85,3.8,3.8
7,13/02/2021,Leverkusen,Mainz,2,,D,11,19,5,6,2,2,1.4,5.0,6.5
8,25/09/2020,Hertha,Ein Frankfurt,1,,A,12,10,6,3,4,3,2.25,3.75,2.9


In [None]:


def impute_goals_based_on_result(df):
    """
    Impute missing values in 'FTHG' and 'FTAG' based on 'FTR', following specific rules.

    Parameters:
    df (pd.DataFrame): The dataset containing the columns 'FTHG', 'FTAG', and 'FTR'.

    Returns:
    pd.DataFrame: The DataFrame with imputed values.
    """
    for index, row in df.iterrows():
        if row['FTR'] == 'H':
            # Home team wins
            if pd.isna(row['FTHG']) and not pd.isna(row['FTAG']):
                df.at[index, 'FTHG'] = row['FTAG'] + 1
            elif not pd.isna(row['FTHG']) and pd.isna(row['FTAG']):
                df.at[index, 'FTAG'] = row['FTHG'] - 1 if row['FTHG'] > 0 else 0

        elif row['FTR'] == 'A':
            # Away team wins
            if pd.isna(row['FTAG']) and not pd.isna(row['FTHG']):
                df.at[index, 'FTAG'] = row['FTHG'] + 1
            elif not pd.isna(row['FTAG']) and pd.isna(row['FTHG']):
                df.at[index, 'FTHG'] = row['FTAG'] - 1 if row['FTAG'] > 0 else 0

        elif row['FTR'] == 'D':
            # Draw
            if pd.isna(row['FTHG']) and not pd.isna(row['FTAG']):
                df.at[index, 'FTHG'] = row['FTAG']
            elif pd.isna(row['FTAG']) and not pd.isna(row['FTHG']):
                df.at[index, 'FTAG'] = row['FTHG']

    return df

In [None]:
df2021 = impute_goals_based_on_result(df2021)

In [None]:
def transform_goals_to_absolute(df):
    """
    Function to transform values in 'FTHG' (Full Time Home Team Goals) and
    'FTAG' (Full Time Away Team Goals) columns to their absolute values.

    Args:
    df (pd.DataFrame): DataFrame containing the match data.

    Returns:
    pd.DataFrame: Updated DataFrame with absolute values in the specified columns.
    """
    # Convert to absolute values
    df['FTHG'] = df['FTHG'].abs()
    df['FTAG'] = df['FTAG'].abs()

    return df

In [None]:
df2016 = transform_goals_to_absolute(df2016)
df2017 = transform_goals_to_absolute(df2017)
df2018 = transform_goals_to_absolute(df2018)
df2019 = transform_goals_to_absolute(df2019)
df2020 = transform_goals_to_absolute(df2020)
df2021 = transform_goals_to_absolute(df2021)
df2022 = transform_goals_to_absolute(df2022)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTHG'] = df['FTHG'].abs()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTAG'] = df['FTAG'].abs()


In [None]:
def find_and_print_outlier_rows(df):
    """
    Identifies and prints rows containing outliers for all numerical columns in the DataFrame.

    Parameters:
    df (pd.DataFrame): The dataset.
    """
    for column in df.select_dtypes(include=['number']).columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 3 * IQR
        upper_bound = Q3 + 3 * IQR

        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

        if not outliers.empty:
            print(f"Rows with outliers in column '{column}':")
            print(outliers)
            print("\n")

In [None]:
find_and_print_outlier_rows(df2016)

Rows with outliers in column 'FTHG':
         Date       HomeTeam       AwayTeam  FTHG  FTAG FTR  HS  AS  HST  AST  \
53   21/11/15      Wolfsburg  Werder Bremen     6     0   H  21   9   12    2   
102  02/05/16  Werder Bremen      Stuttgart     6     2   H  16  13    9    4   
223  12/09/15  Ein Frankfurt        FC Koln     6     2   H  18  13   11    7   

     HC  AC  B365H  B365D  B365A  
53    7   3   1.57    4.0   5.75  
102   3   5   2.05    3.6   3.50  
223   5   2   2.30    3.3   3.20  


Rows with outliers in column 'B365H':
         Date       HomeTeam       AwayTeam  FTHG  FTAG FTR  HS  AS  HST  AST  \
10   19/03/16        FC Koln  Bayern Munich     0     1   A   6   6    2    3   
26   14/02/16       Augsburg  Bayern Munich     1     3   A   7  24    2   10   
55   06/02/16         Hertha       Dortmund     0     0   D  10   9    4    2   
60   22/01/16        Hamburg  Bayern Munich     1     2   A   7  20    1    5   
64   05/12/15     M'gladbach  Bayern Munich     3    

In [None]:
def filter_goals_under_30(df):
    """
    Filter the DataFrame to select rows where both 'FTHG' and 'FTAG' are smaller than 30,
    including rows where 'FTHG' or 'FTAG' might be NA.

    Parameters:
    df (pandas.DataFrame): The input DataFrame with football data.

    Returns:
    pandas.DataFrame: The filtered DataFrame.
    """
    filtered_df = df[((df['FTHG'] < 30) & (df['FTAG'] < 30)) | df['FTHG'].isna() | df['FTAG'].isna()]
    return filtered_df

In [None]:
df2016 = filter_goals_under_30(df2016)
df2017 = filter_goals_under_30(df2017)
df2018 = filter_goals_under_30(df2018)
df2019 = filter_goals_under_30(df2019)
df2020 = filter_goals_under_30(df2020)
df2021 = filter_goals_under_30(df2021)
df2022 = filter_goals_under_30(df2022)

In [None]:
# Function to selectively impute missing values in a DataFrame using KNNImputer.
# The imputation is applied only to columns with missing values, and results are rounded to integers.


from sklearn.impute import KNNImputer

def impute_missing_values_knn(df, n_neighbors=5):
    cols_with_missing = df.columns[df.isnull().any()]
    numeric_cols_with_missing = df[cols_with_missing].select_dtypes(include=[np.number]).columns

    imputer = KNNImputer(n_neighbors=n_neighbors)

    df_numeric_imputed = df.copy()
    if len(numeric_cols_with_missing) > 0:
        imputed_data = imputer.fit_transform(df[numeric_cols_with_missing])
        df_imputed = pd.DataFrame(imputed_data, columns=numeric_cols_with_missing, index=df.index)

        for col in numeric_cols_with_missing:
            df_numeric_imputed[col] = df_numeric_imputed[col].fillna(np.round(df_imputed[col]))

    return df_numeric_imputed




  Function to preprocess football data and create new features:
   
  * Home and Away Team Win Rates
  * Home and Away Team Goals Average
  * Winning probabilities from Brokers's Betting Odds
  * goal ratio if the shot hits the target (total goal/ total shots on target)

In [None]:
def preprocess_football_data(df):


    # Calculating win rates and average goals
    home_win_rate = df.groupby('HomeTeam')['FTR'].apply(lambda x: round((x == 'H').mean(), 2)).to_dict()
    away_win_rate = df.groupby('AwayTeam')['FTR'].apply(lambda x: round((x == 'A').mean(), 2)).to_dict()
    home_goals_avg = df.groupby('HomeTeam')['FTHG'].mean().apply(lambda x: round(x, 2)).to_dict()
    away_goals_avg = df.groupby('AwayTeam')['FTAG'].mean().apply(lambda x: round(x, 2)).to_dict()
    home_goals_conceded_avg = df.groupby('HomeTeam')['FTAG'].mean().apply(lambda x: round(x, 2)).to_dict()
    away_goals_conceded_avg = df.groupby('AwayTeam')['FTHG'].mean().apply(lambda x: round(x, 2)).to_dict()
    goal_ratio_H = df.groupby('HomeTeam').apply(lambda x: round(x['FTHG'].sum() / x['HST'].sum(),2) if x['HST'].sum() > 0 else 0)
    goal_ratio_A = df.groupby('AwayTeam').apply(lambda x: round(x['FTAG'].sum() / x['AST'].sum(),2) if x['AST'].sum() > 0 else 0)


    # Mapping the win rates and average goals to the main DataFrame
    df['HomeTeam_WinRate'] = df['HomeTeam'].map(home_win_rate)
    df['AwayTeam_WinRate'] = df['AwayTeam'].map(away_win_rate)
    df['HomeTeam_GoalsAvg'] = df['HomeTeam'].map(home_goals_avg)
    df['AwayTeam_GoalsAvg'] = df['AwayTeam'].map(away_goals_avg)
    df['HomeTeam_goals_conceded_avg'] = df['HomeTeam'].map(home_goals_conceded_avg)
    df['AwayTeam_goals_conceded_avg'] = df['AwayTeam'].map(away_goals_conceded_avg)


    # Calculating implied probabilities from betting odds
    df['Broker_prob_H'] = round(1 / df['B365H'], 2)
    df['Broker_prob_D'] = round(1 / df['B365D'], 2)
    df['Broker_prob_A'] = round(1 / df['B365A'], 2)

     # Calculate the total goals for each match
    df['total_goal'] = df['FTHG'] + df['FTAG']


    # Map the conversion rates back to the original DataFrame
    df['H_goal_ratio'] = df['HomeTeam'].map(goal_ratio_H)
    df['A_goal_ratio'] = df['AwayTeam'].map(goal_ratio_A)

    clean_df = df[df['HomeTeam'] != df['AwayTeam']]




    return clean_df

In [None]:
def add_adjusted_win_loss_ratio(df):
    def adjusted_win_loss_ratio(wins, draws, losses, total_matches):
        return ((3*wins + draws) - losses) / total_matches if total_matches > 0 else 0

    # Initialize a dictionary to track head-to-head stats
    head_to_head_stats = {}

    # Update head-to-head stats
    for index, row in df.iterrows():
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        if teams not in head_to_head_stats:
            head_to_head_stats[teams] = {'wins': {teams[0]: 0, teams[1]: 0},
                                         'draws': 0,
                                         'total_matches': 0}

        head_to_head_stats[teams]['total_matches'] += 1
        if row['FTR'] == 'H':
            head_to_head_stats[teams]['wins'][row['HomeTeam']] += 1
        elif row['FTR'] == 'D':
            head_to_head_stats[teams]['draws'] += 1
        elif row['FTR'] == 'A':
            head_to_head_stats[teams]['wins'][row['AwayTeam']] += 1

    # Calculate and add the adjusted win-loss ratio to the DataFrame
    def calculate_ratio_for_match(row):
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        stats = head_to_head_stats[teams]
        home_wins = stats['wins'][row['HomeTeam']]
        away_wins = stats['wins'][row['AwayTeam']]
        draws = stats['draws']
        total_matches = stats['total_matches']
        home_ratio = adjusted_win_loss_ratio(home_wins, draws, total_matches - home_wins - draws, total_matches)
        away_ratio = adjusted_win_loss_ratio(away_wins, draws, total_matches - away_wins - draws, total_matches)
        return pd.Series([home_ratio, away_ratio])

    df[['adjusted_win_lost_ratio_H', 'adjusted_win_lost_ratio_A']] = df.apply(calculate_ratio_for_match, axis=1)

    return df

I developed the features {'attack_strength_home_team'} and {'attack_strength_away_team'} for every team in the league. These features measure a team's ability to score goals compared to the league average, offering a consistent way to gauge their attacking strength.

In [None]:
def calculate_attack_strength(df):
    # Calculate total goals for each team
    total_home_goals = df.groupby('HomeTeam')['FTHG'].sum()
    total_away_goals = df.groupby('AwayTeam')['FTAG'].sum()

    # Calculate league averages for home and away goals
    average_home_goals = df['FTHG'].mean()
    average_away_goals = df['FTAG'].mean()

    # Calculate attack strength
    df['attack_strength_home_team'] = df['HomeTeam'].apply(lambda x: round(total_home_goals[x] / average_home_goals,2))
    df['attack_strength_away_team'] = df['AwayTeam'].apply(lambda x: round(total_away_goals[x] / average_away_goals,2))

    return df




In [None]:
df2016 =  preprocess_football_data(df2016)
df2017 =  preprocess_football_data(df2017)
df2018 =  preprocess_football_data(df2018)
df2019 =  preprocess_football_data(df2019)
df2020 =  preprocess_football_data(df2020)
df2021 =  preprocess_football_data(df2021)
df2022 =  preprocess_football_data(df2022)

In [None]:
df2016 = calculate_attack_strength(df2016)
df2017 = calculate_attack_strength(df2017)
df2018 = calculate_attack_strength(df2018)
df2019 = calculate_attack_strength(df2019)
df2020 = calculate_attack_strength(df2020)
df2021 = calculate_attack_strength(df2021)
df2022 = calculate_attack_strength(df2022)

In [None]:
df2016 =  add_adjusted_win_loss_ratio(df2016)
df2017 =  add_adjusted_win_loss_ratio(df2017)
df2018 =  add_adjusted_win_loss_ratio(df2018)
df2019 =  add_adjusted_win_loss_ratio(df2019)
df2020 =  add_adjusted_win_loss_ratio(df2020)
df2021 =  add_adjusted_win_loss_ratio(df2021)
df2022 =  add_adjusted_win_loss_ratio(df2022)

In [None]:

def process_time_data(df, target_year):
    # Convert 'Date' column to datetime
    df['Date'] = pd.to_datetime(df['Date'])

    # Extract 'Day', 'Month', and 'Year' from 'Date'
    df['Day'] = df['Date'].dt.day
    df['Month'] = df['Date'].dt.month
    df['Year'] = df['Date'].dt.year

    # Adjust 'Year' values
    df['Year'] = df['Year'].apply(lambda x: target_year if x != target_year else x)

    # Drop 'Day' and 'Month' columns
    df.drop(['Day', 'Month', 'Date'], axis=1, inplace=True)

    return df

In [None]:
df2016 = process_time_data(df2016, 2016)
df2017 = process_time_data(df2017, 2017)
df2018 = process_time_data(df2018, 2018)
df2019 = process_time_data(df2019, 2019)
df2020 = process_time_data(df2020, 2020)
df2021 = process_time_data(df2021, 2021)
df2022 = process_time_data(df2022, 2022)

  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
columns_to_drop = ['FTHG', 'FTAG', 'HTR', 'HC', 'AC', 'HST', 'AST', 'HS', 'AS']



In [None]:
ger1 = pd.concat([df2016, df2017, df2018, df2019, df2020, df2021, df2022], ignore_index=True)

In [None]:
ger1 = ger1.drop(columns=columns_to_drop, errors='ignore')

In [None]:
ger1.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,...,Broker_prob_D,Broker_prob_A,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year
0,Leverkusen,Ingolstadt,H,1.5,4.33,6.5,0.59,0.18,1.82,0.65,...,0.23,0.15,5.0,0.33,0.16,19.79,8.71,3.0,-1.0,2016
1,M'gladbach,Augsburg,H,2.15,3.4,3.4,0.76,0.35,2.47,1.41,...,0.29,0.29,6.0,0.37,0.35,26.81,19.0,2.0,0.0,2016
2,Ein Frankfurt,M'gladbach,A,2.45,3.4,2.88,0.35,0.25,1.29,1.44,...,0.29,0.35,6.0,0.32,0.32,14.04,18.21,-1.0,3.0,2016
3,Leverkusen,Darmstadt,A,1.25,6.0,13.0,0.59,0.41,1.82,1.35,...,0.17,0.08,1.0,0.33,0.37,19.79,18.21,1.0,1.0,2016
4,Mainz,Hannover,H,2.0,3.5,3.8,0.47,0.18,1.35,0.94,...,0.29,0.26,3.0,0.29,0.26,14.68,12.66,3.0,-1.0,2016


In [None]:
ger1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2096 entries, 0 to 2095
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     2096 non-null   object 
 1   AwayTeam                     2096 non-null   object 
 2   FTR                          2096 non-null   object 
 3   B365H                        2096 non-null   float64
 4   B365D                        2096 non-null   float64
 5   B365A                        2096 non-null   float64
 6   HomeTeam_WinRate             2096 non-null   float64
 7   AwayTeam_WinRate             2096 non-null   float64
 8   HomeTeam_GoalsAvg            2096 non-null   float64
 9   AwayTeam_GoalsAvg            2096 non-null   float64
 10  HomeTeam_goals_conceded_avg  2096 non-null   float64
 11  AwayTeam_goals_conceded_avg  2096 non-null   float64
 12  Broker_prob_H                2096 non-null   float64
 13  Broker_prob_D     

Load Ger2 tier

In [None]:
base_path = "/content/drive/MyDrive/train"
country = "germany"
league = "2"
seasonal_datasets = load_seasonal_data(base_path, country, league, 1, 22)

In [None]:

ger21718 = seasonal_datasets['21718']
ger21819 = seasonal_datasets['21819']
ger21920 = seasonal_datasets['21920']
ger22021 = seasonal_datasets['22021']
ger22122 = seasonal_datasets['22122']

In [None]:

df20182 = ger21718[columns]
df20192 = ger21819[columns]
df20202 = ger21920[columns]
df20212 = ger22021[columns]
df20222 = ger22122[columns]

In [None]:
summary = missing_values_summary(df20182)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df20192)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df20202)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
summary = missing_values_summary(df20212)
print(summary)

        Missing Values Count
Column                      
HC                         9
B365H                      2
B365D                      2
B365A                      2


In [None]:
summary = missing_values_summary(df20222)
print(summary)

Empty DataFrame
Columns: [Missing Values Count]
Index: []


In [None]:
df20182 = transform_goals_to_absolute(df20182)
df20192 = transform_goals_to_absolute(df20192)
df20202 = transform_goals_to_absolute(df20202)
df20212 = transform_goals_to_absolute(df20212)
df20222 = transform_goals_to_absolute(df20222)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTHG'] = df['FTHG'].abs()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['FTAG'] = df['FTAG'].abs()


In [None]:
df20182 = filter_goals_under_30(df20182)
df20192 = filter_goals_under_30(df20192)
df20202 = filter_goals_under_30(df20202)
df20212 = filter_goals_under_30(df20212)
df20222 = filter_goals_under_30(df20222)

In [None]:
df20182 = process_time_data(df20182, 2018)
df20192 = process_time_data(df20192, 2019)
df20202 = process_time_data(df20202, 2020)
df20212 = process_time_data(df20212, 2021)
df20222 = process_time_data(df20222, 2022)

  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])
  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
df20212 = impute_missing_values_knn(df20212)

In [None]:
df20182 =  preprocess_football_data(df20182)
df20192 =  preprocess_football_data(df20192)
df20202 =  preprocess_football_data(df20202)
df20212 =  preprocess_football_data(df20212)
df20222 =  preprocess_football_data(df20222)

In [None]:
df20182 = calculate_attack_strength(df20182)
df20192 = calculate_attack_strength(df20192)
df20202 = calculate_attack_strength(df20202)
df20212 = calculate_attack_strength(df20212)
df20222 = calculate_attack_strength(df20222)

In [None]:
df20182 =  add_adjusted_win_loss_ratio(df20182)
df20192 =  add_adjusted_win_loss_ratio(df20192)
df20202 =  add_adjusted_win_loss_ratio(df20202)
df20212 =  add_adjusted_win_loss_ratio(df20212)
df20222 =  add_adjusted_win_loss_ratio(df20222)

In [None]:
ger2 = pd.concat([df20182, df20192, df20202, df20212, df20222], ignore_index=True)

In [None]:
ger2 = ger2.drop(columns=columns_to_drop, errors='ignore')

In [None]:
ger2.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,Year,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,...,Broker_prob_H,Broker_prob_D,Broker_prob_A,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A
0,Erzgebirge Aue,Union Berlin,A,2.63,3.3,2.6,2018,0.41,0.29,1.06,...,0.38,0.3,0.38,3,0.24,0.33,12.14,18.87,0.0,2.0
1,Sandhausen,Fortuna Dusseldorf,A,2.45,3.3,2.88,2018,0.41,0.53,1.12,...,0.41,0.3,0.35,3,0.28,0.31,12.81,20.44,-1.0,3.0
2,Holstein Kiel,Greuther Furth,H,2.05,3.3,3.6,2018,0.47,0.06,2.12,...,0.49,0.3,0.28,4,0.4,0.2,24.27,7.86,2.0,0.0
3,Heidenheim,Ingolstadt,A,3.5,3.39,2.04,2018,0.53,0.41,2.0,...,0.29,0.29,0.49,3,0.47,0.31,20.23,18.08,-1.0,3.0
4,Dresden,Greuther Furth,D,2.35,3.3,3.0,2018,0.31,0.06,1.19,...,0.43,0.3,0.33,2,0.26,0.2,12.81,7.86,0.0,2.0


In [None]:
ger2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1496 entries, 0 to 1495
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     1496 non-null   object 
 1   AwayTeam                     1496 non-null   object 
 2   FTR                          1496 non-null   object 
 3   B365H                        1496 non-null   float64
 4   B365D                        1496 non-null   float64
 5   B365A                        1496 non-null   float64
 6   Year                         1496 non-null   int64  
 7   HomeTeam_WinRate             1496 non-null   float64
 8   AwayTeam_WinRate             1496 non-null   float64
 9   HomeTeam_GoalsAvg            1496 non-null   float64
 10  AwayTeam_GoalsAvg            1496 non-null   float64
 11  HomeTeam_goals_conceded_avg  1496 non-null   float64
 12  AwayTeam_goals_conceded_avg  1496 non-null   float64
 13  Broker_prob_H     

Merge 2 season together

In [None]:
data2018 = pd.concat([df2018, df20182,], ignore_index=True)
data2019 = pd.concat([df2019, df20192,], ignore_index=True)
data2020 = pd.concat([df2020, df20202,], ignore_index=True)
data2021 = pd.concat([df2021, df20212,], ignore_index=True)
data2022 = pd.concat([df2022, df20222,], ignore_index=True)

In [None]:
file_path = '/content/drive/My Drive/test/germany/1/2223.csv'
df2023 = pd.read_csv(file_path)

In [None]:
columns_test = ['Date', 'HomeTeam', 'AwayTeam',"B365H", "B365D", "B365A" ]

In [None]:
df2023 = df2023[columns_test]

In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      306 non-null    object 
 1   HomeTeam  306 non-null    object 
 2   AwayTeam  306 non-null    object 
 3   B365H     306 non-null    float64
 4   B365D     306 non-null    float64
 5   B365A     306 non-null    float64
dtypes: float64(3), object(3)
memory usage: 14.5+ KB


In [None]:


def calculate_and_apply_overall_averages(season_dfs, new_season_df):
    # Initialize dictionaries for each metric
    metrics = {
        'HomeTeam_WinRate': 'HomeTeam', 'AwayTeam_WinRate': 'AwayTeam',
        'HomeTeam_GoalsAvg': 'HomeTeam', 'AwayTeam_GoalsAvg': 'AwayTeam',
        'HomeTeam_goals_conceded_avg': 'HomeTeam', 'AwayTeam_goals_conceded_avg': 'AwayTeam',
        'H_goal_ratio': 'HomeTeam', 'A_goal_ratio': 'AwayTeam',
        'attack_strength_home_team': 'HomeTeam', 'attack_strength_away_team': 'AwayTeam'
    }
    averages_dict = {metric: {} for metric in metrics}

    # Calculate the overall average for each team across all seasons
    for df in season_dfs:
        for metric, team_col in metrics.items():
            for team in df[team_col].unique():
                averages_dict[metric][team] = df[df[team_col] == team][metric].mean()

    # Apply the overall averages to df2023
    for metric, team_col in metrics.items():
        if metric not in new_season_df:
            new_season_df[metric] = pd.NA
        new_season_df[metric] = new_season_df[team_col].map(averages_dict[metric])

    return new_season_df

# List of DataFrames from 2016 to 2022
season_dfs = [df2016, df2017, data2018, data2019, data2020, data2021, data2022]


In [None]:
# Apply the overall averages to df2023
df2023 = calculate_and_apply_overall_averages(season_dfs, df2023)

In [None]:


def calculate_head_to_head_stats(merged_df):
    # Initialize a dictionary to track head-to-head stats
    head_to_head_stats = {}

    # Update head-to-head stats using merged_df
    for index, row in merged_df.iterrows():
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        if teams not in head_to_head_stats:
            head_to_head_stats[teams] = {'wins': {teams[0]: 0, teams[1]: 0},
                                         'draws': 0,
                                         'total_matches': 0}

        head_to_head_stats[teams]['total_matches'] += 1
        if row['FTR'] == 'H':
            head_to_head_stats[teams]['wins'][row['HomeTeam']] += 1
        elif row['FTR'] == 'D':
            head_to_head_stats[teams]['draws'] += 1
        elif row['FTR'] == 'A':
            head_to_head_stats[teams]['wins'][row['AwayTeam']] += 1

    return head_to_head_stats

def adjusted_win_loss_ratio(wins, draws, losses, total_matches):
    ratio = ((3*wins + draws) - losses) / total_matches if total_matches > 0 else 0
    return round(ratio, 1)

def apply_adjusted_win_loss_ratio_to_2023(df2023, head_to_head_stats):
    def calculate_ratio_for_match(row):
        teams = tuple(sorted([row['HomeTeam'], row['AwayTeam']]))
        stats = head_to_head_stats.get(teams, {'wins': {row['HomeTeam']: 0, row['AwayTeam']: 0}, 'draws': 0, 'total_matches': 0})
        home_wins = stats['wins'].get(row['HomeTeam'], 0)
        away_wins = stats['wins'].get(row['AwayTeam'], 0)
        draws = stats['draws']
        total_matches = stats['total_matches']
        home_ratio = adjusted_win_loss_ratio(home_wins, draws, total_matches - home_wins - draws, total_matches)
        away_ratio = adjusted_win_loss_ratio(away_wins, draws, total_matches - away_wins - draws, total_matches)
        return pd.Series([home_ratio, away_ratio])

    df2023[['adjusted_win_lost_ratio_H', 'adjusted_win_lost_ratio_A']] = df2023.apply(calculate_ratio_for_match, axis=1)
    return df2023

# Assuming merged_df is the DataFrame that contains data from 2015 to 2022
merged_df = ger1.copy()

# Calculate head-to-head stats using merged data
head_to_head_stats = calculate_head_to_head_stats(merged_df)

# Apply the adjusted win-loss ratio to df2023
df2023 = apply_adjusted_win_loss_ratio_to_2023(df2023, head_to_head_stats)

In [None]:
df2023.head(10)

Unnamed: 0,Date,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,AwayTeam_goals_conceded_avg,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A
0,05/08/2022,Ein Frankfurt,Bayern Munich,6.0,5.0,1.5,0.24,0.65,1.18,2.88,1.29,1.29,0.26,0.35,11.41,35.85,0.1,1.9
1,06/08/2022,Augsburg,Freiburg,3.3,3.5,2.15,0.44,0.41,1.5,1.53,1.56,1.24,0.35,0.34,13.69,19.02,0.2,1.8
2,06/08/2022,Bochum,Mainz,3.2,3.4,2.2,0.47,0.19,1.24,1.06,1.0,1.81,0.25,0.24,11.98,12.44,1.0,1.0
3,06/08/2022,M'gladbach,Hoffenheim,2.05,4.2,3.0,0.47,0.29,1.94,1.41,1.59,2.12,0.31,0.35,18.82,17.56,1.1,0.9
4,06/08/2022,Union Berlin,Hertha,1.72,3.75,4.5,0.59,0.18,1.47,0.76,1.0,2.24,0.32,0.21,14.26,9.51,1.3,0.7
5,06/08/2022,Wolfsburg,Werder Bremen,1.85,3.8,3.8,0.33,0.59,1.07,2.12,1.33,1.53,0.23,0.37,9.13,26.6,0.7,1.3
6,06/08/2022,Dortmund,Leverkusen,2.0,4.2,3.2,0.76,0.53,3.06,2.41,1.65,1.41,0.53,0.45,29.66,30.0,1.7,0.3
7,07/08/2022,Stuttgart,RB Leipzig,4.5,3.8,1.75,0.35,0.38,1.65,1.81,1.88,1.12,0.26,0.36,15.97,21.22,-0.8,2.8
8,07/08/2022,FC Koln,Schalke 04,1.7,4.0,4.5,0.53,0.59,1.59,1.88,1.24,1.0,0.31,0.33,15.4,23.65,1.7,0.3
9,12/08/2022,Freiburg,Dortmund,3.0,3.75,2.2,0.5,0.53,2.0,1.94,1.44,1.41,0.37,0.42,18.25,24.15,0.1,1.9


In [None]:
df2023 = process_time_data(df2023, 2023)

  df['Date'] = pd.to_datetime(df['Date'])


In [None]:
def add_probability_B365(df):

    df['Broker_prob_H'] = round(1 / df['B365H'], 2)
    df['Broker_prob_D'] = round(1 / df['B365D'], 2)
    df['Broker_prob_A'] = round(1 / df['B365A'], 2)
    return df

In [None]:
df2023 = add_probability_B365(df2023)

In [None]:
df2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     306 non-null    object 
 1   AwayTeam                     306 non-null    object 
 2   B365H                        306 non-null    float64
 3   B365D                        306 non-null    float64
 4   B365A                        306 non-null    float64
 5   HomeTeam_WinRate             306 non-null    float64
 6   AwayTeam_WinRate             306 non-null    float64
 7   HomeTeam_GoalsAvg            306 non-null    float64
 8   AwayTeam_GoalsAvg            306 non-null    float64
 9   HomeTeam_goals_conceded_avg  306 non-null    float64
 10  AwayTeam_goals_conceded_avg  306 non-null    float64
 11  H_goal_ratio                 306 non-null    float64
 12  A_goal_ratio                 306 non-null    float64
 13  attack_strength_home

In [None]:
train = ger1[ger1['Year'] < 2022]
validation = ger1[ger1['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['FTR']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['FTR']

In [None]:
X_test = df2023.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((1796, 21), (1796,), (300, 21), (300,), (306, 21))

In [None]:
y_train_enc = y_train.map({'H': 1, 'D': 0, 'A': 2})
y_validation_enc = y_validation.map({'H': 1, 'D': 0, 'A': 2})

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])


In [None]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = False, custom_metric = None)
models,pred = clf.fit(X_train, X_validation, y_train_enc, y_validation_enc)

  3%|▎         | 1/29 [00:00<00:10,  2.58it/s]

ROC AUC couldn't be calculated for AdaBoostClassifier
multi_class must be in ('ovo', 'ovr')


  7%|▋         | 2/29 [00:00<00:10,  2.67it/s]

ROC AUC couldn't be calculated for BaggingClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BernoulliNB
multi_class must be in ('ovo', 'ovr')


 14%|█▍        | 4/29 [00:02<00:16,  1.54it/s]

ROC AUC couldn't be calculated for CalibratedClassifierCV
multi_class must be in ('ovo', 'ovr')
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
ROC AUC couldn't be calculated for DecisionTreeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for DummyClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for ExtraTreeClassifier
multi_class must be in ('ovo', 'ovr')


 31%|███       | 9/29 [00:02<00:04,  4.06it/s]

ROC AUC couldn't be calculated for ExtraTreesClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for GaussianNB
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for KNeighborsClassifier
multi_class must be in ('ovo', 'ovr')


 41%|████▏     | 12/29 [00:02<00:03,  5.36it/s]

ROC AUC couldn't be calculated for LabelPropagation
multi_class must be in ('ovo', 'ovr')


 45%|████▍     | 13/29 [00:03<00:03,  5.02it/s]

ROC AUC couldn't be calculated for LabelSpreading
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LinearDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 52%|█████▏    | 15/29 [00:03<00:02,  4.82it/s]

ROC AUC couldn't be calculated for LinearSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LogisticRegression
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NearestCentroid
multi_class must be in ('ovo', 'ovr')


 62%|██████▏   | 18/29 [00:03<00:01,  6.10it/s]

ROC AUC couldn't be calculated for NuSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for PassiveAggressiveClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for Perceptron
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for QuadraticDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 86%|████████▌ | 25/29 [00:04<00:00,  8.43it/s]

ROC AUC couldn't be calculated for RandomForestClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifierCV
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for SGDClassifier
multi_class must be in ('ovo', 'ovr')


 93%|█████████▎| 27/29 [00:04<00:00,  9.15it/s]

ROC AUC couldn't be calculated for SVC
multi_class must be in ('ovo', 'ovr')
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'
ROC AUC couldn't be calculated for XGBClassifier
multi_class must be in ('ovo', 'ovr')
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000311 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 944
[LightGBM] [Info] Number of data points in the train set: 1796, number of used features: 21
[LightGBM] [Info] Start training from score -1.402007
[LightGBM] [Info] Start training from score -0.813718
[LightGBM] [Info] Start training from score -1.168958


100%|██████████| 29/29 [00:05<00:00,  5.20it/s]

ROC AUC couldn't be calculated for LGBMClassifier
multi_class must be in ('ovo', 'ovr')





In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
QuadraticDiscriminantAnalysis,0.69,0.67,,0.7,0.04
LGBMClassifier,0.71,0.66,,0.7,0.42
XGBClassifier,0.7,0.65,,0.69,0.35
ExtraTreeClassifier,0.67,0.65,,0.67,0.01
CalibratedClassifierCV,0.7,0.63,,0.66,1.59
SGDClassifier,0.67,0.63,,0.67,0.1
ExtraTreesClassifier,0.68,0.63,,0.67,0.27
LogisticRegression,0.68,0.63,,0.67,0.07
RandomForestClassifier,0.69,0.62,,0.66,0.43
LinearDiscriminantAnalysis,0.68,0.62,,0.66,0.06


In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Initialize the models
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
decision_tree_clf = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
bagging_clf.fit(X_train, y_train_enc)
extra_trees_clf.fit(X_train, y_train_enc)
random_forest_clf.fit(X_train, y_train_enc)
decision_tree_clf.fit(X_train, y_train_enc)
xgb_clf.fit(X_train, y_train_enc)

# Predict y_validation
y_pred_bagging = bagging_clf.predict(X_validation)
y_pred_extra_trees = extra_trees_clf.predict(X_validation)
y_pred_random_forest = random_forest_clf.predict(X_validation)
y_pred_decision_tree = decision_tree_clf.predict(X_validation)
y_pred_xgb = xgb_clf.predict(X_validation)

# Calculate accuracy and F1 score for each model
accuracy_bagging = accuracy_score(y_validation_enc, y_pred_bagging)
f1_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

accuracy_extra_trees = accuracy_score(y_validation_enc, y_pred_extra_trees)
f1_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')

accuracy_random_forest = accuracy_score(y_validation_enc, y_pred_random_forest)
f1_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

accuracy_decision_tree = accuracy_score(y_validation_enc, y_pred_decision_tree)
f1_decision_tree = f1_score(y_validation_enc, y_pred_decision_tree, average='macro')

accuracy_xgb = accuracy_score(y_validation_enc, y_pred_xgb)
f1_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

# Print out the performance
print(f'Bagging Classifier - Accuracy: {accuracy_bagging}, F1 Score: {f1_bagging}')
print(f'Extra Trees Classifier - Accuracy: {accuracy_extra_trees}, F1 Score: {f1_extra_trees}')
print(f'Random Forest Classifier - Accuracy: {accuracy_random_forest}, F1 Score: {f1_random_forest}')
print(f'Decision Tree Classifier - Accuracy: {accuracy_decision_tree}, F1 Score: {f1_decision_tree}')
print(f'XGB Classifier - Accuracy: {accuracy_xgb}, F1 Score: {f1_xgb}')


Bagging Classifier - Accuracy: 0.6866666666666666, F1 Score: 0.6284085454075484
Extra Trees Classifier - Accuracy: 0.6833333333333333, F1 Score: 0.629685852835758
Random Forest Classifier - Accuracy: 0.6933333333333334, F1 Score: 0.6225364712358932
Decision Tree Classifier - Accuracy: 0.6466666666666666, F1 Score: 0.610342860976919
XGB Classifier - Accuracy: 0.6966666666666667, F1 Score: 0.6549264127626109


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_extra_trees = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



grid_search_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_clf, param_grid_random_forest, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_extra_trees = GridSearchCV(extra_trees_clf, param_grid_extra_trees, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)

grid_search_xgb.fit(X_train, y_train_enc)
grid_search_random_forest.fit(X_train, y_train_enc)
grid_search_extra_trees.fit(X_train, y_train_enc)

best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

best_params_extra_trees = grid_search_extra_trees.best_params_
best_score_extra_trees = grid_search_extra_trees.best_score_


y_pred_xgb = grid_search_xgb.best_estimator_.predict(X_validation)
f1_score_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
f1_score_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

y_pred_extra_trees = grid_search_extra_trees.best_estimator_.predict(X_validation)
f1_score_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')


results = {
    "XGB Classifier": {
        "Best Parameters": best_params_xgb,
        "Best Score": best_score_xgb,
        "F1 Score on Validation": f1_score_xgb
    },
    "Random Forest Classifier": {
        "Best Parameters": best_params_random_forest,
        "Best Score": best_score_random_forest,
        "F1 Score on Validation": f1_score_random_forest
    },
    "Extra Trees Classifier": {
        "Best Parameters": best_params_extra_trees,
        "Best Score": best_score_extra_trees,
        "F1 Score on Validation": f1_score_extra_trees
    }
}





Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
results

{'XGB Classifier': {'Best Parameters': {'colsample_bytree': 1,
   'learning_rate': 0.01,
   'max_depth': 10,
   'n_estimators': 50,
   'subsample': 0.5},
  'Best Score': 0.6969195433902737,
  'F1 Score on Validation': 0.6683630401215815},
 'Random Forest Classifier': {'Best Parameters': {'max_depth': None,
   'min_samples_leaf': 4,
   'min_samples_split': 2,
   'n_estimators': 100},
  'Best Score': 0.6730572169535618,
  'F1 Score on Validation': 0.6228526258584063},
 'Extra Trees Classifier': {'Best Parameters': {'max_depth': 10,
   'min_samples_leaf': 4,
   'min_samples_split': 2,
   'n_estimators': 100},
  'Best Score': 0.6761541686958532,
  'F1 Score on Validation': 0.6385610608760879}}

In [None]:
optimal_xgb_clf = XGBClassifier(
    colsample_bytree=1,
    learning_rate=0.01,
    max_depth=10,
    n_estimators=50,
    subsample=0.5,
    random_state=42  # Optional for reproducibility
)

In [None]:
optimal_xgb_clf.fit(X_train, y_train_enc)

In [None]:
y_pred_test_xgb = optimal_xgb_clf.predict(X_test)

In [None]:
# Define the inverse mapping
inverse_mapping = {1: 'H', 0: 'D', 2: 'A'}

# Convert predictions back to the original form
y_pred_test_xgb_original = [inverse_mapping[label] for label in y_pred_test_xgb]


In [None]:
predictions_df = pd.DataFrame(y_pred_test_xgb_original, columns=['Predictions'])

In [None]:
predictions_df.to_csv('germany_1.csv', index=False)

In [None]:
train = ger1[ger1['Year'] < 2022]
validation = ger1[ger1['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['total_goal']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['total_goal']

In [None]:
X_test = df2023.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((1796, 21), (1796,), (300, 21), (300,), (306, 21))

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
from lazypredict.Supervised import LazyRegressor

# Create an instance of LazyRegressor
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)

# Fit the model
models, predictions = reg.fit(X_train, X_validation, y_train, y_validation)

 21%|██▏       | 9/42 [00:01<00:07,  4.20it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 74%|███████▍  | 31/42 [00:06<00:02,  4.33it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:09<00:00,  4.47it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000402 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 944
[LightGBM] [Info] Number of data points in the train set: 1796, number of used features: 21
[LightGBM] [Info] Start training from score 2.982739





In [None]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RidgeCV,0.14,0.2,1.56,0.03
Ridge,0.14,0.2,1.56,0.01
BayesianRidge,0.14,0.2,1.56,0.04
LinearRegression,0.14,0.2,1.56,0.03
TransformedTargetRegressor,0.14,0.2,1.56,0.02
SGDRegressor,0.14,0.2,1.56,0.04
ElasticNetCV,0.14,0.2,1.56,0.27
LassoLarsCV,0.14,0.2,1.56,0.08
LassoCV,0.14,0.2,1.56,0.3
OrthogonalMatchingPursuitCV,0.14,0.2,1.56,0.02


In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
elastic_net_cv = ElasticNetCV(cv=5, random_state=42)  # Adjust parameters as necessary
poisson_regressor = PoissonRegressor()
svr = SVR()  # Default parameters, adjust as necessary
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_reg = XGBRegressor(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
elastic_net_cv.fit(X_train, y_train)
poisson_regressor.fit(X_train, y_train)
svr.fit(X_train, y_train)
random_forest_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
y_pred_poisson_regressor = poisson_regressor.predict(X_validation)
y_pred_svr = svr.predict(X_validation)
y_pred_random_forest = random_forest_reg.predict(X_validation)
y_pred_xgb = xgb_reg.predict(X_validation)

# Predict y_validation
y_pred_elastic_net_cv_rounded = np.rint(y_pred_elastic_net_cv)
y_pred_poisson_regressor_rounded = np.rint(y_pred_poisson_regressor)
y_pred_svr_rounded = np.rint(y_pred_svr)
y_pred_random_forest_rounded = np.rint(y_pred_random_forest)
y_pred_xgb_rounded = np.rint(y_pred_xgb)

# Calculate MAE and R2 score using rounded predictions
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv_rounded)
r2_elastic_net_cv = r2_score(y_validation, y_pred_elastic_net_cv_rounded)

mae_poisson_regressor = mean_absolute_error(y_validation, y_pred_poisson_regressor_rounded)
r2_poisson_regressor = r2_score(y_validation, y_pred_poisson_regressor_rounded)

mae_svr = mean_absolute_error(y_validation, y_pred_svr_rounded)
r2_svr = r2_score(y_validation, y_pred_svr_rounded)

mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest_rounded)
r2_random_forest = r2_score(y_validation, y_pred_random_forest_rounded)

mae_xgb = mean_absolute_error(y_validation, y_pred_xgb_rounded)
r2_xgb = r2_score(y_validation, y_pred_xgb_rounded)

# Print out the performance with rounded predictions
print(f'ElasticNetCV - MAE: {mae_elastic_net_cv}, R2 Score: {r2_elastic_net_cv}')
print(f'Poisson Regressor - MAE: {mae_poisson_regressor}, R2 Score: {r2_poisson_regressor}')
print(f'SVR - MAE: {mae_svr}, R2 Score: {r2_svr}')
print(f'Random Forest Regressor - MAE: {mae_random_forest}, R2 Score: {r2_random_forest}')
print(f'XGB Regressor - MAE: {mae_xgb}, R2 Score: {r2_xgb}')

ElasticNetCV - MAE: 1.2333333333333334, R2 Score: 0.17182460116816323
Poisson Regressor - MAE: 1.25, R2 Score: 0.13804376253160144
SVR - MAE: 1.3866666666666667, R2 Score: -0.0047075233196756106
Random Forest Regressor - MAE: 1.2666666666666666, R2 Score: 0.1543893296138088
XGB Regressor - MAE: 1.33, R2 Score: 0.029073315316886172


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Parameter grids
param_grid_poisson = {
    'alpha': [0.01, 0.1, 1, 10],
    'max_iter': [100, 300, 500]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV setup
grid_search_poisson = GridSearchCV(poisson_regressor, param_grid_poisson, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_reg, param_grid_random_forest, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)

# Fitting models
grid_search_poisson.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)

# Best parameters and scores
best_params_poisson = grid_search_poisson.best_params_
best_score_poisson = grid_search_poisson.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

# Predict and calculate MAE
y_pred_poisson = grid_search_poisson.best_estimator_.predict(X_validation)
mae_poisson = mean_absolute_error(y_validation, y_pred_poisson)

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest)

# Results
results = {
    "Poisson Regressor": {
        "Best Parameters": best_params_poisson,
        "Best Score (Negative MAE)": best_score_poisson,
        "MAE on Validation": mae_poisson
    },
    "Random Forest Regressor": {
        "Best Parameters": best_params_random_forest,
        "Best Score (Negative MAE)": best_score_random_forest,
        "MAE on Validation": mae_random_forest
    }
}

# ElasticNetCV already uses cross-validation for parameter tuning, so we directly fit it and predict
elastic_net_cv = ElasticNetCV(cv=5, random_state=42).fit(X_train, y_train)
y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv)

results["ElasticNetCV"] = {
    "Best Parameters": elastic_net_cv.get_params(),
    "MAE on Validation": mae_elastic_net_cv
}

# Print results
for model, info in results.items():
    print(f"{model}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Poisson Regressor:
  Best Parameters: {'alpha': 0.01, 'max_iter': 500}
  Best Score (Negative MAE): -1.2419591249592146
  MAE on Validation: 1.2572262057373693

Random Forest Regressor:
  Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 50}
  Best Score (Negative MAE): -1.2937289414315958
  MAE on Validation: 1.2975259331407671

ElasticNetCV:
  Best Parameters: {'alphas': None, 'copy_X': True, 'cv': 5, 'eps': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': 0}
  MAE on Validation: 1.2540402645503033



In [None]:

# Set the best parameters for PoissonRegressor
best_params_poisson = {
    'alpha': 0.01,
    'max_iter': 500
}

# Initialize and fit the PoissonRegressor with the best parameters
poisson_model = PoissonRegressor(**best_params_poisson)
poisson_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_test_poisson = poisson_model.predict(X_test)

# Round the predictions to the nearest integer and convert to int type
y_pred_test_poisson_rounded = np.rint(y_pred_test_poisson).astype(int)

# y_pred_test_poisson_rounded contains the final integer predictions for X_test


In [None]:


# Convert the predictions to a DataFrame
predictions_df_poisson = pd.DataFrame(y_pred_test_poisson_rounded, columns=['Predicted_Total_Goals'])

# Save the DataFrame to a CSV file
predictions_df_poisson.to_csv('germany_1.csv', index=False)


In [None]:
df = ger1.copy()

In [None]:
X_train = df.drop(['FTR', 'total_goal'], axis=1)
y_train = df['FTR']
X_test = df2023.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape, X_test.shape

((2096, 21), (306, 21))

In [None]:
y_train_encoded = y_train.map({'H': 1, 'D': 0, 'A': 2})

In [None]:
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(all_teams_list)


X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:
X_train.head(10)

Unnamed: 0,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,...,Broker_prob_H,Broker_prob_D,Broker_prob_A,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year
0,16,15,1.5,4.33,6.5,0.59,0.18,1.82,0.65,1.0,...,0.67,0.23,0.15,0.33,0.16,19.79,8.71,3.0,-1.0,2016
1,17,0,2.15,3.4,3.4,0.76,0.35,2.47,1.41,1.06,...,0.47,0.29,0.29,0.37,0.35,26.81,19.0,2.0,0.0,2016
2,6,17,2.45,3.4,2.88,0.35,0.25,1.29,1.44,1.41,...,0.41,0.29,0.35,0.32,0.32,14.04,18.21,-1.0,3.0,2016
3,16,4,1.25,6.0,13.0,0.59,0.41,1.82,1.35,1.0,...,0.8,0.17,0.08,0.33,0.37,19.79,18.21,1.0,1.0,2016
4,18,12,2.0,3.5,3.8,0.47,0.18,1.35,0.94,1.06,...,0.5,0.29,0.26,0.29,0.26,14.68,12.66,3.0,-1.0,2016
5,4,23,3.4,3.3,2.2,0.12,0.18,0.88,1.65,1.71,...,0.29,0.3,0.45,0.28,0.31,9.57,22.16,0.0,2.0,2016
6,18,14,2.15,3.5,3.3,0.47,0.18,1.35,1.0,1.06,...,0.47,0.29,0.3,0.29,0.27,14.68,13.46,1.0,1.0,2016
7,4,16,4.75,3.75,1.75,0.12,0.5,0.88,1.56,1.71,...,0.21,0.27,0.57,0.28,0.29,9.57,19.79,1.0,1.0,2016
8,13,25,2.05,3.5,3.6,0.53,0.27,1.41,1.33,0.88,...,0.49,0.29,0.28,0.34,0.36,15.32,15.83,1.0,1.0,2016
9,16,18,1.53,4.33,6.0,0.59,0.35,1.82,1.35,1.0,...,0.65,0.23,0.17,0.33,0.32,19.79,18.21,1.0,1.0,2016


In [None]:
X_test.head(10)

Unnamed: 0,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,...,Broker_prob_H,Broker_prob_D,Broker_prob_A,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year
0,6,1,6.0,5.0,1.5,0.24,0.65,1.18,2.88,1.29,...,0.17,0.2,0.67,0.26,0.35,11.41,35.85,0.1,1.9,2023
1,0,9,3.3,3.5,2.15,0.44,0.41,1.5,1.53,1.56,...,0.3,0.29,0.47,0.35,0.34,13.69,19.02,0.2,1.8,2023
2,3,18,3.2,3.4,2.2,0.47,0.19,1.24,1.06,1.0,...,0.31,0.29,0.45,0.25,0.24,11.98,12.44,1.0,1.0,2023
3,17,14,2.05,4.2,3.0,0.47,0.29,1.94,1.41,1.59,...,0.49,0.24,0.33,0.31,0.35,18.82,17.56,1.1,0.9,2023
4,24,13,1.72,3.75,4.5,0.59,0.18,1.47,0.76,1.0,...,0.58,0.27,0.22,0.32,0.21,14.26,9.51,1.3,0.7,2023
5,26,25,1.85,3.8,3.8,0.33,0.59,1.07,2.12,1.33,...,0.54,0.26,0.26,0.23,0.37,9.13,26.6,0.7,1.3,2023
6,5,16,2.0,4.2,3.2,0.76,0.53,3.06,2.41,1.65,...,0.5,0.24,0.31,0.53,0.45,29.66,30.0,1.7,0.3,2023
7,23,21,4.5,3.8,1.75,0.35,0.38,1.65,1.81,1.88,...,0.22,0.26,0.57,0.26,0.36,15.97,21.22,-0.8,2.8,2023
8,7,22,1.7,4.0,4.5,0.53,0.59,1.59,1.88,1.24,...,0.59,0.25,0.22,0.31,0.33,15.4,23.65,1.7,0.3,2023
9,9,5,3.0,3.75,2.2,0.5,0.53,2.0,1.94,1.44,...,0.33,0.27,0.45,0.37,0.42,18.25,24.15,0.1,1.9,2023


In [None]:
X_train.head(10)

Unnamed: 0,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,...,Broker_prob_H,Broker_prob_D,Broker_prob_A,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year
0,Leverkusen,Ingolstadt,1.5,4.33,6.5,0.59,0.18,1.82,0.65,1.0,...,0.67,0.23,0.15,0.33,0.16,19.79,8.71,3.0,-1.0,2016
1,M'gladbach,Augsburg,2.15,3.4,3.4,0.76,0.35,2.47,1.41,1.06,...,0.47,0.29,0.29,0.37,0.35,26.81,19.0,2.0,0.0,2016
2,Ein Frankfurt,M'gladbach,2.45,3.4,2.88,0.35,0.25,1.29,1.44,1.41,...,0.41,0.29,0.35,0.32,0.32,14.04,18.21,-1.0,3.0,2016
3,Leverkusen,Darmstadt,1.25,6.0,13.0,0.59,0.41,1.82,1.35,1.0,...,0.8,0.17,0.08,0.33,0.37,19.79,18.21,1.0,1.0,2016
4,Mainz,Hannover,2.0,3.5,3.8,0.47,0.18,1.35,0.94,1.06,...,0.5,0.29,0.26,0.29,0.26,14.68,12.66,3.0,-1.0,2016
5,Darmstadt,Stuttgart,3.4,3.3,2.2,0.12,0.18,0.88,1.65,1.71,...,0.29,0.3,0.45,0.28,0.31,9.57,22.16,0.0,2.0,2016
6,Mainz,Hoffenheim,2.15,3.5,3.3,0.47,0.18,1.35,1.0,1.06,...,0.47,0.29,0.3,0.29,0.27,14.68,13.46,1.0,1.0,2016
7,Darmstadt,Leverkusen,4.75,3.75,1.75,0.12,0.5,0.88,1.56,1.71,...,0.21,0.27,0.57,0.28,0.29,9.57,19.79,1.0,1.0,2016
8,Hertha,Werder Bremen,2.05,3.5,3.6,0.53,0.27,1.41,1.33,0.88,...,0.49,0.29,0.28,0.34,0.36,15.32,15.83,1.0,1.0,2016
9,Leverkusen,Mainz,1.53,4.33,6.0,0.59,0.35,1.82,1.35,1.0,...,0.65,0.23,0.17,0.33,0.32,19.79,18.21,1.0,1.0,2016


In [None]:
X_test.head(10)

Unnamed: 0,HomeTeam,AwayTeam,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,HomeTeam_goals_conceded_avg,...,Broker_prob_H,Broker_prob_D,Broker_prob_A,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year
0,Ein Frankfurt,Bayern Munich,6.0,5.0,1.5,0.24,0.65,1.18,2.88,1.29,...,0.17,0.2,0.67,0.26,0.35,11.41,35.85,0.1,1.9,2023
1,Augsburg,Freiburg,3.3,3.5,2.15,0.44,0.41,1.5,1.53,1.56,...,0.3,0.29,0.47,0.35,0.34,13.69,19.02,0.2,1.8,2023
2,Bochum,Mainz,3.2,3.4,2.2,0.47,0.19,1.24,1.06,1.0,...,0.31,0.29,0.45,0.25,0.24,11.98,12.44,1.0,1.0,2023
3,M'gladbach,Hoffenheim,2.05,4.2,3.0,0.47,0.29,1.94,1.41,1.59,...,0.49,0.24,0.33,0.31,0.35,18.82,17.56,1.1,0.9,2023
4,Union Berlin,Hertha,1.72,3.75,4.5,0.59,0.18,1.47,0.76,1.0,...,0.58,0.27,0.22,0.32,0.21,14.26,9.51,1.3,0.7,2023
5,Wolfsburg,Werder Bremen,1.85,3.8,3.8,0.33,0.59,1.07,2.12,1.33,...,0.54,0.26,0.26,0.23,0.37,9.13,26.6,0.7,1.3,2023
6,Dortmund,Leverkusen,2.0,4.2,3.2,0.76,0.53,3.06,2.41,1.65,...,0.5,0.24,0.31,0.53,0.45,29.66,30.0,1.7,0.3,2023
7,Stuttgart,RB Leipzig,4.5,3.8,1.75,0.35,0.38,1.65,1.81,1.88,...,0.22,0.26,0.57,0.26,0.36,15.97,21.22,-0.8,2.8,2023
8,FC Koln,Schalke 04,1.7,4.0,4.5,0.53,0.59,1.59,1.88,1.24,...,0.59,0.25,0.22,0.31,0.33,15.4,23.65,1.7,0.3,2023
9,Freiburg,Dortmund,3.0,3.75,2.2,0.5,0.53,2.0,1.94,1.44,...,0.33,0.27,0.45,0.37,0.42,18.25,24.15,0.1,1.9,2023


In [None]:
file_path_2 = '/content/drive/My Drive/test/germany/2/2223.csv'
df2023_ger2 = pd.read_csv(file_path_2)

In [None]:
df2023_ger2 = df2023_ger2[columns_test]

In [None]:
df2023_ger2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      306 non-null    object 
 1   HomeTeam  306 non-null    object 
 2   AwayTeam  306 non-null    object 
 3   B365H     306 non-null    float64
 4   B365D     306 non-null    float64
 5   B365A     306 non-null    float64
dtypes: float64(3), object(3)
memory usage: 14.5+ KB


In [None]:
df2023_ger2 = calculate_and_apply_overall_averages(season_dfs, df2023_ger2)

In [None]:
df2023_ger2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Date                         306 non-null    object 
 1   HomeTeam                     306 non-null    object 
 2   AwayTeam                     306 non-null    object 
 3   B365H                        306 non-null    float64
 4   B365D                        306 non-null    float64
 5   B365A                        306 non-null    float64
 6   HomeTeam_WinRate             306 non-null    float64
 7   AwayTeam_WinRate             306 non-null    float64
 8   HomeTeam_GoalsAvg            306 non-null    float64
 9   AwayTeam_GoalsAvg            306 non-null    float64
 10  HomeTeam_goals_conceded_avg  306 non-null    float64
 11  AwayTeam_goals_conceded_avg  306 non-null    float64
 12  H_goal_ratio                 306 non-null    float64
 13  A_goal_ratio        

In [None]:
merged_df = ger2.copy()

# Calculate head-to-head stats using merged data
head_to_head_stats = calculate_head_to_head_stats(merged_df)

# Apply the adjusted win-loss ratio to df2023
df2023_ger2 = apply_adjusted_win_loss_ratio_to_2023(df2023_ger2, head_to_head_stats)

In [None]:
df2023_ger2 = process_time_data(df2023_ger2, 2023)

In [None]:
df2023_ger2 = add_probability_B365(df2023_ger2)

In [None]:
df2023_ger2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   HomeTeam                     306 non-null    object 
 1   AwayTeam                     306 non-null    object 
 2   B365H                        306 non-null    float64
 3   B365D                        306 non-null    float64
 4   B365A                        306 non-null    float64
 5   HomeTeam_WinRate             306 non-null    float64
 6   AwayTeam_WinRate             306 non-null    float64
 7   HomeTeam_GoalsAvg            306 non-null    float64
 8   AwayTeam_GoalsAvg            306 non-null    float64
 9   HomeTeam_goals_conceded_avg  306 non-null    float64
 10  AwayTeam_goals_conceded_avg  306 non-null    float64
 11  H_goal_ratio                 306 non-null    float64
 12  A_goal_ratio                 306 non-null    float64
 13  attack_strength_home

In [None]:
train = ger2[ger2['Year'] < 2022]
validation = ger2[ger2['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['FTR']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['FTR']

In [None]:
X_test = df2023_ger2.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((1196, 21), (1196,), (300, 21), (300,), (306, 21))

In [None]:
y_train_enc = y_train.map({'H': 1, 'D': 0, 'A': 2})
y_validation_enc = y_validation.map({'H': 1, 'D': 0, 'A': 2})

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = False, custom_metric = None)
models,pred = clf.fit(X_train, X_validation, y_train_enc, y_validation_enc)

  3%|▎         | 1/29 [00:00<00:18,  1.51it/s]

ROC AUC couldn't be calculated for AdaBoostClassifier
multi_class must be in ('ovo', 'ovr')


  7%|▋         | 2/29 [00:00<00:12,  2.23it/s]

ROC AUC couldn't be calculated for BaggingClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for BernoulliNB
multi_class must be in ('ovo', 'ovr')


 24%|██▍       | 7/29 [00:03<00:08,  2.56it/s]

ROC AUC couldn't be calculated for CalibratedClassifierCV
multi_class must be in ('ovo', 'ovr')
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
ROC AUC couldn't be calculated for DecisionTreeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for DummyClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for ExtraTreeClassifier
multi_class must be in ('ovo', 'ovr')


 38%|███▊      | 11/29 [00:03<00:04,  4.26it/s]

ROC AUC couldn't be calculated for ExtraTreesClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for GaussianNB
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for KNeighborsClassifier
multi_class must be in ('ovo', 'ovr')


 45%|████▍     | 13/29 [00:04<00:03,  4.75it/s]

ROC AUC couldn't be calculated for LabelPropagation
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LabelSpreading
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LinearDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 62%|██████▏   | 18/29 [00:04<00:01,  6.99it/s]

ROC AUC couldn't be calculated for LinearSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for LogisticRegression
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NearestCentroid
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for NuSVC
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for PassiveAggressiveClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for Perceptron
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for QuadraticDiscriminantAnalysis
multi_class must be in ('ovo', 'ovr')


 86%|████████▌ | 25/29 [00:05<00:00,  9.68it/s]

ROC AUC couldn't be calculated for RandomForestClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifier
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for RidgeClassifierCV
multi_class must be in ('ovo', 'ovr')
ROC AUC couldn't be calculated for SGDClassifier
multi_class must be in ('ovo', 'ovr')


 93%|█████████▎| 27/29 [00:05<00:00, 10.87it/s]

ROC AUC couldn't be calculated for SVC
multi_class must be in ('ovo', 'ovr')
StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'
ROC AUC couldn't be calculated for XGBClassifier
multi_class must be in ('ovo', 'ovr')
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000321 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 688
[LightGBM] [Info] Number of data points in the train set: 1196, number of used features: 21
[LightGBM] [Info] Start training from score -1.260738
[LightGBM] [Info] Start training from score -0.860201
[LightGBM] [Info] Start training from score -1.225952


100%|██████████| 29/29 [00:05<00:00,  4.95it/s]

ROC AUC couldn't be calculated for LGBMClassifier
multi_class must be in ('ovo', 'ovr')





In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.68,0.66,,0.67,0.22
CalibratedClassifierCV,0.68,0.66,,0.66,2.19
LGBMClassifier,0.67,0.65,,0.67,0.26
ExtraTreesClassifier,0.68,0.65,,0.67,0.39
LogisticRegression,0.66,0.65,,0.66,0.05
DecisionTreeClassifier,0.65,0.65,,0.66,0.05
LinearDiscriminantAnalysis,0.66,0.65,,0.66,0.05
RidgeClassifier,0.68,0.65,,0.64,0.02
LinearSVC,0.67,0.65,,0.65,0.3
RandomForestClassifier,0.67,0.64,,0.66,0.37


In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

# Initialize the models
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
decision_tree_clf = DecisionTreeClassifier(random_state=42)
xgb_clf = XGBClassifier(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
bagging_clf.fit(X_train, y_train_enc)
extra_trees_clf.fit(X_train, y_train_enc)
random_forest_clf.fit(X_train, y_train_enc)
decision_tree_clf.fit(X_train, y_train_enc)
xgb_clf.fit(X_train, y_train_enc)

# Predict y_validation
y_pred_bagging = bagging_clf.predict(X_validation)
y_pred_extra_trees = extra_trees_clf.predict(X_validation)
y_pred_random_forest = random_forest_clf.predict(X_validation)
y_pred_decision_tree = decision_tree_clf.predict(X_validation)
y_pred_xgb = xgb_clf.predict(X_validation)

# Calculate accuracy and F1 score for each model
accuracy_bagging = accuracy_score(y_validation_enc, y_pred_bagging)
f1_bagging = f1_score(y_validation_enc, y_pred_bagging, average='macro')

accuracy_extra_trees = accuracy_score(y_validation_enc, y_pred_extra_trees)
f1_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')

accuracy_random_forest = accuracy_score(y_validation_enc, y_pred_random_forest)
f1_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

accuracy_decision_tree = accuracy_score(y_validation_enc, y_pred_decision_tree)
f1_decision_tree = f1_score(y_validation_enc, y_pred_decision_tree, average='macro')

accuracy_xgb = accuracy_score(y_validation_enc, y_pred_xgb)
f1_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

# Print out the performance
print(f'Bagging Classifier - Accuracy: {accuracy_bagging}, F1 Score: {f1_bagging}')
print(f'Extra Trees Classifier - Accuracy: {accuracy_extra_trees}, F1 Score: {f1_extra_trees}')
print(f'Random Forest Classifier - Accuracy: {accuracy_random_forest}, F1 Score: {f1_random_forest}')
print(f'Decision Tree Classifier - Accuracy: {accuracy_decision_tree}, F1 Score: {f1_decision_tree}')
print(f'XGB Classifier - Accuracy: {accuracy_xgb}, F1 Score: {f1_xgb}')

Bagging Classifier - Accuracy: 0.6766666666666666, F1 Score: 0.6568429790843303
Extra Trees Classifier - Accuracy: 0.6766666666666666, F1 Score: 0.6498269388231476
Random Forest Classifier - Accuracy: 0.68, F1 Score: 0.6478689710403266
Decision Tree Classifier - Accuracy: 0.66, F1 Score: 0.6529057975762544
XGB Classifier - Accuracy: 0.68, F1 Score: 0.6578254027830597


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1],
    'colsample_bytree': [0.5, 0.7, 1]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_extra_trees = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}



grid_search_xgb = GridSearchCV(xgb_clf, param_grid_xgb, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_clf, param_grid_random_forest, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search_extra_trees = GridSearchCV(extra_trees_clf, param_grid_extra_trees, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)

grid_search_xgb.fit(X_train, y_train_enc)
grid_search_random_forest.fit(X_train, y_train_enc)
grid_search_extra_trees.fit(X_train, y_train_enc)

best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

best_params_extra_trees = grid_search_extra_trees.best_params_
best_score_extra_trees = grid_search_extra_trees.best_score_


y_pred_xgb = grid_search_xgb.best_estimator_.predict(X_validation)
f1_score_xgb = f1_score(y_validation_enc, y_pred_xgb, average='macro')

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
f1_score_random_forest = f1_score(y_validation_enc, y_pred_random_forest, average='macro')

y_pred_extra_trees = grid_search_extra_trees.best_estimator_.predict(X_validation)
f1_score_extra_trees = f1_score(y_validation_enc, y_pred_extra_trees, average='macro')


results = {
    "XGB Classifier": {
        "Best Parameters": best_params_xgb,
        "Best Score": best_score_xgb,
        "F1 Score on Validation": f1_score_xgb
    },
    "Random Forest Classifier": {
        "Best Parameters": best_params_random_forest,
        "Best Score": best_score_random_forest,
        "F1 Score on Validation": f1_score_random_forest
    },
    "Extra Trees Classifier": {
        "Best Parameters": best_params_extra_trees,
        "Best Score": best_score_extra_trees,
        "F1 Score on Validation": f1_score_extra_trees
    }
}




Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
results

{'XGB Classifier': {'Best Parameters': {'colsample_bytree': 1,
   'learning_rate': 0.01,
   'max_depth': 6,
   'n_estimators': 100,
   'subsample': 0.7},
  'Best Score': 0.6439568252089848,
  'F1 Score on Validation': 0.6742240408907075},
 'Random Forest Classifier': {'Best Parameters': {'max_depth': None,
   'min_samples_leaf': 2,
   'min_samples_split': 2,
   'n_estimators': 50},
  'Best Score': 0.6447831164157104,
  'F1 Score on Validation': 0.6362768127180346},
 'Extra Trees Classifier': {'Best Parameters': {'max_depth': 20,
   'min_samples_leaf': 2,
   'min_samples_split': 5,
   'n_estimators': 200},
  'Best Score': 0.645262415520134,
  'F1 Score on Validation': 0.6471871591086178}}

In [None]:
optimal_xgb_clf = XGBClassifier(
    colsample_bytree=1,
    learning_rate=0.01,
    max_depth=6,
    n_estimators=100,
    subsample=0.7,
    random_state=42  # Optional for reproducibility
)

In [None]:
optimal_xgb_clf.fit(X_train, y_train_enc)

In [None]:
# Make predictions on the test set
y_pred_test_xgb = optimal_xgb_clf.predict(X_test)

In [None]:
# Define the inverse mapping
inverse_mapping = {1: 'H', 0: 'D', 2: 'A'}

# Convert y_pred_test_xgb back to original form
y_pred_test_xgb_original = [inverse_mapping[label] for label in y_pred_test_xgb]


In [None]:
predictions_df = pd.DataFrame(y_pred_test_xgb_original, columns=['Predictions'])

In [None]:
predictions_df.to_csv('germany_2.csv', index=False)

In [None]:
train = ger2[ger2['Year'] < 2022]
validation = ger2[ger2['Year'] == 2022]

In [None]:
X_train = train.drop(['FTR', 'total_goal'], axis=1)
y_train = train['total_goal']
X_validation = validation.drop(['FTR', 'total_goal'], axis=1)
y_validation = validation['total_goal']

In [None]:
X_test = df2023_ger2.copy()
X_test = X_test[X_train.columns]

In [None]:
X_train.shape , y_train.shape, X_validation.shape, y_validation.shape, X_test.shape

((1196, 21), (1196,), (300, 21), (300,), (306, 21))

In [None]:

from sklearn.preprocessing import LabelEncoder
# Update the set of all teams to include teams from X_test
all_teams = set(X_train['HomeTeam'].unique()).union(set(X_train['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_validation['HomeTeam'].unique())).union(set(X_validation['AwayTeam'].unique()))
all_teams = all_teams.union(set(X_test['HomeTeam'].unique())).union(set(X_test['AwayTeam'].unique()))

# Convert the set to a list
all_teams_list = list(all_teams)

# Fit the LabelEncoder with the updated list of all teams
encoder = LabelEncoder()
encoder.fit(all_teams_list)

# Transform 'HomeTeam' and 'AwayTeam' in all datasets
X_train['HomeTeam'] = encoder.transform(X_train['HomeTeam'])
X_train['AwayTeam'] = encoder.transform(X_train['AwayTeam'])
X_validation['HomeTeam'] = encoder.transform(X_validation['HomeTeam'])
X_validation['AwayTeam'] = encoder.transform(X_validation['AwayTeam'])
X_test['HomeTeam'] = encoder.transform(X_test['HomeTeam'])
X_test['AwayTeam'] = encoder.transform(X_test['AwayTeam'])

In [None]:

# Create an instance of LazyRegressor
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)

# Fit the model
models, predictions = reg.fit(X_train, X_validation, y_train, y_validation)

 21%|██▏       | 9/42 [00:02<00:10,  3.17it/s]

GammaRegressor model failed to execute
Some value(s) of y are out of the valid range of the loss 'HalfGammaLoss'.


 79%|███████▊  | 33/42 [00:10<00:02,  4.38it/s]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


100%|██████████| 42/42 [00:12<00:00,  3.34it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000384 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 688
[LightGBM] [Info] Number of data points in the train set: 1196, number of used features: 21
[LightGBM] [Info] Start training from score 2.888796





In [None]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BayesianRidge,0.11,0.17,1.4,0.02
SGDRegressor,0.1,0.17,1.4,0.02
ElasticNetCV,0.1,0.17,1.4,0.36
LassoCV,0.1,0.17,1.4,0.37
LassoLarsCV,0.1,0.17,1.4,0.12
LassoLarsIC,0.1,0.16,1.41,0.07
OrthogonalMatchingPursuitCV,0.1,0.16,1.41,0.02
RidgeCV,0.1,0.16,1.41,0.02
PoissonRegressor,0.1,0.16,1.41,0.11
Ridge,0.09,0.15,1.42,0.01


In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
elastic_net_cv = ElasticNetCV(cv=5, random_state=42)  # Adjust parameters as necessary
poisson_regressor = PoissonRegressor()
svr = SVR()  # Default parameters, adjust as necessary
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
xgb_reg = XGBRegressor(random_state=42)  # Default parameters, adjust as necessary

# Fit the models
elastic_net_cv.fit(X_train, y_train)
poisson_regressor.fit(X_train, y_train)
svr.fit(X_train, y_train)
random_forest_reg.fit(X_train, y_train)
xgb_reg.fit(X_train, y_train)

y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
y_pred_poisson_regressor = poisson_regressor.predict(X_validation)
y_pred_svr = svr.predict(X_validation)
y_pred_random_forest = random_forest_reg.predict(X_validation)
y_pred_xgb = xgb_reg.predict(X_validation)

# Predict y_validation
y_pred_elastic_net_cv_rounded = np.rint(y_pred_elastic_net_cv)
y_pred_poisson_regressor_rounded = np.rint(y_pred_poisson_regressor)
y_pred_svr_rounded = np.rint(y_pred_svr)
y_pred_random_forest_rounded = np.rint(y_pred_random_forest)
y_pred_xgb_rounded = np.rint(y_pred_xgb)

# Calculate MAE and R2 score using rounded predictions
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv_rounded)
r2_elastic_net_cv = r2_score(y_validation, y_pred_elastic_net_cv_rounded)

mae_poisson_regressor = mean_absolute_error(y_validation, y_pred_poisson_regressor_rounded)
r2_poisson_regressor = r2_score(y_validation, y_pred_poisson_regressor_rounded)

mae_svr = mean_absolute_error(y_validation, y_pred_svr_rounded)
r2_svr = r2_score(y_validation, y_pred_svr_rounded)

mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest_rounded)
r2_random_forest = r2_score(y_validation, y_pred_random_forest_rounded)

mae_xgb = mean_absolute_error(y_validation, y_pred_xgb_rounded)
r2_xgb = r2_score(y_validation, y_pred_xgb_rounded)

# Print out the performance with rounded predictions
print(f'ElasticNetCV - MAE: {mae_elastic_net_cv}, R2 Score: {r2_elastic_net_cv}')
print(f'Poisson Regressor - MAE: {mae_poisson_regressor}, R2 Score: {r2_poisson_regressor}')
print(f'SVR - MAE: {mae_svr}, R2 Score: {r2_svr}')
print(f'Random Forest Regressor - MAE: {mae_random_forest}, R2 Score: {r2_random_forest}')
print(f'XGB Regressor - MAE: {mae_xgb}, R2 Score: {r2_xgb}')

ElasticNetCV - MAE: 1.0733333333333333, R2 Score: 0.15928059245327364
Poisson Regressor - MAE: 1.13, R2 Score: 0.08733983778065102
SVR - MAE: 1.1766666666666667, R2 Score: -0.0029387563183262966
Random Forest Regressor - MAE: 1.1266666666666667, R2 Score: 0.08310802868226153
XGB Regressor - MAE: 1.26, R2 Score: -0.16233689902433324


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Parameter grids
param_grid_poisson = {
    'alpha': [0.01, 0.1, 1, 10],
    'max_iter': [100, 300, 500]
}

param_grid_random_forest = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV setup
grid_search_poisson = GridSearchCV(poisson_regressor, param_grid_poisson, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)
grid_search_random_forest = GridSearchCV(random_forest_reg, param_grid_random_forest, cv=5, scoring='neg_mean_absolute_error', verbose=1, n_jobs=-1)

# Fitting models
grid_search_poisson.fit(X_train, y_train)
grid_search_random_forest.fit(X_train, y_train)

# Best parameters and scores
best_params_poisson = grid_search_poisson.best_params_
best_score_poisson = grid_search_poisson.best_score_

best_params_random_forest = grid_search_random_forest.best_params_
best_score_random_forest = grid_search_random_forest.best_score_

# Predict and calculate MAE
y_pred_poisson = grid_search_poisson.best_estimator_.predict(X_validation)
mae_poisson = mean_absolute_error(y_validation, y_pred_poisson)

y_pred_random_forest = grid_search_random_forest.best_estimator_.predict(X_validation)
mae_random_forest = mean_absolute_error(y_validation, y_pred_random_forest)

# Results
results = {
    "Poisson Regressor": {
        "Best Parameters": best_params_poisson,
        "Best Score (Negative MAE)": best_score_poisson,
        "MAE on Validation": mae_poisson
    },
    "Random Forest Regressor": {
        "Best Parameters": best_params_random_forest,
        "Best Score (Negative MAE)": best_score_random_forest,
        "MAE on Validation": mae_random_forest
    }
}

# ElasticNetCV already uses cross-validation for parameter tuning, so we directly fit it and predict
elastic_net_cv = ElasticNetCV(cv=5, random_state=42).fit(X_train, y_train)
y_pred_elastic_net_cv = elastic_net_cv.predict(X_validation)
mae_elastic_net_cv = mean_absolute_error(y_validation, y_pred_elastic_net_cv)

results["ElasticNetCV"] = {
    "Best Parameters": elastic_net_cv.get_params(),
    "MAE on Validation": mae_elastic_net_cv
}

# Print results
for model, info in results.items():
    print(f"{model}:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    print()

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Poisson Regressor:
  Best Parameters: {'alpha': 0.01, 'max_iter': 500}
  Best Score (Negative MAE): -1.2817049386785433
  MAE on Validation: 1.1142730867841306

Random Forest Regressor:
  Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
  Best Score (Negative MAE): -1.3167001402652487
  MAE on Validation: 1.1514944972148347

ElasticNetCV:
  Best Parameters: {'alphas': None, 'copy_X': True, 'cv': 5, 'eps': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5, 'max_iter': 1000, 'n_alphas': 100, 'n_jobs': None, 'positive': False, 'precompute': 'auto', 'random_state': 42, 'selection': 'cyclic', 'tol': 0.0001, 'verbose': 0}
  MAE on Validation: 1.1117038554191196



In [None]:

# Best parameters for ElasticNetCV
best_params_elastic_net = {
    'alphas': None,
    'copy_X': True,
    'cv': 5,
    'eps': 0.001,
    'fit_intercept': True,
    'l1_ratio': 0.5,
    'max_iter': 1000,
    'n_alphas': 100,
    'n_jobs': None,
    'positive': False,
    'precompute': 'auto',
    'random_state': 42,
    'selection': 'cyclic',
    'tol': 0.0001,
    'verbose': 0
}

# Initialize and fit the ElasticNetCV model
elastic_net_model = ElasticNetCV(**best_params_elastic_net)
elastic_net_model.fit(X_train, y_train)

# Predict on the test set
y_pred_test_elastic_net = elastic_net_model.predict(X_test)

# Round predictions to nearest integer and convert to int type
y_pred_test_elastic_net_rounded = np.rint(y_pred_test_elastic_net).astype(int)

# y_pred_test_elastic_net_rounded contains the final integer predictions for X_test


In [None]:


# Convert predictions to a DataFrame
predictions_df_elastic_net = pd.DataFrame(y_pred_test_elastic_net_rounded, columns=['Predicted_Total_Goals'])

# Save to CSV
predictions_df_elastic_net.to_csv('germany_2.csv', index=False)


In [None]:
ger1 = label_encode(ger1)

In [None]:
 # Scale the DataFrame columns except for specified columns ['Date', 'HomeTeam', 'AwayTeam', 'total_goal', 'FTR'].


from sklearn.preprocessing import StandardScaler

def scale_dataframe(df, columns_to_exclude=['Date', 'HomeTeam', 'AwayTeam', 'FTR', 'total_goal']):

    columns_to_scale = [col for col in df.columns if col not in columns_to_exclude]

    scaler = StandardScaler()

    df_scaled = df.copy()
    df_scaled[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])

    return df_scaled



In [None]:
ger1 = scale_dataframe(ger1)

In [None]:
ger1.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,Prob_H,Prob_D,Prob_A,total_goal,H_goal_conversion_rate,A_goal_conversion_rate,attack_strength_home_team,attack_strength_away_team
0,2016-05-14,16,15,2,-0.602763,0.054858,0.494751,0.803041,-0.774454,-0.130672,-1.410317,1.038303,-0.421435,-1.010968,5,-0.162227,-2.330258,0.216026,-1.362151
1,2015-09-23,17,0,2,-0.310831,-0.565369,-0.267285,1.757673,0.246974,0.064523,0.169537,0.020763,0.718802,-0.241155,6,-0.109048,0.71648,0.696174,0.394878
2,2015-10-17,6,17,0,-0.176093,-0.565369,-0.395111,-0.544675,-0.353866,-0.289831,0.2319,-0.2845,0.718802,0.088764,6,-0.175522,0.235416,-0.177258,0.259985
3,2015-09-12,16,4,0,-0.715044,1.168599,2.09257,0.803041,0.607477,-0.130672,0.044812,1.699705,-1.561672,-1.395874,1,-0.162227,1.037189,0.216026,0.259985
4,2015-08-29,18,12,2,-0.3782,-0.498678,-0.168958,0.129183,-0.774454,-0.271813,-0.807478,0.173394,0.718802,-0.406115,3,-0.215406,-0.726712,-0.133484,-0.687684


In [None]:
ger1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2100 entries, 0 to 2099
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Date                         2100 non-null   datetime64[ns]
 1   HomeTeam                     2100 non-null   int64         
 2   AwayTeam                     2100 non-null   int64         
 3   FTR                          2100 non-null   int64         
 4   B365H                        2100 non-null   float64       
 5   B365D                        2100 non-null   float64       
 6   B365A                        2100 non-null   float64       
 7   HomeTeam_WinRate             2100 non-null   float64       
 8   AwayTeam_WinRate             2100 non-null   float64       
 9   HomeTeam_GoalsAvg            2100 non-null   float64       
 10  AwayTeam_GoalsAvg            2100 non-null   float64       
 11  HomeTeam_goals_conceded_avg  2100 non-null 

In [None]:
ger1['Date'] = pd.to_datetime(ger1['Date'], dayfirst=True)
ger1['Year'] = ger1['Date'].dt.year
ger1['Month'] = ger1['Date'].dt.month
ger1['Day'] = ger1['Date'].dt.day
ger1.drop('Date', axis=1, inplace=True)

In [None]:
df = ger1.copy()

In [None]:
df.head()

Unnamed: 0,HomeTeam,AwayTeam,FTR,B365H,B365D,B365A,HomeTeam_WinRate,AwayTeam_WinRate,HomeTeam_GoalsAvg,AwayTeam_GoalsAvg,...,total_goal,H_goal_ratio,A_goal_ratio,attack_strength_home_team,attack_strength_away_team,adjusted_win_lost_ratio_H,adjusted_win_lost_ratio_A,Year,Month,Day
0,16,15,2,-0.602763,0.054858,0.494751,0.803041,-0.774454,-0.130672,-1.410317,...,5,-0.162227,-2.330258,0.216026,-1.362151,1.550241,-1.550241,2016,5,14
1,17,0,2,-0.310831,-0.565369,-0.267285,1.757673,0.246974,0.064523,0.169537,...,6,-0.109048,0.71648,0.696174,0.394878,0.773641,-0.773641,2015,9,23
2,6,17,0,-0.176093,-0.565369,-0.395111,-0.544675,-0.353866,-0.289831,0.2319,...,6,-0.175522,0.235416,-0.177258,0.259985,-1.556158,1.556158,2015,10,17
3,16,4,0,-0.715044,1.168599,2.09257,0.803041,0.607477,-0.130672,0.044812,...,1,-0.162227,1.037189,0.216026,0.259985,-0.002958,0.002958,2015,9,12
4,18,12,2,-0.3782,-0.498678,-0.168958,0.129183,-0.774454,-0.271813,-0.807478,...,3,-0.215406,-0.726712,-0.133484,-0.687684,1.550241,-1.550241,2015,8,29


In [None]:
train_df = df[df['Year'] != 2022]
test_df = df[df['Year'] == 2022]

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split


X_train = train_df.drop(['FTR', 'total_goal'], axis=1)
y_train = train_df['FTR']
X_test = test_df.drop(['FTR', 'total_goal'], axis=1)
y_test = test_df['FTR']



In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1949, 23), (151, 23), (1949,), (151,))

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Fit the model to the training data
dt.fit(X_train, y_train)

# Predictions
y_pred = dt.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Accuracy: 0.6622516556291391
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.82      0.78        49
           1       0.38      0.32      0.35        37
           2       0.73      0.74      0.73        65

    accuracy                           0.66       151
   macro avg       0.62      0.63      0.62       151
weighted avg       0.65      0.66      0.66       151



In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)


rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Accuracy: 0.7152317880794702
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.76      0.81        49
           1       0.58      0.30      0.39        37
           2       0.67      0.92      0.77        65

    accuracy                           0.72       151
   macro avg       0.71      0.66      0.66       151
weighted avg       0.71      0.72      0.69       151



In [None]:
feature_importances = rf.feature_importances_
features = X_train.columns

importances = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
importances = importances.sort_values(by='Importance', ascending=False)

print(importances)

                        Feature  Importance
19    adjusted_win_lost_ratio_A    0.174602
18    adjusted_win_lost_ratio_H    0.172205
5              HomeTeam_WinRate    0.039656
2                         B365H    0.039438
4                         B365A    0.039370
13               Broker_prob__A    0.038100
22                          Day    0.036660
10  AwayTeam_goals_conceded_avg    0.035330
9   HomeTeam_goals_conceded_avg    0.033980
17    attack_strength_away_team    0.033739
11                Broker_prob_H    0.033636
6              AwayTeam_WinRate    0.032652
7             HomeTeam_GoalsAvg    0.031333
8             AwayTeam_GoalsAvg    0.031180
16    attack_strength_home_team    0.029669
14                 H_goal_ratio    0.028215
1                      AwayTeam    0.028039
15                 A_goal_ratio    0.026564
0                      HomeTeam    0.026404
21                        Month    0.025793
3                         B365D    0.024802
12               Broker_prob__D 

In [None]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose = 0, ignore_warnings = True, custom_metric = None)
models,pred = clf.fit(X_train, X_test, y_train, y_test)

 97%|█████████▋| 28/29 [00:16<00:00,  1.20it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000468 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1054
[LightGBM] [Info] Number of data points in the train set: 1949, number of used features: 23
[LightGBM] [Info] Start training from score -1.179810
[LightGBM] [Info] Start training from score -1.409654
[LightGBM] [Info] Start training from score -0.801991


100%|██████████| 29/29 [00:16<00:00,  1.74it/s]


In [None]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.72,0.68,,0.71,3.38
LGBMClassifier,0.71,0.67,,0.7,0.59
SVC,0.73,0.66,,0.69,0.36
RandomForestClassifier,0.72,0.66,,0.69,1.27
LinearDiscriminantAnalysis,0.7,0.66,,0.69,0.1
NuSVC,0.72,0.65,,0.68,0.69
LogisticRegression,0.7,0.64,,0.68,0.16
BaggingClassifier,0.66,0.63,,0.65,0.49
DecisionTreeClassifier,0.66,0.63,,0.66,0.05
ExtraTreesClassifier,0.68,0.62,,0.66,0.72


In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# You can use these best parameters to create a new RandomForest model
best_rf = RandomForestClassifier(**best_params)
best_rf.fit(X_train, y_train)

# Predict and evaluate with the optimized model
optimized_y_pred = best_rf.predict(X_test)
optimized_accuracy = accuracy_score(y_test, optimized_y_pred)
optimized_report = classification_report(y_test, optimized_y_pred)

print("Optimized Accuracy:", optimized_accuracy)
print("Optimized Classification Report:\n", optimized_report)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200}
Optimized Accuracy: 0.7019867549668874
Optimized Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.71      0.77        49
           1       0.53      0.24      0.33        37
           2       0.67      0.95      0.79        65

    accuracy                           0.70       151
   macro avg       0.68      0.64      0.63       151
weighted avg       0.69      0.70      0.67       151



In [None]:
import xgboost as xgb
xgb_clf = xgb.XGBClassifier(random_state=42)

# Fit the model to the training data
xgb_clf.fit(X_train, y_train)

# Predictions
y_pred = xgb_clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Accuracy: 0.7218543046357616
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.76      0.80        49
           1       0.59      0.43      0.50        37
           2       0.70      0.86      0.77        65

    accuracy                           0.72       151
   macro avg       0.71      0.68      0.69       151
weighted avg       0.72      0.72      0.71       151



In [None]:
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200, 300],
    'subsample': [0.7, 0.8, 0.9]
}

# Initialize the XGBClassifier
xgb_clf = xgb.XGBClassifier(random_state=42)

# Initialize GridSearchCV
grid_clf = GridSearchCV(xgb_clf, param_grid, scoring='accuracy', cv=3, n_jobs=-1)

# Fit GridSearchCV
grid_clf.fit(X_train, y_train)

# Best parameters and best score
print("Best Parameters:\n", grid_clf.best_params_)
print("Best Score: ", grid_clf.best_score_)

# Predictions
y_pred = grid_clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Best Parameters:
 {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 300, 'subsample': 0.7}
Best Score:  0.7336983920034768
Accuracy: 0.7019867549668874
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.69      0.76        49
           1       0.58      0.30      0.39        37
           2       0.67      0.94      0.78        65

    accuracy                           0.70       151
   macro avg       0.69      0.64      0.64       151
weighted avg       0.70      0.70      0.68       151



In [None]:
ger1.to_csv("ger1.csv", index=False)

In [None]:
ger2.to_csv("ger2.csv", index=False)