
# Task 1 - Dataset, Background, Significance, Motivation.

For this project, I have selected the [soccer](https://www.kaggle.com/datasets/davidcariboo/player-scores?select=transfers.csv) dataset, which is publicly available on Kaggle. 


## Background and Significance




## Motivation


# Start

# Finish

# Task 2 -Exploratory Data Analysis

## 1 Player mean and red cards

# Data Preprocessing: Feature Engineering, Data Transformation, and Cleaning

In this phase, we prepare the dataset for accurate match outcome predictions by engineering features, transforming raw data, selecting relevant variables, merging multiple datasets, and  cleaning the data. 

## Feature Engineering: Player Data with Additional Features

This step focuses on improving the player dataset by adding performance metrics.

1. **Loading and Merging Player Data**: Player appearances, details, valuations, and game lineups are loaded and merged. The most recent valuations and positions are extracted to ensure we have up-to-date data for each player.

2. **Aggregating Player Statistics**: We aggregate player stats such as goals, assists, yellow and red cards, and minutes played. Players with fewer than 1000 minutes are filtered out.

3. **Per-Minute Performance Metrics**: Goals, assists, and cards are normalized by minutes played, and extreme values are capped for better visualization.

4. **Performance Score Calculation**: A weighted performance score is computed (60% for goals, 30% for assists, 10% for cards) to rank players based on their contributions.

5. **Merging Additional Features**: Recent valuations and positions are merged into the dataset, finalizing the player stats for analysis.



In [2]:
import pandas as pd

# Load the appearance, player, and valuation data
df_appearances = pd.read_csv("./data/appearances.csv")
df_players = pd.read_csv("./data/players.csv")
df_valuations = pd.read_csv("./data/player_valuations.csv")  
df_valuations['date'] = pd.to_datetime(df_valuations['date'])  
df_game_lineups = pd.read_csv("./data/game_lineups.csv")

# Extract the most recent valuation for each player
df_recent_valuations = df_valuations.sort_values(by='date').groupby('player_id').tail(1)
df_recent_position = df_game_lineups.sort_values(by='date').groupby('player_id').tail(1)[['player_id', 'position']]  

# Merge the filtered appearances data with the players data to include league info
df_merged = pd.merge(df_appearances, df_players[['player_id', 'current_club_domestic_competition_id','current_club_id', 'name']], on='player_id')

# Group by 'player_id', 'name', and 'current_club_domestic_competition_id' (league) and sum the relevant stats
player_stats = df_merged.groupby(['player_id', 'name', 'current_club_domestic_competition_id', 'current_club_id']).agg({
    'yellow_cards': 'sum',
    'red_cards': 'sum',
    'goals': 'sum',
    'assists': 'sum',
    'minutes_played': 'sum'
}).reset_index()

# Filter out players who have played fewer than 11-12 full matches
min_minutes_threshold = 1000
player_stats = player_stats[player_stats['minutes_played'] >= min_minutes_threshold]

# Calculate the average yellow cards, red cards, goals, and assists per minute played
player_stats['yellow_cards_per_minute'] = player_stats['yellow_cards'] / player_stats['minutes_played']
player_stats['red_cards_per_minute'] = player_stats['red_cards'] / player_stats['minutes_played']
player_stats['goals_per_minute'] = player_stats['goals'] / player_stats['minutes_played']
player_stats['assists_per_minute'] = player_stats['assists'] / player_stats['minutes_played']

# Cap extreme values for better visualization
player_stats['yellow_cards_per_minute'] = player_stats['yellow_cards_per_minute'].clip(upper=0.05)
player_stats['red_cards_per_minute'] = player_stats['red_cards_per_minute'].clip(upper=0.05)
player_stats['goals_per_minute'] = player_stats['goals_per_minute'].clip(upper=0.05)
player_stats['assists_per_minute'] = player_stats['assists_per_minute'].clip(upper=0.05)

# Calculate per-minute stats for each match
df_merged['goals_per_minute'] = df_merged['goals'] / df_merged['minutes_played']
df_merged['assists_per_minute'] = df_merged['assists'] / df_merged['minutes_played']
df_merged['yellow_cards_per_minute'] = df_merged['yellow_cards'] / df_merged['minutes_played']
df_merged['red_cards_per_minute'] = df_merged['red_cards'] / df_merged['minutes_played']

# Group by 'player_id' and calculate the standard deviation and mean for per-minute stats
player_variance_stats_per_minute = df_merged.groupby('player_id').agg({
    'goals_per_minute': ['std', 'mean'],
    'assists_per_minute': ['std', 'mean'],
    'yellow_cards_per_minute': ['std', 'mean'],
    'red_cards_per_minute': ['std', 'mean']
}).reset_index()

# Flatten the column names
player_variance_stats_per_minute.columns = ['player_id', 'goals_per_minute_std', 'goals_per_minute_mean',
                                            'assists_per_minute_std', 'assists_per_minute_mean',
                                            'yellow_cards_per_minute_std', 'yellow_cards_per_minute_mean',
                                            'red_cards_per_minute_std', 'red_cards_per_minute_mean']

# Merge the variance stats per minute with the original player_stats
player_stats = pd.merge(player_stats, player_variance_stats_per_minute, on='player_id')

# Merge the most recent player valuations into the player_stats dataframe
player_stats = pd.merge(player_stats, df_recent_valuations[['player_id', 'market_value_in_eur']], on='player_id')

# Merge the most recent player position into the player_stats dataframe
player_stats = pd.merge(player_stats, df_recent_position, on='player_id', how='left')

# Define a ranking metric (performance score)
player_stats['performance_score'] = (
    player_stats['goals_per_minute'] * 0.6 +  
    player_stats['assists_per_minute'] * 0.3 +  
    player_stats['yellow_cards_per_minute'] * 0.1  
)


# Data Transformation: Player Position Mapping

In this step, we map specific player positions to broader categories: 'Goalkeeper', 'Defenders', 'Midfielders', and 'Attackers'. This transformation makes it easier to analyze player roles across teams

In [3]:

# Create the position mapping
position_mapping = {
    'Goalkeeper': ['Goalkeeper'],
    'Defenders': ['Centre-Back', 'Left-Back', 'Right-Back', 'Sweeper', 'Defensive Midfield'],
    'Midfielders': ['Central Midfield', 'Left Midfield', 'Right Midfield', 'Attacking Midfield', 'midfield'],
    'Attackers': ['Centre-Forward', 'Left Winger', 'Right Winger', 'Second Striker', 'Attack']
}

# Function to map positions
def map_position(pos):
    for key, values in position_mapping.items():
        if pos in values:
            return key
    return "Unknown"

# Apply the mapping
player_stats['position'] = player_stats['position'].apply(map_position)


## Feature Selection: Player Data, Team Data, Competitions Data, and Match Most Relevant Features

In this step, to reduce complexity, we identify the most important features from player, team, competitions, and match data to create a dataset targeted at predicting match outcomes.

1. **Player Data**: From the previous feature engineering.
   - **goals**, **yellow cards**, and **red cards**
   - **performance score**
   - **market value in euros**
   - **position**
   - **current club ID**

2. **Team Data**: From the club dataset.
   - **domestic competition ID**
   - **average age**
   - **foreigners percentage**
   - **stadium name**
   - **coach name**
   - **net transfer record** and **total market value**

3. **Competitions Data**: From the competitions dataset.
   - **competition ID**
   - **country ID**

4. **Match Data**: From the games dataset
   - **home and away team statistics**
   - **game outcomes**


In [4]:
data_files = {
    "appearances": './data/appearances.csv',
    "club_games": './data/club_games.csv',
    "clubs": './data/clubs.csv',
    "competitions": './data/competitions.csv',
    "game_events": './data/game_events.csv',
    "game_lineups": './data/game_lineups.csv',
    "games": './data/games.csv',
    "player_valuations": './data/player_valuations.csv',
    "players": './data/players.csv',
    "transfers": './data/transfers.csv'
}

dfs = {key: pd.read_csv(path) for key, path in data_files.items()}

games_df = dfs['games']
player_stats = player_stats[['player_id', 'goals', 'yellow_cards', 'red_cards', 'performance_score', 'market_value_in_eur', 'position', 'current_club_id']]
clubs_df = dfs['clubs'][['club_id', 'domestic_competition_id', 'average_age', 'foreigners_percentage', 'stadium_name', 'coach_name','net_transfer_record','total_market_value']]
competitions_df = dfs['competitions'][['competition_id', 'country_id']]

## Dataset Merging: Relevant Player, Team, and Competition Features for Each Match

In this step, we merge data from multiple sources—player performance, team statistics, and competition details—to build a rich and complete view of every match for match outcome prediction. To account for player details, we include 10 players per team in each match, ordered by position: 1 goalkeeper, 3 defenders, 3 midfielders, and 3 attackers.

In [5]:

## Merging club data
home_club_df = clubs_df.rename(columns=lambda x: f"home_team_{x}")
away_club_df = clubs_df.rename(columns=lambda x: f"away_team_{x}")
merged_df = pd.merge(games_df, home_club_df, left_on='home_club_id', right_on='home_team_club_id', how='left')
merged_df = pd.merge(merged_df, away_club_df, left_on='away_club_id', right_on='away_team_club_id', how='left')

## Merging competition data
home_competition_df = competitions_df.rename(columns=lambda x: f"home_team_{x}")
away_competition_df = competitions_df.rename(columns=lambda x: f"away_team_{x}")
merged_df = pd.merge(merged_df, home_competition_df, left_on='home_team_domestic_competition_id', right_on='home_team_competition_id', how='left')
merged_df = pd.merge(merged_df, away_competition_df, left_on='away_team_domestic_competition_id', right_on='away_team_competition_id', how='left')

# Define the top player count per position category
top_counts = {
    'Goalkeeper': 1,
    'Defenders': 3,
    'Midfielders': 3,
    'Attackers': 3
}

# Mapping of positions to numeric values
position_mapping_numeric = {'Goalkeeper': 1, 'Defenders': 2, 'Midfielders': 3, 'Attackers': 4}

# Function to get top players for home and away teams
def get_top_players_optimized(df, merged_df, top_counts):

    home_club_ids = merged_df['home_club_id'].unique()
    away_club_ids = merged_df['away_club_id'].unique()
    
    home_players = df[df['current_club_id'].isin(home_club_ids)]
    away_players = df[df['current_club_id'].isin(away_club_ids)]

    player_data_list = []

    for i, row in merged_df.iterrows():

        home_top_players = home_players[home_players['current_club_id'] == row['home_club_id']]
        away_top_players = away_players[away_players['current_club_id'] == row['away_club_id']]

        home_player_data = assign_top_players(home_top_players, 'home', i, top_counts)
        away_player_data = assign_top_players(away_top_players, 'away', i, top_counts)
        
        player_data_list.append({**home_player_data, **away_player_data})
    
    player_data_df = pd.DataFrame(player_data_list)

    merged_df = pd.concat([merged_df, player_data_df], axis=1)
    
    return merged_df

# Function to assign top players to columns
def assign_top_players(top_players, prefix, row_index, top_counts):
    positions = ['Goalkeeper', 'Defenders', 'Midfielders', 'Attackers']
    top_players_list = [
        top_players[top_players['position'] == position].nlargest(top_counts[position], 'performance_score')
        for position in positions if position in top_players['position'].values
    ]
    
    top_players = pd.concat(top_players_list, ignore_index=True) if top_players_list else pd.DataFrame()
    
    player_data = {}
    
    for i, player in enumerate(top_players.itertuples(), 1):
        player_data[f'{prefix}_player_{i}_id'] = player.player_id
        player_data[f'{prefix}_player_{i}_goals'] = player.goals
        player_data[f'{prefix}_player_{i}_yellow_cards'] = player.yellow_cards
        player_data[f'{prefix}_player_{i}_red_cards'] = player.red_cards
        player_data[f'{prefix}_player_{i}_performance_score'] = player.performance_score
        player_data[f'{prefix}_player_{i}_market_value_in_eur'] = player.market_value_in_eur
        player_data[f'{prefix}_player_{i}_position'] = player.position
        player_data[f'{prefix}_player_{i}_current_club_id'] = player.current_club_id
        
        if i == 10:  
            break

    return player_data

## Merging player data
merged_df = get_top_players_optimized(player_stats, merged_df, top_counts)

## Data Cleaning: Preparing for analysis

This step focuses on preparing the dataset for match-level predictions ML nodels.

1. **Dropping Unnecessary Columns**:
   I removed irrelevant columns such as coach names and club IDs that were either duplicated or not useful for predictions.

2. **Handling Missing Values**:
   I filled missing values for key columns like player positions, manager names, and stadiums with default values like 'Unknown'. For numerical columns like attendance and team average age, I used the median value to fill gaps.

3. **Converting Monetary Values**:
   I converted transfer record data from strings (e.g., "€1.5m") to numerical values to standardize the dataset and filled any missing transfer records with the median value.

4. **Handling Missing Player and Competition Data**:
   Missing values in numeric player-related columns were filled with the mean, while non-numeric and competition-related columns were filled with defaults such as 'Unknown' or 0.

5. **Creating Match Outcome Labels**:
   I defined the match outcome (win, lose, or tie) based on the number of goals scored by the home and away teams. This serves as the target variable for the prediction model.

6. **Validation**:
   After processing, I validated the data by checking for any remaining missing values to ensure completeness.

7. **Saving the Cleaned Data**:
   Finally, I saved the cleaned dataframe for further analysis and model training, confirming that the match outcome and other key features were correctly processed.


In [6]:
# Step 1: Drop unnecessary columns
columns_to_drop = [
    'url', 
    'home_team_coach_name', 'away_team_coach_name',  
    'home_team_club_id', 'away_team_club_id',        
    'home_club_name', 'away_club_name',            
    'aggregate'
]
cleaned_df = merged_df.drop(columns=columns_to_drop)

# Step 2: Handle missing values

# Fill missing positions, manager names, formations, stadiums, and referee with 'Unknown' or a default value
cleaned_df.fillna({
    'home_club_position': 7,
    'away_club_position': 7, # 7 is the default value for unknown. Usually leagues have 15 positions so 7 is the middle
    'home_club_manager_name': 'Unknown',
    'away_club_manager_name': 'Unknown',
    'stadium': 'Unknown',
    'home_team_stadium_name': 'Unknown',
    'away_team_stadium_name': 'Unknown',
    'home_club_formation': 'Unknown',
    'away_club_formation': 'Unknown',
    'referee': 'Unknown'
}, inplace=True)

# Fill missing attendance, average age, and foreigners percentage with the median
cleaned_df['attendance'] = cleaned_df['attendance'].fillna(cleaned_df['attendance'].median())
cleaned_df['home_team_average_age'] = cleaned_df['home_team_average_age'].fillna(cleaned_df['home_team_average_age'].median())
cleaned_df['away_team_average_age'] = cleaned_df['away_team_average_age'].fillna(cleaned_df['away_team_average_age'].median())
cleaned_df['home_team_foreigners_percentage'] = cleaned_df['home_team_foreigners_percentage'].fillna(cleaned_df['home_team_foreigners_percentage'].median())
cleaned_df['away_team_foreigners_percentage'] = cleaned_df['away_team_foreigners_percentage'].fillna(cleaned_df['away_team_foreigners_percentage'].median())

# Step 3: Convert monetary values for transfer redords
def convert_monetary_value(value):
    if isinstance(value, str):
        value = value.replace('€', '').replace('+', '').replace(',', '')
        if 'm' in value:
            return float(value.replace('m', '')) * 1e6
        elif 'k' in value:
            return float(value.replace('k', '')) * 1e3
        else:
            return float(value)
    return value

cleaned_df['home_team_net_transfer_record'] = cleaned_df['home_team_net_transfer_record'].apply(convert_monetary_value)
cleaned_df['away_team_net_transfer_record'] = cleaned_df['away_team_net_transfer_record'].apply(convert_monetary_value)

# Fill missing values in transfer records and team market value with the median
cleaned_df['home_team_net_transfer_record'] = cleaned_df['home_team_net_transfer_record'].fillna(cleaned_df['home_team_net_transfer_record'].median())
cleaned_df['away_team_net_transfer_record'] = cleaned_df['away_team_net_transfer_record'].fillna(cleaned_df['away_team_net_transfer_record'].median())
cleaned_df = cleaned_df.drop(['home_team_total_market_value', 'away_team_total_market_value'], axis=1)

# Step 4: Handle missing values in player and competition columns

# Fill missing player-related columns (numeric ones) with their mean
player_columns = [col for col in cleaned_df.columns if 'player_' in col]
for col in player_columns:
    if pd.api.types.is_numeric_dtype(cleaned_df[col]):
        cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].mean())

# Fill remaining missing player-related columns and position-related columns with 0
player_columns = [col for col in cleaned_df.columns if 'player_' in col or 'position' in col]
cleaned_df[player_columns] = cleaned_df[player_columns].fillna(0)

# Fill missing competition-related columns with 'Unknown'
competition_columns = [
    'home_team_domestic_competition_id', 'away_team_domestic_competition_id',
    'home_team_competition_id', 'away_team_competition_id',
    'home_team_country_id', 'away_team_country_id'
]
cleaned_df[competition_columns] = cleaned_df[competition_columns].fillna('Unknown')

# Step 5: Create the match outcome column (win_team_1, lose_team_1, tie)

def determine_match_outcome(home_goals, away_goals):
    if home_goals > away_goals:
        return 'win_team_1'  # Home team wins
    elif home_goals < away_goals:
        return 'lose_team_1'  # Away team wins
    else:
        return 'tie'  # Match tied

# Apply the outcome function to create the new column
cleaned_df['match_outcome'] = cleaned_df.apply(lambda row: determine_match_outcome(row['home_club_goals'], row['away_club_goals']), axis=1)

# Step 6: Validate the cleaning process by checking for missing values
missing_columns = cleaned_df.isnull().sum()

# Filter to show only columns with missing values
missing_columns = missing_columns[missing_columns > 0]

# Step 7: Save the cleaned dataframe if necessary
cleaned_df.to_csv('cleaned_data.csv', index=False)

  cleaned_df['match_outcome'] = cleaned_df.apply(lambda row: determine_match_outcome(row['home_club_goals'], row['away_club_goals']), axis=1)


# Reusable testing code

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Load the dataset
df = pd.read_csv('cleaned_data.csv')

# Preprocess only the categorical feature columns 
categorical_columns = df.drop(['match_outcome'], axis=1).select_dtypes(include=['object']).columns.tolist()
label_encoders = {}
for column in categorical_columns:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column].astype(str))
    label_encoders[column] = le

# Define features (X) and target (y)
X = df.drop(['home_club_goals', 'away_club_goals', 'match_outcome'], axis=1)  # Drop the goal columns and target column
y = df['match_outcome']  # Target is the match outcome (win_team_1, lose_team_1, tie)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

# Define the reusable evaluation function
def evaluate_model(model_name, y_test, y_pred, label_encoder):
    print(f"{model_name} Model Evaluation:\n")
    print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

  df = pd.read_csv('cleaned_data.csv')


# Simple Decision Tree

In [42]:
from sklearn.tree import DecisionTreeClassifier
# Train a Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Predict outcomes for the test set
y_pred_dt = dt_model.predict(X_test)

# Evaluate the Decision Tree model
evaluate_model("Decision Tree", y_test, y_pred_dt, label_encoder)

Decision Tree Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.48      0.49      0.49      6907
         tie       0.28      0.29      0.28      4480
  win_team_1       0.58      0.57      0.57      9504

    accuracy                           0.48     20891
   macro avg       0.45      0.45      0.45     20891
weighted avg       0.48      0.48      0.48     20891



# Random Forest with Information Gain (Via Entropy)

In [45]:

from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest model with Information Gain
rf_gini = RandomForestClassifier(criterion='entropy', n_estimators=100, random_state=42)
rf_gini.fit(X_train, y_train)

# Predict outcomes for the test set
y_pred_gini = rf_gini.predict(X_test)

# Evaluate the Random Forest model
evaluate_model("Random Forest (Information Gain)", y_test, y_pred_gini, label_encoder)

Random Forest (Information Gain) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.54      0.55      0.54      6907
         tie       0.29      0.16      0.20      4480
  win_team_1       0.59      0.71      0.64      9504

    accuracy                           0.54     20891
   macro avg       0.47      0.47      0.46     20891
weighted avg       0.51      0.54      0.52     20891



# Random Forest with Gini Index

In [44]:
# Initialize the Random Forest with Gini Index
rf_gini = RandomForestClassifier(criterion='gini', n_estimators=100, random_state=42)

# Predict outcomes for the test set
rf_gini.fit(X_train, y_train)
y_pred_gini = rf_gini.predict(X_test)

# Evaluate the Random Forest model
evaluate_model("Random Forest (Gini Index)", y_test, y_pred_gini, label_encoder)

Random Forest (Gini Index) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.53      0.54      0.53      6907
         tie       0.29      0.16      0.21      4480
  win_team_1       0.59      0.70      0.64      9504

    accuracy                           0.53     20891
   macro avg       0.47      0.47      0.46     20891
weighted avg       0.50      0.53      0.51     20891



# Random Forest with Gini Index and Hyperparameter tuning

In [48]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_selection import RFE
from scipy.stats import randint

# Set hyperparameter distribution for Random Forest (Asked chatGPT)
param_distributions = {
    'n_estimators': randint(50, 200),
    'max_depth': [3, 5, 7, None],
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 5),
    'bootstrap': [True, False]
}

# Initialize the RandomForestClassifier
rf_model = RandomForestClassifier(criterion='entropy', random_state=42)

# Perform Randomized Search for hyperparameter tuning
random_search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_distributions, n_iter=10,
                                   scoring='accuracy', cv=2, verbose=1, random_state=42)
random_search.fit(X_train, y_train)

# Get the best parameters from the random search
best_params = random_search.best_params_
print("Best parameters found: ", best_params, "\n")

# Train the best model
best_rf_model = RandomForestClassifier(**best_params, random_state=42)
best_rf_model.fit(X_train, y_train)

# Feature Importance Pruning
feature_importances = best_rf_model.feature_importances_

# Select top 100 important features
top_n = 100
important_feature_indices = feature_importances.argsort()[-top_n:][::-1]
selected_features = X.columns[important_feature_indices]

# Apply RFE on the reduced feature set
X_train_top = X_train[selected_features]
X_test_top = X_test[selected_features]

rf_model_rfe = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
# Select top 30 features
rfe = RFE(rf_model_rfe, n_features_to_select=30)  
X_train_rfe = rfe.fit_transform(X_train_top, y_train)
X_test_rfe = rfe.transform(X_test_top)

# Retraining model with RFE-selected features
best_rf_model.fit(X_train_rfe, y_train)

# Predicting outcomes
y_pred = best_rf_model.predict(X_test_rfe)

evaluate_model("Random Forest (Gini Index and hyperparameter tuning)", y_test, y_pred, label_encoder)


Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best parameters found:  {'bootstrap': True, 'max_depth': 7, 'min_samples_leaf': 3, 'min_samples_split': 4, 'n_estimators': 137} 

Random Forest (Gini Index and hyperparameter tuning) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.57      0.55      0.56      6907
         tie       0.91      0.01      0.02      4480
  win_team_1       0.57      0.85      0.69      9504

    accuracy                           0.57     20891
   macro avg       0.69      0.47      0.42     20891
weighted avg       0.64      0.57      0.50     20891



# XGBoost with Hyperparameter Tuning and Importance Pruning

In [50]:
from xgboost import XGBClassifier
from sklearn.feature_selection import RFE
from scipy.stats import randint

# Set up hyperparameter distribution for XGBoost (Asked chatGPT)
param_distributions = {
    'n_estimators': randint(50, 200),
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'gamma': [0, 0.1]  
}

# Initialize the XGBClassifier
xgb_model = XGBClassifier(random_state=42)

# Perform Randomized Search for hyperparameter tuning
random_search = RandomizedSearchCV(estimator=xgb_model, param_distributions=param_distributions, n_iter=10,
                                   scoring='accuracy', cv=2, verbose=1, random_state=42)
random_search.fit(X_train, y_train)

# Get the best parameters from the random search
best_params = random_search.best_params_
print("Best parameters found: ", best_params)

# Train the best model
best_xgb_model = XGBClassifier(**best_params, random_state=42)
best_xgb_model.fit(X_train, y_train)

# Feature Importance Pruning
feature_importances = best_xgb_model.feature_importances_

# Select top 100 important features
top_n = 100
important_feature_indices = feature_importances.argsort()[-top_n:][::-1] 
selected_features = X.columns[important_feature_indices]

# Apply RFE on the reduced feature set
X_train_top = X_train[selected_features]
X_test_top = X_test[selected_features]

xgb_model_rfe = XGBClassifier(n_estimators=50, max_depth=3, random_state=42)
# Select top 30 features
rfe = RFE(xgb_model_rfe, n_features_to_select=30)  
X_train_rfe = rfe.fit_transform(X_train_top, y_train)
X_test_rfe = rfe.transform(X_test_top)

# Retraining model with RFE-selected features
best_xgb_model.fit(X_train_rfe, y_train)

# Predicting outcomes
y_pred = best_xgb_model.predict(X_test_rfe)

evaluate_model("XGBoost (with Hyperparameter Tuning and Importance Pruning)", y_test, y_pred, label_encoder)

Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best parameters found:  {'colsample_bytree': 0.8, 'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 58, 'subsample': 1.0}
XGBoost (with Hyperparameter Tuning and Importance Pruning) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.56      0.66      0.61      6907
         tie       0.50      0.06      0.10      4480
  win_team_1       0.62      0.80      0.70      9504

    accuracy                           0.60     20891
   macro avg       0.56      0.51      0.47     20891
weighted avg       0.58      0.60      0.54     20891



# Gaussian Naive Bayes 

In [8]:
from sklearn.naive_bayes import GaussianNB

# Initialize the Naive Bayes classifier
nb_model = GaussianNB()

# Train the model and Predicting outcomes
nb_model.fit(X_train, y_train)
y_pred = nb_model.predict(X_test)

# Evaluate the model
evaluate_model("Gaussian Naive Bayes",y_test, y_pred, label_encoder)


Gaussian Naive Bayes Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.40      0.76      0.53      6907
         tie       0.25      0.06      0.10      4480
  win_team_1       0.60      0.42      0.50      9504

    accuracy                           0.46     20891
   macro avg       0.42      0.42      0.37     20891
weighted avg       0.46      0.46      0.42     20891



# Bernoulli Naive Bayes

In [13]:
from sklearn.naive_bayes import BernoulliNB

# Bernoulli Naive Bayes Model
bnb = BernoulliNB()
bnb.fit(X_train, y_train)

# Predict and evaluate
y_pred = bnb.predict(X_test)

# Evaluate the model
evaluate_model("Bernoulli Naive Bayes",y_test, y_pred, label_encoder)


Bernoulli Naive Bayes Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.47      0.27      0.35      6907
         tie       0.24      0.83      0.38      4480
  win_team_1       0.65      0.11      0.19      9504

    accuracy                           0.32     20891
   macro avg       0.45      0.40      0.30     20891
weighted avg       0.50      0.32      0.28     20891



# Multinomial Naive Bayes

In [11]:
from sklearn.naive_bayes import MultinomialNB

X_train_no_negatives = X_train.drop(['home_team_net_transfer_record', 'away_team_net_transfer_record'], axis=1)
X_test_no_negatives = X_test.drop(['home_team_net_transfer_record', 'away_team_net_transfer_record'], axis=1)

nb_model = MultinomialNB()

# Train the model
nb_model.fit(X_train_no_negatives, y_train)

# Predicting outcomes
y_pred = nb_model.predict(X_test_no_negatives)

# Evaluate the model
evaluate_model("Multinomial Naive Bayes", y_test, y_pred, label_encoder)

Multinomial Naive Bayes Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.44      0.43      0.44      6907
         tie       0.22      0.32      0.26      4480
  win_team_1       0.57      0.46      0.51      9504

    accuracy                           0.42     20891
   macro avg       0.41      0.40      0.40     20891
weighted avg       0.46      0.42      0.43     20891



# Complement Naive Bayes 

In [12]:
from sklearn.naive_bayes import ComplementNB

# Initialize the Complement Naive Bayes classifier
nb_model = ComplementNB()

# Train the model
nb_model.fit(X_train_no_negatives, y_train)

# Predicting outcomes
y_pred = nb_model.predict(X_test_no_negatives)

# Evaluate the model
evaluate_model("Complement Naive Bayes", y_test, y_pred, label_encoder)


Complement Naive Bayes Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.42      0.52      0.46      6907
         tie       0.28      0.06      0.09      4480
  win_team_1       0.54      0.65      0.59      9504

    accuracy                           0.48     20891
   macro avg       0.41      0.41      0.38     20891
weighted avg       0.44      0.48      0.44     20891



# Support Vector Machine (Linear Kernel)

In [2]:
from sklearn.svm import LinearSVC
# Initialize the SVM with a linear kernel
svm_model = LinearSVC(max_iter=5000)

# Train the model
svm_model.fit(X_train, y_train)

# Predicting outcomes
y_pred = svm_model.predict(X_test)

# Evaluate the model
evaluate_model("Linear SVM", y_test, y_pred, label_encoder)

# Support Vector Machine (Polynomial Kernel of Degree 3)

In [None]:
from sklearn.svm import SVC

# Initialize the SVM with a polynomial kernel
svm_model = SVC(kernel='poly', degree=3, max_iter=5000)  

# Train the model
svm_model.fit(X_train, y_train)

# Predicting outcomes
y_pred = svm_model.predict(X_test)

# Evaluate the model
evaluate_model("Polynomial SVM", y_test, y_pred, label_encoder)


# Support Vector Machine (Radial Basis Function Kernel)

In [None]:
from sklearn.svm import SVC

# Initialize the SVM with an RBF kernel
svm_model = SVC(kernel='rbf', max_iter=5000)

# Train the model
svm_model.fit(X_train, y_train)

# Predicting outcomes
y_pred = svm_model.predict(X_test)

# Evaluate the model
evaluate_model("RBF Kernel SVM", y_test, y_pred, label_encoder)


# Support Vector Machine (Sigmoid Kernel)

In [None]:
from sklearn.svm import SVC

# Initialize the SVM with a sigmoid kernel
svm_model = SVC(kernel='sigmoid',max_iter=5000)

# Train the model
svm_model.fit(X_train, y_train)

# Predicting outcomes
y_pred = svm_model.predict(X_test)

# Evaluate the model
evaluate_model("Sigmoid SVM", y_test, y_pred, label_encoder)

# K-Nearest Neighbours (Eucliding Distance, k = 5)

In [86]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize k-NN with Euclidean distance and k=5 
knn_euclidean = KNeighborsClassifier(metric='euclidean', n_neighbors=5)

# Train the model with and Predict outcomes
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)

# Evaluate the model
evaluate_model("K-Nearest Neighbours (Euclidian Distance, k = 5)", y_test, y_pred_euclidean, label_encoder)

K-Nearest Neighbours (Euclidian Distance, k = 5) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.45      0.56      0.50      6907
         tie       0.25      0.19      0.22      4480
  win_team_1       0.57      0.53      0.55      9504

    accuracy                           0.47     20891
   macro avg       0.42      0.43      0.42     20891
weighted avg       0.46      0.47      0.46     20891



# K-Nearest Neighbours (Manhattan Distance, k = 5)

In [76]:
# Initialize k-NN with Manhattan distance and k=10
knn_manhattan = KNeighborsClassifier(metric='manhattan', n_neighbors=5)

# Train the model with Manhattan distance and k=10
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)

# Evaluate the model
evaluate_model("K-Nearest Neighbours (Manhattan Distance, k = 5)", y_test, y_pred_manhattan, label_encoder)

K-Nearest Neighbours (Manhattan Distance, k = 5) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.45      0.56      0.50      6907
         tie       0.26      0.19      0.22      4480
  win_team_1       0.57      0.53      0.55      9504

    accuracy                           0.47     20891
   macro avg       0.43      0.43      0.42     20891
weighted avg       0.46      0.47      0.46     20891



# K-Nearest Neighbours (Eucliding Distance, k = 10)

In [79]:
# Initialize k-NN with Euclidean distance and k=10
knn_euclidean = KNeighborsClassifier(metric='euclidean', n_neighbors=10)

# Train the model with and Predict outcomes
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)

# Evaluate the model
evaluate_model("K-Nearest Neighbours (Euclidian Distance, k = 10)", y_test, y_pred_euclidean, label_encoder)

K-Nearest Neighbours (Euclidian Distance, k = 10) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.48      0.54      0.51      6907
         tie       0.26      0.14      0.19      4480
  win_team_1       0.56      0.63      0.59      9504

    accuracy                           0.49     20891
   macro avg       0.43      0.44      0.43     20891
weighted avg       0.47      0.49      0.48     20891



# K-Nearest Neighbours (Manhattan Distance, k = 10)

In [80]:
# Initialize k-NN with Manhattan distance and k=10
knn_manhattan = KNeighborsClassifier(metric='manhattan', n_neighbors=5)

# Train the model with Manhattan distance and k=10
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)

# Evaluate the model
evaluate_model("K-Nearest Neighbours (Manhattan Distance, k = 10)", y_test, y_pred_manhattan, label_encoder)


K-Nearest Neighbours (Manhattan Distance, k = 10) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.45      0.56      0.50      6907
         tie       0.26      0.19      0.22      4480
  win_team_1       0.57      0.53      0.55      9504

    accuracy                           0.47     20891
   macro avg       0.43      0.43      0.42     20891
weighted avg       0.46      0.47      0.46     20891



# Logistic Regression with L2 Regularization (Ridge) 

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Get Traning and Testing Data into the same sclae for better performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize Logistic Regression with balanced class weights and more iterations
log_reg_balanced = LogisticRegression(solver='lbfgs', max_iter=10000, penalty="l2")

# Train the model and predict outcomes
log_reg_balanced.fit(X_train_scaled, y_train)
y_pred_balanced = log_reg_balanced.predict(X_test_scaled)

# Evaluate the model
evaluate_model("Logistic Regression (No Regularization)", y_test, y_pred_balanced, label_encoder)

Logistic Regression (No Regularization) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.57      0.63      0.60      6907
         tie       0.35      0.06      0.10      4480
  win_team_1       0.61      0.80      0.69      9504

    accuracy                           0.58     20891
   macro avg       0.51      0.50      0.46     20891
weighted avg       0.54      0.58      0.53     20891



# Logistic Regression with L2 Regularization (Ridge) and balanced class weights

In [7]:
# Initialize Logistic Regression with L2 regularization using lbfgs solver
log_reg_l2 = LogisticRegression(solver='lbfgs', penalty='l2', max_iter=10000, class_weight='balanced')

# Train the model and predict outcomes
log_reg_l2.fit(X_train_scaled, y_train)
y_pred_l2 = log_reg_l2.predict(X_test_scaled)

# Evaluate the model
evaluate_model("Logistic Regression (L2 Regularization)", y_test, y_pred_l2, label_encoder)

Logistic Regression (L2 Regularization) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.57      0.61      0.59      6907
         tie       0.31      0.40      0.35      4480
  win_team_1       0.69      0.56      0.62      9504

    accuracy                           0.54     20891
   macro avg       0.52      0.53      0.52     20891
weighted avg       0.57      0.54      0.55     20891



# Logistic Regression with L1 Regularization (Lasso) and balanced class weights

In [12]:
# Initialize Logistic Regression with L1 regularization using liblinear solver
log_reg_l1 = LogisticRegression(solver='liblinear', penalty='l1', max_iter=1000, class_weight='balanced')

# Train the model and predict outcomes
log_reg_l1.fit(X_train_scaled, y_train)
y_pred_l1 = log_reg_l1.predict(X_test_scaled)

# Evaluate the model
evaluate_model("Logistic Regression (L1 Regularization)", y_test, y_pred_l1, label_encoder)



Logistic Regression (L1 Regularization) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.55      0.66      0.60      6907
         tie       0.33      0.16      0.22      4480
  win_team_1       0.65      0.71      0.68      9504

    accuracy                           0.58     20891
   macro avg       0.51      0.51      0.50     20891
weighted avg       0.55      0.58      0.55     20891



## Logistic Regression with ElasticNet Regularization and balanced class weights

In [13]:
# Initialize Logistic Regression with ElasticNet regularization using saga solver
log_reg_elasticnet = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=10000, class_weight='balanced')

# Train the model and predict outcomes
log_reg_elasticnet.fit(X_train_scaled, y_train)
y_pred_elasticnet = log_reg_elasticnet.predict(X_test_scaled)

# Evaluate the model
evaluate_model("Logistic Regression (ElasticNet Regularization)", y_test, y_pred_elasticnet, label_encoder)


Logistic Regression (ElasticNet Regularization) Model Evaluation:

              precision    recall  f1-score   support

 lose_team_1       0.57      0.61      0.59      6907
         tie       0.31      0.40      0.35      4480
  win_team_1       0.69      0.56      0.62      9504

    accuracy                           0.54     20891
   macro avg       0.52      0.52      0.52     20891
weighted avg       0.57      0.54      0.55     20891

