
# Task 1 - Dataset, Background, Significance, Motivation.

For this project, I have selected the [soccer](https://www.kaggle.com/datasets/davidcariboo/player-scores?select=transfers.csv) dataset, which is publicly available on Kaggle. 


## Background and Significance




## Motivation


# Task 2 -Exploratory Data Analysis

## 1 Player mean and red cards

# Data Preprocessing
In this project, the objective of the data preprocessing phase was to gather and clean relevant data from multiple sources to prepare it for match outcome prediction. The data spans various levels, including team, competition, match, and player-level data. The primary steps involved in this process were:

1. **Data Loading and Integration**: 
   I began by loading multiple datasets such as player appearances, valuations, and game lineups, ensuring that all necessary information related to players and teams was available. I merged these datasets, creating a unified structure that included key data points like player positions, market value, and club associations.

2. **Data Cleaning**: 
   Missing values were handled by filling gaps in important columns, such as player positions and manager names, with either defaults or medians for numeric data. I also dealt with duplicate and unnecessary columns to streamline the dataset. Additionally, string-based monetary values were converted into numerical format for further analysis.

3. **Feature Engineering**: 
   To enhance the predictive power of the data, I generated new features, such as per-minute statistics for goals, assists, yellow and red cards, and calculated the standard deviation and mean for each player's performance across matches. This allowed me to capture both the consistency and overall contribution of players during the season.

4. **Data Transformation**: 
   Position data was mapped to broader categories (e.g., Goalkeepers, Defenders, Midfielders, Attackers) to simplify analysis. Extreme values were capped to minimize the impact of outliers on the model. I also created a performance score based on goals, assists, and cards, weighted according to their importance for the prediction.

5. **Team and Competition Data**: 
   Club-level data (such as average age, transfer records, and market values) and competition details were merged into the main dataset, providing team-level insights that could influence match outcomes.

6. **Match Outcome Labeling**: 
   Finally, I created a column representing match outcomes (win, lose, tie) based on the goals scored by home and away teams. This label serves as the target variable for the match prediction model.

Through these steps, the dataset was thoroughly cleaned and enriched, ensuring that both team-level and player-level data were aligned for accurate match outcome prediction.



## Player Data Integration and Preparation for Match Prediction

The first step in obtaining relevant player information focused on gathering key data necessary for predicting match outcomes. I loaded and merged player appearances, details, and valuations, ensuring I had the most recent player stats and positions. By filtering for the 2023 season and incorporating attributes like current club and league, I built a dataset that combines individual player performance with team dynamics. This data processing is crucial for match prediction, as player contributions (goals, assists, cards, etc.) directly impact team performance, which in turn influences match outcomes.

In [52]:
import pandas as pd

# Load the appearance, player, and valuation data
df_appearances = pd.read_csv("./data/appearances.csv")
df_players = pd.read_csv("./data/players.csv")
df_valuations = pd.read_csv("./data/player_valuations.csv")  # Contains 'player_id', 'market_value_in_eur', and 'date'
df_valuations['date'] = pd.to_datetime(df_valuations['date'])  # Convert date to datetime format
df_game_lineups = pd.read_csv("./data/game_lineups.csv")

# Extract the most recent valuation for each player
df_recent_valuations = df_valuations.sort_values(by='date').groupby('player_id').tail(1)
df_recent_position = df_game_lineups.sort_values(by='date').groupby('player_id').tail(1)[['player_id', 'position']]  # Only keep relevant columns

# Extract records from the last season (e.g., 2023) based on the 'date' column
df_appearances['date'] = pd.to_datetime(df_appearances['date'])  # Convert date to datetime format
df_last_season = df_appearances

# Merge the filtered appearances data with the players data to include league info
df_merged = pd.merge(df_last_season, df_players[['player_id', 'current_club_domestic_competition_id','current_club_id', 'name']], on='player_id')

# Group by 'player_id', 'name', and 'current_club_domestic_competition_id' (league), then sum the relevant stats
player_stats = df_merged.groupby(['player_id', 'name', 'current_club_domestic_competition_id', 'current_club_id']).agg({
    'yellow_cards': 'sum',
    'red_cards': 'sum',
    'goals': 'sum',
    'assists': 'sum',
    'minutes_played': 'sum'
}).reset_index()

# Filter out players who have played fewer than 11-12 full matches
min_minutes_threshold = 1000
player_stats = player_stats[player_stats['minutes_played'] >= min_minutes_threshold]

# Calculate the average yellow cards, red cards, goals, and assists per minute played
player_stats['yellow_cards_per_minute'] = player_stats['yellow_cards'] / player_stats['minutes_played']
player_stats['red_cards_per_minute'] = player_stats['red_cards'] / player_stats['minutes_played']
player_stats['goals_per_minute'] = player_stats['goals'] / player_stats['minutes_played']
player_stats['assists_per_minute'] = player_stats['assists'] / player_stats['minutes_played']

# Cap extreme values for better visualization
player_stats['yellow_cards_per_minute'] = player_stats['yellow_cards_per_minute'].clip(upper=0.05)
player_stats['red_cards_per_minute'] = player_stats['red_cards_per_minute'].clip(upper=0.05)
player_stats['goals_per_minute'] = player_stats['goals_per_minute'].clip(upper=0.05)
player_stats['assists_per_minute'] = player_stats['assists_per_minute'].clip(upper=0.05)

# Calculate per-minute stats for each match
df_merged['goals_per_minute'] = df_merged['goals'] / df_merged['minutes_played']
df_merged['assists_per_minute'] = df_merged['assists'] / df_merged['minutes_played']
df_merged['yellow_cards_per_minute'] = df_merged['yellow_cards'] / df_merged['minutes_played']
df_merged['red_cards_per_minute'] = df_merged['red_cards'] / df_merged['minutes_played']

# Group by 'player_id' and calculate the standard deviation and mean for per-minute stats
player_variance_stats_per_minute = df_merged.groupby('player_id').agg({
    'goals_per_minute': ['std', 'mean'],
    'assists_per_minute': ['std', 'mean'],
    'yellow_cards_per_minute': ['std', 'mean'],
    'red_cards_per_minute': ['std', 'mean']
}).reset_index()

# Flatten the column names
player_variance_stats_per_minute.columns = ['player_id', 'goals_per_minute_std', 'goals_per_minute_mean',
                                            'assists_per_minute_std', 'assists_per_minute_mean',
                                            'yellow_cards_per_minute_std', 'yellow_cards_per_minute_mean',
                                            'red_cards_per_minute_std', 'red_cards_per_minute_mean']

# Merge the variance stats per minute with the original player_stats
player_stats = pd.merge(player_stats, player_variance_stats_per_minute, on='player_id')

# Merge the most recent player valuations into the player_stats dataframe
player_stats = pd.merge(player_stats, df_recent_valuations[['player_id', 'market_value_in_eur']], on='player_id')

# Merge the most recent player position into the player_stats dataframe
player_stats = pd.merge(player_stats, df_recent_position, on='player_id', how='left')

# Define a ranking metric (performance score)
player_stats['performance_score'] = (
    player_stats['goals_per_minute'] * 0.6 +  # Higher weight for goals
    player_stats['assists_per_minute'] * 0.3 +  # Moderate weight for assists
    player_stats['yellow_cards_per_minute'] * 0.1  # Lower weight for cards (can adjust or remove)
)

print(player_stats.head(1))


   player_id            name current_club_domestic_competition_id  \
0         10  Miroslav Klose                                  IT1   

   current_club_id  yellow_cards  red_cards  goals  assists  minutes_played  \
0              398            19          0     48       25            8808   

   yellow_cards_per_minute  red_cards_per_minute  goals_per_minute  \
0                 0.002157                   0.0           0.00545   

   assists_per_minute  goals_per_minute_std  goals_per_minute_mean  \
0            0.002838              0.013516               0.006194   

   assists_per_minute_std  assists_per_minute_mean  \
0                 0.01163                 0.003627   

   yellow_cards_per_minute_std  yellow_cards_per_minute_mean  \
0                     0.005105                      0.001925   

   red_cards_per_minute_std  red_cards_per_minute_mean  market_value_in_eur  \
0                       0.0                        0.0              1000000   

         position  perf

## Player Position Mapping and Performance Data Refinement**

In this step, I refined the `player_stats` dataframe by selecting key attributes such as goals, yellow and red cards, market value, and performance scores. I then mapped player positions into broader categories—Goalkeeper, Defenders, Midfielders, and Attackers—using a predefined position mapping function. This categorization simplifies the analysis and model training by grouping similar roles. Finally, I calculated the percentage of unmatched positions labeled as "Unknown," ensuring that any inconsistencies in position data were accounted for before proceeding with further analysis.

In [53]:
player_stats = player_stats[['player_id', 'goals', 'yellow_cards', 'red_cards', 'performance_score', 'market_value_in_eur', 'position', 'current_club_id']]

# Assuming you have 'player_stats' with a 'position' column and the position mapping
position_mapping = {
    'Goalkeeper': ['Goalkeeper'],
    'Defenders': ['Centre-Back', 'Left-Back', 'Right-Back', 'Sweeper', 'Defensive Midfield'],
    'Midfielders': ['Central Midfield', 'Left Midfield', 'Right Midfield', 'Attacking Midfield', 'midfield'],
    'Attackers': ['Centre-Forward', 'Left Winger', 'Right Winger', 'Second Striker', 'Attack']
}

# Function to map positions
def map_position(pos):
    for key, values in position_mapping.items():
        if pos in values:
            return key
    return "Unknown"

# Apply the mapping
player_stats['position'] = player_stats['position'].apply(map_position)

# Calculate percentage of unmatched positions (i.e., "Unknown")
unmatched_percentage = (player_stats['position'] == "Unknown").mean() * 100

# Print the resulting dataframe
print(player_stats.head(3))

# Print the percentage of unmatched values
print(f"Percentage of unmatched positions: {unmatched_percentage:.2f}%")

   player_id  goals  yellow_cards  red_cards  performance_score  \
0         10     48            19          0           0.004337   
1         26      0             4          2           0.000030   
2         65     38            11          1           0.003163   

   market_value_in_eur    position  current_club_id  
0              1000000   Attackers              398  
1               750000  Goalkeeper               16  
2              1000000   Attackers             1091  
Percentage of unmatched positions: 1.07%


## Merging Team, Competition, and Player Data for Match-Level Analysis

In this step, I merged datasets like appearances, clubs, competitions, and game events to create a comprehensive match-level dataset. The goal was to capture essential team and player attributes for accurate match predictions.

By integrating home and away team data (e.g., team age, market value) and competition details, each match record now includes the necessary context about team strength and dynamics. I also included the top-performing players for each team, ensuring their contributions are considered. This complete match-level data is crucial for predicting outcomes based on both team and player performance.

In [54]:
import pandas as pd

# Load the data from CSV files
data_files = {
    "appearances": './data/appearances.csv',
    "club_games": './data/club_games.csv',
    "clubs": './data/clubs.csv',
    "competitions": './data/competitions.csv',
    "game_events": './data/game_events.csv',
    "game_lineups": './data/game_lineups.csv',
    "games": './data/games.csv',
    "player_valuations": './data/player_valuations.csv',
    "players": './data/players.csv',
    "transfers": './data/transfers.csv'
}

# Load all datasets into a dictionary (read them only once)
dfs = {key: pd.read_csv(path) for key, path in data_files.items()}

print("Data loaded successfully!")



# Optimized: Merging home and away club data in one step
clubs_df = dfs['clubs'][['club_id', 'domestic_competition_id', 'average_age', 'foreigners_percentage', 'stadium_name', 'coach_name','net_transfer_record','total_market_value']]

# Rename columns for home and away club data
home_club_df = clubs_df.rename(columns=lambda x: f"home_team_{x}")
away_club_df = clubs_df.rename(columns=lambda x: f"away_team_{x}")

# Merge all club data at once
merged_df = pd.merge(dfs['games'], home_club_df, left_on='home_club_id', right_on='home_team_club_id', how='left')
merged_df = pd.merge(merged_df, away_club_df, left_on='away_club_id', right_on='away_team_club_id', how='left')

print("Merged club data successfully!")

# Optimized: Merge competition data at once to avoid multiple merges
competitions_df = dfs['competitions'][['competition_id', 'country_id']]
home_competition_df = competitions_df.rename(columns=lambda x: f"home_team_{x}")
away_competition_df = competitions_df.rename(columns=lambda x: f"away_team_{x}")

merged_df = pd.merge(merged_df, home_competition_df, left_on='home_team_domestic_competition_id', right_on='home_team_competition_id', how='left')
merged_df = pd.merge(merged_df, away_competition_df, left_on='away_team_domestic_competition_id', right_on='away_team_competition_id', how='left')

print("Merged competition data successfully!")


# Define the top player count per position category
top_counts = {
    'Goalkeeper': 1,
    'Defenders': 3,
    'Midfielders': 3,
    'Attackers': 3
}

# Mapping of positions to numeric values
position_mapping_numeric = {'Goalkeeper': 1, 'Defenders': 2, 'Midfielders': 3, 'Attackers': 4}

# Function to get top players for home and away teams
# Function to get top players for home and away teams
def get_top_players_optimized(df, merged_df, top_counts):
    # Filter player data only once by club IDs
    home_club_ids = merged_df['home_club_id'].unique()
    away_club_ids = merged_df['away_club_id'].unique()
    
    # Pre-filter the data for the relevant clubs
    home_players = df[df['current_club_id'].isin(home_club_ids)]
    away_players = df[df['current_club_id'].isin(away_club_ids)]

    # List to collect all player data to avoid row-by-row assignment
    player_data_list = []

    # Process each row in merged_df
    for i, row in merged_df.iterrows():
        # Get home and away top players
        home_top_players = home_players[home_players['current_club_id'] == row['home_club_id']]
        away_top_players = away_players[away_players['current_club_id'] == row['away_club_id']]

        # Collect the top players for both home and away clubs
        home_player_data = assign_top_players(home_top_players, 'home', i, top_counts)
        away_player_data = assign_top_players(away_top_players, 'away', i, top_counts)
        
        # Combine player data
        player_data_list.append({**home_player_data, **away_player_data})
    
    # Convert the collected data into a DataFrame
    player_data_df = pd.DataFrame(player_data_list)

    # Merge the player data into the original merged_df
    merged_df = pd.concat([merged_df, player_data_df], axis=1)
    
    return merged_df

# Optimized function to assign top players directly to columns
def assign_top_players(top_players, prefix, row_index, top_counts):
    positions = ['Goalkeeper', 'Defenders', 'Midfielders', 'Attackers']
    top_players_list = [
        top_players[top_players['position'] == position].nlargest(top_counts[position], 'performance_score')
        for position in positions if position in top_players['position'].values
    ]
    
    # Concatenate all top players for all positions
    top_players = pd.concat(top_players_list, ignore_index=True) if top_players_list else pd.DataFrame()

    # Create a dictionary to hold player data for the current row
    player_data = {}
    
    # Assign players to the columns (limit to 10 players per team)
    for i, player in enumerate(top_players.itertuples(), 1):
        player_data[f'{prefix}_player_{i}_id'] = player.player_id
        player_data[f'{prefix}_player_{i}_goals'] = player.goals
        player_data[f'{prefix}_player_{i}_yellow_cards'] = player.yellow_cards
        player_data[f'{prefix}_player_{i}_red_cards'] = player.red_cards
        player_data[f'{prefix}_player_{i}_performance_score'] = player.performance_score
        player_data[f'{prefix}_player_{i}_market_value_in_eur'] = player.market_value_in_eur
        player_data[f'{prefix}_player_{i}_position'] = player.position
        player_data[f'{prefix}_player_{i}_current_club_id'] = player.current_club_id
        
        if i == 10:  # Limit to 10 players
            break

    return player_data

# Example usage
# Assuming `player_stats` contains player data and `merged_df` is your main dataframe
final_df = get_top_players_optimized(player_stats, merged_df, top_counts)

# Print the final dataframe with the player columns
pd.set_option('display.max_columns', None)
print("\nFinal dataframe with player columns:")
print(final_df.head(1))

Data loaded successfully!
Merged club data successfully!
Merged competition data successfully!

Final dataframe with player columns:
   game_id competition_id  season        round        date  home_club_id  \
0  2321044             L1    2013  2. Matchday  2013-08-18            16   

   away_club_id  home_club_goals  away_club_goals  home_club_position  \
0            23                2                1                 1.0   

   away_club_position home_club_manager_name away_club_manager_name  \
0                15.0           Jürgen Klopp   Torsten Lieberknecht   

             stadium  attendance       referee  \
0  SIGNAL IDUNA PARK     80200.0  Peter Sippel   

                                                                                                   url  \
0  https://www.transfermarkt.co.uk/borussia-dortmund_eintracht-braunschweig/index/spielbericht/2321044   

  home_club_formation away_club_formation     home_club_name  \
0             4-2-3-1             4-3-2-1  Bor

## Data Cleaning and Preprocessing for Match-Level Predictions**

This step focuses on preparing the dataset for match-level predictions by cleaning and handling missing values, converting monetary data, and creating the match outcome label.

1. **Dropping Unnecessary Columns**:
   I removed irrelevant columns such as coach names and club IDs that were either duplicated or not useful for predictions.

2. **Handling Missing Values**:
   I filled missing values for key columns like player positions, manager names, and stadiums with default values like 'Unknown'. For numerical columns like attendance and team average age, I used the median value to fill gaps.

3. **Converting Monetary Values**:
   I converted transfer record data from strings (e.g., "€1.5m") to numerical values to standardize the dataset and filled any missing transfer records with the median value.

4. **Handling Missing Player and Competition Data**:
   Missing values in numeric player-related columns were filled with the mean, while non-numeric and competition-related columns were filled with defaults such as 'Unknown' or 0.

5. **Creating Match Outcome Labels**:
   I defined the match outcome (win, lose, or tie) based on the number of goals scored by the home and away teams. This serves as the target variable for the prediction model.

6. **Validation**:
   After processing, I validated the data by checking for any remaining missing values to ensure completeness.

7. **Saving the Cleaned Data**:
   Finally, I saved the cleaned dataframe for further analysis and model training, confirming that the match outcome and other key features were correctly processed.

This process ensures the dataset is well-prepared for accurate match outcome predictions.

In [84]:
# Step 1: Drop unnecessary columns
columns_to_drop = [
    'url', 
    'home_team_coach_name', 'away_team_coach_name',  # Dropping coach names
    'home_team_club_id', 'away_team_club_id',        # Dropping duplicate club IDs
    'home_club_name', 'away_club_name',              # Dropping club names, ID is enough
    'aggregate'
]
cleaned_df = final_df.drop(columns=columns_to_drop)

# Step 2: Handle missing values

# Fill missing positions, manager names, formations, stadiums, and referee with 'Unknown' or a default value
cleaned_df.fillna({
    'home_club_position': -1,
    'away_club_position': -1,
    'home_club_manager_name': 'Unknown',
    'away_club_manager_name': 'Unknown',
    'stadium': 'Unknown',
    'home_team_stadium_name': 'Unknown',
    'away_team_stadium_name': 'Unknown',
    'home_club_formation': 'Unknown',
    'away_club_formation': 'Unknown',
    'referee': 'Unknown'
}, inplace=True)

# Fill missing attendance, average age, and foreigners percentage with the median
cleaned_df['attendance'] = cleaned_df['attendance'].fillna(cleaned_df['attendance'].median())
cleaned_df['home_team_average_age'] = cleaned_df['home_team_average_age'].fillna(cleaned_df['home_team_average_age'].median())
cleaned_df['away_team_average_age'] = cleaned_df['away_team_average_age'].fillna(cleaned_df['away_team_average_age'].median())
cleaned_df['home_team_foreigners_percentage'] = cleaned_df['home_team_foreigners_percentage'].fillna(cleaned_df['home_team_foreigners_percentage'].median())
cleaned_df['away_team_foreigners_percentage'] = cleaned_df['away_team_foreigners_percentage'].fillna(cleaned_df['away_team_foreigners_percentage'].median())

# Step 3: Convert monetary values (transfer records)

def convert_monetary_value(value):
    if isinstance(value, str):
        value = value.replace('€', '').replace('+', '').replace(',', '')
        if 'm' in value:
            return float(value.replace('m', '')) * 1e6
        elif 'k' in value:
            return float(value.replace('k', '')) * 1e3
        else:
            return float(value)
    return value

# Apply the conversion function to both transfer record columns
cleaned_df['home_team_net_transfer_record'] = cleaned_df['home_team_net_transfer_record'].apply(convert_monetary_value)
cleaned_df['away_team_net_transfer_record'] = cleaned_df['away_team_net_transfer_record'].apply(convert_monetary_value)

# Fill missing values in transfer records with the median
cleaned_df['home_team_net_transfer_record'] = cleaned_df['home_team_net_transfer_record'].fillna(cleaned_df['home_team_net_transfer_record'].median())
cleaned_df['away_team_net_transfer_record'] = cleaned_df['away_team_net_transfer_record'].fillna(cleaned_df['away_team_net_transfer_record'].median())

# Step 4: Handle missing values in player and competition columns

# Fill missing player-related columns (numeric ones) with their mean
player_columns = [col for col in cleaned_df.columns if 'player_' in col]
for col in player_columns:
    if pd.api.types.is_numeric_dtype(cleaned_df[col]):
        cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].mean())

# Fill remaining missing player-related columns and position-related columns with 0
player_columns = [col for col in cleaned_df.columns if 'player_' in col or 'position' in col]
cleaned_df[player_columns] = cleaned_df[player_columns].fillna(0)

# Fill missing competition-related columns with 'Unknown'
competition_columns = [
    'home_team_domestic_competition_id', 'away_team_domestic_competition_id',
    'home_team_competition_id', 'away_team_competition_id',
    'home_team_country_id', 'away_team_country_id'
]
cleaned_df[competition_columns] = cleaned_df[competition_columns].fillna('Unknown')

# Step 5: Create the match outcome column (win_team_1, lose_team_1, tie)

def determine_match_outcome(home_goals, away_goals):
    if home_goals > away_goals:
        return 'win_team_1'  # Home team wins
    elif home_goals < away_goals:
        return 'lose_team_1'  # Away team wins
    else:
        return 'tie'  # Match tied

# Apply the outcome function to create the new column
cleaned_df['match_outcome'] = cleaned_df.apply(lambda row: determine_match_outcome(row['home_club_goals'], row['away_club_goals']), axis=1)

# Step 6: Validate the cleaning process by checking for missing values
missing_values_after_cleaning = cleaned_df.isnull().mean() * 100
print("Missing values after cleaning:", missing_values_after_cleaning)

# Step 7: Save the cleaned dataframe if necessary
cleaned_df.to_csv('cleaned_data.csv', index=False)

# Print first few rows to ensure the outcome column and cleaning are correct
print(cleaned_df[['home_club_goals', 'away_club_goals', 'match_outcome']].head())



  cleaned_df['match_outcome'] = cleaned_df.apply(lambda row: determine_match_outcome(row['home_club_goals'], row['away_club_goals']), axis=1)


Missing values after cleaning: game_id                               0.0
competition_id                        0.0
season                                0.0
round                                 0.0
date                                  0.0
home_club_id                          0.0
away_club_id                          0.0
home_club_goals                       0.0
away_club_goals                       0.0
home_club_position                    0.0
away_club_position                    0.0
home_club_manager_name                0.0
away_club_manager_name                0.0
stadium                               0.0
attendance                            0.0
referee                               0.0
home_club_formation                   0.0
away_club_formation                   0.0
competition_type                      0.0
home_team_domestic_competition_id     0.0
home_team_average_age                 0.0
home_team_foreigners_percentage       0.0
home_team_stadium_name                0.0
hom

# Reusable testing code

In [99]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier

# Load the dataset
df = pd.read_csv('cleaned_data.csv')

# Preprocess only the categorical feature columns (not the target column)
categorical_columns = df.drop(['match_outcome'], axis=1).select_dtypes(include=['object']).columns.tolist()

label_encoders = {}
for column in categorical_columns:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column].astype(str))
    label_encoders[column] = le

# Define features (X) and target (y)
X = df.drop(['home_club_goals', 'away_club_goals', 'match_outcome'], axis=1)  # Drop the goal columns and target column
y = df['match_outcome']  # Target is the match outcome (win_team_1, lose_team_1, tie)

# Encode the target labels (win_team_1 -> 0, lose_team_1 -> 1, tie -> 2)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

# Define the evaluation function
def evaluate_model(y_test, y_pred, label_encoder):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')

    print(f"Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1 Score: {f1}")

    # Convert numeric predictions and test labels back to original strings
    y_test_labels = label_encoder.inverse_transform(y_test)
    y_pred_labels = label_encoder.inverse_transform(y_pred)
    
    # Use the original class labels for the classification report
    target_names = label_encoder.classes_  # Ensure it contains the original class labels
    
   # print("\nClassification Report:")
   # print(classification_report(y_test_labels, y_pred_labels, target_names=target_names))



  df = pd.read_csv('cleaned_data.csv')


# Simple Decision Tree

In [100]:
# Train a Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Predict outcomes for the test set
y_pred_dt = dt_model.predict(X_test)

# Evaluate the Decision Tree model
print("\nDecision Tree Model Evaluation:")
evaluate_model(y_test, y_pred_dt, label_encoder)


Decision Tree Model Evaluation:
Accuracy: 0.4796802450816141
Precision: 0.4813702795526526
Recall: 0.4796802450816141
F1 Score: 0.4804815096714646


# Random Forest with Gain Ratio (Via Entropy)

In [117]:

from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest with Gini Index
rf_gini = RandomForestClassifier(criterion='entropy', n_estimators=100, random_state=42)

# Train the model
rf_gini.fit(X_train, y_train)

# Predicting outcomes
y_pred_gini = rf_gini.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred_gini, label_encoder)

Accuracy: 0.5374563209037384
Precision: 0.5080977434157512
Recall: 0.5374563209037384
F1 Score: 0.5157941515737863


# Random Forest with Gini Index

In [116]:
# Initialize the Random Forest with Gini Index
rf_gini = RandomForestClassifier(criterion='gini', n_estimators=100, random_state=42)

# Train the model
rf_gini.fit(X_train, y_train)

# Predicting outcomes
y_pred_gini = rf_gini.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred_gini, label_encoder)

Accuracy: 0.5341534632138242
Precision: 0.506396508988572
Recall: 0.5341534632138242
F1 Score: 0.5143315943279574


# Random Forest with Gini Index and hyperparameter tuning

In [119]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.preprocessing import LabelEncoder
from scipy.stats import randint

# Set up hyperparameter distribution for Random Forest
param_distributions = {
    'n_estimators': randint(50, 200),
    'max_depth': [3, 5, 7, None],
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 5),
    'bootstrap': [True, False]
}

# Initialize the RandomForestClassifier
rf_model = RandomForestClassifier(criterion='entropy', random_state=42)

# Perform Randomized Search for hyperparameter tuning
random_search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_distributions, n_iter=10,
                                   scoring='accuracy', cv=2, verbose=1, random_state=42)
random_search.fit(X_train, y_train)

# Get the best parameters from the random search
best_params = random_search.best_params_
print("Best parameters found: ", best_params)

# Train the best model
best_rf_model = RandomForestClassifier(**best_params, random_state=42)
best_rf_model.fit(X_train, y_train)

# Feature Importance Pruning
feature_importances = best_rf_model.feature_importances_

# Select top 100 important features
top_n = 100
important_feature_indices = feature_importances.argsort()[-top_n:][::-1]  # Get indices of top features
selected_features = X.columns[important_feature_indices]

# Apply RFE on the reduced feature set
X_train_top = X_train[selected_features]
X_test_top = X_test[selected_features]

rf_model_rfe = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=42)
rfe = RFE(rf_model_rfe, n_features_to_select=30)  # Select top 30 features
X_train_rfe = rfe.fit_transform(X_train_top, y_train)
X_test_rfe = rfe.transform(X_test_top)

# Retraining model with RFE-selected features
best_rf_model.fit(X_train_rfe, y_train)

# Predicting outcomes
y_pred = best_rf_model.predict(X_test_rfe)

evaluate_model(y_test, y_pred, label_encoder)


Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best parameters found:  {'bootstrap': True, 'max_depth': 7, 'min_samples_leaf': 3, 'min_samples_split': 4, 'n_estimators': 137}
Accuracy: 0.5716815853716911
Precision: 0.44905695065668944
Recall: 0.5716815853716911
F1 Score: 0.4975211023829888


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# XGBoost with Feature with hyperparamete rturning and Importance Pruning

In [101]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from xgboost import XGBClassifier
from sklearn.feature_selection import RFE
from sklearn.preprocessing import LabelEncoder
from scipy.stats import randint

# Set up hyperparameter distribution for XGBoost
param_distributions = {
    'n_estimators': randint(50, 200),
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'gamma': [0, 0.1]  # Regularization term
}

# Initialize the XGBClassifier
xgb_model = XGBClassifier(random_state=42)

# Perform Randomized Search for hyperparameter tuning
random_search = RandomizedSearchCV(estimator=xgb_model, param_distributions=param_distributions, n_iter=10,
                                   scoring='accuracy', cv=2, verbose=1, random_state=42)
random_search.fit(X_train, y_train)

# Get the best parameters from the random search
best_params = random_search.best_params_
print("Best parameters found: ", best_params)

# Train the best model
best_xgb_model = XGBClassifier(**best_params, random_state=42)
best_xgb_model.fit(X_train, y_train)

# Feature Importance Pruning
feature_importances = best_xgb_model.feature_importances_

# Select top 100 important features
top_n = 100
important_feature_indices = feature_importances.argsort()[-top_n:][::-1]  # Get indices of top features
selected_features = X.columns[important_feature_indices]

# Apply RFE on the reduced feature set
X_train_top = X_train[selected_features]
X_test_top = X_test[selected_features]

xgb_model_rfe = XGBClassifier(n_estimators=50, max_depth=3, random_state=42)
rfe = RFE(xgb_model_rfe, n_features_to_select=30)  # Select top 30 features
X_train_rfe = rfe.fit_transform(X_train_top, y_train)
X_test_rfe = rfe.transform(X_test_top)

# Retraining model with RFE-selected features
best_xgb_model.fit(X_train_rfe, y_train)

# Predicting outcomes
y_pred = best_xgb_model.predict(X_test_rfe)

evaluate_model(y_test, y_pred, label_encoder)



Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best parameters found:  {'colsample_bytree': 0.8, 'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 58, 'subsample': 1.0}
Accuracy: 0.59437078167632
Precision: 0.5748284381334055
Recall: 0.59437078167632
F1 Score: 0.5400034797900262


# Gaussian Naive Bayes (For Continuous Numeric Features)

In [110]:
from sklearn.naive_bayes import GaussianNB

# Initialize the Naive Bayes classifier
nb_model = GaussianNB()

# Train the model
nb_model.fit(X_train, y_train)

# Predicting outcomes
y_pred = nb_model.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred, label_encoder)


Accuracy: 0.45790053132928055
Precision: 0.46125514733564554
Recall: 0.45790053132928055
F1 Score: 0.4216126320110794


# Categorical Naive Bayes (For Categorical Data)

In [None]:
from sklearn.naive_bayes import CategoricalNB

# Initialize the Categorical Naive Bayes classifier
nb_model = CategoricalNB()

# Train the model
nb_model.fit(X_train, y_train)

# Predicting outcomes
y_pred = nb_model.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred, label_encoder)

# Complement Naive Bayes (For Imbalanced Data)

In [None]:
from sklearn.naive_bayes import ComplementNB

# Initialize the Complement Naive Bayes classifier
nb_model = ComplementNB()

# Train the model
nb_model.fit(X_train, y_train)

# Predicting outcomes
y_pred = nb_model.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred, label_encoder)


# K-Nearest Neightboaur (Eucliding Distance)

In [106]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize k-NN with Euclidean distance and k=10 (default metric is Euclidean)
knn_euclidean = KNeighborsClassifier(metric='euclidean', n_neighbors=5)

# Train the model with Euclidean distance and k=10
knn_euclidean.fit(X_train, y_train)

# Predicting outcomes
y_pred_euclidean = knn_euclidean.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred_euclidean, label_encoder)



Accuracy: 0.4691015269733378
Precision: 0.46252941519896973
Recall: 0.4691015269733378
F1 Score: 0.46232578303920147


# K-Nearest Neightbour (Manhattan Distance)

In [107]:
# Initialize k-NN with Manhattan distance and k=10
knn_manhattan = KNeighborsClassifier(metric='manhattan', n_neighbors=5)

# Train the model with Manhattan distance and k=10
knn_manhattan.fit(X_train, y_train)

# Predicting outcomes
y_pred_manhattan = knn_manhattan.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred_manhattan, label_encoder)



Accuracy: 0.4705854195586616
Precision: 0.4648500595715298
Recall: 0.4705854195586616
F1 Score: 0.46421959887172515


# K-Neaerest Neightbour (Euclidian distance different number of k)

In [109]:
# Initialize k-NN with Euclidean distance and k=10 (default metric is Euclidean)
knn_euclidean = KNeighborsClassifier(metric='euclidean', n_neighbors=10)

# Train the model with Euclidean distance and k=10
knn_euclidean.fit(X_train, y_train)

# Predicting outcomes
y_pred_euclidean = knn_euclidean.predict(X_test)

# Initialize k-NN with Manhattan distance and k=10
knn_manhattan = KNeighborsClassifier(metric='manhattan', n_neighbors=10)

# Train the model with Manhattan distance and k=10
knn_manhattan.fit(X_train, y_train)

# Predicting outcomes
y_pred_manhattan = knn_manhattan.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred_euclidean, label_encoder)

# Evaluate the model
evaluate_model(y_test, y_pred_manhattan, label_encoder)


Accuracy: 0.4945191709348523
Precision: 0.46940393421632187
Recall: 0.4945191709348523
F1 Score: 0.4768535476342482
Accuracy: 0.4937532908908142
Precision: 0.4693392471113894
Recall: 0.4937532908908142
F1 Score: 0.476812192117953


# Default Logistic Regression (No Regularization)

In [114]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model with default parameters (no regularization)
log_reg_default = LogisticRegression(solver='lbfgs', max_iter=5000)

# Train the model
log_reg_default.fit(X_train, y_train)

# Predicting outcomes
y_pred_default = log_reg_default.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred_default, label_encoder)


Accuracy: 0.5075869991862525
Precision: 0.3957398687743344
Recall: 0.5075869991862525
F1 Score: 0.43655349738812754


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Logistic Regression with L2 Regularization (Ridge)

In [112]:
# Initialize the Logistic Regression model with L2 regularization (default is L2)
log_reg_l2 = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=5000)

# Train the model
log_reg_l2.fit(X_train, y_train)

# Predicting outcomes
y_pred_l2 = log_reg_l2.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred_l2, label_encoder)

Accuracy: 0.501124886314681
Precision: 0.3905773996051592
Recall: 0.501124886314681
F1 Score: 0.42594616334571206


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Logistic Regression with L1 Regularization (Lasso)

In [None]:
# Initialize the Logistic Regression model with L1 regularization
log_reg_l1 = LogisticRegression(penalty='l1', solver='liblinear', max_iter=5000)

# Train the model
log_reg_l1.fit(X_train, y_train)

# Predicting outcomes
y_pred_l1 = log_reg_l1.predict(X_test)

# Evaluate the model
evaluate_model(y_test, y_pred_l1, label_encoder)
