# Excerpt of Machine Learning Project — Università Bocconi (Spring 2025)

This notebook presents selected excerpts of the code I developed for my Machine Learning coursework project at Università Bocconi.

**Note:** The dataset used in the original project is not included here for confidentiality reasons.  
The purpose of this notebook is to demonstrate the main preprocessing and modeling pipelines (Ridge and Random Forest regressions) I implemented.

In [None]:
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder

# Model construction and evaluation
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

# Regression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

# Ensembles
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

## Data Preprocessing

Below are the preprocessing steps applied to clean and prepare the dataset prior to model training.
These included handling missing values, encoding categorical features, and log-transforming the target variable.
Separate preprocessing was performed for outfielders and goalkeepers to optimize model performance.

*Log-transform target 'value_eur'*

In [None]:
# Define new dataset
df_train_log = df_train_raw.copy()

# Apply logarithmic transformation
df_train_log['log_value_eur'] = np.log1p(df_train_log['value_eur'])

# Drop the original 'value_eur' column
df_train_log = df_train_log.drop(columns=['value_eur'])

# Drop columns with >30% missing
# Drop missing value_eur rows
df_train_clean = ...

*One-Hot Encode Dummies*

In [None]:
df_train_clean = pd.get_dummies(df_train_clean, columns=['preferred_foot','work_rate','body_type'], drop_first=True)

*Impute missing values*

In [None]:
# Identify numeric columns excluding subset-specific columns
numeric_cols = df_train_clean.select_dtypes(include=['number']).columns
excluded_cols = ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'goalkeeping_speed']
cols_to_impute = [col for col in numeric_cols if col not in excluded_cols]

# Initialize the median imputer
imputer = SimpleImputer(strategy="median")

# Apply imputation only to the selected numeric columns
df_train_clean[cols_to_impute] = imputer.fit_transform(df_train_clean[cols_to_impute])

*Partition into 'Goalkeepers' and 'Outfielders'*

In [None]:
# Step 1: Define unique features
gk_features = ['goalkeeping_speed']
of_features = ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']

# Step 2: Identify shared features
exclude_features = gk_features + of_features + ['player_positions']
shared_features = [col for col in df_train_clean.columns if col not in exclude_features]

# Step 3: Define masks
gk_mask = df_train_clean['player_positions'].str.contains('GK', na=False)
of_mask = ~gk_mask

# Step 4: Build feature sets
gk_full_features = shared_features + gk_features
of_full_features = shared_features + of_features

# Step 5: Partition the data
df_train_gk = df_train_clean.loc[gk_mask, list(set(gk_full_features + ['log_value_eur']))].copy()
df_train_of = df_train_clean.loc[of_mask, list(set(of_full_features + ['log_value_eur']))].copy()

## Ridge Regression

To address multicollinearity identified in the feature space, Ridge regression was implemented using a standardized pipeline for both subsets of players — **Goalkeepers** and **Outfielders**. Each subset was modeled independently to account for position-specific features that were mutually exclusive between groups. The pipeline applied z-score standardization (`StandardScaler`) before training the Ridge estimator. 

A **GridSearchCV** procedure with 5-fold cross-validation was used to tune the regularization parameter `alpha` over 100 equally spaced values between 0.05 and 5.0. The model was evaluated using the coefficient of determination (R²) as the scoring metric, ensuring that the selected hyperparameter generalized well across folds. After identifying the optimal `alpha` separately for each subset, final models were retrained on the full training data, and **in-sample RMSE** was computed on the log-transformed target variable. To obtain interpretable estimates of model error, RMSE values were later rescaled back to the normal (euro) scale using the mean of the target variable.

In [None]:
# Split features and targets
X_gk = df_train_gk.drop(columns=['log_value_eur'])
y_gk = df_train_gk['log_value_eur']

X_of = df_train_of.drop(columns=['log_value_eur'])
y_of = df_train_of['log_value_eur']

# Create dictionary datasets
datasets = {
    'Goalkeepers': (X_gk, y_gk),
    'Outfielders': (X_of, y_of)
}

In [None]:
# Define ridge pipeline
ridge_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
])

# Define parameter grid
param_grid_ridge = {
    'ridge__alpha': np.linspace(0.05, 5.0, 100)
}

# Create dictionary of datasets
datasets = {
    'Goalkeepers': (X_gk, y_gk),
    'Outfielders': (X_of, y_of)
}

# Run GridSearchCV for each dataset
results = {}

for name, (X, y) in datasets.items():
    ridge_grid = GridSearchCV(
        estimator=ridge_pipe,
        param_grid=param_grid_ridge,
        scoring='r2',
        cv=5,
        n_jobs=-1
    )
    ridge_grid.fit(X, y)
    results[name] = {
        'best_alpha_ridge': ridge_grid.best_params_['ridge__alpha'],
        'best_score_ridge': ridge_grid.best_score_
    }
    print(f"{name} — Best alpha: {results[name]['best_alpha_ridge']}")

In [None]:
# Ridge model evaluation
ridge_results = {}

for name, (X, y) in datasets.items():
    best_alpha_ridge = results[name]['best_alpha_ridge']
    
    # Create final pipeline with best alpha
    final_pipe_ridge = Pipeline([
        ('scaler', StandardScaler()),
        ('ridge', Ridge(alpha=best_alpha_ridge))
    ])
    
    # Fit model on full training data
    final_pipe_ridge.fit(X, y)
    
    # Predict on training data
    preds = final_pipe_ridge.predict(X)
    
    # Compute in-sample RMSE
    rmse_in_sample = np.sqrt(mean_squared_error(y, preds))

    ridge_results[name] = {
        'best_alpha_ridge': best_alpha_ridge,
        'rmse_in_sample': rmse_in_sample,
    }

    print(f"{name} — Best alpha: {best_alpha_ridge}, In-sample RMSE: {rmse_in_sample:.4f}")

In [None]:
# RMSE values (use normal-scale values here)
ridge_rmse_gk = X
ridge_rmse_of = Y

# Combined RMSE
ridge_rmse_c = np.sqrt((n_gk * ridge_rmse_gk**2 + n_of * ridge_rmse_of**2) / (n_gk + n_of))

# Convert log-scale RMSE to normal-scale approximation
ridge_rmse = mean_value * (np.exp(ridge_rmse_c) - 1)

print(f"RIDGE RMSE: €{ridge_rmse:,.2f}")

## Random Forest

To capture nonlinear relationships and high-order feature interactions beyond the capacity of linear models, separate **Random Forest Regressors** were trained for the **Goalkeeper** and **Outfielder** subsets. Each model used bootstrapped sampling and the out-of-bag (OOB) estimation procedure to evaluate generalization error without requiring an explicit validation split, thus maximizing data usage. 

The pipeline employed feature subsampling (`max_features='sqrt'`) and tree averaging (`n_estimators=100`) to reduce variance and mitigate overfitting. OOB predictions provided an unbiased estimate of RMSE for each subset, while combined OOB performance was computed by concatenating predictions from both models. The Random Forest approach demonstrated superior predictive accuracy and stability, achieving the lowest RMSE and highest R² (up to **0.996**) among all tested algorithms. Model errors were subsequently transformed from log to euro scale for interpretability.

In [None]:
# Goalkeepers
gk_features = df_train_gk.drop(columns=['log_value_eur'])
gk_target   = df_train_gk['log_value_eur']

# Outfielders
of_features = df_train_of.drop(columns=['log_value_eur'])
of_target   = df_train_of['log_value_eur']

In [None]:
# Define pipeline with StandardScaler and RF (OOB enabled)
def build_rf_pipeline_oob(n_estimators=100, max_depth=None, max_features='sqrt'):
    return Pipeline([
        ('rfoob', RandomForestRegressor(
            n_estimators=n_estimators,
            max_depth=max_depth,
            max_features=max_features,
            bootstrap=True,
            oob_score=True,
            random_state=42,
            n_jobs=-1
        ))
    ])

# Train for Goalkeepers
print("Training Random Forest with OOB for Goalkeepers...")
rf_gk_pipeline = build_rf_pipeline_oob()
rf_gk_pipeline.fit(gk_features, gk_target)
rf_gk_model = rf_gk_pipeline.named_steps['rfoob']
gk_oob_rmse = mean_squared_error(gk_target, rf_gk_model.oob_prediction_, squared=False)
print(f"Goalkeepers OOB RMSE: {gk_oob_rmse:.4f}")

# Train for Outfielders
print("\nTraining Random Forest with OOB for Outfielders...")
rf_of_pipeline = build_rf_pipeline_oob()
rf_of_pipeline.fit(of_features, of_target)
rf_of_model = rf_of_pipeline.named_steps['rfoob']
of_oob_rmse = mean_squared_error(of_target, rf_of_model.oob_prediction_, squared=False)
print(f"Outfielders OOB RMSE: {of_oob_rmse:.4f}")

In [None]:
# Combine predictions and truths
rf_oob_preds = np.concatenate([rf_gk_model.oob_prediction_, rf_of_model.oob_prediction_])
rf_oob_truth = np.concatenate([gk_target, of_target])

overall_rf_oob_rmse = mean_squared_error(rf_oob_truth, rf_oob_preds, squared=False)
print(f"\nCombined Random Forest OOB RMSE: {overall_rf_oob_rmse:.4f}")

rmse_rf_oob_normal = mean_value * (np.exp(overall_rf_oob_rmse) - 1)

print(f"Random Forest RMSE: €{rmse_rf_oob_normal:,.2f}")