<div style="background-color:#2b0000; color:white; padding:25px; border-radius:10px; 
            text-align:center; font-family:'Segoe UI', sans-serif;">

  <h1 style="margin-bottom:8px;"> Cars 4 You üèéÔ∏èüí®</h1>
  <h3 style="margin-top:0; font-style:italic; font-weight:normal; color:#f05a5a;">
    Auxiliary Notebook ‚Äì Neural Network Creation
  </h3>

  <hr style="width:60%; border:1px solid #700000; margin:15px auto;">

  <p style="margin:5px 0; font-size:15px;">
    <b>Group 4</b> - Machine Learning Project (2025/2026)
  </p>
  <p style="margin:0; font-size:13px; color:#e3bdbd;">
    Master in Data Science and Advanced Analytics - Nova Information Management School
  </p>
</div>

<br>

<div style="background-color:#3a0808; color:#f4eaea; padding:15px 20px; border-left:5px solid #700000; 
            border-radius:6px; font-family:'Segoe UI', sans-serif; font-size:14px;">

  <b> Notebook Context</b><br>
  In this auxiliary notebook, we developde a pipeline to automatize the creation of neural networks and hyperparameter tuning using Optuna.
</div>

<br>

<div style="text-align:center; margin-top:10px;">
  <a href="main.ipynb" 
     style="display:inline-block; background-color:#700000; color:#fff; 
            padding:8px 16px; text-decoration:none; border-radius:6px; 
            font-family:'Segoe UI', sans-serif; font-size:13px;">
     <- Back to Main Notebook
  </a>
</div>

<br>

<div style="text-align:right; font-size:12px; color:#d8bfbf;">
  Last updated: November 2025
</div>


In [45]:
import numpy as np
import pandas as pd
from pathlib import Path
import os
import pickle
import matplotlib.pyplot as plt
import optuna
from joblib import Parallel, delayed
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [46]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
print("Available devices:", tf.config.list_physical_devices()) 

TensorFlow version: 2.10.0
Available devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


In [47]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.callbacks import EarlyStopping


### Keras Wrapper

This KerasWrapper is a lightweight class that makes a Keras neural network behave like a scikit-learn estimator.

By implementing `.fit()` and `.predict()`, it allows the model to plug directly into scikit-learn tools (e.g., cross-validation, pipelines, hyperparameter search).

In [48]:
def build_keras_model(input_shape, hidden_layers=1, units=64, activation='relu', dropout=0.0, lr=0.001, optimizer='adam'):
    model = Sequential()
    model.add(Input(shape=(input_shape,)))
    model.add(Dense(units, activation=activation))
    
    for _ in range(hidden_layers - 1): # add hidden layers
        model.add(Dense(units, activation=activation))
        if dropout > 0:
            model.add(Dropout(dropout))
    
    model.add(Dense(1))
    
    # choose optimizer
    if optimizer.lower() == 'adam':
        opt = Adam(learning_rate=lr)
    elif optimizer.lower() == 'rmsprop':
        opt = RMSprop(learning_rate=lr)
    else:
        raise ValueError("Unsupported optimizer")
    
    model.compile(optimizer=opt, loss='mse') # compile, loss function = minimize mean squared error
    return model

In [49]:
class KerasWrapper:
    def __init__(self, input_shape=None, hidden_layers=1, units=64, activation='relu', dropout=0.0,
                 lr=0.001, optimizer='adam', epochs=50, batch_size=32, verbose=0):
        # Initialize hyperparameters and model settings
        self.input_shape = input_shape
        self.hidden_layers = hidden_layers
        self.units = units
        self.activation = activation
        self.dropout = dropout
        self.lr = lr
        self.optimizer = optimizer
        self.epochs = epochs
        self.batch_size = batch_size
        self.verbose = verbose
        self.model = None  # placeholder for Keras model

    def fit(self, X, y, trial=None):
        if self.input_shape is None:
            self.input_shape = X.shape[1]

        self.model = build_keras_model(
            input_shape=self.input_shape,
            hidden_layers=self.hidden_layers,
            units=self.units,
            activation=self.activation,
            dropout=self.dropout,
            lr=self.lr,
            optimizer=self.optimizer
        )

        # Early stopping (within trial)
        early_stop = EarlyStopping(
            monitor='val_mse',
            patience=10,
            restore_best_weights=True
        )

        callbacks = [early_stop]

        self.model.fit(
            X, y,
            epochs=self.epochs,
            batch_size=self.batch_size,
            verbose=self.verbose,
            validation_split=0.1,
            callbacks=callbacks
        )

    def predict(self, X):
        # Predict outputs for new data, flatten to 1D array (metric functions require that)
        return self.model.predict(X, verbose=0).flatten()

### Fold Evaluation with Fixed Hyperparameters

The function `evaluate_fold_with_hyperparams()` evaluates the performance of the NN model on a single CV fold using a predefined set of hyperparameters. For the specified fold, it loads the training and validation data, fits a Keras-based model, and generates predictions on the validation set.

Predictions and true values are first unscaled and then exponentiated to recover prices in their original units. Model performance is finally assessed using the **Root Mean Squared Error (RMSE)** on the validation data, which is returned as the evaluation metric.


In [50]:
def evaluate_fold_with_hyperparams(fold_idx, folds_dir, hyperparams):
    """
    Evaluate a single fold with given hyperparameters.
    Returns validation RMSE on the original scale (after unscaling/unlogging).
    """
    # Load training and validation data for this fold
    fold_path = f'{folds_dir}/fold_{fold_idx}'
    train_df = pd.read_csv(f'{fold_path}/train{fold_idx}_FINAL.csv')
    val_df = pd.read_csv(f'{fold_path}/validation{fold_idx}_FINAL.csv')
    
    # Prepare input features (X) and target (y)
    X_train = train_df.drop(columns=['price_log']).values
    y_train = train_df['price_log'].values
    X_val = val_df.drop(columns=['price_log']).values
    y_val = val_df['price_log'].values
    
    # Initialize and train model with the given hyperparameters
    model = KerasWrapper(
        input_shape=X_train.shape[1],
        hidden_layers=hyperparams['hidden_layers'],
        units=hyperparams['units'],
        activation=hyperparams['activation'],
        dropout=hyperparams['dropout'],
        lr=hyperparams['lr'],
        optimizer=hyperparams['optimizer'],
        epochs=hyperparams['epochs'],
        batch_size=hyperparams['batch_size'],
        verbose=1
    )
    
    # Fit the model on training data
    model.fit(X_train, y_train)
    
    # Predict on validation data
    y_pred = model.predict(X_val)

    # ========================

    # Load the scaler used for preprocessing
    scaler_path = f'{fold_path}/scaler.pkl'
    with open(scaler_path, 'rb') as f:
        scaler = pickle.load(f)
    
    price_log_idx = train_df.columns.tolist().index('price_log')
    print(f"price_log index: {price_log_idx}") 

    data_min = scaler.data_min_[price_log_idx]
    data_max = scaler.data_max_[price_log_idx]
    data_range = data_max - data_min
    
    # Reverse scaling
    y_val_log = y_val * data_range + data_min
    y_pred_log = y_pred * data_range + data_min
    
    # Reverse log transform to get original prices
    y_val_actual = np.expm1(y_val_log)
    y_pred_actual = np.expm1(y_pred_log)
    
    # Calculate RMSE on the original price scale
    rmse = np.sqrt(mean_squared_error(y_val_actual, y_pred_actual))
    
    return rmse

### Optuna Objective Function

Instead of using GridSearch or RandomSearch, we decided to implement **Optuna** for hyperparameter tuning. Optuna, rather than brute-forcing or randomly sampling hyperparameters, uses a *Bayesian optimization‚Äìbased strategy* to efficiently explore the search space by **learning from the performance of previous trials** and focusing subsequent searches on the most promising regions.

The `objective()` function defines the optimization target for Optuna. For each trial, it samples a set of the NN parameters and evaluates their performance using cross-validation. Model evaluation is performed across multiple folds. **The 5 folds can be trained in parallel, as their training is independent among each other**. For that, we used `joblib`, and the average RMSE across folds is returned as the objective value to be minimized. However we cannot parallelize the trials, as one depends on the one prior to it.

In [51]:
def objective(trial, folds_dir='./folds', n_folds=5):
    """
    Optuna objective function: suggest hyperparameters and evaluate on CV folds
    Sequential fold evaluation (no parallelization) to avoid GPU conflicts and race conditions.
    """  
    # Suggest hyperparameters for this trial
    hyperparams = {
        'hidden_layers': trial.suggest_int('hidden_layers', 1, 4),
        'units': trial.suggest_int('units', 32, 256, step=32),
        'activation': trial.suggest_categorical('activation', ['relu', 'tanh', 'sigmoid']),
        'dropout': trial.suggest_float('dropout', 0.0, 0.5, step=0.1),
        'lr': trial.suggest_float('lr', 1e-5, 1e-2, log=True),
        'optimizer': trial.suggest_categorical('optimizer', ['adam', 'rmsprop']),
        'epochs': trial.suggest_int('epochs', 30, 100),
        'batch_size': trial.suggest_categorical('batch_size', [16, 32, 64, 128]),
    }
    
    print(f"\n[Trial {trial.number}] Testing hyperparams: {hyperparams}")
    
    # Evaluate each CV fold sequentially
    fold_scores = []
    try:
        for fold_idx in range(1, n_folds + 1):
            rmse = evaluate_fold_with_hyperparams(fold_idx, folds_dir, hyperparams)
            fold_scores.append(rmse)
            print(f"  Fold {fold_idx} RMSE: {rmse:.4f}")
    except Exception as e:
        print(f"  Trial failed: {e}")
        return float('inf')  # bad trial

    # Compute average across folds
    avg_rmse = np.mean(fold_scores)
    print(f"  Average RMSE: {avg_rmse:.4f}")

    # minimize RMSE
    return avg_rmse

### Main Function

In [52]:
def run_optuna_optimization(folds_dir='./folds', n_trials=200, output_dir='./optuna_results'):
    """
    Run Optuna hyperparameter optimization with joblib fold parallelization
    
    Parameters:
    - folds_dir: directory containing fold data
    - n_trials: number of trials to run sequentially
    - n_jobs_folds: number of parallel jobs for fold evaluation (1=sequential, -1=all CPU cores)
                    Keep at 1 if using GPU to avoid conflicts
    - output_dir: where to save results
    """
    
    os.makedirs(output_dir, exist_ok=True)
    
    print(f"\n{'='*70}")
    print(f"Starting Optuna Hyperparameter Optimization")
    #print(f"Fold parallelization: n_jobs_folds={n_jobs_folds}")
    print(f"{'='*70}\n")
    
    # Create study
    study = optuna.create_study(direction='minimize')  # minimize RMSE
    
    # Optimize (trials sequential, folds can be parallel)
    study.optimize(
        lambda trial: objective(trial, folds_dir=folds_dir, n_folds=5),
        n_trials=n_trials,
        n_jobs=1  # Keep at 1 to avoid GPU conflicts
    )
    
    # Get best trial
    best_trial = study.best_trial
    
    print(f"\n{'='*70}")
    print(f"Optimization Complete!")
    print(f"{'='*70}")
    print(f"\nBest Trial: {best_trial.number}")
    print(f"Best RMSE: ${best_trial.value:.2f}")
    print(f"Best Hyperparameters:")
    for key, value in best_trial.params.items():
        print(f"  {key}: {value}")
    
    # Save results
    results_df = study.trials_dataframe()
    
    # Customize output: rename value, drop datetime columns there by default (keep duration)
    results_df = results_df.rename(columns={'value': 'average_rmse'})
    results_df = results_df.drop(columns=['datetime_start', 'datetime_complete'])
    
    # Reorder columns: number, average_rmse, duration, params..., state
    cols = ['number', 'average_rmse', 'duration'] + [col for col in results_df.columns if col.startswith('params_')] + ['state']
    results_df = results_df[cols]
    
    results_path = f'{output_dir}/optuna_trials.csv'
    results_df.to_csv(results_path, index=False)
    print(f"\n‚úì Trial results saved to: {results_path}")
    
    # Save best hyperparameters
    best_params_path = f'{output_dir}/best_hyperparams.pkl'
    with open(best_params_path, 'wb') as f:
        pickle.dump(best_trial.params, f)
    print(f"‚úì Best hyperparameters saved to: {best_params_path}")
    
    return study, best_trial.params

In [53]:
'''
study, best_params = run_optuna_optimization(
        folds_dir='./preprocessing_results/folds',
        n_trials=30,         
        n_jobs_folds=-1,        
        output_dir='./optuna_results'
    )
    
print(f"\nBest hyperparameters to use:")
print(best_params)'''

'\nstudy, best_params = run_optuna_optimization(\n        folds_dir=\'./preprocessing_results/folds\',\n        n_trials=30,         \n        n_jobs_folds=-1,        \n        output_dir=\'./optuna_results\'\n    )\n    \nprint(f"\nBest hyperparameters to use:")\nprint(best_params)'

In [54]:
with open('./optuna_results/best_hyperparams.pkl', 'rb') as f:
    best_hyperparams = pickle.load(f)

print(best_hyperparams)

{'hidden_layers': 3, 'units': 64, 'activation': 'tanh', 'dropout': 0.0, 'lr': 2.213225954589399e-05, 'optimizer': 'adam', 'epochs': 73, 'batch_size': 64}


## Use the Best architecture/hyperparameters 

Using the results of optuna, it automatically gets the best hyperparameters to create a NN and train it using said architecture, on the full dataset.

In [55]:
def train_best_nn_no_cv(optuna_params_path='./optuna_results/best_hyperparams.pkl',
                        results_dir='./preprocessing_results/full_dataset',
                        output_base_dir='./models_best',
                        model_name='NN_Optuna_NoCV',
                        verbose=1):
    """
    Train best neural network (from Optuna) on full dataset without CV.
    
    Parameters:
    - optuna_params_path: path to best_hyperparams.pkl from Optuna
    - results_dir: directory with train_FINAL.csv and test_FINAL.csv
    - output_base_dir: output directory for submission
    - model_name: name for the model
    - verbose: 0=silent, 1=progress bar, 2=per epoch output
    
    Returns:
    - submission_df: DataFrame with carID and price predictions
    """
    
    # Load best hyperparameters
    print(f"Loading best hyperparameters from: {optuna_params_path}")
    with open(optuna_params_path, 'rb') as f:
        best_hyperparams = pickle.load(f)
    
    print(f"Best hyperparameters:")
    for key, value in best_hyperparams.items():
        print(f"  {key}: {value}")
    
    # Create output directory
    output_dir = f'{output_base_dir}/{model_name}'
    os.makedirs(output_dir, exist_ok=True)
    
    print(f"\n{'='*70}")
    print(f"Running Pipeline (No CV) for: {model_name}")
    print(f"{'='*70}\n")
    
    # Load paths
    train_path = f"{results_dir}/train_FINAL.csv"
    test_path = f"{results_dir}/test_FINAL.csv"
    
    # Load complete train dataset
    print(f"Loading train data...")
    train_df = pd.read_csv(train_path)
    print(f"  Train shape: {train_df.shape}")
    print(f"  Train columns: {train_df.columns.tolist()}")
    
    # Load complete test dataset
    print(f"Loading test data...")
    test_df = pd.read_csv(test_path)
    print(f"  Test shape: {test_df.shape}")
    print(f"  Test columns: {test_df.columns.tolist()}\n")
    
    # Get price_log index
    train_columns_with_target = train_df.columns.tolist()
    price_log_idx = train_columns_with_target.index('price_log')
    print(f"  price_log is at index {price_log_idx}\n")
    
    # Prepare training data
    X_train = train_df.drop(columns=['price_log']).values
    y_train = train_df['price_log'].values
    
    # Prepare test data
    X_test = test_df.drop(columns=['carID']).values
    car_ids = test_df['carID'].values 
    
    # Train model with best hyperparameters
    print(f"Training {model_name}...")
    model = KerasWrapper(
        input_shape=X_train.shape[1],
        hidden_layers=best_hyperparams['hidden_layers'],
        units=best_hyperparams['units'],
        activation=best_hyperparams['activation'],
        dropout=best_hyperparams['dropout'],
        lr=best_hyperparams['lr'],
        optimizer=best_hyperparams['optimizer'],
        epochs=best_hyperparams['epochs'],
        batch_size=best_hyperparams['batch_size'],
        verbose=verbose
    )
    
    model.fit(X_train, y_train)
    print(f"  ‚úì Model trained\n")
    
    # Predict on test
    print(f"Generating predictions on test set...")
    y_test_pred_scaled = model.predict(X_test)
    
    # Load scaler to unscale predictions
    scaler_path = f'{results_dir}/scaler.pkl'
    with open(scaler_path, 'rb') as f:
        scaler = pickle.load(f)

    # Use the price_log_idx from earlier 
    data_min = scaler.data_min_[price_log_idx]
    data_max = scaler.data_max_[price_log_idx]
    data_range = data_max - data_min
    
    # Unscale and unlog test predictions
    y_test_pred_log = y_test_pred_scaled * data_range + data_min
    y_test_pred_actual = np.expm1(y_test_pred_log)
    
    print(f"  Predictions range: ${y_test_pred_actual.min():.2f} - ${y_test_pred_actual.max():.2f}\n")
    
    # ============= SAVE SUBMISSION =============
    
    submission_df = pd.DataFrame({
        'carID': car_ids,
        'price': y_test_pred_actual
    })
    
    submission_path = f'{output_dir}/submission_{model_name}.csv'
    submission_df.to_csv(submission_path, index=False)
    
    print(f"‚úì Submission saved to: {submission_path}")
    print(f"  Preview:")
    print(submission_df.head().to_string(index=False))
    
    # Save hyperparameters used
    params_path = f'{output_dir}/hyperparameters_used.pkl'
    with open(params_path, 'wb') as f:
        pickle.dump(best_hyperparams, f)
    print(f"‚úì Hyperparameters saved to: {params_path}")
    
    print(f"\n{'='*70}")
    print(f"Pipeline complete for: {model_name}")
    print(f"{'='*70}\n")
    
    return submission_df

In [56]:
submission_df = train_best_nn_no_cv(
        optuna_params_path='./optuna_results/best_hyperparams.pkl',
        results_dir='./preprocessing_results/full_dataset',
        output_base_dir='./models_best',
        model_name='NN_Optuna_NoCV',
        verbose=1 
    )

Loading best hyperparameters from: ./optuna_results/best_hyperparams.pkl
Best hyperparameters:
  hidden_layers: 3
  units: 64
  activation: tanh
  dropout: 0.0
  lr: 2.213225954589399e-05
  optimizer: adam
  epochs: 73
  batch_size: 64

Running Pipeline (No CV) for: NN_Optuna_NoCV

Loading train data...
  Train shape: (74402, 11)
  Train columns: ['model_encoded', 'tax', 'car_age', 'mileage', 'mpg', 'engineSize', 'transmission_Manual', 'transmission_Semi-Auto', 'fuelType_Diesel', 'fuelType_Hybrid', 'price_log']
Loading test data...
  Test shape: (32567, 11)
  Test columns: ['carID', 'model_encoded', 'tax', 'car_age', 'mileage', 'mpg', 'engineSize', 'transmission_Manual', 'transmission_Semi-Auto', 'fuelType_Diesel', 'fuelType_Hybrid']

  price_log is at index 10

Training NN_Optuna_NoCV...
Epoch 1/73
Epoch 2/73
Epoch 3/73
Epoch 4/73
Epoch 5/73
Epoch 6/73
Epoch 7/73
Epoch 8/73
Epoch 9/73
Epoch 10/73
Epoch 11/73
Epoch 12/73
Epoch 13/73
Epoch 14/73
Epoch 15/73
Epoch 16/73
Epoch 17/73
Epoch