# Part 1 — Multi-Layer Perceptron (Age Regression)

| Name            | Student ID | Email                                      |
|-----------------|------------|--------------------------------------------|
| Valeria Avino   | 1905974    | avino.1905974@studenti.uniroma1.it         |
| Marta Lombardi  | 2156537    | lombardi.2156537@studenti.uniroma1.it      |


This notebook presents the implementation of a custom Multi-Layer Perceptron (MLP) for a **regression task** using the dataset `AGE REGRESSION.csv`. The goal is to minimize the **L2 regularized loss function** using manually derived gradients and a numerical optimizer provided by `scipy.optimize`.

The neural network:
- Uses **at least two hidden layers**
- Applies **L2 regularization** to the weight matrices
- Supports **sigmoid** or **tanh** activation functions
- Is optimized via **L-BFGS-B**, without relying on automatic differentiation libraries
- Evaluates performance with the **Mean Absolute Percentage Error (MAPE)**
- Selects hyperparameters using **k-fold cross-validation**

This notebook is structured to:
1. Preprocess the dataset
2. Define and optimize the MLP model from scratch
3. Perform model selection using cross-validation
4. Report training, validation, and test results



In [1]:
# importing all necessary libraries
import numpy as np
import pandas as pd
from typing import Callable, Tuple, List, Dict, Any
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
import time
from scipy.optimize import minimize
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.exceptions import NotFittedError
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import make_scorer 


# importing auxiliary functions from Functions.py file
from Functions_11_Avino_Lombardi import (
    forward, backward,
    g1, dg1_dx, g2, dg2_dx,
    mse_loss, mape,
    initialize_parameters, unroll_params, roll_params,
    check_gradients_with_central_differences,
    objective_function, final_report_metrics
)

### Data loading and splitting
After loading the data in our environment, we randomly partition it into training ( $80\% $) and testing ( $20\%$ ) sets. This separation is fundamental to evaluate the generalization capability of our trained model on unseen (test) data. 

In [3]:
df = pd.read_csv('/Users/Val/Documents/GitHub/OMDS-Project/dataset/AGE_PREDICTION.csv')

# Separate features (X_full) and target (y_full) from the entire dataset
feature_columns = [f'feat_{i}' for i in range(1, 33)]
X_full = df[feature_columns].values # Shape will be [N, D]
y_full = df['gt'].values.reshape(-1, 1) # Shape will be [N, 1]


print(f"Initial X shape: {X_full.shape}")
print(f"Initial y shape: {y_full.shape}")

# Get total number of samples
N_samples_full = X_full.shape[0]
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=0.2, random_state=1234)

X_train = X_train.T # Transpose to [D, Ntrain]
y_train = y_train.T # Transpose to [1, Ntrain]
X_test = X_test.T # Transpose to [D, N-Ntrain]
y_test = y_test.T # Transpose to [1, N-Ntrain]

# Determine D_input (number of input features) - from training data to check shapes
D_input = X_train.shape[0]
y_output_dim = y_train.shape[0]
print(f"D_input (number of input features): {D_input}")
print(f"y_output_dim (number of output dimensions): {y_output_dim}\n")
print(f"X_train shape after split: {X_train.shape}") # (N_features, N_train)
print(f"y_train shape after split: {y_train.shape}") # (1, N_train)

Initial X shape: (20475, 32)
Initial y shape: (20475, 1)
D_input (number of input features): 32
y_output_dim (number of output dimensions): 1

X_train shape after split: (32, 16380)
y_train shape after split: (1, 16380)


### Feature standardization

Before applying any optimization routine, we normalize our data
For each feature $x_i$:
$$x_i^{\text{normalized}}=\frac{x_i-\mu_i}{\sigma_i}$$

where $\mu_i$ and $\sigma_i$ are the **mean** and **standard deviation** of the $i^{th}$ feature, computed over the **training set**, since at this stage we do not have access to test information. The same transformation is then applied to test data.

Standardization ensures __all features contribute equally__ to the loss landscape, thus to the gradient updates, prevents issues like vanishing or exploding gradients due to varying feature scales, and accelerates the convergence of our L-BFGS-B optimizer.

In [4]:
# manual, but can be done with "StandardScaler"
# only computed from train data
mu = X_train.mean(axis=1, keepdims=True)
sigma = X_train.std(axis=1, keepdims=True)

# Handle cases where standard deviation might be zero (constant feature)
sigma[sigma == 0] = 1e-8 

# Apply the transformation to the TRAINING DATA
X_train_normalized = (X_train - mu) / sigma

# Apply the *SAME* transformation (using mu and sigma from training) to the TEST DATA
X_test_normalized = (X_test - mu) / sigma

Let's check if the gradient was correctly computed by the __Central differences__ approximation:
$$f'(x) \approx \frac{f(x+\Delta x)-f(x-\Delta x)}{2\Delta x}$$


In [5]:
print("\nPerforming gradient check...")

# subset of training data for gradient check 
num_samples_for_check = min(1000, X_train_normalized.shape[1])
X_check_subset = X_train_normalized[:, :num_samples_for_check]
y_check_subset = y_train[:, :num_samples_for_check]

# toy set of hyperparameters for the check
check_L = 3
check_neurons_config = [5, 5]
check_activation_func = g1 # also checked with g2
check_activation_prime = dg1_dx
check_reg_factor = 0.01

# initialize parameters
W_check_init, b_check_init, v_check_init = initialize_parameters(
    D_input, check_neurons_config, y_output_dim, check_reg_factor
)
initial_flat_params_for_check = unroll_params(W_check_init, b_check_init, v_check_init)

W_shapes_for_check = [W.shape for W in W_check_init]
b_shapes_for_check = [b.shape for b in b_check_init]
v_shape_for_check = v_check_init.shape

# gradient check function
check_gradients_with_central_differences(
    initial_flat_params_for_check,
    X_check_subset, y_check_subset,
    W_shapes_for_check, b_shapes_for_check, v_shape_for_check,
    check_activation_func, check_activation_prime,
    check_reg_factor, check_L,
    objective_function
)

print("Gradient check finished.")



Performing gradient check...

--- Gradient Check (Central Differences) ---
Checking 200 parameters...
Analytical loss at initial point: 1678.853975

Gradient check PASSED! Norm of difference: 1.407312e-06
------------------------------------------
Gradient check finished.


### K-Fold Cross Validation

We now perform **5-fold cross-validation** on the training data to assess the optimal Multi-Layer Perceptron (MLP) network architecture. This method involves partitioning the training set into k=5 equally sized, disjoint subsets. For each fold $i\in\{1,…,5\}$, the model is trained on the data from the other 4 folds and subsequently evaluated on the held-out fold $F_i$

This cross-validation procedure is integrated with a **Grid Search** strategy. The Grid Search exhaustively explores a predefined hyperparameter space **H**, which includes combinations of:

- Number of hidden layers: $L-1\in\{2,3,4\}$

- Number of neurons per hidden layer

- Choice of activation function ( sigmoid or hyperbolic tangent )

- Regularization factor ($\lambda$) for the L2 penalty.

For each unique combination of hyperparameters within this grid, a model is trained and evaluated $5$ times (once for each fold). The "best performance on average on validation sets" refers to the mean evaluation metric (specifically, the Mean Absolute Percentage Error, MAPE) computed across these $5$ validation folds for that particular hyperparameter combination. The **optimal MLP architecture** is then identified as the combination of hyperparameters from the grid that yields the **lowest average MAPE** across its respective cross-validation folds. This averaging process provides a more robust and reliable estimate of the model's performance by reducing the variance associated with a single train/validation split.

In [None]:
# Define Hyperparameter Search Space
hyperparameter_grid = {
    'num_layers': [2, 3, 4],
    'num_neurons_per_layer': {
        2: [[8], [16], [32]], # For L=2
        3: [[8, 8], [16, 16], [32, 32]], # For L=3
        4: [[8, 8, 8], [16, 16, 16]] # For L=4
    },
    'activation_function': [(g1, dg1_dx), (g2, dg2_dx)], # Pass tuples of (func, prime_func)
    'regularization_factor': [0.001, 0.01, 0.1]
}

# --- Call our custom K-Fold CV function ---
# max_iter_minimize for the CV phase is set to 1000 for faster grid search.
best_mape, best_hyperparameters, best_training_results = my_k_fold_CV(
    X_train_norm=X_train_normalized,
    y_train_data=y_train,
    D_input=D_input,
    y_output_dim=y_output_dim,
    hyperparameter_grid=hyperparameter_grid,
    n_splits=5,
    random_seed=1234,
    max_iter_minimize=500 # Use 500 iterations for the CV search for speed
)

print("\n--- Optimal Hyperparameters and Performance (from K-Fold CV) ---")
print(f"Optimal Number of Layers (L): {best_hyperparameters['num_layers']}")
print(f"Optimal Number of Neurons per Layer (N): {best_hyperparameters['num_neurons_per_layer']}")
print(f"Optimal Activation Function: {best_hyperparameters['activation_function']}")
print(f"Optimal Regularization Factor (lambda): {best_hyperparameters['regularization_factor']}")
print(f"Max Iterations for Optimizer (during CV): {best_hyperparameters['max_iter_minimize']}")
print(f"Optimization Solver: L-BFGS-B ({best_training_results['optimization_solver']})")
print(f"Average Number of Iterations for Optimization (during CV): {best_training_results['num_iterations']:.2f}")
print(f"Average Optimization Time per Fold (during CV): (Removed for speed, not directly tracked)")
print(f"Average Initial Objective Function Value (during CV): {best_training_results['initial_objective_function_value']:.4e}")
print(f"Average Final Objective Function Value (during CV): {best_training_results['final_objective_function_value']:.4e}")
print(f"Average Validation Error (MAPE): {best_training_results['average_validation_mape']:.4f}%")
print(f"Average Validation Error (MSE, regularized): {best_training_results['average_validation_mse_reg']:.4f}")

NameError: name 'my_k_fold_CV' is not defined

In [7]:
print("\n--- Optimal Hyperparameters and Performance (from 5-Fold CV) ---")
print(f"Optimal Number of Layers (L): {best_hyperparameters['num_layers']}")
print(f"Optimal Number of Neurons per Layer (N): {best_hyperparameters['num_neurons_per_layer']}")
print(f"Optimal Activation Function: {best_hyperparameters['activation_function']}")
print(f"Optimal Regularization Factor (lambda): {best_hyperparameters['regularization_factor']}")
print(f"Max Iterations for Optimizer (during CV): {best_hyperparameters['max_iter_minimize']}")
print(f"Optimization Solver: L-BFGS-B ({best_training_results['optimization_solver']})")
print(f"Average Number of Iterations for Optimization (during CV): {best_training_results['num_iterations']:.2f}")
print(f"Average Optimization Time per Fold (during CV): (Removed for speed, not directly tracked)")
print(f"Average Initial Objective Function Value (during CV): {best_training_results['initial_objective_function_value']:.4e}")
print(f"Average Final Objective Function Value (during CV): {best_training_results['final_objective_function_value']:.4e}")
print(f"Average Validation Error (MAPE): {best_training_results['average_validation_mape']:.4f}%")
print(f"Average Validation Error (MSE, regularized): {best_training_results['average_validation_mse_reg']:.4f}")


--- Optimal Hyperparameters and Performance (from 5-Fold CV) ---
Optimal Number of Layers (L): 2
Optimal Number of Neurons per Layer (N): [8]
Optimal Activation Function: g2
Optimal Regularization Factor (lambda): 0.001
Max Iterations for Optimizer (during CV): 500
Optimization Solver: L-BFGS-B (STOP: TOTAL NO. of ITERATIONS REACHED LIMIT)
Average Number of Iterations for Optimization (during CV): 500.00
Average Optimization Time per Fold (during CV): (Removed for speed, not directly tracked)
Average Initial Objective Function Value (during CV): 1.6524e+03
Average Final Objective Function Value (during CV): 9.4578e+01
Average Validation Error (MAPE): 20.3481%
Average Validation Error (MSE, regularized): 96.2489


### Retraining the final model with optimal parameters

In [8]:
print("\nRetraining optimal model on full TRAINING dataset for final evaluation...")
best_L = best_hyperparameters['num_layers']
best_neurons_config = best_hyperparameters['num_neurons_per_layer']
best_activation_func = g1 if best_hyperparameters['activation_function'] == 'g1' else g2
best_activation_prime = dg1_dx if best_hyperparameters['activation_function'] == 'g1' else dg2_dx
best_reg_factor = best_hyperparameters['regularization_factor']

# Re-initialize for full training on the *entire training set* (X_train_normalized, y_train)
W_final_init, b_final_init, v_final_init = initialize_parameters(D_input, best_neurons_config, y_output_dim, best_reg_factor)
initial_flat_params_final = unroll_params(W_final_init, b_final_init, v_final_init)

W_shapes_final = [W.shape for W in W_final_init]
b_shapes_final = [b.shape for b in b_final_init]
v_shape_final = v_final_init.shape

start_time_final_train = time.time()
result_final_train = minimize(
    fun=objective_function,
    x0=initial_flat_params_final,
    args=(X_train_normalized, y_train, W_shapes_final, b_shapes_final, v_shape_final, best_activation_func, best_activation_prime, best_reg_factor, best_L),
    method='L-BFGS-B',
    jac=True,
    options={'disp': False, 'maxiter': 5000} # here i've incremented max iter for robustness
)
end_time_final_train = time.time()

# Final Trained Parameters
W_final_trained, b_final_trained, v_final_trained = roll_params(result_final_train.x, W_shapes_final, b_shapes_final, v_shape_final)
final_training_time = end_time_final_train - start_time_final_train
final_training_iterations = result_final_train.nit
final_train_objective_value = result_final_train.fun # Final regularized MSE on training set
print(f"\nFinal model training completed in {final_training_time:.2f} seconds over {final_training_iterations} iterations.")



Retraining optimal model on full TRAINING dataset for final evaluation...

Final model training completed in 30.55 seconds over 1456 iterations.


In [15]:
# Training Set Performance ---
y_train_pred_final, _, _ = forward(X_train_normalized, W_final_trained, b_final_trained, v_final_trained, best_activation_func, best_L)
final_train_mape = mape(y_train, y_train_pred_final)
train_error_mse = mse_loss(y_train, y_train_pred_final)

print("\nTraining Set Performance:")
print(f"  Final Training Error (MAPE): {final_train_mape:.4f}%")
print(f"  Final Training Error (MSE): {train_error_mse:.4f}")
print(f"  Final Training Error (MSE, regularized): {final_train_objective_value:.4f}") # Final regularized MSE on full training set



Training Set Performance:
  Final Training Error (MAPE): 20.1753%
  Final Training Error (MSE): 93.0971
  Final Training Error (MSE, regularized): 94.7941


### Testing
Having identified the optimal set of hyperparameters, the model will be retrained using these **best-performing parameters** on the **entire available training dataset**. This step ensures the model leverages all learning opportunities before its ultimate evaluation. Subsequently, we will assess its true generalization capability by making predictions on the completely unseen test dataset.

In [None]:
# TESTING ON NEVER SEEN BEFORE DATA (X_test_normalized, y_test)
print(f"\nFinal Test Data Shape (from split): {X_test.shape}, {y_test.shape}")
y_test_pred, _, _ = forward(X_test_normalized, W_final_trained, b_final_trained, v_final_trained, best_activation_func, best_L)
test_error_mape = mape(y_test, y_test_pred)
test_error_mse = mse_loss(y_test, y_test_pred)

test_error_mse_reg, _ = objective_function(
    result_final_train.x, 
    X_test_normalized, y_test, W_shapes_final, b_shapes_final, v_shape_final,
    best_activation_func, best_activation_prime, best_reg_factor, best_L
)

print("\n Test Set Performance:")
print(f"  Final Test Error (MAPE): {test_error_mape:.4f}%")
print(f"  Final Test Error (MSE): {test_error_mse:.4f}")
print(f"  Final Test Error (MSE, regularized): {test_error_mse_reg:.4f}")


Final Test Data Shape (from split): (32, 4095), (1, 4095)

 Test Set Performance:
  Final Test Error (MAPE): 20.1206%
  Final Test Error (MSE): 94.7598
  Final Test Error (MSE, regularized): 96.4569


In [None]:
true_values = y_test.flatten()
predicted_values = y_test_pred.flatten()
comparison_df = pd.DataFrame({
    'True Age': true_values,
    'Predicted Age': predicted_values
})

print("Entry-by-Entry Comparison of True vs. Predicted Ages (Test Set):")
print(comparison_df[abs(comparison_df['True Age']-comparison_df['Predicted Age'] <5 ) ])

Entry-by-Entry Comparison of True vs. Predicted Ages (Test Set):
      True Age  Predicted Age
49          58      25.854565
308         70      39.101132
386         60      26.376181
811         68      35.880332
827         70      31.633644
1012        65      32.887125
1086        69      27.382353
1435        72      35.469229
1715        61      29.480984
1741        80      47.672859
1814        66      30.553193
2027        76      34.867095
2214        89      54.050493
2319        65      30.671877
2647        85      45.027950
2795        76      31.290346
3049        80      43.397095
3118        74      42.672235
3330        69      27.290921
3367        71      38.578848
3598        62      31.664357
3698        75      44.107233
4018        65      32.866114


# Fixed MLP with L= number of hidden layers + class definition

In [7]:
# custom MLP Regressor Class 
class myMLPRegressor(BaseEstimator, RegressorMixin):
    # default values
    def __init__(self, D_input: int, y_output_dim: int, num_layers: int = 3, # 2 hidden
                 num_neurons: List[int] = [8, 4],
                 activation_func_name: str = 'g1',
                 regularization_factor: float = 0.001,
                 max_iter: int = 5000, print_callback_loss: bool = True):

        # Hyperparameters for gridsearch
        self.num_layers = num_layers
        self.num_neurons = num_neurons
        self.activation_func_name = activation_func_name
        self.regularization_factor = regularization_factor
        self.max_iter = max_iter
        self.print_callback_loss = print_callback_loss

        # Fixed parameters from dataset
        self.D_input = D_input
        self.y_output_dim = y_output_dim

        # Attributes that will be set after fitting (by the fit method)
        self.W_list_ = None
        self.b_list_ = None
        self.v_ = None
        self.W_shapes_ = None
        self.b_shapes_ = None
        self.v_shape_ = None
        self.activation_func_ = None 
        self.activation_prime_ = None
        self.n_iterations_ = None
        self.optimization_message_ = None
        self.final_objective_value_ = None
        self._is_invalid_combo = False # Flag to mark invalid hyperparameter combinations

    def _get_activation_functions(self):
        """Maps activation function name (string) to callable functions."""
        if self.activation_func_name == 'g1':
            return g1, dg1_dx
        elif self.activation_func_name == 'g2':
            return g2, dg2_dx
        else:
            raise ValueError(f"Unknown activation function name: {self.activation_func_name}")

    def fit(self, X: np.ndarray, y: np.ndarray):
        
        X_transposed = X.T # (D, N_samples) 
        y_transposed = y.T # (1, N_samples)

        # Architectural Validation
        expected_hidden_layers = self.num_layers - 1 # L total layers, L-1 hidden layers
        if len(self.num_neurons) != expected_hidden_layers:
            self._is_invalid_combo = True
            return self

        self._is_invalid_combo = False # Reset flag for valid combinations

        # Get activation functions
        self.activation_func_, self.activation_prime_ = self._get_activation_functions()

        # Initialize parameters
        W_init, b_init, v_init = initialize_parameters(
            self.D_input, self.num_neurons, self.y_output_dim, self.regularization_factor
        )
        initial_flat_params = unroll_params(W_init, b_init, v_init)

        # Store shapes of parameters for rolling/unrolling inside objective_function
        self.W_shapes_ = [W.shape for W in W_init]
        self.b_shapes_ = [b.shape for b in b_init]
        self.v_shape_ = v_init.shape

        # Callback function to print mse loss in training
        iteration_count = 0 

        def callback_function(current_flat_params):
            nonlocal iteration_count 
            iteration_count += 1

            if iteration_count % 10 == 0:
                W_list_cb, b_list_cb, v_cb = roll_params(
                    current_flat_params, self.W_shapes_, self.b_shapes_, self.v_shape_
                )
                # Perform forward pass to get y_pred
                y_pred_cb, _, _ = forward(
                    X_transposed, W_list_cb, b_list_cb, v_cb, self.activation_func_, self.num_layers
                )
                # Calculate non-regularized MSE loss
                non_reg_loss = mse_loss(y_transposed, y_pred_cb)
                print(f"  Iteration {iteration_count}: Non-regularized MSE Loss = {non_reg_loss:.6f}")

        callback_arg = callback_function if self.print_callback_loss else None

        result = minimize(
            fun=objective_function,
            x0=initial_flat_params,
            args=(X_transposed, y_transposed, self.W_shapes_, self.b_shapes_, self.v_shape_,
                  self.activation_func_, self.activation_prime_, self.regularization_factor, self.num_layers),
            method='L-BFGS-B',
            jac=True,
            options={'disp': False, 'maxiter': self.max_iter},
            callback=callback_arg # Pass the callback here
        )

        # Store the optimized parameters and optimization details
        self.W_list_, self.b_list_, self.v_ = roll_params(
            result.x, self.W_shapes_, self.b_shapes_, self.v_shape_
        )
        self.n_iterations_ = result.nit
        self.optimization_message_ = result.message
        self.final_objective_value_ = result.fun # final (regularized) loss after optimization

        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        # If combination flagged invalid during fit, raise an error
        if self._is_invalid_combo:
            raise NotFittedError("This estimator was skipped due to an invalid hyperparameter combination during fit.")
        # Check if the model has actually been trained
        if self.W_list_ is None:
            raise NotFittedError("Model has not been trained yet. Call .fit() first.")

        # Transpose X for forward function
        X_transposed = X.T
        # Perform forward pass with the TRAINED parameters to get predictions
        y_pred, _, _ = forward(X_transposed, self.W_list_, self.b_list_, self.v_, self.activation_func_, self.num_layers)
        # Transpose prediction back to (N_samples,1) for sklearn compatibility
        return y_pred.T.flatten() 

In [8]:

# Custom MAPE Scorer for GridSearchCV 
mape_scorer = make_scorer(lambda y_true, y_pred: -mape(y_true.reshape(1,-1), y_pred.reshape(1,-1)), greater_is_better=True)

# X_train_normalized and y_train are now (D, N) and (1, N) 
# For scikit-learn, they need to be (N, D) and (N, 1).
X_train_sklearn = X_train_normalized.T
y_train_sklearn = y_train.T

X_test_sklearn = X_test_normalized.T
y_test_sklearn = y_test.T 

print(f"X_train_sklearn shape (for sklearn): {X_train_sklearn.shape}")
print(f"y_train_sklearn shape (for sklearn): {y_train_sklearn.shape}\n")


# Hyperparameter Search Space for GridSearchCV
param_grid_full = {
    'num_layers': [3, 4, 5], # Total layers
    'num_neurons': [
        [16, 8], [32, 16], [32, 32], 
        [16, 16, 16], [32, 16, 32], [32, 32, 32],
        [16, 8, 16, 8], [32, 16, 32, 16] 
    ],
    'activation_func_name': ['g1', 'g2'], 
    'regularization_factor': [0.001, 0.01],
}

# Keep only VALID combinations
filtered_param_grid = []
for L_val in param_grid_full['num_layers']:
    # Number of hidden layers is L_val - 1 (L is total layers, including output)
    expected_hidden_layers = L_val - 1
    for N_list_val in param_grid_full['num_neurons']:
        if len(N_list_val) == expected_hidden_layers:
            for act_name_val in param_grid_full['activation_func_name']:
                for reg_f_val in param_grid_full['regularization_factor']:
                    filtered_param_grid.append({
                        'num_layers': [L_val],
                        'num_neurons': [N_list_val],
                        'activation_func_name': [act_name_val],
                        'regularization_factor': [reg_f_val],
                    })

# Instantiate the MLP Regressor for cv, with a reduced maxiter for speed
mlp_estimator = myMLPRegressor(D_input=D_input, y_output_dim=y_output_dim, max_iter=500, print_callback_loss=False) # no loss prints here

# Setup KFold for cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=1234)

print("Starting GridSearchCV based Hyperparameter Tuning...\n")

# Set up GridSearchCV 
grid_search = GridSearchCV(
    estimator=mlp_estimator,
    param_grid=filtered_param_grid, 
    cv=kf,
    scoring=mape_scorer, 
    verbose=2, 
    n_jobs=-1,
    error_score=np.nan ,
    refit=True # for initial loss calculation
)

# Run the grid search 
start_time_grid_search = time.time()
grid_search.fit(X_train_sklearn, y_train_sklearn)
end_time_grid_search = time.time()
grid_search_time = end_time_grid_search - start_time_grid_search

print("\nGridSearchCV completed.")
print(f"GridSearchCV took {grid_search_time:.2f} seconds to complete.")

# Results = Best hyperparameters
best_overall_params = grid_search.best_params_
best_overall_mape_score = -grid_search.best_score_

X_train_sklearn shape (for sklearn): (16380, 32)
y_train_sklearn shape (for sklearn): (16380, 1)

Starting GridSearchCV based Hyperparameter Tuning...

Fitting 5 folds for each of 32 candidates, totalling 160 fits

GridSearchCV completed.
GridSearchCV took 2322.47 seconds to complete.


### Testing
Having identified the optimal set of hyperparameters, the model will be retrained using these **best-performing parameters** on the **entire available training dataset**. This step ensures the model leverages all learning opportunities before its ultimate evaluation. Subsequently, we will assess its true generalization capability by making predictions on the completely **unseen test** data.

In [10]:
print("\n--- Retraining Optimal Model on Full TRAINING Dataset (max_iter=5000) ---")

# Extract final parameters from best_overall_params
final_L = best_overall_params['num_layers']
final_neurons_config = best_overall_params['num_neurons']
final_activation_name = best_overall_params['activation_func_name']
final_reg_factor = best_overall_params['regularization_factor']
final_train_max_iter = 5000

final_activation_func = g1 if final_activation_name == 'g1' else g2
final_activation_prime = dg1_dx if final_activation_name == 'g1' else dg2_dx

print(f"\nFinal training uses Optimal Configuration:")
print(f"  Layers (L): {final_L}")
print(f"  Neurons per Layer: {final_neurons_config}")
print(f"  Activation Function: {final_activation_func.__name__}")
print(f"  Regularization Factor: {final_reg_factor}")
print(f"  Max Iterations for L-BFGS-B (Final Train): {final_train_max_iter}")


final_model = myMLPRegressor(
    D_input=D_input,
    y_output_dim=y_output_dim, 
    num_layers=final_L,
    num_neurons=final_neurons_config,
    activation_func_name=final_activation_name,
    regularization_factor=final_reg_factor,
    max_iter=final_train_max_iter,
    print_callback_loss=True # callback printing to track mse loss behaviour
)

# Calculate Initial Training Error (Regularized MSE & MAPE) 
W_final_init, b_final_init, v_final_init = initialize_parameters(
    D_input, final_neurons_config, y_output_dim, final_reg_factor
)
initial_flat_params_final_model = unroll_params(W_final_init, b_final_init, v_final_init)

W_shapes_final_model = [W.shape for W in W_final_init]
b_shapes_final_model = [b.shape for b in b_final_init] 
v_shape_final_model = v_final_init.shape

y_train_pred_initial_final_model, _, _ = forward(
    X_train_normalized, W_final_init, b_final_init, v_final_init, final_activation_func, final_L
)
initial_train_mape_final_model = mape(y_train, y_train_pred_initial_final_model)
initial_train_mse_reg_final_model, _ = objective_function(
    initial_flat_params_final_model,
    X_train_normalized, y_train, W_shapes_final_model, b_shapes_final_model, v_shape_final_model,
    final_activation_func, final_activation_prime, final_reg_factor, final_L
)

# Final Training on Full Training Data by calling .fit() on myMLPRegressor instance
start_time_final_train = time.time()
final_model.fit(X_train_sklearn, y_train_sklearn) 
end_time_final_train = time.time()

# predict on train to get MAPE
y_train_pred_final = final_model.predict(X_train_sklearn)
final_train_mape = mape(y_train, y_train_pred_final)

# Extract results directly from final_model's attributes
final_training_time = end_time_final_train - start_time_final_train
final_training_iterations = final_model.n_iterations_
final_train_objective_value = final_model.final_objective_value_
result_final_train_message = final_model.optimization_message_ # message from the fitted model

print(f"\nFinal model training completed in {final_training_time:.2f} seconds over {final_training_iterations} iterations.")


--- Retraining Optimal Model on Full TRAINING Dataset (max_iter=5000) ---

Final training uses Optimal Configuration:
  Layers (L): 4
  Neurons per Layer: [32, 16, 32]
  Activation Function: g2
  Regularization Factor: 0.01
  Max Iterations for L-BFGS-B (Final Train): 5000
  Iteration 10: Non-regularized MSE Loss = 114.039038
  Iteration 20: Non-regularized MSE Loss = 97.124642
  Iteration 30: Non-regularized MSE Loss = 95.988423
  Iteration 40: Non-regularized MSE Loss = 95.702783
  Iteration 50: Non-regularized MSE Loss = 95.648880
  Iteration 60: Non-regularized MSE Loss = 95.335569
  Iteration 70: Non-regularized MSE Loss = 95.148132
  Iteration 80: Non-regularized MSE Loss = 95.020283
  Iteration 90: Non-regularized MSE Loss = 94.966386
  Iteration 100: Non-regularized MSE Loss = 94.972889
  Iteration 110: Non-regularized MSE Loss = 94.876734
  Iteration 120: Non-regularized MSE Loss = 94.807525
  Iteration 130: Non-regularized MSE Loss = 94.720381
  Iteration 140: Non-regularize

In [11]:
# Extract trained parameters from final_model
W_final_trained = final_model.W_list_
b_final_trained = final_model.b_list_
v_final_trained = final_model.v_

print("\n--- Calculating Test Set Performance ---")
# Make predictions on the test set using the final trained model
y_test_pred, _, _ = forward(X_test_normalized, W_final_trained, b_final_trained, v_final_trained, final_activation_func, final_L)

# MAPE on the test set
test_error_mape = mape(y_test, y_test_pred)

# Non regularized MSE
test_mse_no_reg = mse_loss(y_test, y_test_pred)

# vs Non regulatized train MSE 
final_train_mse_no_reg = mse_loss(y_train, y_train_pred_final)

# Regularized MSE on the test set
test_error_mse_reg, _ = objective_function(
    unroll_params(W_final_trained, b_final_trained, v_final_trained), 
    X_test_normalized, y_test, # Test data
    W_shapes_final_model, b_shapes_final_model, v_shape_final_model,
    final_activation_func, final_activation_prime, final_reg_factor, final_L # Hyperparams
)
print(f"  Test Error (MAPE): {test_error_mape:.4f}%")
print(f"  Test Error (MSE, regularized): {test_error_mse_reg:.4f}")
print(f"  Test Error (MSE, non-regularized): {test_mse_no_reg:.4f}")


--- Calculating Test Set Performance ---
  Test Error (MAPE): 20.1203%
  Test Error (MSE, regularized): 98.5102
  Test Error (MSE, non-regularized): 94.6767


In [12]:
# auxiliary function to display metrics for report
final_report_metrics(
    final_L=final_L,
    final_neurons_config=final_neurons_config,
    final_activation_name=final_activation_name,
    final_reg_factor=final_reg_factor,
    final_train_max_iter=final_train_max_iter,
    final_training_iterations=final_training_iterations,
    initial_train_mse_reg_final_model=initial_train_mse_reg_final_model,
    final_train_objective_value=final_train_objective_value,
    initial_train_mape_final_model=initial_train_mape_final_model,
    final_train_mape=final_train_mape,
    best_overall_mape_score=best_overall_mape_score,
    test_error_mape=test_error_mape,
    test_error_mse_reg=test_error_mse_reg, 
    result_final_train_message = result_final_train_message,
    final_train_mse_no_reg= final_train_mse_no_reg , test_error_mse_no_reg = test_mse_no_reg, 
)


--- Comprehensive Performance Metrics for Final Report ---

1. Optimal Model Configuration:
  Non-linearity (Activation Function): g2
  Total Number of Layers (L): 4
  Neurons per Layer (Nl): [32, 16, 32]
  Regularization Factor (λ): 0.01

2. Optimization Routine Details (for Final Training):
  Optimization Routine: L-BFGS-B
  Max Number of Iterations Parameter: 5000
  Returned Message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
  Number of Iterations Performed: 2243
  Starting Value of Objective Function: 1.7146e+03
  Final Value of Objective Function: 9.7588e+01

3. Training Set Performance:
  Initial Training Error (MAPE): 5462.4904%
  Initial Training Error (MSE, regularized): 1714.6084
  Final Training Error (MAPE): 20.2494%
  Final Training Error (MSE, regularized): 97.5876
  Final Training Error (MSE, **non-regularized**): 93.7542

4. Validation Set Performance (Average from K-Fold CV):
  Average Validation Error (MAPE): 20.3744%

5. Test Set Performance:
  Final Test Erro