<a href="https://colab.research.google.com/github/melmar-g1thub/INTERPOLATING-NEURAL-NETWORK/blob/main/random_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

RANDOM SEARCH TO FIND OPTIMAL HYPER-PARAMETERS

The main challenge lies in bridging scikit-learn's RandomizedSearchCV with PyTorch nn.Module.
- RandomizedSearchCV expects a scikit-learn estimator, which is a class that implements fit(), predict(), and potentially score().
- Interpolation(nn.Module) is a PyTorch model, not directly a scikit-learn estimator.

We'll do this with ` skorch ` where `NeuralNetRegressor`provides its own trainig loop. It also implements early stopping.

#### RANDOM SEARCH - Scikit Library

We'll be logging the best parameters and model in order to:
- Compare all runs
- Track randomness/reproducibility

In [None]:
# Define the Search Space for the Randomly-searched parameters
param_distributions = {
    'module__hidden_layer_sizes': [(256, 128), (128, 64), (128, 64, 32), (64, 128, 64), (64, 64, 64)],
    'module__activation': [nn.ReLU, nn.LeakyReLU, nn.ELU, nn.SiLU],
    'module__dropout': [0.1, 0.2, 0.4, 0.5],
    'lr': np.logspace(-4, -2, 20),
    'batch_size': [32, 64, 128, 256],
}

# Defined hyperparameters
input_size = 2  # baryon number and temperature
output_size = 2 # Q1 and Q2 in eos.thermo

n_iterations = 40
epochs = 200
patience = 25

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Aditional metrics
def prediction_variance(y_true, y_pred):
    return np.var(y_true,y_pred)

r2_scorer = make_scorer(r2_score, greater_is_better=True)
var_scorer = make_scorer(prediction_variance, greater_is_better=True)

# Define the model
# Skorch wrapper
model_estimator = NeuralNetRegressor(
    module=Interpolation,
    module__in_size=input_size,
    module__out_size=output_size,
    max_epochs=epochs,
    lr=0.01,  # Will be overridden during random search
    batch_size=64, #Likewise
    optimizer=torch.optim.Adam,
    criterion=nn.MSELoss,
    callbacks=[EarlyStopping(monitor='valid_loss', patience=patience)],
)

# Define the CrossValidation
cv = RepeatedKFold(n_splits=3, n_repeats=2, random_state=42)

RandomSearchCV implements Cross Validations KFold, RepeatedKFold is recommended for regression tasks

It also includes a scoring, the metric must be maximizing: better models result in larger scores
For regression, a negative error measure (‘neg_mean_absolute_error‘) makes values closer to zero to represent less prediction error by the model.

Once defined, the search is performed by calling the fit() function and providing a dataset used to train and evaluate model hyperparameter combinations using cross-validation.


In [None]:
def save_random_search_results(search_obj, name_prefix="logP_logS_search"):
    base_dir = "/content/drive/My Drive/Colab Notebooks/09.06/Random"
    exp_dir = os.path.join(base_dir, f"{name_prefix}")
    os.makedirs(exp_dir, exist_ok=True)

    # CSV + JSON
    results_df = pd.DataFrame(search_obj.cv_results_)
    results_df.to_csv(os.path.join(exp_dir, "results_full.csv"), index=False)

    def safe_serialize(obj):
        if isinstance(obj, dict):
            return {k: safe_serialize(v) for k, v in obj.items()}
        elif isinstance(obj, list):
            return [safe_serialize(v) for v in obj]
        elif isinstance(obj, tuple):
            return tuple(safe_serialize(v) for v in obj)
        elif isinstance(obj, (int, float, str, bool)) or obj is None:
            return obj
        else:
            return str(obj)  # covert to string

    with open(os.path.join(exp_dir, "results_full.json"), "w") as f:
        json.dump(safe_serialize(search_obj.cv_results_), f, indent=2)

    joblib.dump(search_obj.best_estimator_, os.path.join(exp_dir, "best_model.pkl"))
    with open(os.path.join(exp_dir, "best_params.json"), "w") as f:
        json.dump({k: str(v) for k, v in search_obj.best_params_.items()}, f, indent=2)

    print(f"\n Saved results at: {exp_dir}")
    return exp_dir

In [None]:
random_search = RandomizedSearchCV(
    estimator=model_estimator,
    param_distributions=param_distributions,
    n_iter=n_iterations,
    cv=cv,
    scoring={
        'mean_mse': 'neg_mean_squared_error',
        'r2': r2_scorer,
        'var_pred': var_scorer
    },
    refit='r2',  # R² as model score
    random_state=42,
    n_jobs=-1,
    verbose=2
)

# Convert combined train+val data to PyTorch tensors for the search (NumPy for RandomizedSearchCV)
x_train_val = np.concatenate((in_train_processed, in_val_processed), axis=0).astype(np.float32)
y_train_val = np.concatenate((out_train_processed, out_val_processed), axis=0).astype(np.float32)

start_time = time.time()
random_search.fit(x_train_val, y_train_val)
end_time = time.time()
elapsed = end_time - start_time
minutes, seconds = divmod(elapsed, 60)
print(f"\n Randomized Search complete.")
print(f"Total time: {int(minutes)} min {int(seconds)} sec")

save_random_search_results(random_search)

In [None]:
# Log results
logfile = '/content/drive/My Drive/Colab Notebooks/09.06/Random/results_full.csv'
log_dir = os.path.dirname(logfile)
os.makedirs(log_dir, exist_ok=True)

results_df = pd.read_csv(logfile)

# Clean up and rename columns for clarity
columns_to_log = [
    'param_module__hidden_layer_sizes',
    'param_lr',
    'param_module__dropout',
    'param_module__activation',
    'param_batch_size',
    'rank_test_mean_mse',
    'rank_test_r2',
    'rank_test_var_pred',
    'mean_fit_time',
    'std_fit_time',
    'mean_test_mean_mse',
    'mean_test_r2',
    'mean_test_var_pred'
]

df_filtered = results_df[columns_to_log].copy()

df_filtered.rename(columns={
    'param_module__hidden_layer_sizes': 'hidden_layer_sizes',
    'param_lr': 'learning_rate',
    'param_module__dropout': 'dropout',
    'param_module__activation': 'activation_fn',
    'param_batch_size': 'batch_size',
    'mean_test_mean_mse': 'mean_neg_mse',
    'mean_test_r2': 'mean_r2',
    'mean_test_var_pred': 'mean_var'
}, inplace=True)

df_filtered['mean_mse'] = -df_filtered['mean_neg_mse'] # Convert to positive MSE

df_filtered.to_csv(logfile, index=False)
print(f"\nAll search results saved to: {logfile}")

# Summarize best result
print('\n--- Best Result ---')
print('Best cross-validation score (R2): %s' % random_search.best_score_) # Use random_search here
print('Best Hyperparameters: %s' % random_search.best_params_)

#### EVALUATION OF RANDOM SEARCH

Correlations heatmat gives a direct answer on which hyper-parameters are most influencial. Pairplot visually displays what values or range for each hyper-parameter is responsible for lower MSE (expect better performance)

In [None]:
# Heatmat of correlations
corr_df = df_filtered[['learning_rate', 'dropout', 'batch_size',
                         'activation_fn_encoded', 'hidden_layer_encoded', 'mean_mse']]

corr_matrix = corr_df.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", square=True, cbar=True)
plt.title("Heatmap of correlations between hyperparameters and MSE")
plt.tight_layout()
#plt.savefig(os.path.join(plot_save_path, 'MSE_heatmap.png'), dpi=300, bbox_inches='tight')
plt.show()

PAIRPLOT is a multi-plot grid that shows the relationship between each pair of variables in a dataset. It includes:
- Scatter plots for every variable vs every other variable.
- Distributions for each variable on the diagonal.

And it shows:
1. Linear or nonlinear relationships (e.g., higher learning rate → higher MSE?)
2. Clusters (e.g., some batch sizes perform consistently better?)
3. Outliers (points that break patterns)
4. Correlations (strong slopes in the scatter plots)



In [None]:
plot_save_path = "/content/drive/My Drive/Colab Notebooks/09.06/Random/Hyperparameters Analysis"
os.makedirs(plot_save_path, exist_ok=True)

# Pairplot for comparing the different combinations of hyperparameters
le_act = LabelEncoder()
le_layer = LabelEncoder()

df_filtered['activation_fn_str'] = df_filtered['activation_fn'].apply(lambda fn: fn if isinstance(fn, str) else fn.__name__)
df_filtered['hidden_layer_str'] = df_filtered['hidden_layer_sizes'].astype(str)
df_filtered['activation_fn_encoded'] = le_act.fit_transform(df_filtered['activation_fn_str'])
df_filtered['hidden_layer_encoded'] = le_layer.fit_transform(df_filtered['hidden_layer_str'])

# Set columns for pairplot
pairplot_vars = ['learning_rate', 'dropout', 'batch_size', 'mean_mse']

# Plot with activation_fn colored
g = sns.pairplot(df_filtered, vars=pairplot_vars, hue='activation_fn_str', diag_kind='kde', corner='True')

# Adjust legend title and position
new_labels = ['ReLU', 'LeakyReLU']
for t, l in zip(g._legend.texts, new_labels):
    t.set_text(l)
g._legend.set_title("Activation Function")
g._legend.set_bbox_to_anchor((0.6, 0.8))  # move it outside the plot if needed

plt.suptitle("Pairplot of Hyperparameters vs MSE", fontsize=16, y=1.02, x=0.4)
plt.savefig(os.path.join(plot_save_path, 'MSE_pairplot_act_fn.png'), dpi=300, bbox_inches='tight')
plt.show()

# Plot with hidden_layers colored
h = sns.pairplot(df_filtered, vars=pairplot_vars, hue='hidden_layer_str', diag_kind='kde', corner='True')
h._legend.set_title("Hidden Layers Architecture")
h._legend.set_bbox_to_anchor((0.7, 0.8))
plt.suptitle("Pairplot of Hyperparameters vs MSE", fontsize=16, y=1.02)
plt.savefig(os.path.join(plot_save_path, 'MSE_pairplot_hidden.png'), dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Plot top best models based on R2, MSE and predictions mean variance

logfile = '/content/drive/My Drive/Colab Notebooks/TFG_NN_FINAL/Random_40/results_full.csv'
log_dir = os.path.dirname(logfile)
os.makedirs(log_dir, exist_ok=True)  # Ensure directory exists

df = pd.read_csv(logfile)

top5_r2 = df.sort_values(by='rank_test_r2').head(5).copy()
top5_mse = df.sort_values(by='rank_test_mean_mse').head(5).copy()

for df_sub in [top5_r2, top5_mse]:
    df_sub['config_label'] = df_sub.apply(
        lambda row: f"{row['hidden_layer_sizes']} | dr={row['dropout']} | batch={row['batch_size']} | lr={row['learning_rate']:.1e}",
        axis=1
    )

# Plot Side-by-side
fig, axs = plt.subplots(1, 3, figsize=(20, 8), sharey=True)
axs[1].sharey(axs[0])
fig.subplots_adjust(wspace=3.5)

# R²
sns.barplot(x='mean_r2', y='config_label', data=top5_r2, palette='crest', ax=axs[0])
axs[0].set_xlabel("Mean R² (CV)", fontsize=19)
axs[0].set_ylabel("Configurations", fontsize=19)
axs[0].grid(axis='x', linestyle='--', alpha=0.6)
axs[0].set_xlim(0.98, 1)
axs[0].tick_params(axis='both', labelsize=18)

# MSE
sns.barplot(x='mean_mse', y='config_label', data=top5_mse, palette='crest', ax=axs[1])
axs[1].set_xlabel("Mean MSE (CV)", fontsize=19)
axs[1].grid(axis='x', linestyle='--', alpha=0.6)
axs[1].set_xlim(0.0005, 0.00125)
axs[1].tick_params(axis='y', left=False, labelleft=False)
axs[1].set_ylabel("")
axs[1].tick_params(axis='both', labelsize=18)
formatter = ScalarFormatter(useMathText=True)   # usa notación "×10ⁿ" en LaTeX
formatter.set_powerlimits((-4, -4))
axs[1].xaxis.offsetText.set_fontsize(18)
axs[1].xaxis.set_major_formatter(formatter)


# Variance (de los top5 por R²)
sns.barplot(x='mean_var', y='config_label', data=top5_r2, palette='crest', ax=axs[2])
axs[2].set_xlabel("Mean Predicted Variance", fontsize=19)
axs[2].grid(axis='x', linestyle='--', alpha=0.6)
axs[2].set_xlim(0.95, 1)
axs[2].tick_params(axis='y', left=False, labelleft=False)
axs[2].set_ylabel("")
axs[2].tick_params(axis='both', labelsize=18)

plt.tight_layout()
output_file = os.path.join(log_dir, "top5_hyperparameter_comparison.png")
plt.savefig(output_file, dpi=300)
plt.show()


RANDOM FOREST measures how much each input variable (hyperparameter) contributes to reducing the error in the forest:
- For each tree in the forest, it checks how much each feature reduces the MSE when it’s used to split a node.
- Then it averages these reductions over all trees, giving you one score per hyperparameter.

The more it reduces the error, the more important it is.


In [None]:
# SHAP graph
# RandomizedSearchCV no entrena un modelo con interpretabilidad directa como un árbol,
# usaremos RandomForestRegressor para estimar la importancia relativa de cada hiperparámetro sobre el MSE

X_forest = df_filtered[['learning_rate', 'dropout', 'batch_size', 'activation_fn_encoded', 'hidden_layer_encoded']]
Y_forest = df_filtered['mean_r2']

# Train Forest model to stimate relevance of each parameter
rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(X_forest, Y_forest)

importances = rf.feature_importances_
feature_names = X_forest.columns

# Plot histogram
plt.figure(figsize=(8, 5))
sns.barplot(x=importances, y=feature_names, palette='viridis')
plt.title("Impact of hyperparameters on R2 (RandomForest estimation)")
plt.xlabel("Relative importance")
plt.tight_layout()
plt.savefig(os.path.join(plot_save_path, 'R2_randomforest.png'), dpi=300, bbox_inches='tight')
plt.show()