##### STAGE 07 - HYPERPARAMETER MODEL TUNING WITH OPTUNA AND MOEL TRACKING WITH MLFLOW

`For better performance of the ML models, hyperparameter tuning becomes essential and there steps to follow in performing hyperparameter tuning. The common ones are: GridSearchCV, RandomSearch but these require manual approach and can be slow, error-prone and not easily scalable.`

`In this case, I will use OPTUNA to perform hyperparameter tuning. OPTUNA is an automatic hyperparameter optimization framework for machine learning which uses the Bayesian optimization and Tree-structured Parzen Estimators(TPE) by default. It is a smart experiment engine that automatically searches for the best hyperparameters by learning from past trials.`

`For a ML real-world application, Optuna provides:`
1. Faster experimentation
2. Better models with fewer trials
3. Easy integration with:
    1. MLflow
    2. Sklearn
    3. PyTorch
    4. XGBoost
    5. LightGBM
    6. CatBoost
4. Production-ready (used at scale).

`Use OPTUNA when:`
1. Model performance matters
2. Manual tuning is too slow
3. Having more than 2-3 hyperparameters
4. Require reproducible experiments.

`To better appreciate the efficiency of OPTUNA, I will load the best trained model (EXTRATREESREGRESSOR), load the datasets and retrain completely with OPTUNA with focus on the hyperparameters, track model, and hyperparameters in MLflow and compare baseline model performance with OPTUNA model performance` 



`IMPORT NECESSARY LIBRARIES`

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import pickle
import os
import joblib
import mlflow
import mlflow.sklearn
import optuna


`LOAD DATASETS`

In [4]:
train_data = pd.read_csv("../data/processed_data/engineered_train_df_without_multicoll.csv")
eval_data = pd.read_csv("../data/processed_data/engineered_eval_df_without_multicoll.csv")

`LOAD BASELINE MODEL (EXTRATREESREGRESSOR)`

In [13]:
BASELINE_MODEL = joblib.load("../model/trained_EXTRATREESREGRESSOR_model_without_scaling.pkl")
BASELINE_MODEL

0,1,2
,n_estimators,500
,criterion,'squared_error'
,max_depth,10
,min_samples_split,5
,min_samples_leaf,2
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,False


`LOGGING THE BASELINE MODEL INTO MLflow FOR TRACKING`

In [14]:
#if "mlruns" not in os.listdir("../"):
#    os.mkdir("../mlruns")
mlflow.set_tracking_uri("file:../mlruns")
mlflow.set_experiment("BASELINE_MODEL_OPTIMIZATION")

with mlflow.start_run(run_name="BASELINE_MODEL"):
    mlflow.sklearn.log_model(BASELINE_MODEL, name="model")
    mlflow.log_param("model_type", "ExtraTreesRegressor")

2025/12/14 16:11:49 INFO mlflow.tracking.fluent: Experiment with name 'BASELINE_MODEL_OPTIMIZATION' does not exist. Creating a new experiment.




`RETRAINING THE MODEL BY OPTIMIZING THE HYPERPARAMETERS WITH OPTUNA OBJECTIVE FUNCTION WITH MLflow`

In [15]:
target_col = "price"
feature_cols = [col for col in train_data.columns if col != target_col]
X_train = train_data[feature_cols]
y_train = train_data[target_col]
X_eval = eval_data[feature_cols]
y_eval = eval_data[target_col]

In [18]:
# ====================================================
# DEFINE THE OPTIMIZATION OBJECTIVE WITH OPTUNA
# ====================================================

def optuna_objective(trial):

    # DEFINE THE HYPERPARAMETER SEARCH SPACE
    PARAMS = {
        "n_estimators": trial.suggest_int("n_estimators", 300, 800),
        "max_depth": trial.suggest_categorical("max_depth", [None, 10, 15, 20, 30]),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 15),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 5),
        "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2", 0.5]),
        "criterion": "squared_error",
        "bootstrap": False,
        "ccp_alpha": trial.suggest_float("ccp_alpha", 0.0, 0.02),
        "random_state": 42,
        "n_jobs": -1
    }

    # INITIALIZE AND TRAIN THE MODEL
    model = ExtraTreesRegressor(**PARAMS)
    model.fit(X_train, y_train)

    # EVALUATE THE MODEL
    preds = model.predict(X_eval)
    mse = mean_squared_error(preds, y_eval)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(preds, y_eval)
    r2 = r2_score(preds, y_eval)

    # LOG METRICS TO MLFLOW
    with mlflow.start_run(nested=True):
        mlflow.log_params(PARAMS)
        mlflow.log_metrics({"mse": mse, "rmse": rmse, "mae": mae, "r2_score": r2})

    return mae

`RUNNING OPTUNA STUDY WITH TRACKING IN MLFLOW`

In [19]:
mlflow.set_tracking_uri("file:../mlruns")
mlflow.set_experiment("OPTUNA_MODEL_OPTIMIZATION")

study = optuna.create_study(direction="minimize", study_name="EXTRATREESREGRESSOR_OPTIMIZATION")
study.optimize(optuna_objective, n_trials=20)

best_trial = study.best_trial
print("Best trial parameters:", best_trial.params)

[I 2025-12-14 18:19:55,044] A new study created in memory with name: EXTRATREESREGRESSOR_OPTIMIZATION


[I 2025-12-14 18:32:13,112] Trial 0 finished with value: 66494.16902883323 and parameters: {'n_estimators': 730, 'max_depth': 30, 'min_samples_split': 15, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'ccp_alpha': 0.015703865432999002}. Best is trial 0 with value: 66494.16902883323.
[I 2025-12-14 18:43:15,799] Trial 1 finished with value: 65575.54666480082 and parameters: {'n_estimators': 638, 'max_depth': None, 'min_samples_split': 13, 'min_samples_leaf': 5, 'max_features': 'log2', 'ccp_alpha': 0.010348258830422173}. Best is trial 1 with value: 65575.54666480082.
[I 2025-12-14 18:55:27,545] Trial 2 finished with value: 85891.7378574611 and parameters: {'n_estimators': 416, 'max_depth': 15, 'min_samples_split': 15, 'min_samples_leaf': 4, 'max_features': 0.5, 'ccp_alpha': 0.017927345636122808}. Best is trial 1 with value: 65575.54666480082.
[I 2025-12-14 19:07:46,167] Trial 3 finished with value: 64972.779486369225 and parameters: {'n_estimators': 730, 'max_depth': 30, 'min_samples_spl

Best trial parameters: {'n_estimators': 798, 'max_depth': None, 'min_samples_split': 13, 'min_samples_leaf': 4, 'max_features': 'log2', 'ccp_alpha': 0.012990027693259696}


###### `TRAINING THE FINAL MODEL WITH THE BEST PARAMS AND LOG TO MLFLOW`

In [24]:
best_params = best_trial.params
best_params

{'n_estimators': 798,
 'max_depth': None,
 'min_samples_split': 13,
 'min_samples_leaf': 4,
 'max_features': 'log2',
 'ccp_alpha': 0.012990027693259696}

In [25]:
optimized_model = ExtraTreesRegressor(**best_params)
optimized_model.fit(X_train, y_train)

preds = optimized_model.predict(X_eval)
mse = mean_squared_error(preds, y_eval)
rmse = np.sqrt(mse)
mae = mean_absolute_error(preds, y_eval)
r2 = r2_score(preds, y_eval)

print(f"Optimized Model Performance:\nMSE: {mse}\nRMSE: {rmse}\nMAE: {mae}\nR2_Score: {r2}")

# LOG FINAL METRICS AND OPTIMIZED MODEL TO MLFLOW
#with mlflow.start_run(run_name="optimized_extratreesregressor_for_housing_price_model"):
#    mlflow.log_params(best_params)
#    mlflow.log_metrics({"mse": mse, "rmse": rmse, "mae": mae, "r2_score": r2})
#    mlflow.sklearn.log_model(optimized_model, name="optimized_extratreesregressor_model", serialization_format="pickle", signature=None, input_example=None)

Optimized Model Performance:
MSE: 15426671647.522705
RMSE: 124204.15310094388
MAE: 65964.68329955288
R2_Score: 0.8389461558895535


`Note that, the optimized model became so large and resulted in memory error. Reason for commenting out the MLflow loggers. Saving manually with joblib is better compared to pickle for memory usage.`

##### `SAVING THE OPTIMIZED MODEL`

In [27]:
with open("../model/optimized_model_without_scaling.joblib", "wb") as file:
    joblib.dump(optimized_model, file, compress=3)
    print("✅ Optimized model by Optuna is saved successfully.")

✅ Optimized model by Optuna is saved successfully.
