# 2.0 - Model Training: RandomForest and XGBoost

**Objective:**
1. Load the prepared data from the feature engineering stage.
2. Train and evaluate two regression models: `RandomForestRegressor` and `XGBoostRegressor`.
3. Predict the target variables: `Heating_Load` and `Cooling_Load`.
4. Compare performance metrics (MAE, MSE, R²) to select the best models.
5. Save the trained models for later use in predictions.

## 1. Environment Setup

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb
import joblib
import os
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:
# --- Paths (Adjusted for the local environment) ---
# The paths are relative to the 'notebooks/' directory
TRAIN_PROCESSED_DATA_PATH = '../data/processed/energy_efficiency_train_prepared.csv'
TEST_PROCESSED_DATA_PATH = '../data/processed/energy_efficiency_test_prepared.csv'
MODELS_DIR = '../models/'

# Create the models directory if it doesn't exist
os.makedirs(MODELS_DIR, exist_ok=True)

## 2. Data Loading

We load the dataset that was processed and prepared in the `1.3-feature-engineering.ipynb` notebook.

In [3]:
try:
    df_train = pd.read_csv(TRAIN_PROCESSED_DATA_PATH)
    print("Train Data loaded successfully:")
    display(df_train.head())

    df_test = pd.read_csv(TEST_PROCESSED_DATA_PATH)
    print("Test Data loaded successfully:")
    display(df_test.head())
except FileNotFoundError:
    print("Error: The file was not found.")
    print("Make sure you have run the previous notebooks to generate the processed data.")

Train Data loaded successfully:


Unnamed: 0,X1,X3,X5,X7,X6_3,X6_4,X6_5,X8_1,X8_2,X8_3,X8_4,X8_5,Y1,Y2
0,0.055556,0.571429,0.0,0.25,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,15.42,19.34
1,0.777778,0.428571,1.0,0.625,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,29.87,29.87
2,0.666667,0.285714,1.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,27.03,25.82
3,1.0,0.285714,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,32.75,34.0
4,0.388889,1.0,1.0,0.625,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,36.7,36.15


Test Data loaded successfully:


Unnamed: 0,X1,X3,X5,X7,X6_3,X6_4,X6_5,X8_1,X8_2,X8_3,X8_4,X8_5,Y1,Y2
0,0.194444,0.285714,0.0,0.625,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.16,14.27
1,0.25,0.142857,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,14.51,17.1
2,0.666667,0.285714,1.0,0.625,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,29.47,29.45
3,0.25,0.142857,0.0,0.25,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,10.68,14.3
4,0.333333,0.0,0.0,0.625,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,12.45,15.1


## 3. Data Preparation for Modeling

We separate the features (X) from the target variables (y). We will train a model for each target: `Heating_Load` and `Cooling_Load`.

In [4]:
FEATURES = ['X1','X3','X5','X7','X6_3','X6_4','X6_5','X8_1','X8_2','X8_3','X8_4','X8_5']

TARGET_HEATING = 'Y1' # Heating_Load
TARGET_COOLING = 'Y2' # Cooling_Load

X_train = df_train[FEATURES]
y_train_heating = df_train[TARGET_HEATING]
y_train_cooling = df_train[TARGET_COOLING]

X_test = df_test[FEATURES]
y_test_heating = df_test[TARGET_HEATING]
y_test_cooling = df_test[TARGET_COOLING]

## 4. Model for `Heating_Load`

### 4.1. RandomForestRegressor

In [5]:
rf_model_h = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_model_h.fit(X_train, y_train_heating)
y_pred_rf_h = rf_model_h.predict(X_test)

print("--- RandomForest Results for Heating Load ---")
print(f"MAE: {mean_absolute_error(y_test_heating, y_pred_rf_h):.4f}")
print(f"MSE: {mean_squared_error(y_test_heating, y_pred_rf_h):.4f}")
print(f"R²: {r2_score(y_test_heating, y_pred_rf_h):.4f}")

--- RandomForest Results for Heating Load ---
MAE: 0.6452
MSE: 4.3078
R²: 0.9577


### 4.2. XGBoostRegressor

In [6]:
xgb_model_h = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42, n_jobs=-1)
xgb_model_h.fit(X_train, y_train_heating)
y_pred_xgb_h = xgb_model_h.predict(X_test)

print("--- XGBoost Results for Heating Load ---")
print(f"MAE: {mean_absolute_error(y_test_heating, y_pred_xgb_h):.4f}")
print(f"MSE: {mean_squared_error(y_test_heating, y_pred_xgb_h):.4f}")
print(f"R²: {r2_score(y_test_heating, y_pred_xgb_h):.4f}")

--- XGBoost Results for Heating Load ---
MAE: 0.6846
MSE: 4.3036
R²: 0.9577


## 5. Model for `Cooling_Load`

### 5.1. RandomForestRegressor

In [7]:
rf_model_c = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_model_c.fit(X_train, y_train_cooling)
y_pred_rf_c = rf_model_c.predict(X_test)

print("--- RandomForest Results for Cooling Load ---")
print(f"MAE: {mean_absolute_error(y_test_cooling, y_pred_rf_c):.4f}")
print(f"MSE: {mean_squared_error(y_test_cooling, y_pred_rf_c):.4f}")
print(f"R²: {r2_score(y_test_cooling, y_pred_rf_c):.4f}")

--- RandomForest Results for Cooling Load ---
MAE: 1.4576
MSE: 8.5555
R²: 0.9016


### 5.2. XGBoostRegressor

In [8]:
xgb_model_c = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42, n_jobs=-1)
xgb_model_c.fit(X_train, y_train_cooling)
y_pred_xgb_c = xgb_model_c.predict(X_test)

print("--- XGBoost Results for Cooling Load ---")
print(f"MAE: {mean_absolute_error(y_test_cooling, y_pred_xgb_c):.4f}")
print(f"MSE: {mean_squared_error(y_test_cooling, y_pred_xgb_c):.4f}")
print(f"R²: {r2_score(y_test_cooling, y_pred_xgb_c):.4f}")

--- XGBoost Results for Cooling Load ---
MAE: 1.1549
MSE: 7.2045
R²: 0.9171


## 6. Model Saving

Based on the results, the XGBoost model shows slightly superior performance. Therefore, we will save the two XGBoost models in the `models/` folder.

In [9]:
heating_model_path = os.path.join(MODELS_DIR, 'xgb_heating_load_model.joblib')
cooling_model_path = os.path.join(MODELS_DIR, 'xgb_cooling_load_model.joblib')

joblib.dump(xgb_model_h, heating_model_path)
joblib.dump(xgb_model_c, cooling_model_path)

print(f"Model for Heating Load saved at: {heating_model_path}")
print(f"Model for Cooling Load saved at: {cooling_model_path}")

Model for Heating Load saved at: ../models/xgb_heating_load_model.joblib
Model for Cooling Load saved at: ../models/xgb_cooling_load_model.joblib


## 7. Conclusion

In this notebook, we have successfully trained and evaluated the RandomForest and XGBoost models.

The **XGBoost** model proved to be the slightly best performer for both tasks. The winning models have been saved and are ready to be used in the next steps of the MLOps cycle.