## Aim and Objective

The aim of this project is to develop an AI model to optimize energy consumption in industrial plants. By anticipating and regulating energy demand in real-time, the model aims to reduce costs and increase efficiency. Specifically, the model will predict heating and cooling loads based on building characteristics and environmental conditions.


In [None]:
pip install openpyxl

## Data Description

We are using the Energy Efficiency dataset from the UCI Machine Learning Repository. The dataset contains 768 samples and 8 features, along with two target variables: Heating Load and Cooling Load. The data is structured in a single CSV file with the following features:

1. **Relative Compactness**: The ratio of the volume of the building to its external surface area.
2. **Surface Area**: The total exterior surface area of the building.
3. **Wall Area**: The total area of the walls in the building.
4. **Roof Area**: The total area of the roof.
5. **Overall Height**: The height of the building from the ground to the highest point.
6. **Orientation**: The main orientation of the building (1-4 representing north, east, south, and west).
7. **Glazing Area**: The total area of windows as a percentage of the exterior surface area.
8. **Glazing Area Distribution**: The distribution of the glazing (0 means no glazing, 1-5 represent different configurations of glazing).

The target variables are:
- **Heating Load**: The energy required to maintain the indoor temperature at a comfortable level during cold periods.
- **Cooling Load**: The energy required to maintain the indoor temperature at a comfortable level during hot periods.


In [11]:
import pandas as pd

file_path = 'D:/phd docs/Mystic Minds/Energy efficiency/energy+efficiency/ENB2012_data.xlsx'
data = pd.read_excel(file_path, engine='openpyxl')

print(data.head())


     X1     X2     X3      X4   X5  X6   X7  X8     Y1     Y2
0  0.98  514.5  294.0  110.25  7.0   2  0.0   0  15.55  21.33
1  0.98  514.5  294.0  110.25  7.0   3  0.0   0  15.55  21.33
2  0.98  514.5  294.0  110.25  7.0   4  0.0   0  15.55  21.33
3  0.98  514.5  294.0  110.25  7.0   5  0.0   0  15.55  21.33
4  0.90  563.5  318.5  122.50  7.0   2  0.0   0  20.84  28.28


In [12]:

print(data.info())
print(data.describe())
print(data.isnull().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      768 non-null    float64
 1   X2      768 non-null    float64
 2   X3      768 non-null    float64
 3   X4      768 non-null    float64
 4   X5      768 non-null    float64
 5   X6      768 non-null    int64  
 6   X7      768 non-null    float64
 7   X8      768 non-null    int64  
 8   Y1      768 non-null    float64
 9   Y2      768 non-null    float64
dtypes: float64(8), int64(2)
memory usage: 60.1 KB
None
               X1          X2          X3          X4         X5          X6  \
count  768.000000  768.000000  768.000000  768.000000  768.00000  768.000000   
mean     0.764167  671.708333  318.500000  176.604167    5.25000    3.500000   
std      0.105777   88.086116   43.626481   45.165950    1.75114    1.118763   
min      0.620000  514.500000  245.000000  110.250000    3.50000    2.00000

## Physical Significance of Features

1. **Relative Compactness**: This feature helps to understand the efficiency of the building's design in terms of energy consumption. A more compact building generally requires less energy for heating and cooling.

2. **Surface Area**: Larger surface areas can lead to higher energy loss or gain, affecting both heating and cooling requirements.

3. **Wall Area**: The area of the walls contributes to heat transfer between the inside and outside of the building, impacting energy consumption.

4. **Roof Area**: Similar to wall area, the roof's surface area affects the amount of heat transferred into or out of the building.

5. **Overall Height**: The height can influence the stratification of air within the building, affecting heating and cooling efficiency.

6. **Orientation**: The direction in which the building faces can influence solar gain, with some orientations receiving more sunlight than others, affecting the heating and cooling loads.

7. **Glazing Area**: The amount of glazing (windows) can significantly affect heat loss or gain, impacting both heating and cooling needs.

8. **Glazing Area Distribution**: The placement and distribution of glazing can influence the building's thermal performance, with certain configurations being more efficient than others.


In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features (X) and targets (y)
X = data.iloc[:, :-2]  
y_heating = data.iloc[:, -2]  # Second last column is Heating Load
y_cooling = data.iloc[:, -1]  # Last column is Cooling Load

X_train, X_test, y_train_heating, y_test_heating = train_test_split(X, y_heating, test_size=0.2, random_state=42)
X_train, X_test, y_train_cooling, y_test_cooling = train_test_split(X, y_cooling, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [14]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score

ridge = Ridge()
lasso = Lasso()
ridge.fit(X_train, y_train_heating)
y_pred_heating = ridge.predict(X_test)
mse_heating = mean_squared_error(y_test_heating, y_pred_heating)
r2_heating = r2_score(y_test_heating, y_pred_heating)
print(f'Ridge Regression for Heating Load - MSE: {mse_heating}, R^2: {r2_heating}')

lasso.fit(X_train, y_train_cooling)
y_pred_cooling = lasso.predict(X_test)
mse_cooling = mean_squared_error(y_test_cooling, y_pred_cooling)
r2_cooling = r2_score(y_test_cooling, y_pred_cooling)
print(f'Lasso Regression for Cooling Load - MSE: {mse_cooling}, R^2: {r2_cooling}')


Ridge Regression for Heating Load - MSE: 9.213843234012048, R^2: 0.9116028949393403
Lasso Regression for Cooling Load - MSE: 13.752392534129699, R^2: 0.8515777827979275


## Training Procedure

The training procedure involves several key steps:

1. **Data Preprocessing**: 
   - **Normalization**: All numerical features are normalized to ensure that the model training process is not biased towards features with larger ranges.
   - **Train-Test Split**: The dataset is divided into training and test sets to evaluate the model's performance. Typically, 70-80% of the data is used for training, and the remaining 20-30% is used for testing.

2. **Model Selection**:
   - We use a variety of machine learning models to predict the Heating and Cooling Loads:
     - **Linear Regression**: A basic linear model for initial benchmarking.
     - **Ridge Regression**: A regularized linear model that penalizes large coefficients to prevent overfitting.
     - **Lasso Regression**: Another regularized linear model that can also perform feature selection by setting some coefficients to zero.
     - **Random Forest Regressor**: An ensemble method that uses multiple decision trees to improve predictive accuracy and control overfitting.
     - **Gradient Boosting Regressor**: An ensemble technique that builds trees sequentially, each time trying to correct the errors of the previous tree.
     - **Support Vector Regressor (SVR)**: A model that uses support vector machines for regression tasks.

3. **Model Evaluation**:
   - The models are evaluated using metrics such as Mean Squared Error (MSE) and R-squared (R²) on both the training and test datasets.
   - These metrics help in understanding how well the models have learned the patterns in the training data and how they generalize to unseen data.


In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

file_path = 'D:/phd docs/Mystic Minds/Energy efficiency/energy+efficiency/ENB2012_data.xlsx'
data = pd.read_excel(file_path, engine='openpyxl')


X = data.iloc[:, :-2]  
y_heating = data.iloc[:, -2]  
y_cooling = data.iloc[:, -1]  

X_train, X_test, y_train_heating, y_test_heating = train_test_split(X, y_heating, test_size=0.2, random_state=42)
X_train, X_test, y_train_cooling, y_test_cooling = train_test_split(X, y_cooling, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

models = {
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'Random Forest': RandomForestRegressor(),
    'Gradient Boosting': GradientBoostingRegressor(),
    'SVR': SVR()
}

print("Heating Load Model Evaluation:")
for name, model in models.items():
    model.fit(X_train, y_train_heating)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test_heating, y_pred)
    r2 = r2_score(y_test_heating, y_pred)
    print(f"{name} - MSE: {mse}, R^2: {r2}")

print("\n" + "="*50 + "\n") #==============================

print("Cooling Load Model Evaluation:")
for name, model in models.items():
    model.fit(X_train, y_train_cooling)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test_cooling, y_pred)
    r2 = r2_score(y_test_cooling, y_pred)
    print(f"{name} - MSE: {mse}, R^2: {r2}")


Heating Load Model Evaluation:
Ridge - MSE: 9.213843234012048, R^2: 0.9116028949393403
Lasso - MSE: 12.424268666600273, R^2: 0.8808022499700053
Random Forest - MSE: 0.2414368229220765, R^2: 0.9976836684042364
Gradient Boosting - MSE: 0.26489362841538155, R^2: 0.9974586251028782
SVR - MSE: 7.9698407537153475, R^2: 0.9235377862928831


Cooling Load Model Evaluation:
Ridge - MSE: 9.93717548762786, R^2: 0.8927533798254
Lasso - MSE: 13.752392534129699, R^2: 0.8515777827979275
Random Forest - MSE: 3.1477748566883004, R^2: 0.9660277495480768
Gradient Boosting - MSE: 2.293051691795314, R^2: 0.9752523195210875
SVR - MSE: 10.624078668708774, R^2: 0.8853400011797448


## Hyperparameter Optimization

Hyperparameter optimization is a crucial step to enhance the model's performance by fine-tuning the parameters that govern the learning process. For this, we use the following approaches:

1. **Randomized Search with Cross-Validation**:
   - We perform a randomized search over a predefined grid of hyperparameters for each model. This method samples a fixed number of parameter settings from the specified distributions, allowing a more efficient search than a complete grid search.
   - Cross-validation (typically 3-fold) is used to evaluate each combination of parameters. This helps ensure that the model's performance is consistent across different subsets of the data.

2. **Hyperparameters Tuning**:
   - For **Ridge and Lasso Regression**, we tune the regularization parameter `alpha` to balance the trade-off between fitting the training data and maintaining a model with small weights.
   - For **Random Forest**, parameters like the number of trees (`n_estimators`), maximum depth of the trees (`max_depth`), and the maximum number of features considered for splitting (`max_features`) are optimized.
   - For **Gradient Boosting**, the number of boosting stages (`n_estimators`), the learning rate, and the maximum depth of the individual trees are tuned.
   - For **SVR**, we optimize the regularization parameter `C`, kernel type, and kernel coefficient `gamma`.

3. **Results**:
   - The best set of hyperparameters is selected based on the cross-validation score, and the final model is trained on the entire training dataset.
   - The optimized model is then evaluated on the test set to determine its generalization performance.
   - This process helps in obtaining the best-performing model that balances the complexity and predictive power.



In [16]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'Ridge': {
        'alpha': np.logspace(-3, 3, 10)
    },
    'Lasso': {
        'alpha': np.logspace(-3, 3, 10)
    },
    'Random Forest': {
        'n_estimators': [10, 50, 100, 200],
        'max_features': ['auto', 'sqrt', 'log2'],
        'max_depth': [None, 10, 20, 30, 40, 50]
    },
    'Gradient Boosting': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2, 0.3],
        'max_depth': [3, 4, 5, 6, 7, 8]
    },
    'SVR': {
        'C': np.logspace(-2, 2, 5),
        'gamma': np.logspace(-2, 2, 5),
        'kernel': ['linear', 'rbf']
    }
}


n_folds = 3

# Randomized Search with Cross-Validation for each model
best_models = {}
print("Hyperparameter Optimization Results:")

for name, model in models.items():
    print(f"Optimizing {name}...")
    random_search = RandomizedSearchCV(
        model,
        param_distributions=param_distributions[name],
        n_iter=20,
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        cv=n_folds,
        random_state=42
    )
    random_search.fit(X_train, y_train_heating)
    best_models[name] = random_search.best_estimator_
    best_params = random_search.best_params_
    print(f"Best parameters for {name}: {best_params}")

    y_pred = best_models[name].predict(X_test)
    mse = mean_squared_error(y_test_heating, y_pred)
    r2 = r2_score(y_test_heating, y_pred)
    print(f"Optimized {name} - MSE: {mse}, R^2: {r2}")
    print("\n" + "="*50 + "\n")


Hyperparameter Optimization Results:
Optimizing Ridge...




Best parameters for Ridge: {'alpha': 0.1}
Optimized Ridge - MSE: 9.158918084644545, R^2: 0.9121298438005049


Optimizing Lasso...
Best parameters for Lasso: {'alpha': 0.001}
Optimized Lasso - MSE: 9.15818305552932, R^2: 0.9121368956293966


Optimizing Random Forest...


18 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
13 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\acer\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\acer\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "c:\Users\acer\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\acer\AppData\Local\Programs\Python\Python311\Lib\site

Best parameters for Random Forest: {'n_estimators': 100, 'max_features': 'log2', 'max_depth': 20}
Optimized Random Forest - MSE: 0.30818034962662, R^2: 0.9970433346811227


Optimizing Gradient Boosting...
Best parameters for Gradient Boosting: {'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.3}
Optimized Gradient Boosting - MSE: 0.2104631988392117, R^2: 0.9979808276495832


Optimizing SVR...
Best parameters for SVR: {'kernel': 'rbf', 'gamma': 1.0, 'C': 100.0}
Optimized SVR - MSE: 1.7119504985686453, R^2: 0.983575641104681


