**Energy Efficiency in Smart Buildings**

**Building Energy Usage Dataset**



**Problem Statement**

Buildings consume around 40% of global energy, especially for heating, cooling, and ventilation. There’s a need to optimize energy usage in smart buildings using AI to reduce energy waste, improve cost efficiency, and contribute to sustainability. This project aims to build a model to predict energy consumption based on occupancy and weather data and provide actionable insights for energy optimization.

Problem Scoping

What?

* Build an AI system to predict energy consumption and suggest optimization strategies for smart buildings.

Who?

* Building managers

* Facility engineers

* Energy consultants

* Sustainability officers

Where/When?

* Commercial smart buildings

* Continuous monitoring (real-time or periodic updates)

* Using historical and real-time data over 2 years

Why?

* To reduce operational costs

* Lower carbon footprint

* Achieve smart energy efficiency and sustainability goals



In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Load Library**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from warnings import filterwarnings
filterwarnings('ignore')

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler

## Data Loading

This section focuses on loading the dataset into a pandas DataFrame and performing initial checks.

In [None]:
data = pd.read_csv('/content/drive/MyDrive/building_energy_data_extended.csv')

In [None]:
data.head()

In [None]:
new_var = data.columns
new_var

In [None]:
# Checking the shape of the data
num_rows, num_cols = data.shape

print("Shape of the Data:")
print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_cols}\n")

In [None]:
data.info()

In [None]:
data.columns

## Data Cleaning

This section addresses missing values, duplicates, and handles data types as needed.

In [None]:
# Check for missing values

data.isnull().sum()

In [None]:
data.duplicated().sum()

## Exploratory Data Analysis (EDA)

This section involves visualizing data distributions, checking for correlations, and understanding the characteristics of the data.

In [None]:
data.describe().T

In [None]:
data['Energy_Usage (kWh)'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
data['Temperature (°C)'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
data['Humidity (%)'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
data['Building_Type'].value_counts().plot(kind='bar', figsize=(10,5))
plt.legend()
plt.show()

In [None]:
data['Occupancy_Level'].value_counts().plot(kind='bar', figsize=(10,5))
plt.legend()
plt.show()

**Boxplot:**

* What it shows:

Boxplots display the distribution of numerical data and highlight potential outliers. In this case, you've plotted boxplots for 'Energy_Usage (kWh)', 'Temperature (°C)', and 'Humidity (%)'.

* Analysis of results:

The boxplot for 'Energy_Usage (kWh)' shows a relatively wide spread of data, with the median around 280 kWh. There appear to be no significant outliers in this feature based on the plot. The boxplots for 'Temperature (°C)' and 'Humidity (%)' also show distributions without obvious extreme outliers.

In [None]:
sns.boxplot(data=data[[ 'Energy_Usage (kWh)', 'Temperature (°C)','Humidity (%)']])

**Displot:**

* What it shows:

A displot (distribution plot) shows the distribution of a single variable. In this case, it's for 'Energy_Usage (kWh)'.

* Analysis of results:

The displot for 'Energy_Usage (kWh)' shows a fairly uniform distribution across the range of values. This means that energy usage values are distributed quite evenly between the minimum and maximum values, without strong peaks or skewness.

In [None]:
sns.displot(data, x= "Energy_Usage (kWh)", color="blue")
plt.show()

In [None]:
# Drop the 'Timestamp' column
data = data.drop('Timestamp', axis=1)
data

**Heapmap:**

* What it shows:

A heatmap visualizes the correlation matrix between numerical variables. The color intensity indicates the strength and direction of the correlation (positive or negative).

* Analysis of results:

The heatmap shows the correlation coefficients between the encoded 'Building_ID', 'Energy_Usage (kWh)', 'Temperature (°C)', 'Humidity (%)', 'Building_Type', and 'Occupancy_Level'.

The diagonal line is always 1, representing the correlation of a variable with itself.

The correlations between 'Energy_Usage (kWh)' and the other features are very close to zero (e.g., 0.016 with 'Building_ID', 0.0095 with 'Temperature (°C)', -0.0081 with 'Humidity (%)', 0.0089 with 'Building_Type', and -0.0065 with 'Occupancy_Level').

This suggests that there is very little linear relationship between energy usage and these individual features in this dataset.

This finding is consistent with the poor performance of the linear regression model we observed earlier.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode categorical features
if 'Building_ID' in data.columns and data['Building_ID'].dtype == 'object':
    data['Building_ID'] = LabelEncoder().fit_transform(data['Building_ID'])
if 'Building_Type' in data.columns and data['Building_Type'].dtype == 'object':
    data['Building_Type'] = LabelEncoder().fit_transform(data['Building_Type'])
if 'Occupancy_Level' in data.columns and data['Occupancy_Level'].dtype == 'object':
    data['Occupancy_Level'] = LabelEncoder().fit_transform(data['Occupancy_Level'])

plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.show()

**Pairplot:**

* What it shows:

A pairplot creates a grid of scatter plots for all pairs of numerical variables in the dataset, and histograms for each individual numerical variable along the diagonal. It helps visualize relationships and distributions among multiple variables simultaneously.
* Analysis of results:

The pairplot reinforces the observations from the individual histograms and the heatmap. The scatter plots between 'Energy_Usage (kWh)' and the other features show no clear linear patterns or strong relationships. The distributions along the diagonal are consistent with the individual histograms and displots. The plots involving the encoded categorical variables ('Building_ID', 'Building_Type', 'Occupancy_Level') show discrete points or clusters, as expected.


In [None]:
sns.pairplot(data)
plt.show()

## Data Modeling

This section covers preparing the data for machine learning models and training different regression algorithms.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor

In [None]:
x = data[['Building_ID', 'Temperature (°C)', 'Humidity (%)', 'Building_Type', 'Occupancy_Level']]
y = data['Energy_Usage (kWh)']

In [None]:
x.head()

In [None]:
y.head()

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

## Model Evaluation

This section evaluates the performance of the trained models using relevant metrics and visualizations.

**Analysis of Linear Regression Results:**

**Metrics:**

The RMSE (Root Mean Squared Error) on the test data is approximately 129.14 kWh, and the R-squared value is around 0.000.
* The high RMSE indicates that, on average, the model's predictions are about 129.14 kWh away from the actual energy usage values.
* The R-squared value of 0.000 is very low and close to zero. This means that the linear regression model explains almost none of the variance in the energy usage data. An R-squared of 0 suggests that the model performs no better than simply predicting the mean of the target variable.

**Scatter Plot:**

* The scatter plot of actual vs. predicted energy usage shows that all the predicted values are clustered around a single value (approximately 278 kWh, which is close to the mean energy usage).
* The points do not follow the perfect prediction diagonal line at all.
* This visual confirms the low R-squared value and indicates that the linear model is essentially predicting the average energy usage regardless of the input features.
* This suggests that the linear relationship between the selected features and energy usage is very weak in this dataset.

In [None]:
# Train Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, Y_train)

In [None]:
# Predictions
train_pred = lr_model.predict(X_train)
test_pred = lr_model.predict(X_test)

In [None]:
# Metrics
RMSE_train = np.sqrt(mean_squared_error(Y_train, train_pred))
RMSE_test = np.sqrt(mean_squared_error(Y_test, test_pred))
R2_train = r2_score(Y_train, train_pred)
R2_test = r2_score(Y_test, test_pred)

print("RMSE TrainingData =", RMSE_train)
print("RMSE TestData =", RMSE_test)
print("-" * 50)
print("R² on Train =", R2_train)
print("R² on Test =", R2_test)

In [None]:
# --- Scatter Plot ---
plt.figure(figsize=(8, 6))
plt.scatter(Y_test, test_pred, alpha=0.5, edgecolors='k')

# Perfect prediction line
min_val = min(Y_test.min(), test_pred.min())
max_val = max(Y_test.max(), test_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect Prediction')

# Labels & Title
plt.xlabel("Actual Energy Usage (kWh)")
plt.ylabel("Predicted Energy Usage (kWh)")
plt.title("Actual vs Predicted Energy Usage (Linear Regression)")

# Add RMSE & R² inside plot
plt.text(min_val + (max_val-min_val)*0.05, max_val - (max_val-min_val)*0.05,
         f"RMSE = {RMSE_test:.2f}\nR² = {R2_test:.3f}",
         fontsize=12, bbox=dict(facecolor='white', alpha=0.6))

plt.legend()
plt.grid(True)
plt.savefig('linear_regression_scatter_plot.png') # Save the figure
plt.show()

**Analysis of Random Forest Results:**

*   **Metrics:** The RMSE on the training data is approximately 119.25 kWh, and on the test data, it's around 129.95 kWh. The R-squared on the training data is about 0.156, and on the test data, it's around -0.012.
    *   The RMSE values are still quite high, similar to the linear regression model, indicating a significant average prediction error.
    *   The R-squared value on the test data is negative (-0.012). A negative R-squared indicates that the model performs worse than simply predicting the mean of the target variable. The R-squared on the training data is low (0.156), suggesting the model explains only a small percentage of the variance in the training data.
*   **Scatter Plot:** The scatter plot of actual vs. predicted energy usage for the Random Forest model also shows the predicted values clustered around a central value, similar to the linear regression plot. There is no clear linear trend along the perfect prediction line.
    *   This visual confirms the low and negative R-squared values and suggests that the Random Forest model, with the current features and hyperparameters, is not effectively capturing the patterns in the data to accurately predict energy usage.

In [None]:
model = RandomForestRegressor(
    n_estimators=200,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=4,
    max_features='sqrt',
    random_state=42
)
model.fit(X_train, Y_train)

In [None]:
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)

In [None]:
# Metrics
RMSE_train = np.sqrt(mean_squared_error(Y_train, train_pred))
RMSE_test = np.sqrt(mean_squared_error(Y_test, test_pred))

print("RMSE TrainingData = ", RMSE_train)
print("RMSE TestData = ", RMSE_test)
print("-" * 50)
print("R² on Train : ", r2_score(Y_train, train_pred))
print("R² on Test : ", r2_score(Y_test, test_pred))

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Calculate metrics
rmse_test = np.sqrt(mean_squared_error(Y_test, test_pred))
r2_test = r2_score(Y_test, test_pred)

# Create scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(Y_test, test_pred, alpha=0.5, edgecolors='k')

# Plot the perfect prediction diagonal
min_val = min(Y_test.min(), test_pred.min())
max_val = max(Y_test.max(), test_pred.max()) # Corrected from test_val
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect Prediction')

# Labels and title
plt.xlabel("Actual Energy Usage (kWh)")
plt.ylabel("Predicted Energy Usage (kWh)")
plt.title("Actual vs Predicted Energy Usage (Random Forest)")

# Add RMSE & R² text
plt.text(min_val + (max_val-min_val)*0.05, max_val - (max_val-min_val)*0.05,
         f"RMSE = {rmse_test:.2f}\nR² = {r2_test:.3f}",
         fontsize=12, bbox=dict(facecolor='white', alpha=0.6))

plt.legend()
plt.grid(True)
plt.savefig('random_forest_scatter_plot.png') # Save the figure
plt.show()

**Analysis of Decision Tree Results:**

*   **Metrics:** The RMSE on the training data is approximately 120.65 kWh, and on the test data, it's around 139.49 kWh. The R-squared on the training data is about 0.136, and on the test data, it's around -0.167.
    *   Similar to the other models, the RMSE is high, indicating a significant average prediction error.
    *   The R-squared values are low and negative on the test set (-0.167), suggesting that the Decision Tree model, with the current configuration, also performs worse than simply predicting the mean of the target variable on unseen data.
*   **Scatter Plot:** The scatter plot of actual vs. predicted energy usage for the Decision Tree model shows some scattered points, but they are not closely aligned with the perfect prediction line. There is a tendency for the predicted values to cluster, although there is more variation than in the Linear Regression plot.
    *   This visual confirms the low R-squared value and indicates that the Decision Tree model is not accurately capturing the underlying patterns in the data to predict energy usage effectively.

In [None]:
dt_model = DecisionTreeRegressor(
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=4,
    random_state=42
)
dt_model.fit(X_train, Y_train)

In [None]:
# Predictions
train_pred = dt_model.predict(X_train)
test_pred = dt_model.predict(X_test)

In [None]:
# Metrics
RMSE_train = np.sqrt(mean_squared_error(Y_train, train_pred))
RMSE_test = np.sqrt(mean_squared_error(Y_test, test_pred))
R2_train = r2_score(Y_train, train_pred)
R2_test = r2_score(Y_test, test_pred)

print("RMSE TrainingData =", RMSE_train)
print("RMSE TestData =", RMSE_test)
print("-" * 50)
print("R² on Train =", R2_train)
print("R² on Test =", R2_test)

In [None]:
# --- Scatter Plot with diagonal & metrics ---
plt.figure(figsize=(8, 6))
plt.scatter(Y_test, test_pred, alpha=0.5, edgecolors='k')

# Perfect prediction line
min_val = min(Y_test.min(), test_pred.min())
max_val = max(Y_test.max(), test_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect Prediction')

# Labels and title
plt.xlabel("Actual Energy Usage (kWh)")
plt.ylabel("Predicted Energy Usage (kWh)")
plt.title("Actual vs Predicted Energy Usage (Decision Tree)")

# Add RMSE & R² text
plt.text(min_val + (max_val-min_val)*0.05, max_val - (max_val-min_val)*0.05,
         f"RMSE = {RMSE_test:.2f}\nR² = {R2_test:.3f}",
         fontsize=12, bbox=dict(facecolor='white', alpha=0.6))

plt.legend()
plt.grid(True)
plt.savefig('decision_tree_scatter_plot.png') # Save the figure
plt.show()

In [None]:
from sklearn.model_selection import cross_val_score

# Create lists to store RMSE for different n_neighbors values
rmse_scores = []
n_neighbors_values = range(1, 21) # Check n_neighbors from 1 to 20

for n in n_neighbors_values:
    knn_model = KNeighborsRegressor(n_neighbors=n)
    # Use negative mean squared error as scoring and take the absolute value and then sqrt for RMSE
    scores = cross_val_score(knn_model, X_train, Y_train, cv=5, scoring='neg_mean_squared_error')
    rmse = np.sqrt(-scores.mean())
    rmse_scores.append(rmse)

# Find the optimal n_neighbors with the lowest RMSE
optimal_n_neighbors = n_neighbors_values[np.argmin(rmse_scores)]
print(f"Optimal n_neighbors: {optimal_n_neighbors}")

# Plot the RMSE for different n_neighbors
plt.figure(figsize=(10, 6))
plt.plot(n_neighbors_values, rmse_scores, marker='o')
plt.xlabel('Number of Neighbors (n_neighbors)')
plt.ylabel('Cross-Validated RMSE')
plt.title('RMSE vs. Number of Neighbors for KNN Regression')
plt.xticks(n_neighbors_values)
plt.grid(True)
plt.show()

**Analysis of KNN Regression Results:**

*   **Metrics:** The RMSE on the training data is approximately 0.0 kWh, and on the test data, it's around 141.91 kWh. The R-squared on the training data is 1.0, and on the test data, it's around -0.207.
    *   The RMSE of 0.0 and R-squared of 1.0 on the training data indicate perfect fitting to the training data. This is expected when using `weights='distance'` in KNN, where the prediction for a training point is the point itself. This is a sign of overfitting.
    *   The high RMSE (141.91 kWh) and negative R-squared (-0.207) on the test data show that the model performs poorly on unseen data, even worse than simply predicting the mean. This confirms the overfitting observed on the training data.
*   **Scatter Plot:** The scatter plot of actual vs. predicted energy usage for the KNN Regressor shows points scattered around the central predicted value, with some points aligned with the perfect prediction line (likely the training points that are also in the test set, due to the `weights='distance'` setting). However, the overall spread on the test data is large, and the predictions do not follow the diagonal line well.
    *   This visual reinforces the poor performance on the test set and the issue of overfitting.

In [None]:
# Train KNN Regressor
knn_model = KNeighborsRegressor(
    n_neighbors=5,
    weights='distance',
    metric='minkowski',
    p=2
)
knn_model.fit(X_train, Y_train)

In [None]:
# Predictions
train_pred = knn_model.predict(X_train)
test_pred = knn_model.predict(X_test)

In [None]:
# Metrics
RMSE_train = np.sqrt(mean_squared_error(Y_train, train_pred))
RMSE_test = np.sqrt(mean_squared_error(Y_test, test_pred))
R2_train = r2_score(Y_train, train_pred)
R2_test = r2_score(Y_test, test_pred)

print("RMSE TrainingData =", RMSE_train)
print("RMSE TestData =", RMSE_test)
print("-" * 50)
print("R² on Train =", R2_train)
print("R² on Test =", R2_test)

In [None]:
# --- Scatter Plot ---
plt.figure(figsize=(8, 6))
plt.scatter(Y_test, test_pred, alpha=0.5, edgecolors='k')

# Perfect prediction line
min_val = min(Y_test.min(), test_pred.min())
max_val = max(Y_test.max(), test_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect Prediction')

# Labels & Title
plt.xlabel("Actual Energy Usage (kWh)")
plt.ylabel("Predicted Energy Usage (kWh)")
plt.title("Actual vs Predicted Energy Usage (KNN Regressor)")

# Add RMSE & R² inside plot
plt.text(min_val + (max_val-min_val)*0.05, max_val - (max_val-min_val)*0.05,
         f"RMSE = {RMSE_test:.2f}\nR² = {R2_test:.3f}",
         fontsize=12, bbox=dict(facecolor='white', alpha=0.6))

plt.legend()
plt.grid(True)
plt.savefig('knn_regression_scatter_plot.png') # Save the figure
plt.show()