## Machine Learning: Regression - Predicting Energy Efficiency of Buildings

## Introduction 
The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters). The attribute information can be seen below.

Attribute Information:

Date : time year-month-day hour:minute:second

Appliances : energy use in Wh 

lights : energy use of light fixtures in the house in Wh

T1 : Temperature in kitchen area, in Celsius

RH_1 : Humidity in kitchen area, in %

T2 : Temperature in living room area, in Celsius

RH_2 : Humidity in living room area, in %

T3 : Temperature in laundry room area

RH_3 : Humidity in laundry room area, in %

T4 : Temperature in office room, in Celsius

RH_4 : Humidity in office room, in %

T5 : Temperature in bathroom, in Celsius

RH_5 : Humidity in bathroom, in %

T6 : Temperature outside the building (north side), in Celsius

RH_6 : Humidity outside the building (north side), in %

T7 : Temperature in ironing room , in Celsius

RH_7 : Humidity in ironing room, in %

T8 : Temperature in teenager room 2, in Celsius

RH_8 : Humidity in teenager room 2, in %

T9 : Temperature in parents room, in Celsius

RH_9 : Humidity in parents room, in %

T_out : Temperature outside (from Chievres weather station), in Celsius

Pressure : (from Chievres weather station), in mm Hg

RH_out : Humidity outside (from Chievres weather station), in %

Wind speed : (from Chievres weather station), in m/s

Visibility : (from Chievres weather station), in km

Tdewpoint : (from Chievres weather station), Â°C

rv1 : Random variable 1, nondimensional

rv2 : Random variable 2, nondimensional

### Importing Libraries

In [14]:
import pandas as pd
import numpy as np
import seaborn as sns
import math
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

In [15]:
#Importing energydata
energydata = pd.read_csv("energydata_complete.csv")
energydata

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.890000,47.596667,19.200000,44.790000,19.790000,44.730000,19.000000,...,17.033333,45.5300,6.600000,733.5,92.000000,7.000000,63.000000,5.300000,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.890000,46.693333,19.200000,44.722500,19.790000,44.790000,19.000000,...,17.066667,45.5600,6.483333,733.6,92.000000,6.666667,59.166667,5.200000,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.890000,46.300000,19.200000,44.626667,19.790000,44.933333,18.926667,...,17.000000,45.5000,6.366667,733.7,92.000000,6.333333,55.333333,5.100000,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.890000,46.066667,19.200000,44.590000,19.790000,45.000000,18.890000,...,17.000000,45.4000,6.250000,733.8,92.000000,6.000000,51.500000,5.000000,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.890000,46.333333,19.200000,44.530000,19.790000,45.000000,18.890000,...,17.000000,45.4000,6.133333,733.9,92.000000,5.666667,47.666667,4.900000,10.084097,10.084097
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19730,2016-05-27 17:20:00,100,0,25.566667,46.560000,25.890000,42.025714,27.200000,41.163333,24.700000,...,23.200000,46.7900,22.733333,755.2,55.666667,3.333333,23.666667,13.333333,43.096812,43.096812
19731,2016-05-27 17:30:00,90,0,25.500000,46.500000,25.754000,42.080000,27.133333,41.223333,24.700000,...,23.200000,46.7900,22.600000,755.2,56.000000,3.500000,24.500000,13.300000,49.282940,49.282940
19732,2016-05-27 17:40:00,270,10,25.500000,46.596667,25.628571,42.768571,27.050000,41.690000,24.700000,...,23.200000,46.7900,22.466667,755.2,56.333333,3.666667,25.333333,13.266667,29.199117,29.199117
19733,2016-05-27 17:50:00,420,10,25.500000,46.990000,25.414000,43.036000,26.890000,41.290000,24.700000,...,23.200000,46.8175,22.333333,755.2,56.666667,3.833333,26.166667,13.233333,6.322784,6.322784


In [16]:
#Summary Statistics of the dataset
energydata.describe()

Unnamed: 0,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
count,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,...,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0
mean,97.694958,3.801875,21.686571,40.259739,20.341219,40.42042,22.267611,39.2425,20.855335,39.026904,...,19.485828,41.552401,7.411665,755.522602,79.750418,4.039752,38.330834,3.760707,24.988033,24.988033
std,102.524891,7.935988,1.606066,3.979299,2.192974,4.069813,2.006111,3.254576,2.042884,4.341321,...,2.014712,4.151497,5.317409,7.399441,14.901088,2.451221,11.794719,4.194648,14.496634,14.496634
min,10.0,0.0,16.79,27.023333,16.1,20.463333,17.2,28.766667,15.1,27.66,...,14.89,29.166667,-5.0,729.3,24.0,0.0,1.0,-6.6,0.005322,0.005322
25%,50.0,0.0,20.76,37.333333,18.79,37.9,20.79,36.9,19.53,35.53,...,18.0,38.5,3.666667,750.933333,70.333333,2.0,29.0,0.9,12.497889,12.497889
50%,60.0,0.0,21.6,39.656667,20.0,40.5,22.1,38.53,20.666667,38.4,...,19.39,40.9,6.916667,756.1,83.666667,3.666667,40.0,3.433333,24.897653,24.897653
75%,100.0,0.0,22.6,43.066667,21.5,43.26,23.29,41.76,22.1,42.156667,...,20.6,44.338095,10.408333,760.933333,91.666667,5.5,40.0,6.566667,37.583769,37.583769
max,1080.0,70.0,26.26,63.36,29.856667,56.026667,29.236,50.163333,26.2,51.09,...,24.5,53.326667,26.1,772.3,100.0,14.0,66.0,15.5,49.99653,49.99653


In [17]:
#.info() function is used to understand the data types, numbers of columns, number of rows and memory storage of the data.
energydata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

### Question 17

### From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the Root Mean Squared error in three D.P?

In [18]:
# Select the predictor variable (x) and the target variable (y)
x = energydata['T2'].values.reshape(-1, 1)
y = energydata['T6']

# Split the data into a training set and a testing set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Fit a linear regression model
model = LinearRegression()
model.fit(x_train, y_train)

# Make predictions on the test set
y_pred = model.predict(x_test)

# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Calculate the root mean squared error (RMSE)
rmse = math.sqrt(mse)

# Print the RMSE with three decimal places
print(f'The Root Mean Squared error is: {rmse:.3f}')


The Root Mean Squared error is: 3.633


### Question 18

### Remove the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a  random state of 42 (for reproducibility). Normalize the dataset using the MinMaxScaler (Hint: Use the MinMaxScaler fit_transform and transform methods on the train and test set respectively). Run a multiple linear regression using the training set. Answer the following questions:



### What is the Mean Absolute Error (in three decimal places) for the  training set?

In [19]:
# Drop the "date" and "lights" columns
energydata = energydata.drop(columns=["date", "lights"])

# Define the target variable (y) and the features (X)
X = energydata.drop(columns=["Appliances"])
y = energydata["Appliances"]

# Split the data into training and testing sets with a 70-30 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and fit the MinMaxScaler on the training data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)

# Initialize and fit the Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Make predictions on the training set
y_train_pred = model.predict(X_train_scaled)

# Calculate the Mean Absolute Error (MAE) on the training set
mae_train = mean_absolute_error(y_train, y_train_pred)

# Print the MAE for the training set (rounded to three decimal places)
print(f"The Mean Absolute Error for the training set is: {mae_train:.3f}")


The Mean Absolute Error for the training set is: 53.742


### Question 19

### What is the Root Mean Squared Error (in three decimal places) for the training set?

In [20]:
# Calculate the Mean Squared Error (MSE) on the training set
mse_train = np.mean((y_train - y_train_pred) ** 2)

# Calculate the RMSE on the training set
rmse_train = np.sqrt(mse_train)

# Print the RMSE for the training set (rounded to three decimal places)
print(f"Root Mean Squared Error for the training set is: {rmse_train:.3f}")

Root Mean Squared Error for the training set is: 95.216


### Question 20

### What is the Mean Absolute Error (in three decimal places) for test set?

In [21]:
# Make predictions on the test set
y_test_pred = model.predict(X_test_scaled)

# Calculate the Mean Absolute Error (MAE) on the test set
mae_test = mean_absolute_error(y_test, y_test_pred)

# Print the MAE for the test set (rounded to three decimal places)
print(f"Mean Absolute Error for the test set: {mae_test:.3f}")


Mean Absolute Error for the test set: 53.643


### Question 21

### What is the Root Mean Squared Error (in three decimal places) for test set?

In [22]:
# Calculate the Mean Squared Error (MSE) on the test set
mse_test = np.mean((y_test - y_test_pred) ** 2)

# Calculate the RMSE on the test set
rmse_test = np.sqrt(mse_test)

# Print the RMSE for the test set (rounded to three decimal places)
print(f"Root Mean Squared Error for the test set: {rmse_test:.3f}")

Root Mean Squared Error for the test set: 93.640


### Question 22

### Did the Model above overfit to the training set

In [23]:
# Train a multiple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the training set
y_pred_train = model.predict(X_train)

# Make predictions on the test set
y_pred_test = model.predict(X_test)

# Calculate the Mean Absolute Error (MAE) for the training set
mae_train = mean_absolute_error(y_train, y_pred_train)

# Calculate the Mean Absolute Error (MAE) for the test set
mae_test = mean_absolute_error(y_test, y_pred_test)

# Calculate the Root Mean Squared Error (RMSE) for the training set
rmse_train = mean_squared_error(y_train, y_pred_train, squared=False)

# Calculate the Root Mean Squared Error (RMSE) for the test set
rmse_test = mean_squared_error(y_test, y_pred_test, squared=False)

# Print the MAE and RMSE for both sets
print(f"MAE for the training set: {mae_train:.3f}")
print(f"MAE for the test set: {mae_test:.3f}")
print(f"RMSE for the training set: {rmse_train:.3f}")
print(f"RMSE for the test set: {rmse_test:.3f}")

MAE for the training set: 53.742
MAE for the test set: 53.643
RMSE for the training set: 95.216
RMSE for the test set: 93.640


The MAE and RMSE values for the training and test sets are almost close, which shows that the model generalizes well to unseen data because the MAE for the training set is relatively 53.742 while the MAE for the test set is relatively 53.643. These values are quite similar and this reveals that the model's performance on the training and test sets is consistent.
The RMSE for the training set is relatively 95.216 and the RMSE for the test set is relatively 93.640. These values are in the same range, suggesting that the model doesn't show significant overfitting.

From the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) values above, it shows that the model did not totally overfit to the training set. The model appears to generalize well to the test data. 
No, The model above does not overfit the training set  







### Question 23

### Train a ridge regression model with default parameters. Is there any change to the root mean squared error(RMSE) when evaluated on the test set?

In [24]:
# Create and train a Ridge regression model with default parameters
ridge_model = Ridge()
ridge_model.fit(X_train, y_train)

# Predict the target variable for the test set
y_pred_test_ridge = ridge_model.predict(X_test)

# Calculate the RMSE for the Ridge regression model on the test set
rmse_test_ridge = np.sqrt(mean_squared_error(y_test, y_pred_test_ridge))

# Print the RMSE for both linear regression and Ridge regression on the test set
print(f"Root Mean Squared Error (Linear Regression - Test Set): {rmse_test:.3f}")
print(f"Root Mean Squared Error (Ridge Regression - Test Set): {rmse_test_ridge:.3f}")

Root Mean Squared Error (Linear Regression - Test Set): 93.640
Root Mean Squared Error (Ridge Regression - Test Set): 93.641


Yes, there was slight change. The RMSE for the Ridge regression model on the test set is relatively 93.641. This RMSE value is very close to the RMSE of the linear regression model on the test set which was relatively 93.640. 
Since the RMSE values are almost the same, so using Ridge regularization with default parameters did not significantly change the model's performance on the test set in this type of case. It is possible that the default level of regularization applied by Ridge did not have a substantial impact on the model's generalization performance for the dataset.


### Question 24

### Train a lasso regression model with default value and obtain the new feature weights with it. How many of the features have non-zero feature weights?

In [25]:
# Create and train a Lasso regression model with default parameters
lasso_model = Lasso()
lasso_model.fit(X_train_scaled, y_train)

# Get the feature weights (coefficients)
feature_weights = lasso_model.coef_

# Count the number of features with non-zero weights
num_non_zero_features = sum(feature_weights != 0)

print("Number of features with non-zero weights:", num_non_zero_features)

Number of features with non-zero weights: 4


### Question 25

### What is the new RMSE with the Lasso Regression on the test set?

In [26]:
# Create and train a Lasso regression model with default parameters
lasso_model = Lasso()
lasso_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_test_pred_lasso = lasso_model.predict(X_test_scaled)

# Calculate the RMSE for the test set with Lasso regression
rmse_test_lasso = np.sqrt(mean_squared_error(y_test, y_test_pred_lasso))

print("The new Lasso Regression RMSE on the Test Set is: {:.3f}".format(rmse_test_lasso))

The new Lasso Regression RMSE on the Test Set is: 99.424
