# <div align="center"> COSC 2673/2793 | Machine Learning </div>
## <div align="center"> Assignment 2 - Joseph Packham (s3838978) and Kylie Nguyen (s3946026) </div>

# Introduction
This report will cover the process of producing a machine learning model that will predict energy usage...

In [None]:
#importing packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
#read in CSV file and display first 5 rows
energyUse_df = pd.read_csv('./dataset/UCI-electricity/UCI_data.csv', delimiter=',')
energyUse_df.head()

# Exploratory Data Analysis
To start off, the data is first investigated through EDA. It is observed that the shape of the dataframe is 19735 rows of data with 28 columns, where 1 column is the target variable (energy usage in Wh), and the remaining columns are the attributes. According to the description of the data, these attributes cover the temperature and humidity of different rooms in the house, as well as outside, along with a few other weather related variables such as pressure and windspeed. It is noted that there are two variables listed as "Random Variable". Using the .info() function, it is confirmed that there are no null values within the dataset.

In [None]:
#check for any null values, using shape to compare
print("Shape of Energy Use dataframe: ", energyUse_df.shape, "\n")

energyUse_df.info()

Using the describe function, the count, mean, standard deviation, quantiles and the minimum and maximum values of the data are returned. With these values it is seen that, although the range of the values among the variables regarding humidity and temperature are relatively similar, there are cases where the range differs greatly. For example, the range of Windspeed is between 0-14, whereas the range of target energy is between 10-1110. This suggests that feature scaling should be done later in the process, as the differing ranges may cause problems or confuse the learning algorithms.

In [None]:
energyUse_df.describe()

# Data distribution
In order to observe the distributions of each variable, histograms are plotted for the variables other than date, as the date variable is of type object and cannot be plotted.

In [None]:
#get list of columns other than date
columns = (energyUse_df.columns).difference(['date'])
#plot histogram for all variables other than date
plt.figure(figsize=(20,20))
for i, column in enumerate(columns):
    plt.subplot(6,5,i+1)
    plt.hist(energyUse_df[column], alpha=0.3, color='b', density=True)
    plt.title(column)
    plt.xticks(rotation='vertical')
    plt.tight_layout()

> **Observations:**
> - There are a number of attributes that appear to be skewed, eg. RH_5, RH_Out, T2 etc.
> - The two random variables are very evenly distributed.

In [None]:
#display boxplot for the target, energy usage, variable
plt.boxplot(energyUse_df['TARGET_energy'])
plt.title('Energy Usage')
plt.show()

Upon displaying the boxplot for the target variable, it is observed that there are a number of outliers above the lower limit. These values will be dropped as to prevent these dramatically different values from affecting the model. The outliers are dropped using the IQR method, which appears to have not removed two outliers, as they are within the lower and upper limits. Another method of dropping the outliers was not attempted due to the restrictions placed by the course.


In [None]:
#get the quantiles and IQR
q1 = energyUse_df['TARGET_energy'].quantile(0.25)
q3 = energyUse_df['TARGET_energy'].quantile(0.75)
IQR = q3-q1

#calculate lower and upper limits
lowerLimit = q1 - (1.5*IQR)
upperLimit = q3 + (1.5*IQR)

#get rid of rows with outliers from the dataframe
energyUse_df = energyUse_df.loc[(energyUse_df['TARGET_energy'] > lowerLimit) & (energyUse_df['TARGET_energy'] < upperLimit)]

#display boxplot without outliers
plt.boxplot(energyUse_df['TARGET_energy'])
plt.title('Energy Usage')
plt.show()

In [None]:
energyUse_df.shape

# Relationship between variables
Using scatterplots, the relationship between the target variable, Energy Usage, against the other attributes in the dataframe is explored.

In [None]:
#import seaborn package for plotting scatterplots
import seaborn as sns

#plot scatterplots for all features against target variable
plt.figure(figsize=(20,20))
for i, column in enumerate(columns):
    plt.subplot(6,5, i+1)
    sns.scatterplot(data=energyUse_df, x=column, y='TARGET_energy')
    plt.title(column)

plt.xticks(rotation='vertical')
plt.tight_layout()
plt.show()

In [None]:
#get df without date column
energyUse_df_noDate = energyUse_df.drop(columns=['date'])

#plot correlation plot
f, ax = plt.subplots(figsize=(11, 9))
corr = energyUse_df_noDate.corr()
ax = sns.heatmap(
    corr,
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=90,
    horizontalalignment='right'
)

> **Observations:**
> - Variables relating to temperature are highly positively correlated with each other, and variables that are related to humidity are similarly, highly positively correlated with each other.
> - Variables involving temperature generally have either a slight positive, or slight negative correlation with variables involving humidity.
> - RH_6, the humidity outside the building (northside) seems to be quite negatively correlated with variables regarding temperature.
> - The two random variables do not seem to be correlated with any other variable other being highly correlated with themselves as well as each other.

# Non-Neural Network - Linear Regression

### Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

#split the dataset into 70% train and 15% test and 15% val
with pd.option_context('mode.chained_assignment', None):
    LR_train, LR_test = train_test_split(energyUse_df, test_size=0.3, shuffle=True, random_state = 42)
    LR_test, LR_val = train_test_split(LR_test, test_size=0.5, shuffle=True, random_state = 42)

#Separate the target and the attributes
LR_X_train = LR_train.drop(['TARGET_energy', 'date'], axis=1)
LR_y_train = LR_train['TARGET_energy']

LR_X_test = LR_test.drop(['TARGET_energy', 'date'], axis=1)
LR_y_test = LR_test['TARGET_energy']

LR_X_val = LR_val.drop(['TARGET_energy', 'date'], axis=1)
LR_y_val = LR_val['TARGET_energy']

print("LR_X_train shape: ", LR_X_train.shape)
print("LR_y_train shape: ", LR_y_train.shape)
print("LR_X_test shape: ", LR_X_test.shape)
print("LR_y_test shape: ", LR_y_test.shape)
print("LR_X_val shape: ", LR_X_val.shape)
print("LR_y_val shape: ", LR_y_val.shape)

In [None]:
energyUse_df_X = energyUse_df.drop(['TARGET_energy', 'date'], axis=1)

#plotting histograms of both training and test datasets
plt.figure(figsize=(20,20))
for i, col in enumerate(energyUse_df_X.columns):
    plt.subplot(6,5,i+1)
    plt.hist(LR_X_train[col], alpha=0.3, color='b', density=True)
    plt.hist(LR_X_test[col], alpha=0.3, color='r', density=True)
    plt.title(col)
    plt.xticks(rotation='vertical')
    plt.tight_layout()

### Base Model, Unscaled Data

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import math

#unscaled
model_us_lr = LinearRegression().fit(LR_X_train, LR_y_train)
LR_y_val_pred_US = model_us_lr.predict(LR_X_val)

r2_us_lr = r2_score(LR_y_val, LR_y_val_pred_US)
print('The R^2 score for the linear regression model (without feature scaling) is: {:.3f}'.format(r2_us_lr))

MSE_us_lr = np.square(np.subtract(LR_y_val,LR_y_val_pred_US)).mean()
RMSE_us_lr = math.sqrt(MSE_us_lr)

print('The RMSE score for the linear regression model (without feature scaling) is: {:.3f}'.format(RMSE_us_lr))

In [None]:
#predicting using linear model and plotting predicted vs actual values

fig, energyUse_LinearRegression = plt.subplots()
energyUse_LinearRegression.scatter(LR_y_val, LR_y_val_pred_US, s=25, cmap=plt.cm.coolwarm, zorder=10)

lims = [
    np.min([energyUse_LinearRegression.get_xlim(), energyUse_LinearRegression.get_ylim()]),
    np.max([energyUse_LinearRegression.get_xlim(), energyUse_LinearRegression.get_ylim()]),
]

energyUse_LinearRegression.plot(lims, lims, 'k--', alpha=0.75, zorder=0)
energyUse_LinearRegression.plot(lims, [np.mean(LR_y_train),]*2, 'r--', alpha=0.75, zorder=0)
energyUse_LinearRegression.set_aspect('equal')
energyUse_LinearRegression.set_xlim(lims)
energyUse_LinearRegression.set_ylim(lims)

plt.xlabel('Actual Energy Use')
plt.ylabel('Predicted Energy Use')

plt.show()

In [None]:
#plot residuals for unscaled
fig, ax = plt.subplots()
ax.scatter(LR_y_val, LR_y_val-LR_y_val_pred_US, s=25, cmap=plt.cm.coolwarm, zorder=10)

xlims = ax.get_xlim()
ax.plot(xlims, [0.0,]*2, 'k--', alpha=0.75, zorder=0)
ax.set_xlim(xlims)

plt.xlabel('Actual Energy Use')
plt.ylabel('Residual')

plt.show()

### Model with MinMaxScaling and Power Transforming

In [None]:
#scaling all features, normalising skewed features
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PowerTransformer

logNorm_attributes = ['RH_1', 'T2', 'T3', 'RH_3', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility']
minmax_attributes = list(set(energyUse_df_X.columns).difference(set(logNorm_attributes)))

LR_X_train_scaled = LR_X_train.copy()
LR_X_val_scaled = LR_X_val.copy()

minmaxscaler = MinMaxScaler().fit(LR_X_train_scaled.loc[:, minmax_attributes])
LR_X_train_scaled.loc[:, minmax_attributes] = minmaxscaler.transform(LR_X_train_scaled.loc[:, minmax_attributes])
LR_X_val_scaled.loc[:, minmax_attributes] = minmaxscaler.transform(LR_X_val_scaled.loc[:, minmax_attributes])

powertransformer = PowerTransformer(method='yeo-johnson', standardize=False).fit(LR_X_train.loc[:, logNorm_attributes])
LR_X_train_scaled.loc[:, logNorm_attributes] = powertransformer.transform(LR_X_train.loc[:, logNorm_attributes])
LR_X_val_scaled.loc[:, logNorm_attributes] = powertransformer.transform(LR_X_val.loc[:, logNorm_attributes])

minmaxscaler_pt = MinMaxScaler().fit(LR_X_train_scaled.loc[:, logNorm_attributes])
LR_X_train_scaled.loc[:, logNorm_attributes] = minmaxscaler_pt.transform(LR_X_train_scaled.loc[:, logNorm_attributes])
LR_X_val_scaled.loc[:, logNorm_attributes] = minmaxscaler_pt.transform(LR_X_val_scaled.loc[:, logNorm_attributes])

In [None]:
#plot all histograms after scaling and normalisation
plt.figure(figsize=(20,20))
for i, col in enumerate(LR_X_train_scaled.columns):
    plt.subplot(6,5,i+1)
    plt.hist(LR_X_train_scaled[col], alpha=0.3, color='b', density=True)
    plt.hist(LR_X_val_scaled[col], alpha=0.3, color='r', density=True)
    plt.title(col)
    plt.xticks(rotation='vertical')
    plt.tight_layout()

In [None]:
#fitting a linear regression model
model_scaled_lr = LinearRegression().fit(LR_X_train_scaled, LR_y_train)

#predicting using linear model and plotting predicted vs actual values
LR_y_val_pred_scaled = model_scaled_lr.predict(LR_X_val_scaled)

fig, energyUse_LinearRegression = plt.subplots()
energyUse_LinearRegression.scatter(LR_y_val, LR_y_val_pred_scaled, s=25, cmap=plt.cm.coolwarm, zorder=10)

lims = [
    np.min([energyUse_LinearRegression.get_xlim(), energyUse_LinearRegression.get_ylim()]),
    np.max([energyUse_LinearRegression.get_xlim(), energyUse_LinearRegression.get_ylim()]),
]

energyUse_LinearRegression.plot(lims, lims, 'k--', alpha=0.75, zorder=0)
energyUse_LinearRegression.plot(lims, [np.mean(LR_y_train),]*2, 'r--', alpha=0.75, zorder=0)
energyUse_LinearRegression.set_aspect('equal')
energyUse_LinearRegression.set_xlim(lims)
energyUse_LinearRegression.set_ylim(lims)

plt.xlabel('Actual Energy Use')
plt.ylabel('Predicted Energy Use')

plt.show()

In [None]:
#plot residuals for scaled
fig, ax = plt.subplots()
ax.scatter(LR_y_val, LR_y_val-LR_y_val_pred_scaled, s=25, cmap=plt.cm.coolwarm, zorder=10)

xlims = ax.get_xlim()
ax.plot(xlims, [0.0,]*2, 'k--', alpha=0.75, zorder=0)
ax.set_xlim(xlims)

plt.xlabel('Actual Energy Use')
plt.ylabel('Residual')

plt.show()

In [None]:
#scaled
r2_lr_scaled = r2_score(LR_y_val, LR_y_val_pred_scaled)

print('The R^2 score for the linear regression model (with feature scaling) is: {:.3f}'.format(r2_lr_scaled))

MSE_lr_scaled = np.square(np.subtract(LR_y_val,LR_y_val_pred_scaled)).mean()
RMSE_lr_scaled = math.sqrt(MSE_lr_scaled)

print('The RMSE score for the linear regression model (with feature scaling) is: {:.3f}'.format(RMSE_lr_scaled))

### Day of Week Column + Scaled & Transformed data

In [None]:
#trying to use date to see if that makes model perform better
energyUse_df['date'] = pd.to_datetime(energyUse_df['date'], format="%Y-%m-%d %H:%M:%S")
energyUse_df['day_of_week'] = energyUse_df['date'].dt.dayofweek

#split the dataset into 70% train and 15% test and 15% val
with pd.option_context('mode.chained_assignment', None):
    LR_train, LR_test = train_test_split(energyUse_df, test_size=0.3, shuffle=True, random_state = 42)
    LR_test, LR_val = train_test_split(LR_test, test_size=0.5, shuffle=True, random_state = 42)

#Separate the target and the attributes
LR_X_train = LR_train.drop(['TARGET_energy', 'date'], axis=1)
LR_y_train = LR_train['TARGET_energy']

LR_X_test = LR_test.drop(['TARGET_energy', 'date'], axis=1)
LR_y_test = LR_test['TARGET_energy']

LR_X_val = LR_val.drop(['TARGET_energy', 'date'], axis=1)
LR_y_val = LR_val['TARGET_energy']

print("LR_X_train shape: ", LR_X_train.shape)
print("LR_y_train shape: ", LR_y_train.shape)
print("LR_X_test shape: ", LR_X_test.shape)
print("LR_y_test shape: ", LR_y_test.shape)
print("LR_X_val shape: ", LR_X_val.shape)
print("LR_y_val shape: ", LR_y_val.shape)


In [None]:
#scaling all features, normalising skewed features
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PowerTransformer

logNorm_attributes = ['RH_1', 'T2', 'T3', 'RH_3', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility']
minmax_attributes = list(set(energyUse_df_X.columns).difference(set(logNorm_attributes)))

LR_X_train_scaled = LR_X_train.copy()
LR_X_val_scaled = LR_X_val.copy()

minmaxscaler = MinMaxScaler().fit(LR_X_train_scaled.loc[:, minmax_attributes])
LR_X_train_scaled.loc[:, minmax_attributes] = minmaxscaler.transform(LR_X_train_scaled.loc[:, minmax_attributes])
LR_X_val_scaled.loc[:, minmax_attributes] = minmaxscaler.transform(LR_X_val_scaled.loc[:, minmax_attributes])

powertransformer = PowerTransformer(method='yeo-johnson', standardize=False).fit(LR_X_train.loc[:, logNorm_attributes])
LR_X_train_scaled.loc[:, logNorm_attributes] = powertransformer.transform(LR_X_train.loc[:, logNorm_attributes])
LR_X_val_scaled.loc[:, logNorm_attributes] = powertransformer.transform(LR_X_val.loc[:, logNorm_attributes])

minmaxscaler_pt = MinMaxScaler().fit(LR_X_train_scaled.loc[:, logNorm_attributes])
LR_X_train_scaled.loc[:, logNorm_attributes] = minmaxscaler_pt.transform(LR_X_train_scaled.loc[:, logNorm_attributes])
LR_X_val_scaled.loc[:, logNorm_attributes] = minmaxscaler_pt.transform(LR_X_val_scaled.loc[:, logNorm_attributes])

In [None]:
#fitting a linear regression model
model_scaled_lr_wDayOfWeek = LinearRegression().fit(LR_X_train_scaled, LR_y_train)

#predicting using linear model and plotting predicted vs actual values
LR_y_val_pred_dayOfWeek = model_scaled_lr_wDayOfWeek.predict(LR_X_val_scaled)

fig, energyUse_wDayOfWeek_LinearRegression = plt.subplots()
energyUse_wDayOfWeek_LinearRegression.scatter(LR_y_val, LR_y_val_pred_dayOfWeek, s=25, cmap=plt.cm.coolwarm, zorder=10)

lims = [
    np.min([energyUse_wDayOfWeek_LinearRegression.get_xlim(), energyUse_wDayOfWeek_LinearRegression.get_ylim()]),
    np.max([energyUse_wDayOfWeek_LinearRegression.get_xlim(), energyUse_wDayOfWeek_LinearRegression.get_ylim()]),
]

energyUse_wDayOfWeek_LinearRegression.plot(lims, lims, 'k--', alpha=0.75, zorder=0)
energyUse_wDayOfWeek_LinearRegression.plot(lims, [np.mean(LR_y_train),]*2, 'r--', alpha=0.75, zorder=0)
energyUse_wDayOfWeek_LinearRegression.set_aspect('equal')
energyUse_wDayOfWeek_LinearRegression.set_xlim(lims)
energyUse_wDayOfWeek_LinearRegression.set_ylim(lims)

plt.xlabel('Actual Energy Use')
plt.ylabel('Predicted Energy Use')

plt.show()

In [None]:
fig, ax = plt.subplots()
ax.scatter(LR_y_val, LR_y_val-LR_y_val_pred_dayOfWeek, s=25, cmap=plt.cm.coolwarm, zorder=10)

xlims = ax.get_xlim()
ax.plot(xlims, [0.0,]*2, 'k--', alpha=0.75, zorder=0)
ax.set_xlim(xlims)

plt.xlabel('Actual Energy Use')
plt.ylabel('Residual')

plt.show()

In [None]:
#scaled + dayOfWeek
r2_lr = r2_score(LR_y_val, LR_y_val_pred_dayOfWeek)

print('The R^2 score for the linear regression model (with feature scaling + dayOfWeek) is: {:.3f}'.format(r2_lr))

MSE_lr = np.square(np.subtract(LR_y_val,LR_y_val_pred_dayOfWeek)).mean()
RMSE_lr = math.sqrt(MSE_lr)

print('The RMSE score for the linear regression model (with feature scaling + dayOfWeek) is: {:.3f}'.format(RMSE_lr))

### Day of Week Column + Unscaled & Untransformed data

In [None]:
#fitting a linear regression model
model_us_lr_wDayOfWeek = LinearRegression().fit(LR_X_train, LR_y_train)

#predicting using linear model and plotting predicted vs actual values
LR_y_val_pred_dayOfWeek_us = model_us_lr_wDayOfWeek.predict(LR_X_val)

fig, energyUse_wDayOfWeek_LinearRegression = plt.subplots()
energyUse_wDayOfWeek_LinearRegression.scatter(LR_y_val, LR_y_val_pred_dayOfWeek_us, s=25, cmap=plt.cm.coolwarm, zorder=10)

lims = [
    np.min([energyUse_wDayOfWeek_LinearRegression.get_xlim(), energyUse_wDayOfWeek_LinearRegression.get_ylim()]),
    np.max([energyUse_wDayOfWeek_LinearRegression.get_xlim(), energyUse_wDayOfWeek_LinearRegression.get_ylim()]),
]

energyUse_wDayOfWeek_LinearRegression.plot(lims, lims, 'k--', alpha=0.75, zorder=0)
energyUse_wDayOfWeek_LinearRegression.plot(lims, [np.mean(LR_y_train),]*2, 'r--', alpha=0.75, zorder=0)
energyUse_wDayOfWeek_LinearRegression.set_aspect('equal')
energyUse_wDayOfWeek_LinearRegression.set_xlim(lims)
energyUse_wDayOfWeek_LinearRegression.set_ylim(lims)

plt.xlabel('Actual Energy Use')
plt.ylabel('Predicted Energy Use')

plt.show()

In [None]:
fig, ax = plt.subplots()
ax.scatter(LR_y_val, LR_y_val-LR_y_val_pred_dayOfWeek, s=25, cmap=plt.cm.coolwarm, zorder=10)

xlims = ax.get_xlim()
ax.plot(xlims, [0.0,]*2, 'k--', alpha=0.75, zorder=0)
ax.set_xlim(xlims)

plt.xlabel('Actual Energy Use')
plt.ylabel('Residual')

plt.show()

In [None]:
#unscaled + dayOfWeek
r2_lr = r2_score(LR_y_val, LR_y_val_pred_dayOfWeek_us)

print('The R^2 score for the linear regression model (unscaled + dayOfWeek) is: {:.3f}'.format(r2_lr))

MSE_lr = np.square(np.subtract(LR_y_val,LR_y_val_pred_dayOfWeek_us)).mean()
RMSE_lr = math.sqrt(MSE_lr)

print('The RMSE score for the linear regression model (unscaled + dayOfWeek) is: {:.3f}'.format(RMSE_lr))

In [None]:
LR_y_test_pred_dayOfWeek_us = model_us_lr_wDayOfWeek.predict(LR_X_test)
#unscaled + dayOfWeek
r2_lr = r2_score(LR_y_test, LR_y_test_pred_dayOfWeek_us)

print('The R^2 score for the linear regression model (unscaled + dayOfWeek) is: {:.3f}'.format(r2_lr))

MSE_lr = np.square(np.subtract(LR_y_test, LR_y_test_pred_dayOfWeek_us)).mean()
RMSE_lr = math.sqrt(MSE_lr)

print('The RMSE score for the linear regression model (unscaled + dayOfWeek) is: {:.3f}'.format(RMSE_lr))