OpenClassrooms
Project 4, Data Scientist
Author : Oumeima EL GHARBI
Date : August 2022

# PART 2 : Exploration of the dataset and Feature Engineering

### Introduction

#### Idea :

- The variable that we want to predict is : "TotalGHGEmissions"

1) Data visualization of "TotalGHGEmissions" and categorical features (like "BuildingType" for instance)
    Feature Engineering : delete features that do not help for the prediction.
2) Study distribution of quantitative features (energy variables etc)
    Feature Engineering :
    - delete features that won't be useful for the prediction of "TotalGHGEmissions"
    - apply transformations (log ? normalization ? etc skewness ??)
3) Correlation Matrix :
    Feature Engineering : delete features that are correlated (repeeated features like kWh and kBtu).

#### **Conclusion** of the exploration and feature engineering :

In [1]:
print("Source Energy Accounts for Total Energy Use")
features_to_predict = [
    "SiteEnergyUse(kBtu)",
    "SteamUse(kBtu)",
    "Electricity(kBtu)",
    "NaturalGas(kBtu)",
    "TotalGHGEmissions"]

# pour prédire les 4 premieres : avec le meme groupe de var (utilisation des carstetiques des bat)

print("Features to predict : ", features_to_predict)

Source Energy Accounts for Total Energy Use
Features to predict :  ['SiteEnergyUse(kBtu)', 'SteamUse(kBtu)', 'Electricity(kBtu)', 'NaturalGas(kBtu)', 'TotalGHGEmissions']


### Starting exploration

#### Importing libraries

In [2]:
% reset -f

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

% matplotlib inline
% autosave 300

UsageError: Line magic function `%` not found.


In [None]:
from functions import *

#### Loading dataset

In [None]:
columns_to_categorize = ["BuildingType", "PrimaryPropertyType", "Neighborhood", "ZipCode", "CouncilDistrictCode",
                         "LargestPropertyUseType", "SecondLargestPropertyUseType", "ThirdLargestPropertyUseType"]

category_types = {column: 'category' for column in columns_to_categorize}
print("This dictionary will be used when reading the csv file to assign a type to categorical features :",
      category_types)

In [None]:
path = "./dataset/cleaned/"
filename = "2016_Building_Energy_Cleaned.csv"

dataset_path = "{}{}".format(path, filename)
# we assign the categorical features with a categotical type
dataset = pd.read_csv(dataset_path, dtype=category_types)

In [None]:
dataset.shape

In [None]:
dataset[:10]

In [None]:
dataset.dtypes

In [None]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)
pd.set_option('display.max_columns', None)

dataset.describe()

### 1) Plot kWh / kBtu and therms / kBtu

In [None]:
print("Natural Gas")
dataset.plot.scatter("NaturalGas(therms)", "NaturalGas(kBtu)", c="g")

In [None]:
# Easy Linear Regression : x = kWh y = kBtu
print("Electricity")
plt.plot(dataset["Electricity(kWh)"], dataset["Electricity(kBtu)"], "ro", markersize=4)
plt.show()

In [None]:
print(
    "After checking that the Electricity in kWh = 3.412 Electricity in kBtu and that NaturalGas(therms) = 100 NaturalGas(kBtu) ; we drop the kWh and therms.")
dataset_v1 = dataset.drop(columns=["Electricity(kWh)", "NaturalGas(therms)"])

We tried to implement linear regression to verify that :
- the Electricity in kWh x 3.412 = Electricity in kBtu,
- the NaturalGas in Therms x 99.98 = NaturalGas in kBtu.


#### Manual Linear Regression

In [None]:
# Linear Regression "manually" / we transform the dataset into matrices to compute theta
X = np.matrix([np.ones(dataset.shape[0]), dataset["Electricity(kWh)"]]).T
Y = np.matrix(dataset["Electricity(kBtu)"]).T

In [None]:
np.set_printoptions(suppress=True)  # remove scientific notation
X[:10]

In [None]:
# Computing the exact value of the parameter theta
theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)

# displaying theta with a = slope ; and b = y-intercept
print(theta)

# we test Y = aX + b for X = 10
print(theta.item(0) + theta.item(1) * 10)

In [None]:
plt.xlabel("Electricity(kWh)")
plt.ylabel("Electricity(kBtu)")

x_min = 0
x_max = 200000000
y_min = theta.item(0)
y_max = theta.item(0) + x_max * theta.item(1)

plt.plot([x_min, x_max], [y_min, y_max], linestyle="--", c="#000000")
plt.plot(dataset["Electricity(kWh)"], dataset["Electricity(kBtu)"], "ro", markersize=4)

plt.show()

#### Automatic Linear Regression

In [None]:
# 0) Getting data and Sampling
from sklearn.linear_model import LinearRegression

X = dataset["Electricity(kWh)"]
y = dataset["Electricity(kBtu)"]  # target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

# reshape to get 2D array instead of 1D array.
# values to get a numpy array instead of a pandas Series
X_train = X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)

In [None]:
X_train

In [None]:
# 1) Training Linear Regression and Evaluating
reg = LinearRegression().fit(X_train, y_train)
prediction_score = reg.score(X_test, y_test)
#print("Accuracy is : %.2f" % (100 * prediction_score))
print('Accuracy is : {:.0%}'.format(prediction_score))

# 2) Trying to predict a value
reg.predict([[10]])  # if X = 10 kWh then Y = 34.12 kBtu ;)

In [None]:
print("Electricity prediction")
plt.plot(reg.predict(X_test), y_test, "ro", markersize=4)
plt.show()

#### Conclusion :
This Linear Rregression was for training purposes.
We remove the redundant features like Electricity in kWh and NaturalGas in Therms.

Thanks to the correlation matrix below, we will be able to choose the features that will help our prediction model.

for column in dataset_v1.select_dtypes(['int32', 'float64']).columns:
    plt.figure(figsize=(12,6))
    plt.title('Distribution de ' + column)
    sns.histplot(data[column], bins=20)

### 2) Correlation Matrix

#### 2.1) Correlation between energy variables

In [None]:
all_energy_features = ["SiteEUI(kBtu/sf)", "SiteEUIWN(kBtu/sf)", "SourceEUI(kBtu/sf)", "SourceEUIWN(kBtu/sf)",
                       "SiteEnergyUse(kBtu)", "SiteEnergyUseWN(kBtu)", "SteamUse(kBtu)", "Electricity(kBtu)",
                       "NaturalGas(kBtu)", "TotalGHGEmissions", "GHGEmissionsIntensity"]

In [None]:
# exploration : matrice corrélation et corrélation entre categorielles et quantitatives (GHGE corrélé avec var consom E (elctricity, steam etc)
# predict : consomation bat puis avec ca predire l'emission CO2
# 1) prediction consommation sur l'énergie puis predire CO2 (le + long le +complexe)
# categories batiments mal ecrites / numeriques incohérentes

In [None]:
# we create a dataframe with all the energy features.
df_to_corr = dataset_v1[all_energy_features]

# we assign the type float to all the values of the matrix
df_to_corr = df_to_corr.astype(float)
corr_df = df_to_corr.corr(method='spearman')

print("We display here the correlation matrix without options to justify the display below.")
plt.figure(figsize=(10, 8))
plt.title('Correlation matrix for energy features.')
sns.heatmap(corr_df, annot=True, vmin=-1, cmap='coolwarm')
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
sns.set(font_scale=1)
plt.title('Correlation matrix for energy features with upper triangle masked.')

# to hide the upper triangle of the matrix
trimask = np.zeros_like(corr_df)
trimask[np.triu_indices_from(trimask)] = True
sns.heatmap(corr_df, annot=True, mask=trimask | (np.abs(corr_df) <= 0.4), vmin=0,
            cmap='coolwarm')  # we don't have negative correlations here

plt.show()

**Interpretation** :

- We can see a strong correlation between the variables.


**Conclusion** :
- We will drop the variables Weather Normalized,
- We drop the variable GHGEmissionsIntensity.

#### 2.2) Correlation matrix with all features

In [None]:
numeric_columns = dataset_v1.select_dtypes(include=['int64', 'float64']).columns
corr = dataset_v1[numeric_columns].corr()

plt.figure(figsize=(13, 11))
sns.set(font_scale=1)
plt.title('Correlation matrix for all numeric features.')

# to hide the upper triangle of the matrix
trimask = np.zeros_like(corr)
trimask[np.triu_indices_from(trimask)] = True
sns.heatmap(corr, annot=True, vmin=0, cmap='coolwarm')  # we don't have negative correlations here

plt.show()

dataset_v1.columns

In [None]:
print("ASK MENTOR INTERPRETATION CORR MATRIX")
# carac bat interess pour predirection / redtirer quand trop correlees entre elles

**Interpretation** :

- We can see a strong correlation between the variables.


**Conclusion** :
- We will drop the variables Weather Normalized,
- We drop the variable GHGEmissionsIntensity.

In [None]:
correlated_features = ["SiteEUI(kBtu/sf)", 'SiteEUIWN(kBtu/sf)', 'SiteEnergyUseWN(kBtu)', 'SourceEUI(kBtu/sf)',
                       'SourceEUIWN(kBtu/sf)', "GHGEmissionsIntensity"]

In [None]:
dataset_v2 = dataset_v1.drop(columns=correlated_features, inplace=False)

### 3) Distribution of quantitative variables

#### 3.1) Boxplot per primary property type

In [None]:
dataset_v2.dtypes

In [None]:
from functions import *

print("After removing outliers, we display boxplot per energy feature and per Primary Property Type.")
print(
    "We suppose that a K-12 School has different energy needed compared to a hospital, so the boxplot must represent that difference.")

data_plot = dataset_v2.copy()
data_plot.loc[data_plot["ENERGYSTARScore"] == -1] = np.nan

list_features_to_plot = data_plot.select_dtypes(include="number").columns.tolist()
list_features_to_plot.remove("OSEBuildingID")

In [None]:
display_boxplot_per_feature(data_plot, list_features_to_plot, "PrimaryPropertyType")

#### 3.2) Histograms

In [None]:
display_distribution_per_feature(data_plot, list_features_to_plot, 10)

In [None]:
dataset_v2.columns

#### 3.3) Distribution

In [None]:
dataset_v2.dtypes

In [None]:
features_for_prediction = ["YearBuilt", "CouncilDistrictCode", "Neighborhood", "NumberofBuildings", "NumberofFloors",
                           "BuildingType", "PrimaryPropertyType", "LargestPropertyUseType",
                           "SecondLargestPropertyUseType", "ThirdLargestPropertyUseType", "PropertyGFABuilding(s)",
                           "LargestPropertyUseTypeGFA", "SecondLargestPropertyUseTypeGFA",
                           "ThirdLargestPropertyUseTypeGFA"]

densite(data_plot[list_features_to_plot])

### 4) Logarithmic transformation of the variables to predict

#### 4.1) Log Graphs ??

In [None]:
print("We check the effect of a log transformation on the variable we want to predict.")
all_log_transformations = compute_log_for_feature(dataset_v2, "SiteEnergyUse(kBtu)")
log_distribution(all_log_transformations)

#### 4.2) Log transformation for the features to predict

In [None]:
features_to_predict = [
    "SiteEnergyUse(kBtu)",
    "SteamUse(kBtu)",
    "Electricity(kBtu)",
    "NaturalGas(kBtu)",
    "TotalGHGEmissions"]

print(features_to_predict)

In [None]:
dataset_v3 = log_transformation(dataset_v2, features_to_predict)
display(dataset_v3)

### 5) Saving cleaned dataset

In [None]:
# We reset the index
final_dataset = dataset_v3.reset_index(drop=True)

# Save
export_path = "./dataset/cleaned/"
export_filename = "2016_Building_Energy_Prediction.csv"
final_dataset.to_csv("{}{}".format(export_path, export_filename), index=False)