# Feature Selection with sklearn and Pandas

Here we have a problem of Energy Consumption, where we would like to understand which are the features that are more relevant for the model.

My inspiration comes from this articule: 
https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b

In [1]:
# Import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Data reading and prep

In [2]:
# wifi counts number of people per day
wifi = pd.read_csv(r"C:\Users\Javiera Vines\Documents\Projects\Square Sense\2020-01-29_data-analyst-intern-building-dataset\2020-01-29_data-analyst-intern-building-dataset\wifi_visits.csv")
wifi

Unnamed: 0,operating_date,visits
0,2018-10-08,336
1,2018-10-09,336
2,2018-10-10,238
3,2018-10-11,328
4,2018-10-12,19
...,...,...
360,2019-10-03,141
361,2019-10-04,112
362,2019-10-05,2
363,2019-10-06,0


In [3]:
#electrical consumption data hourly elevator & hvac (air conditioning)
eec = pd.read_csv(r"C:\Users\Javiera Vines\Documents\Projects\Square Sense\2020-01-29_data-analyst-intern-building-dataset\2020-01-29_data-analyst-intern-building-dataset\eec.csv")
eec.head(5)

Unnamed: 0,local_time,energy_consumed,sensor
0,2018-10-08T00:00:00,203495,elevator-eec-meter
1,2018-10-08T00:00:00,1368750,hvac-eec-meter
2,2018-10-08T01:00:00,210536,elevator-eec-meter
3,2018-10-08T01:00:00,1373077,hvac-eec-meter
4,2018-10-08T02:00:00,188568,elevator-eec-meter


In [5]:
type(eec["local_time"][0]) # currently this variable is a str we will convert it to datetimeformat to extract the date only

str

In [6]:
eec['local_time2']= pd.to_datetime(eec['local_time']) 
eec.head()

Unnamed: 0,local_time,energy_consumed,sensor,local_time2
0,2018-10-08T00:00:00,203495,elevator-eec-meter,2018-10-08 00:00:00
1,2018-10-08T00:00:00,1368750,hvac-eec-meter,2018-10-08 00:00:00
2,2018-10-08T01:00:00,210536,elevator-eec-meter,2018-10-08 01:00:00
3,2018-10-08T01:00:00,1373077,hvac-eec-meter,2018-10-08 01:00:00
4,2018-10-08T02:00:00,188568,elevator-eec-meter,2018-10-08 02:00:00


In [7]:
type(eec["local_time2"][0]) #check datatype

pandas._libs.tslibs.timestamps.Timestamp

In [12]:
#replace the column local_time with date only
eec['local_time'] = eec['local_time2'].dt.date
eec.head()

Unnamed: 0,local_time,energy_consumed,sensor,local_time2
0,2018-10-08,203495,elevator-eec-meter,2018-10-08 00:00:00
1,2018-10-08,1368750,hvac-eec-meter,2018-10-08 00:00:00
2,2018-10-08,210536,elevator-eec-meter,2018-10-08 01:00:00
3,2018-10-08,1373077,hvac-eec-meter,2018-10-08 01:00:00
4,2018-10-08,188568,elevator-eec-meter,2018-10-08 02:00:00


In [13]:
type(eec["local_time"][0]) #check datatype

datetime.date

In [14]:
weather = pd.read_csv(r"C:\Users\Javiera Vines\Documents\Projects\Square Sense\2020-01-29_data-analyst-intern-building-dataset\2020-01-29_data-analyst-intern-building-dataset\weather.csv")
weather.head(5)

Unnamed: 0,apparentTemperature,cloudCover,dewPoint,humidity,icon,ozone,precipIntensity,precipProbability,precipType,pressure,temperature,time,uvIndex,visibility,windBearing,windGust,windSpeed
0,23.01,0.73,11.0,0.47,fog,319.5,0.1081,0.1,rain,1015.83,23.01,2019-08-26T05:00:00+0000,0.0,2.312,78.0,8.21,3.88
1,21.94,0.19,10.93,0.5,fog,322.6,0.1826,0.17,rain,1016.02,21.94,2019-08-26T06:00:00+0000,0.0,2.693,83.0,8.22,3.43
2,17.04,0.75,16.38,0.98,fog,303.6,0.9554,0.22,rain,1022.58,16.76,2019-09-14T22:00:00+0000,0.0,1.828,57.0,4.34,3.1
3,16.41,0.62,16.13,1.0,fog,301.0,1.215,0.22,rain,1022.81,16.13,2019-09-14T23:00:00+0000,0.0,0.0,70.0,4.53,1.17
4,16.97,0.57,16.24,0.97,fog,301.3,1.1289,0.23,rain,1022.64,16.71,2019-09-15T00:00:00+0000,0.0,0.0,55.0,5.36,2.16


In [None]:
weather.count()

In [None]:
#here we proceed to modify the "time" column because the format is different from the other tables
day_list = []
for date in weather["time"]:
    day = date[:10]
    day_list.append(day)
    
day_df = pd.DataFrame(day_list, columns = ["date"]) 
weather_new = pd.concat([weather, day_df], axis = 1)
weather_new.head(5)

In [None]:
weather["time"].dtype

In [None]:
weather['time2']= pd.to_datetime(weather['time'])
weather["time2"].dtype

For energy consumption in HVAC system it is very important to understand its relationship with the daily heating/cooling degree days. This indicator -also known as HDD - corresponds to the number of degrees that the outdoor air temperature is below a reference temperature over a given time interval (and CDD when the air temperature is above the reference temperature).

Usually, the reference temperature is 18°C, so we will compute HDD and CDD compared if it is below 18 °C and above 18 °C respectively.

In [None]:
weather['time3'] = weather['time2'].dt.date

In [None]:
weather["time3"]

In [None]:
weather["time3"].value_counts()

In [None]:
count = weather["time3"].value_counts()
count.dtype

In [None]:
#we see how these values are distributed
#import pylab as pl
viz = count
viz.hist()
plt.show()

To simplify the model I will only work with the average value of HDD/CDD. However the correct way would be to take all the values

In [None]:
#we proceed to estimate the mean values
mean_weather_new = weather_new.groupby('date').mean()
mean_weather_new.head(5)

In [None]:
#create HDD and CDD
def temp_deg (row):
    if row['temperature'] > 18 :
        val = "CDD"
    else:
        val = "HDD"
    return val

#create values
def temp_value (row):
    if row['temperature'] > 18 :
        val = row["temperature"] - 18
    else:
        val = 18 - row["temperature"]
    return val

mean_weather_new['HDD/CDD'] = mean_weather_new.apply(temp_deg, axis=1) #create the column with HDD/CDD (as categories)
mean_weather_new['Degree_Dif'] = mean_weather_new.apply(temp_value, axis=1) #calculate the HDD/CDD values

mean_weather_new.head(20)

In [None]:
#observe how they are distributed
mean_weather_new['HDD/CDD'].value_counts()

In [None]:
#HDD / CDD - create one hot code feature to analyze it later
dummies = pd.get_dummies(mean_weather_new['HDD/CDD'])
dummies = dummies.drop(columns = ["CDD"])

mean_weather_new = pd.concat([mean_weather_new, dummies], axis = 1)
mean_weather_new

### Feature Selection

Here we will expose statistical relationships between the energy consumption of the elevator, the energy consumption of the HVAC system, WiFi visits, and weather conditions (in particular heating/cooling degree days).

In [None]:
mean_weather_new.shape

In [None]:
#find missing values
mean_weather_new.isnull().sum()

First, of 366 rows there are 199 missing values corresponding to different features. For this reason, these features are going to be deleted.

Features windGust and cloudCover are going be deleted as well due to the missing values represent 15% and 26% of the data that is missing repectively.

Finally, HDD/CDD will be deleted as well, because we will keep the dummified variable HDD (one-hot-code).

In [None]:
mean_weather_new = mean_weather_new.drop(columns = ["cloudCover",'ozone',"precipIntensity","precipProbability","pressure","windGust","HDD/CDD"])
mean_weather_new.head(5)

In [None]:
eec_day = eec_day.set_index('date')
eec_day

In [None]:
#Now we are going to merge all the datasets
#as this dataset has two sensor categories, we will divide this categories into two different columns, where we estimate the mean Energy Consumption of each sensor
eec_hvac = eec_day[(eec_day.sensor == "hvac-eec-meter")]
mean_eec_hvac = eec_hvac.groupby('date').mean()
mean_eec_hvac = mean_eec_hvac.rename(columns={"energy_consumed": "energy_consumed_hvac"})
mean_eec_hvac.head(5)

In [None]:
eec_elevator = eec_day[(eec_day.sensor == "elevator-eec-meter")]
mean_eec_elevator = eec_elevator.groupby('date').mean()
mean_eec_elevator = mean_eec_elevator.rename(columns={"energy_consumed": "energy_consumed_elevator"})
mean_eec_elevator.head(5)

In [None]:
wifi = wifi.set_index('operating_date') #set date as index as the rest of the tables that are going to be merged
wifi

In [None]:
#Merge all datas wifi, eec (elevator and hvac), weather
df = pd.concat([mean_eec_elevator, mean_eec_hvac, wifi, mean_weather_new], axis = 1)
df

In [None]:
#delete nan
df = df.dropna(subset=["visits"])

Now we proceed to analize correlation of features

In [None]:
#corr plot
corr = df.corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(df.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(df.columns)
ax.set_yticklabels(df.columns)
plt.show()

In [None]:
df.corr() #with values

From here we can see that the variables: dewPoint, visibility, windBearing, windSpeed are not correlated at all with energy consumption (both sensors), so we proceed to delete these variables. Also, we delete apparentTemperature because is highly correlated with temperature, we only leave one to avoid multicolinearity

In [None]:
df = df.drop(columns = ["dewPoint","visibility", "windBearing", "windSpeed","apparentTemperature"])
df.head()

In [None]:
#We export to a csv file
df.to_csv('df_consolidated.csv')

In [None]:
#histograms
#import pylab as pl
viz = df[["energy_consumed_elevator","energy_consumed_hvac"]]
viz.hist()
plt.show()

As both data distribution are very different, we proceed to separate the statiscal relationship by sensor

In [None]:
# #histograms
# #import pylab as pl
# viz = df[["visits","humidity","temperature", "uvIndex","Degree_Dif", "HDD"]]
# viz.hist()
# plt.show()

## Multilinear Regression

In [None]:
# #histograms
# #import pylab as pl
# viz = df[df.columns]
# viz.hist()
# plt.show()

In [None]:
#define X,y to start modeling
X = df.drop(columns = ["energy_consumed_elevator", "energy_consumed_hvac"])

#we set our set of y which represents our independt variable for the model
y_el = df['energy_consumed_elevator']
y_ac = df['energy_consumed_hvac']

### Y_el = Energy Elevator

In [None]:
#Divide into train and test subsets
from sklearn.model_selection import train_test_split
X_train, X_test, y_el_train, y_el_test = train_test_split(X, y_el, test_size=0.25, random_state=0)

In [None]:
#from sklearn import datasets
from regressors import stats
from sklearn import linear_model
ols = linear_model.LinearRegression()
ols.fit(X, y_el)
xlabels = X_train.columns
stats.summary(ols, X, y_el, xlabels)

From here we need to observe the column of p-values which represent the significance of the variables. If p-value > 0,05 we say the variable is not significant for the model.

Therefore, the variables with higher significance are: visit, temperature, uvIndex, HDD. So, a next step would be to do a regression with only this set of variables.

### Y_ac = Energy Hvac

In [None]:
#Divide into train and test subsets
from sklearn.model_selection import train_test_split
X_train, X_test, y_ac_train, y_ac_test = train_test_split(X, y_ac, test_size=0.25, random_state=0)

In [None]:
#from sklearn import datasets
from regressors import stats
from sklearn import linear_model
ols = linear_model.LinearRegression()
ols.fit(X, y_ac)
xlabels = X_train.columns
stats.summary(ols, X, y_ac, xlabels)

Again we observe the p-values to understand the significance of the variables. 
Here, the variables with higher significance are visits, temperature and Degree_Dif due to p-value is below 0,05.

So, a next step would be to do a regression with only this set of variables. I would also suggest to delete the variable HDD, since this variable represents the category (HDD or CDD) of Degree_Dif, and this last one would not be considered in the next model.

### To sum up: 
For boths models, variables with higher significance are:
- Elevator = visit, temperature, uvIndex, HDD
- Hvac = visit, temperature, Degree_Dif (HDD/CDD).

In conclusion, as Degree_Dif represent the value of HDD and CDD, we say that Energy Consumed in both cases are highly affected by visitors, temperature and HDD/CDD