# Data Exploration

## Dataset information
The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters).

## Source
Luis Candanedo, luismiguel.candanedoibarra '@' umons.ac.be, University of Mons (UMONS).

In [1]:
# Import necessary packages
import pandas as pd
import numpy as np
#use widget instead of inline to make the plot interactive
%matplotlib widget
import matplotlib.pyplot as plt
# plt.rcParams['figure.figsize'] = [20, 10]
from sklearn import preprocessing

In [2]:
df = pd.read_csv('data/energydata_complete.csv', parse_dates=['date'])
df = df.set_index(['date'],drop=True)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 19735 entries, 2016-01-11 17:00:00 to 2016-05-27 18:00:00
Data columns (total 28 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Appliances   19735 non-null  int64  
 1   lights       19735 non-null  int64  
 2   T1           19735 non-null  float64
 3   RH_1         19735 non-null  float64
 4   T2           19735 non-null  float64
 5   RH_2         19735 non-null  float64
 6   T3           19735 non-null  float64
 7   RH_3         19735 non-null  float64
 8   T4           19735 non-null  float64
 9   RH_4         19735 non-null  float64
 10  T5           19735 non-null  float64
 11  RH_5         19735 non-null  float64
 12  T6           19735 non-null  float64
 13  RH_6         19735 non-null  float64
 14  T7           19735 non-null  float64
 15  RH_7         19735 non-null  float64
 16  T8           19735 non-null  float64
 17  RH_8         19735 non-null  float64
 18  T9         

In [4]:
df.head(4)

Unnamed: 0_level_0,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389


In [9]:
min_max_df=(df-df.min())/(df.max()-df.min())
min_max_df.loc['2016-01-11 17:00:00':'2016-01-12 17:00:00'][['Appliances','lights','T_out']].plot()
plt.ylabel(r'Min-Max Normalized units')
plt.title('Sample Dataset plot')
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

## Feature Engineering

In [6]:
# Add previous energy consumption in 10minute intervals
for i in range(6):
    i+=1
    df[f'Appliances_{i}0'] = df['Appliances'].shift(i)
df = df.dropna(axis=0)

In [11]:
plt.figure()
df.loc['2016-01-15 17:00:00':'2016-01-30 17:00:00']['lights'].plot()
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Looking at a sample section, 8am to 11pm are durations when the lights are usually on. this can change in different hourseholds. since we are using this dataset which is from a single household, is safe to use this information to assign day/night categorization

In [46]:
df['hour'] = df.index.hour
df['is_day'] = df.hour.apply(lambda x: 1 if (x >= 8 and x <= 22) else 0)
df['is_night'] = df.is_day.apply(lambda x: int(not(x)))

In [47]:
df['T_mean']=df[[f'T{s+1}' for s in range(9)]].mean(axis=1)

df['RH_mean']=df[[f'RH_{s+1}' for s in range(9)]].mean(axis=1)

In [48]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm',axis=None).set_precision(2)

Unnamed: 0,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2,Appliances_10,Appliances_20,Appliances_30,Appliances_40,Appliances_50,Appliances_60,hour,is_day,is_night,T_mean,RH_mean
Appliances,1.0,0.2,0.06,0.09,0.12,-0.06,0.08,0.04,0.04,0.02,0.02,0.01,0.12,-0.08,0.03,-0.06,0.04,-0.09,0.01,-0.05,0.1,-0.04,-0.15,0.09,0.0,0.02,-0.01,-0.01,0.75,0.53,0.44,0.39,0.36,0.32,0.22,0.32,-0.32,0.08,-0.06
lights,0.2,1.0,-0.02,0.11,-0.01,0.05,-0.1,0.13,-0.01,0.11,-0.08,0.14,-0.08,0.15,-0.13,0.03,-0.07,0.01,-0.16,-0.01,-0.07,-0.01,0.07,0.06,0.02,-0.04,0.0,0.0,0.19,0.18,0.17,0.17,0.17,0.18,0.26,0.2,-0.2,-0.08,0.14
T1,0.06,-0.02,1.0,0.16,0.84,-0.0,0.89,-0.03,0.88,0.1,0.89,-0.01,0.65,-0.61,0.84,0.14,0.83,-0.01,0.84,0.07,0.68,-0.15,-0.35,-0.09,-0.08,0.57,-0.01,-0.01,0.06,0.07,0.08,0.09,0.1,0.11,0.18,0.07,-0.07,0.92,-0.32
RH_1,0.09,0.11,0.16,1.0,0.27,0.8,0.25,0.84,0.11,0.88,0.21,0.3,0.32,0.24,0.02,0.8,-0.03,0.74,0.12,0.76,0.34,-0.29,0.27,0.2,-0.02,0.64,-0.0,-0.0,0.12,0.15,0.15,0.15,0.13,0.12,0.02,0.05,-0.05,0.21,0.66
T2,0.12,-0.01,0.84,0.27,1.0,-0.17,0.74,0.12,0.76,0.23,0.72,0.03,0.8,-0.58,0.66,0.23,0.58,0.07,0.68,0.16,0.79,-0.13,-0.51,0.05,-0.07,0.58,-0.01,-0.01,0.13,0.13,0.14,0.14,0.15,0.15,0.25,0.3,-0.3,0.86,-0.26
RH_2,-0.06,0.05,-0.0,0.8,-0.17,1.0,0.14,0.68,-0.05,0.72,0.11,0.25,-0.01,0.39,-0.05,0.69,-0.04,0.68,0.05,0.68,0.03,-0.26,0.58,0.07,-0.01,0.5,0.01,0.01,-0.05,-0.04,-0.03,-0.03,-0.03,-0.03,-0.18,-0.25,0.25,-0.01,0.69
T3,0.08,-0.1,0.89,0.25,0.74,0.14,1.0,-0.01,0.85,0.12,0.89,-0.07,0.69,-0.65,0.85,0.17,0.8,0.04,0.9,0.14,0.7,-0.19,-0.28,-0.1,-0.1,0.65,-0.01,-0.01,0.09,0.11,0.12,0.13,0.14,0.14,0.04,0.01,-0.01,0.92,-0.32
RH_3,0.04,0.13,-0.03,0.84,0.12,0.68,-0.01,1.0,-0.14,0.9,-0.05,0.38,0.08,0.51,-0.25,0.83,-0.28,0.83,-0.19,0.83,0.12,-0.23,0.36,0.26,0.02,0.41,-0.0,-0.0,0.05,0.07,0.08,0.08,0.08,0.07,-0.05,-0.08,0.08,-0.06,0.83
T4,0.04,-0.01,0.88,0.11,0.76,-0.05,0.85,-0.14,1.0,-0.05,0.87,-0.08,0.65,-0.7,0.88,0.04,0.8,-0.09,0.89,-0.03,0.66,-0.08,-0.39,-0.19,-0.1,0.52,-0.0,-0.0,0.05,0.05,0.06,0.06,0.07,0.07,0.09,0.07,-0.07,0.91,-0.44
RH_4,0.02,0.11,0.1,0.88,0.23,0.72,0.12,0.9,-0.05,1.0,0.09,0.35,0.26,0.39,-0.13,0.89,-0.17,0.85,-0.04,0.86,0.29,-0.25,0.34,0.3,0.0,0.62,-0.0,-0.0,0.02,0.02,0.02,0.02,0.02,0.02,-0.02,-0.03,0.03,0.1,0.78
