### Author: Yikang Li

As a Data Scientist, I want to explore my data set with visualizations and common statistical characteristics pertaining to time series in particular so that I can understand the nature and shape of the the data and get an idea of what algorithms will perform best.

To Do:
Check each feature, extract characteristics.
Check Correlation among the features and their extracted characteristics.
Visualize time series data and explore trends and characteristics using Plotly and Cufflinks.
Transform the data in the spectral domain using fast fourier transform and perform the same EDA to get more information.
Study the seasonal decomposition of the time series of each column and check the trend and seasonality and other such characteristics.

Also, perform the same plotting for minute and 15 minute data.

In [1]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import cufflinks as cf
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from  plotly.offline import plot

from datetime import datetime
import pandas_datareader.data as web

In [3]:
#load the data
house1 = pd.read_csv("./processed/house_01.csv")
house2 = pd.read_csv("./processed/house_02.csv")
house3 = pd.read_csv("./processed/house_03.csv")
house4 = pd.read_csv("./processed/house_04.csv")
house5 = pd.read_csv("./processed/house_05.csv")
house6 = pd.read_csv("./processed/house_06.csv")

### House 1:

In [4]:
house1.head()

Unnamed: 0.1,Unnamed: 0,date,occupancy,Fridge,Dryer,Coffee machine,Kettle,Washing machine,Freezer
0,0,2012-06-01 00:00:00,,49.2516,830.508,,0.0,4.39739,2.23178
1,1,2012-06-01 00:00:01,,49.2516,834.774,,0.0,4.39739,2.23178
2,2,2012-06-01 00:00:02,,49.2516,834.774,,0.0,4.39739,2.23178
3,3,2012-06-01 00:00:03,,51.3899,832.641,,0.0,4.39739,2.23178
4,4,2012-06-01 00:00:04,,49.2516,832.641,,0.0,6.5338,2.23178


The values are missing in columns "occupancy" and "Coffee machine".

#### Statistical Characteristics:

In [5]:
house1.describe()

Unnamed: 0.1,Unnamed: 0,occupancy,Fridge,Dryer,Coffee machine,Kettle,Washing machine,Freezer
count,20390400.0,7344000.0,19872000.0,19872000.0,9676800.0,17452800.0,19872000.0,19872000.0
mean,10195200.0,0.7859397,20.70984,22.46613,4.421196,4.40276,23.3662,18.2222
std,5886202.0,0.4101689,24.97417,129.8046,73.18194,90.16391,187.0091,15.56339
min,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,5097600.0,1.0,2.20578,0.0,0.0,0.0,0.0,2.23178
50%,10195200.0,1.0,4.34432,0.0,0.0,0.0,0.0,30.2745
75%,15292800.0,1.0,49.2516,0.0,0.0,0.0,2.26097,32.4316
max,20390400.0,1.0,1012.68,1018.2,1290.88,1902.46,2427.99,105.764


In [6]:
house1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20390400 entries, 0 to 20390399
Data columns (total 9 columns):
Unnamed: 0         int64
date               object
occupancy          float64
Fridge             float64
Dryer              float64
Coffee machine     float64
Kettle             float64
Washing machine    float64
Freezer            float64
dtypes: float64(7), int64(1), object(1)
memory usage: 1.4+ GB


#### Correlation:

In [7]:
pd.DataFrame.corr(house1)

Unnamed: 0.1,Unnamed: 0,occupancy,Fridge,Dryer,Coffee machine,Kettle,Washing machine,Freezer
Unnamed: 0,1.0,-0.124171,-0.035811,0.018758,-0.005077,0.015791,0.014661,-0.070602
occupancy,-0.124171,1.0,0.027438,0.065237,0.034402,0.026767,0.039345,0.004867
Fridge,-0.035811,0.027438,1.0,0.011183,0.006161,0.002722,0.010244,0.018268
Dryer,0.018758,0.065237,0.011183,1.0,-0.001826,0.002089,0.048201,0.001266
Coffee machine,-0.005077,0.034402,0.006161,-0.001826,1.0,0.034834,0.003753,0.001518
Kettle,0.015791,0.026767,0.002722,0.002089,0.034834,1.0,0.004929,-0.005632
Washing machine,0.014661,0.039345,0.010244,0.048201,0.003753,0.004929,1.0,0.006294
Freezer,-0.070602,0.004867,0.018268,0.001266,0.001518,-0.005632,0.006294,1.0


#### Visualize time series data:

In [8]:
plotly.tools.set_credentials_file('liyikang', 'gFn5H8Cy4VbGagtM2IUR')

In [9]:
house1_dt = house1[['date','Fridge', 'Dryer', 'Kettle', 'Washing machine', 'Freezer']].iloc[:10000, :]
house1_dt.iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for clients without much RAM.



Estimated Draw Time Slow



#### minute and 15 minute data:

In [10]:
group_mins = house1.groupby(np.arange(len(house1))//60)
house1_by_mins = pd.concat((group_mins['date'].first(), group_mins[[c for c in house1.columns if c != 'date']].sum()), axis=1)
house1_by_mins.head()

Unnamed: 0.1,date,Unnamed: 0,occupancy,Fridge,Dryer,Coffee machine,Kettle,Washing machine,Freezer
0,2012-06-01 00:00:00,1770,0.0,2976.479,50107.775,0.0,0.0,291.61673,102.66188
1,2012-06-01 00:01:00,5370,0.0,2963.6492,49883.802,0.0,0.0,295.88955,706.65375
2,2012-06-01 00:02:00,8970,0.0,2946.5424,49873.138,0.0,0.0,291.61672,1937.2673
3,2012-06-01 00:03:00,12570,0.0,2935.8504,24947.1134,0.0,0.0,300.16237,1986.879
4,2012-06-01 00:04:00,16170,0.0,2918.7432,4482.35334,0.0,0.0,302.29878,1995.507


In [11]:
house1_by_mins[['date','Fridge', 'Dryer', 'Kettle', 'Washing machine', 'Freezer']].iplot(kind = "scatter", x = 'date')


Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points




The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



In [12]:
group_15mins = house1.groupby(np.arange(len(house1))//900)
house1_by_15mins = pd.concat((group_15mins['date'].first(), group_15mins[[c for c in house1.columns if c != 'date']].sum()), axis=1)
house1_by_15mins.head()

Unnamed: 0.1,date,Unnamed: 0,occupancy,Fridge,Dryer,Coffee machine,Kettle,Washing machine,Freezer
0,2012-06-01 00:00:00,404550,0.0,43507.4314,195942.03986,0.0,0.0,4459.70734,26138.31223
1,2012-06-01 00:15:00,1214550,0.0,12909.26128,24376.89486,0.0,0.0,4436.20682,15496.66314
2,2012-06-01 00:30:00,2024550,0.0,1166.85762,7796.92002,0.0,0.0,4466.11657,8083.31259
3,2012-06-01 00:45:00,2834550,0.0,1158.0345,453.4542,0.0,0.0,4425.52477,28938.2129
4,2012-06-01 01:00:00,3644550,0.0,1135.9767,527.9346,0.0,0.0,4434.07041,6333.1408


In [14]:
house1_by_15mins[['date','Fridge', 'Dryer', 'Kettle', 'Washing machine', 'Freezer']].iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



### House 2:

In [15]:
house2.head()

Unnamed: 0.1,Unnamed: 0,date,occupancy,Tablet,Dishwasher,Air exhaust,Fridge,Entertainment,Freezer,Kettle,Lamp,Laptops,Stove,Stereo
0,0,2012-06-01 00:00:00,1.0,2.21504,0.0,0.0,2.21458,0.0,53.651,0.0,0.0,0.0,,0.0
1,1,2012-06-01 00:00:01,1.0,4.3293,0.0,0.0,2.21458,2.17127,55.7929,0.0,0.0,0.0,,2.17127
2,2,2012-06-01 00:00:02,1.0,2.21504,0.0,0.0,0.0,0.0,53.651,0.0,0.0,0.0,,0.0
3,3,2012-06-01 00:00:03,1.0,2.21504,0.0,0.0,0.0,0.0,53.651,0.0,0.0,0.0,,0.0
4,4,2012-06-01 00:00:04,1.0,2.21504,0.0,0.0,0.0,0.0,55.7929,0.0,0.0,0.0,,0.0


#### Statistical Characteristics:

In [16]:
house2.describe()

Unnamed: 0.1,Unnamed: 0,occupancy,Tablet,Dishwasher,Air exhaust,Fridge,Entertainment,Freezer,Kettle,Lamp,Laptops,Stove,Stereo
count,21081600.0,10886400.0,20649600.0,20649600.0,20649600.0,20649600.0,20304000.0,20649600.0,20649600.0,20649600.0,20649600.0,2419200.0,20649600.0
mean,10540800.0,0.7349464,1.202025,15.84929,0.5644951,24.38221,53.48632,27.10266,4.76108,14.94485,6.049097,13.3844,16.41233
std,6085734.0,0.4413617,1.397134,180.1734,7.394131,40.8649,87.32343,36.27581,94.46724,48.4657,16.82079,169.4802,25.31551
min,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0
25%,5270400.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,10540800.0,1.0,2.21504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,15811200.0,1.0,2.21504,0.0,0.0,68.1728,55.1171,53.651,0.0,0.0,0.0,0.0,47.572
max,21081600.0,1.0,10.672,2335.92,185.706,1037.75,393.848,967.608,1910.49,317.139,1613.28,6224.0,788.357


In [17]:
house2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21081600 entries, 0 to 21081599
Data columns (total 14 columns):
Unnamed: 0       int64
date             object
occupancy        float64
Tablet           float64
Dishwasher       float64
Air exhaust      float64
Fridge           float64
Entertainment    float64
Freezer          float64
Kettle           float64
Lamp             float64
Laptops          float64
Stove            float64
Stereo           float64
dtypes: float64(12), int64(1), object(1)
memory usage: 2.2+ GB


#### Correlation:

In [18]:
pd.DataFrame.corr(house2)

Unnamed: 0.1,Unnamed: 0,occupancy,Tablet,Dishwasher,Air exhaust,Fridge,Entertainment,Freezer,Kettle,Lamp,Laptops,Stove,Stereo
Unnamed: 0,1.0,-0.035678,-0.04277,-0.000696,0.012986,-0.027235,0.051407,-0.067576,-0.009343,0.210775,-0.03336,-0.021155,0.031542
occupancy,-0.035678,1.0,-0.016264,0.030555,0.051676,0.026999,0.363817,-0.002114,0.034912,0.165345,0.217732,0.047757,0.383578
Tablet,-0.04277,-0.016264,1.0,0.026378,0.011852,0.032669,0.029561,0.033417,0.003525,0.001212,0.023116,-0.001898,0.050481
Dishwasher,-0.000696,0.030555,0.026378,1.0,0.002779,0.004039,0.060516,0.001657,-0.002181,0.023605,0.04759,0.064955,0.064952
Air exhaust,0.012986,0.051676,0.011852,0.002779,1.0,0.015754,0.066469,0.006188,0.045596,0.087837,0.031006,0.328872,0.085361
Fridge,-0.027235,0.026999,0.032669,0.004039,0.015754,1.0,0.029078,0.022544,0.001513,0.003777,0.0108,0.006518,0.034642
Entertainment,0.051407,0.363817,0.029561,0.060516,0.066469,0.029078,1.0,0.016083,0.031613,0.44721,0.350369,0.045343,0.839844
Freezer,-0.067576,-0.002114,0.033417,0.001657,0.006188,0.022544,0.016083,1.0,-0.000304,-0.005653,0.014593,0.009558,0.036533
Kettle,-0.009343,0.034912,0.003525,-0.002181,0.045596,0.001513,0.031613,-0.000304,1.0,0.013271,0.026646,0.009525,0.031175
Lamp,0.210775,0.165345,0.001212,0.023605,0.087837,0.003777,0.44721,-0.005653,0.013271,1.0,0.234802,-0.012703,0.37823


#### Visualize time series data:

In [19]:
house2_dt = house2[['date', 'occupancy', 'Tablet', 'Dishwasher',
       'Air exhaust', 'Fridge', 'Entertainment', 'Freezer', 'Kettle', 'Lamp',
       'Laptops', 'Stove', 'Stereo']].iloc[:10000, :]
house2_dt.iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



#### minute and 15 minute data:

In [20]:
group_mins = house2.groupby(np.arange(len(house2))//60)
house2_by_mins = pd.concat((group_mins['date'].first(), group_mins[[c for c in house2.columns if c != 'date']].sum()), axis=1)
house2_by_mins.head()

Unnamed: 0.1,date,Unnamed: 0,occupancy,Tablet,Dishwasher,Air exhaust,Fridge,Entertainment,Freezer,Kettle,Lamp,Laptops,Stove,Stereo
0,2012-06-01 00:00:00,1770,60.0,147.70222,0.0,0.0,50.93534,21.7127,3240.479,0.0,2.23556,0.0,0.0,21.7127
1,2012-06-01 00:01:00,5370,60.0,154.045,0.0,2.23367,22.1458,26.05524,3204.0667,0.0,0.0,0.0,0.0,26.05524
2,2012-06-01 00:02:00,8970,60.0,154.045,0.0,0.0,26.57496,26.05524,3169.7963,0.0,2.23556,0.0,0.0,26.05524
3,2012-06-01 00:03:00,12570,60.0,145.58796,0.0,0.0,19.93122,30.39778,3141.9516,0.0,0.0,0.0,0.0,30.39778
4,2012-06-01 00:04:00,16170,60.0,149.81648,0.0,0.0,24.36038,21.7127,3150.5192,0.0,2.23556,0.0,0.0,21.7127


In [21]:
house2_by_mins[['date', 'occupancy', 'Tablet', 'Dishwasher',
       'Air exhaust', 'Fridge', 'Entertainment', 'Freezer', 'Kettle', 'Lamp',
       'Laptops', 'Stove', 'Stereo']].iloc[0:10000, :].iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



In [22]:
group_15mins = house2.groupby(np.arange(len(house2))//900)
house2_by_15mins = pd.concat((group_15mins['date'].first(), group_15mins[[c for c in house2.columns if c != 'date']].sum()), axis=1)
house2_by_15mins.head()

Unnamed: 0.1,date,Unnamed: 0,occupancy,Tablet,Dishwasher,Air exhaust,Fridge,Entertainment,Freezer,Kettle,Lamp,Laptops,Stove,Stereo
0,2012-06-01 00:00:00,404550,900.0,2156.13246,0.0,6.70101,372.04944,384.31479,33969.36568,0.0,17.88448,0.0,0.0,384.31479
1,2012-06-01 00:15:00,1214550,900.0,1795.28886,0.0,6.70101,58471.07208,410.37003,21125.7786,0.0,15.64892,0.0,0.0,410.37003
2,2012-06-01 00:30:00,2024550,900.0,959.11232,0.0,4.46734,4215.81695,395.17114,28818.56302,0.0,17.88448,0.0,0.0,395.17114
3,2012-06-01 00:45:00,2834550,900.0,945.82208,0.0,2.23367,367.62028,423.39765,23092.3551,0.0,17.88448,0.0,0.0,423.39765
4,2012-06-01 01:00:00,3644550,900.0,1007.8432,0.0,4.46734,36376.85084,382.14352,34197.4406,0.0,24.59116,0.0,0.0,382.14352


In [23]:
house2_by_15mins[['date', 'occupancy', 'Tablet', 'Dishwasher',
       'Air exhaust', 'Fridge', 'Entertainment', 'Freezer', 'Kettle', 'Lamp',
       'Laptops', 'Stove', 'Stereo']].iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



### House 3:

In [24]:
house3.head()

Unnamed: 0.1,Unnamed: 0,date,occupancy,Tablet,Freezer,Coffee machine,Fridge,Kettle,Entertainment
0,0,2012-10-23 00:00:00,,-1.0,-1.0,-1.0,-1.0,,
1,1,2012-10-23 00:00:01,,-1.0,-1.0,-1.0,-1.0,,
2,2,2012-10-23 00:00:02,,-1.0,-1.0,-1.0,-1.0,,
3,3,2012-10-23 00:00:03,,-1.0,-1.0,-1.0,-1.0,,
4,4,2012-10-23 00:00:04,,-1.0,-1.0,-1.0,-1.0,,


#### Statistical Characteristics:

In [25]:
house3.describe()

Unnamed: 0.1,Unnamed: 0,occupancy,Tablet,Freezer,Coffee machine,Fridge,Kettle,Entertainment
count,8640000.0,1814400.0,8294400.0,8294400.0,5184000.0,3542400.0,3110400.0,3888000.0
mean,4320000.0,0.743804,1.280892,5.770378,0.2899241,9.513641,12.20957,3.222176
std,2494153.0,0.4365315,1.605,9.244474,21.59075,26.89626,158.7499,14.57987
min,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,2160000.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0
50%,4320000.0,1.0,0.0,2.2255,0.0,0.0,0.0,0.0
75%,6479999.0,1.0,2.23857,2.2255,0.0,0.0,0.0,0.0
max,8639999.0,1.0,10.7186,74.3903,1295.99,1211.63,2144.88,69.1188


In [26]:
house3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8640000 entries, 0 to 8639999
Data columns (total 9 columns):
Unnamed: 0        int64
date              object
occupancy         float64
Tablet            float64
Freezer           float64
Coffee machine    float64
Fridge            float64
Kettle            float64
Entertainment     float64
dtypes: float64(7), int64(1), object(1)
memory usage: 593.3+ MB


#### Correlation:

In [27]:
pd.DataFrame.corr(house3)

Unnamed: 0.1,Unnamed: 0,occupancy,Tablet,Freezer,Coffee machine,Fridge,Kettle,Entertainment
Unnamed: 0,1.0,0.246438,0.071583,0.028032,-0.009354,0.013647,-0.019516,0.000505
occupancy,0.246438,1.0,-0.098691,0.011559,0.025263,0.092494,-0.038727,0.117429
Tablet,0.071583,-0.098691,1.0,0.008209,0.001777,-0.030414,-0.011618,0.041807
Freezer,0.028032,0.011559,0.008209,1.0,0.002975,-0.007825,0.000474,0.006144
Coffee machine,-0.009354,0.025263,0.001777,0.002975,1.0,0.002145,-0.001492,0.00845
Fridge,0.013647,0.092494,-0.030414,-0.007825,0.002145,1.0,0.023615,0.017771
Kettle,-0.019516,-0.038727,-0.011618,0.000474,-0.001492,0.023615,1.0,0.140681
Entertainment,0.000505,0.117429,0.041807,0.006144,0.00845,0.017771,0.140681,1.0


#### Visualize time series data:

In [67]:
house3_dt = house3[['date', 'occupancy', 'Tablet', 'Freezer',
       'Coffee machine', 'Fridge', 'Kettle', 'Entertainment']].iloc[:10000, :]
house3_dt.iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



#### minute and 15 minute data:

In [29]:
group_mins = house3.groupby(np.arange(len(house3))//60)
house3_by_mins = pd.concat((group_mins['date'].first(), group_mins[[c for c in house3.columns if c != 'date']].sum()), axis=1)
house3_by_mins.head()

Unnamed: 0.1,date,Unnamed: 0,occupancy,Tablet,Freezer,Coffee machine,Fridge,Kettle,Entertainment
0,2012-10-23 00:00:00,1770,0.0,-60.0,-60.0,-60.0,-60.0,0.0,0.0
1,2012-10-23 00:01:00,5370,0.0,-60.0,-60.0,-60.0,-60.0,0.0,0.0
2,2012-10-23 00:02:00,8970,0.0,-60.0,-60.0,-60.0,-60.0,0.0,0.0
3,2012-10-23 00:03:00,12570,0.0,-60.0,-60.0,-60.0,-60.0,0.0,0.0
4,2012-10-23 00:04:00,16170,0.0,-60.0,-60.0,-60.0,-60.0,0.0,0.0


In [30]:
house3_by_mins[['date', 'occupancy', 'Tablet', 'Freezer',
       'Coffee machine', 'Fridge', 'Kettle', 'Entertainment']].iplot(kind = "scatter", x = 'date')


Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points




The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



In [31]:
group_15mins = house3.groupby(np.arange(len(house3))//900)
house3_by_15mins = pd.concat((group_15mins['date'].first(), group_15mins[[c for c in house3.columns if c != 'date']].sum()), axis=1)
house3_by_15mins.head()

Unnamed: 0.1,date,Unnamed: 0,occupancy,Tablet,Freezer,Coffee machine,Fridge,Kettle,Entertainment
0,2012-10-23 00:00:00,404550,0.0,-900.0,-900.0,-900.0,-900.0,0.0,0.0
1,2012-10-23 00:15:00,1214550,0.0,-900.0,-900.0,-900.0,-900.0,0.0,0.0
2,2012-10-23 00:30:00,2024550,0.0,-900.0,-900.0,-900.0,-900.0,0.0,0.0
3,2012-10-23 00:45:00,2834550,0.0,-900.0,-900.0,-900.0,-900.0,0.0,0.0
4,2012-10-23 01:00:00,3644550,0.0,-900.0,-900.0,-900.0,-900.0,0.0,0.0


In [32]:
house3_by_15mins[['date', 'occupancy', 'Tablet', 'Freezer',
       'Coffee machine', 'Fridge', 'Kettle', 'Entertainment']].iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



### House 4:

In [33]:
house4.head()

Unnamed: 0.1,Unnamed: 0,date,occupancy,Fridge,Kitchen appliances,Lamp,Stereo and laptop,Freezer,Tablet,Entertainment,Microwave
0,0,2012-06-27 00:00:00,,102.429,2.16516,2.23978,15.0524,172.72,0.0,10.7178,4.34694
1,1,2012-06-27 00:00:01,,100.296,2.16516,2.23978,15.0524,170.589,0.0,10.7178,2.23214
2,2,2012-06-27 00:00:02,,102.429,0.0,0.0,15.0524,172.72,0.0,10.7178,4.34694
3,3,2012-06-27 00:00:03,,102.429,0.0,0.0,15.0524,172.72,2.22889,10.7178,4.34694
4,4,2012-06-27 00:00:04,,100.296,2.16516,2.23978,15.0524,172.72,0.0,10.7178,2.23214


#### Statistical Characteristics:

In [34]:
house4.describe()

Unnamed: 0.1,Unnamed: 0,occupancy,Fridge,Kitchen appliances,Lamp,Stereo and laptop,Freezer,Tablet,Entertainment,Microwave
count,18144000.0,7430400.0,16675200.0,16675200.0,14688000.0,14601600.0,16588800.0,16243200.0,15984000.0,16675200.0
mean,9072000.0,0.9335496,27.03878,9.4107,10.56615,12.19372,168.149,1.218688,31.43746,15.82864
std,5237722.0,0.2490678,44.81938,99.32784,27.93486,14.28696,106.3362,4.544799,41.81098,131.1163
min,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,4536000.0,1.0,0.0,0.0,0.0,0.0,93.8658,0.0,10.7178,2.23214
50%,9072000.0,1.0,0.0,0.0,0.0,12.9145,176.982,0.0,10.7178,4.34694
75%,13608000.0,1.0,87.4998,0.0,2.23978,15.0524,260.085,2.22889,42.5753,4.34694
max,18144000.0,1.0,1174.22,2331.24,867.52,149.726,3168.81,1564.99,223.067,1594.67


In [35]:
house4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18144000 entries, 0 to 18143999
Data columns (total 11 columns):
Unnamed: 0            int64
date                  object
occupancy             float64
Fridge                float64
Kitchen appliances    float64
Lamp                  float64
Stereo and laptop     float64
Freezer               float64
Tablet                float64
Entertainment         float64
Microwave             float64
dtypes: float64(9), int64(1), object(1)
memory usage: 1.5+ GB


#### Correlation:

In [36]:
pd.DataFrame.corr(house4)

Unnamed: 0.1,Unnamed: 0,occupancy,Fridge,Kitchen appliances,Lamp,Stereo and laptop,Freezer,Tablet,Entertainment,Microwave
Unnamed: 0,1.0,0.048192,-0.064277,0.015337,0.333668,-0.33585,-0.396736,0.059188,0.02143,0.013133
occupancy,0.048192,1.0,2.5e-05,0.02475,-0.003765,0.054727,-0.03066,-0.034001,0.121855,0.026881
Fridge,-0.064277,2.5e-05,1.0,0.014399,-0.003508,0.05186,0.056912,0.002854,0.021206,0.017328
Kitchen appliances,0.015337,0.02475,0.014399,1.0,0.018169,0.00718,0.002179,-0.001926,0.009854,0.090353
Lamp,0.333668,-0.003765,-0.003508,0.018169,1.0,-0.146481,0.048068,0.032655,0.03799,0.01805
Stereo and laptop,-0.33585,0.054727,0.05186,0.00718,-0.146481,1.0,0.075864,0.009398,0.050456,0.011214
Freezer,-0.396736,-0.03066,0.056912,0.002179,0.048068,0.075864,1.0,-0.023911,0.079761,0.006758
Tablet,0.059188,-0.034001,0.002854,-0.001926,0.032655,0.009398,-0.023911,1.0,0.003521,0.002853
Entertainment,0.02143,0.121855,0.021206,0.009854,0.03799,0.050456,0.079761,0.003521,1.0,0.025523
Microwave,0.013133,0.026881,0.017328,0.090353,0.01805,0.011214,0.006758,0.002853,0.025523,1.0


#### Visualize time series data:

In [37]:
house4_dt = house4[['date', 'occupancy', 'Fridge', 'Kitchen appliances',
       'Lamp', 'Stereo and laptop', 'Freezer', 'Tablet', 'Entertainment',
       'Microwave']].iloc[:10000, :]
house4_dt.iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



#### minute and 15 minute data:

In [38]:
group_mins = house4.groupby(np.arange(len(house4))//60)
house4_by_mins = pd.concat((group_mins['date'].first(), group_mins[[c for c in house4.columns if c != 'date']].sum()), axis=1)
house4_by_mins.head()

Unnamed: 0.1,date,Unnamed: 0,occupancy,Fridge,Kitchen appliances,Lamp,Stereo and laptop,Freezer,Tablet,Entertainment,Microwave
0,2012-06-27 00:00:00,1770,0.0,6090.282,21.6516,47.03538,886.0408,10318.449,57.95114,643.068,229.0944
1,2012-06-27 00:01:00,5370,0.0,6182.001,54.129,26.87736,853.9723,10307.794,78.01115,643.068,197.3724
2,2012-06-27 00:02:00,8970,0.0,6135.075,17.32128,58.23428,845.4207,10318.449,64.63781,643.06802,182.5688
3,2012-06-27 00:03:00,12570,0.0,6049.756,41.13804,38.07626,862.5239,10260.912,69.09559,640.94412,222.75
4,2012-06-27 00:04:00,16170,0.0,5977.2425,45.46836,20.15802,881.765,10243.864,35.66224,643.068,226.9796


In [39]:
house4_by_mins[['date', 'occupancy', 'Fridge', 'Kitchen appliances',
       'Lamp', 'Stereo and laptop', 'Freezer', 'Tablet', 'Entertainment',
       'Microwave']].iloc[0:10000, :].iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



In [40]:
group_15mins = house4.groupby(np.arange(len(house4))//900)
house4_by_15mins = pd.concat((group_15mins['date'].first(), group_15mins[[c for c in house4.columns if c != 'date']].sum()), axis=1)
house4_by_15mins.head()

Unnamed: 0.1,date,Unnamed: 0,occupancy,Fridge,Kitchen appliances,Lamp,Stereo and laptop,Freezer,Tablet,Entertainment,Microwave
0,2012-06-27 00:00:00,404550,0.0,87041.9312,480.66552,598.02126,13057.5809,153738.938,833.60486,9624.78126,3263.0024
1,2012-06-27 00:15:00,1214550,0.0,5615.41787,452.51844,602.50082,13004.1334,227019.611,793.48484,9609.91412,3165.7216
2,2012-06-27 00:30:00,2024550,0.0,85.15533,474.17004,604.7406,13029.7882,203782.105,521.56026,9648.14398,3193.214
3,2012-06-27 00:45:00,2834550,0.0,39.30246,441.69264,562.18478,13012.685,120148.5577,976.25382,9633.27676,3250.3136
4,2012-06-27 01:00:00,3644550,0.0,86007.6521,474.17004,582.3428,13066.1325,158572.046,846.9782,9648.144,3218.5916


In [41]:
house4_by_15mins[['date', 'occupancy', 'Fridge', 'Kitchen appliances',
       'Lamp', 'Stereo and laptop', 'Freezer', 'Tablet', 'Entertainment',
       'Microwave']].iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



### House 5:

In [42]:
house5.head()

Unnamed: 0.1,Unnamed: 0,date,occupancy,Tablet,Coffee machine,Fountain,Microwave,Fridge,Entertainment,Kettle
0,0,2012-06-27 00:00:00,,2.20778,4.48706,8.72041,4.44332,4.44546,6.56679,
1,1,2012-06-27 00:00:01,,2.20778,2.3477,8.72041,4.44332,4.44546,8.69303,
2,2,2012-06-27 00:00:02,,4.33249,4.48706,8.72041,6.57853,4.44546,8.69303,
3,3,2012-06-27 00:00:03,,4.33249,4.48706,8.72041,4.44332,4.44546,6.56679,
4,4,2012-06-27 00:00:04,,2.20778,2.3477,8.72041,4.44332,4.44546,8.69303,


#### Statistical Characteristics:

In [43]:
house5.describe()

Unnamed: 0.1,Unnamed: 0,occupancy,Tablet,Coffee machine,Fountain,Microwave,Fridge,Entertainment,Kettle
count,18835200.0,6393600.0,18748800.0,18748800.0,6134400.0,18748800.0,18748800.0,16502400.0,2160000.0
mean,9417600.0,0.9008222,4.598005,5.523087,11.93648,8.713093,45.32431,24.66514,0.2405304
std,5437254.0,0.2989006,1.267522,83.33408,9.892267,84.31363,58.22051,55.80181,21.90341
min,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,4708800.0,1.0,4.33249,0.0,8.72041,4.44332,4.44546,2.31429,0.0
50%,9417600.0,1.0,4.33249,0.0,8.72041,4.44332,4.44546,6.56679,0.0
75%,14126400.0,1.0,4.33249,0.0,8.72041,6.57853,112.8,8.69303,0.0
max,18835200.0,1.0,14.9559,1581.54,45.0991,2680.55,1341.41,274.43,2253.18


In [44]:
house5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18835200 entries, 0 to 18835199
Data columns (total 10 columns):
Unnamed: 0        int64
date              object
occupancy         float64
Tablet            float64
Coffee machine    float64
Fountain          float64
Microwave         float64
Fridge            float64
Entertainment     float64
Kettle            float64
dtypes: float64(8), int64(1), object(1)
memory usage: 1.4+ GB


#### Correlation:

In [45]:
pd.DataFrame.corr(house5)

Unnamed: 0.1,Unnamed: 0,occupancy,Tablet,Coffee machine,Fountain,Microwave,Fridge,Entertainment,Kettle
Unnamed: 0,1.0,-0.095775,-0.034843,0.000702,0.017279,-0.015699,-0.050267,0.065159,-0.006182
occupancy,-0.095775,1.0,-0.048231,0.018819,0.081846,0.01414,0.010938,0.076323,
Tablet,-0.034843,-0.048231,1.0,0.005634,0.03179,0.012375,0.066179,0.049081,0.00799
Coffee machine,0.000702,0.018819,0.005634,1.0,-0.01729,0.003413,0.007602,-0.014424,0.000201
Fountain,0.017279,0.081846,0.03179,-0.01729,1.0,-0.01681,0.008741,0.116845,
Microwave,-0.015699,0.01414,0.012375,0.003413,-0.01681,1.0,0.006285,-0.00926,-0.00052
Fridge,-0.050267,0.010938,0.066179,0.007602,0.008741,0.006285,1.0,0.009198,-0.003141
Entertainment,0.065159,0.076323,0.049081,-0.014424,0.116845,-0.00926,0.009198,1.0,-0.004216
Kettle,-0.006182,,0.00799,0.000201,,-0.00052,-0.003141,-0.004216,1.0


#### Visualize time series data:

In [65]:
house5_dt = house5[['date', 'occupancy', 'Tablet', 'Coffee machine',
       'Fountain', 'Microwave', 'Fridge', 'Entertainment', 'Kettle']].iloc[:10000, :]
house5_dt.iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



#### minute and 15 minute data:

In [49]:
group_mins = house5.groupby(np.arange(len(house5))//60)
house5_by_mins = pd.concat((group_mins['date'].first(), group_mins[[c for c in house5.columns if c != 'date']].sum()), axis=1)
house5_by_mins.head()

Unnamed: 0.1,date,Unnamed: 0,occupancy,Tablet,Coffee machine,Fountain,Microwave,Fridge,Entertainment,Kettle
0,2012-06-27 00:00:00,1770,0.0,194.08339,245.69064,510.3846,319.97945,273.10203,464.17332,0.0
1,2012-06-27 00:01:00,5370,0.0,189.83397,239.27256,503.9646,322.11466,270.97722,472.67828,0.0
2,2012-06-27 00:02:00,8970,0.0,211.08107,249.96936,506.1046,309.3034,270.97722,470.55204,0.0
3,2012-06-27 00:03:00,12570,0.0,200.45752,239.27256,503.9646,319.97945,275.22684,457.7946,0.0
4,2012-06-27 00:04:00,16170,0.0,206.83165,239.27256,499.6846,319.97945,273.10203,455.66836,0.0


In [50]:
house5_by_mins[['date', 'occupancy', 'Tablet', 'Coffee machine',
       'Fountain', 'Microwave', 'Fridge', 'Entertainment', 'Kettle']].iloc[0:10000, :].iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



In [51]:
group_15mins = house5.groupby(np.arange(len(house5))//900)
house5_by_15mins = pd.concat((group_15mins['date'].first(), group_15mins[[c for c in house5.columns if c != 'date']].sum()), axis=1)
house5_by_15mins.head()

Unnamed: 0.1,date,Unnamed: 0,occupancy,Tablet,Coffee machine,Fountain,Microwave,Fridge,Entertainment,Kettle
0,2012-06-27 00:00:00,404550,0.0,2996.23925,3621.1788,7578.729,4752.71713,4109.27931,6909.4438,0.0
1,2012-06-27 00:15:00,1214550,0.0,2957.99447,3621.1788,7570.169,4767.6636,4143.27627,6856.2878,0.0
2,2012-06-27 00:30:00,2024550,0.0,2943.1215,3621.1788,7625.809,4769.79881,28627.50803,6860.54028,0.0
3,2012-06-27 00:45:00,2834550,0.0,2943.1215,3606.20328,7619.389,4816.77343,107712.709,6896.68636,0.0
4,2012-06-27 01:00:00,3644550,0.0,3019.61106,3648.99048,7619.389,4825.31427,76826.0529,6847.7829,0.0


In [52]:
house5_by_15mins[['date', 'occupancy', 'Tablet', 'Coffee machine',
       'Fountain', 'Microwave', 'Fridge', 'Entertainment', 'Kettle']].iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



### House 6:

In [53]:
house6.head()

Unnamed: 0.1,Unnamed: 0,date,Lamp,Laptop,Router,Coffee machine,Entertainment,Fridge,Kettle
0,0,2012-06-27 00:00:00,0.0,4.35384,19.3387,0.0,15.0043,2.19884,0.0
1,1,2012-06-27 00:00:01,0.0,4.35384,19.3387,0.0,15.0043,2.19884,0.0
2,2,2012-06-27 00:00:02,0.0,6.47995,19.3387,0.0,15.0043,0.0,0.0
3,3,2012-06-27 00:00:03,0.0,6.47995,19.3387,0.0,15.0043,0.0,0.0
4,4,2012-06-27 00:00:04,0.0,4.35384,19.3387,0.0,15.0043,0.0,0.0


#### Statistical Characteristics:

In [54]:
house6.describe()

Unnamed: 0.1,Unnamed: 0,Lamp,Laptop,Router,Coffee machine,Entertainment,Fridge,Kettle
count,18835200.0,14256000.0,15897600.0,7603200.0,15379200.0,15552000.0,15379200.0,12700800.0
mean,9417600.0,-0.08376633,6.586103,19.17793,3.643533,23.41736,10.1145,2.354713
std,5437254.0,2.567623,6.320948,3.825831,43.75191,21.8664,31.81019,71.43318
min,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,4708800.0,-1.0,4.35384,19.3387,0.0,15.0043,0.0,0.0
50%,9417600.0,0.0,6.47995,19.3387,0.0,15.0043,0.0,0.0
75%,14126400.0,0.0,6.47995,19.3387,0.0,21.3979,2.19884,0.0
max,18835200.0,47.2993,93.6445,27.8679,1285.58,153.516,1128.58,2122.29


In [55]:
house6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18835200 entries, 0 to 18835199
Data columns (total 9 columns):
Unnamed: 0        int64
date              object
Lamp              float64
Laptop            float64
Router            float64
Coffee machine    float64
Entertainment     float64
Fridge            float64
Kettle            float64
dtypes: float64(7), int64(1), object(1)
memory usage: 1.3+ GB


#### Correlation:

In [56]:
pd.DataFrame.corr(house6)

Unnamed: 0.1,Unnamed: 0,Lamp,Laptop,Router,Coffee machine,Entertainment,Fridge,Kettle
Unnamed: 0,1.0,0.072416,-0.03379,-0.055584,0.002007,0.02326,-0.028748,-0.002645
Lamp,0.072416,1.0,0.022549,0.076244,0.018302,0.089889,0.010271,0.020491
Laptop,-0.03379,0.022549,1.0,0.193302,0.027232,0.136001,0.021443,0.007476
Router,-0.055584,0.076244,0.193302,1.0,0.018071,0.205047,0.070357,0.008394
Coffee machine,0.002007,0.018302,0.027232,0.018071,1.0,-0.014171,2.8e-05,0.009734
Entertainment,0.02326,0.089889,0.136001,0.205047,-0.014171,1.0,0.020155,0.028553
Fridge,-0.028748,0.010271,0.021443,0.070357,2.8e-05,0.020155,1.0,0.001574
Kettle,-0.002645,0.020491,0.007476,0.008394,0.009734,0.028553,0.001574,1.0


#### Visualize time series data:

In [57]:
house6_dt = house6[['date', 'Lamp', 'Laptop', 'Router', 'Coffee machine',
       'Entertainment', 'Fridge', 'Kettle']].iloc[:10000, :]
house6_dt.iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



#### minute and 15 minute data:

In [58]:
group_mins = house6.groupby(np.arange(len(house6))//60)
house6_by_mins = pd.concat((group_mins['date'].first(), group_mins[[c for c in house6.columns if c != 'date']].sum()), axis=1)
house6_by_mins.head()

Unnamed: 0.1,date,Unnamed: 0,Lamp,Laptop,Router,Coffee machine,Entertainment,Fridge,Kettle
0,2012-06-27 00:00:00,1770,0.0,325.0137,1190.1742,0.0,891.7332,24.18724,0.0
1,2012-06-27 00:01:00,5370,0.0,331.39203,1202.968,0.0,1017.4737,19.78956,0.0
2,2012-06-27 00:02:00,8970,0.0,329.26592,1207.2326,0.0,1104.8532,24.18724,0.0
3,2012-06-27 00:03:00,12570,0.0,322.88759,1200.8357,0.0,889.602,21.9884,0.0
4,2012-06-27 00:04:00,16170,0.0,339.89647,1205.1003,0.0,891.7332,13.19304,0.0


In [59]:
house6_by_mins[['date', 'Lamp', 'Laptop', 'Router', 'Coffee machine',
       'Entertainment', 'Fridge', 'Kettle']].iloc[0: 10000, :].iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



In [60]:
group_15mins = house6.groupby(np.arange(len(house6))//900)
house6_by_15mins = pd.concat((group_15mins['date'].first(), group_15mins[[c for c in house6.columns if c != 'date']].sum()), axis=1)
house6_by_15mins.head()

Unnamed: 0.1,date,Unnamed: 0,Lamp,Laptop,Router,Coffee machine,Entertainment,Fridge,Kettle
0,2012-06-27 00:00:00,404550,0.0,4962.37601,18040.2554,0.0,13714.8585,294.64456,0.0
1,2012-06-27 00:15:00,1214550,0.0,4943.24102,18080.7691,0.0,13437.8028,12723.95104,0.0
2,2012-06-27 00:30:00,2024550,0.0,4924.10603,18061.5784,0.0,13576.3308,5344.43652,0.0
3,2012-06-27 00:45:00,2834550,0.0,4953.87157,18076.5045,0.0,13589.118,7839.97874,0.0
4,2012-06-27 01:00:00,3644550,0.0,4921.97992,18114.8859,0.0,13616.8236,12930.33968,0.0


In [61]:
house6_by_15mins[['date', 'Lamp', 'Laptop', 'Router', 'Coffee machine',
       'Entertainment', 'Fridge', 'Kettle']].iplot(kind = "scatter", x = 'date')

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



As a Data Scientist, I want to understand the time series algorithms and realize the differences between time series forecasting and traditional regression models.

To Do:
Choose one feature from any house. Preferably the whole team should choose the same feature so as to have the same sample data to compare accuracy metrics.
AR models
MA Models
ARIMA models
Holt Winters exponential smoothing model

In [62]:
#import statsmodels as sm
#sm.tsa.ar_model.AR(np.asarray(house1), missing='drop')
#AR(house1['Kettle'], house1['date'])

In [63]:
#sm.tsa.arima_model.ARMA(house1, missing='drop', freq='D')

In [64]:
#sm.tsa.holtwinters.Holt(house1)