## Warning!
This notebook is only for experimental purposes while learning, working with data of Ashrae Kaggle Competition. It won't be part of the final project, and will be removed from the repository.

### Training data sets structure

**train.csv**
* `building_id` - Foreign key for the building metadata.
* `meter` - The meter id code. Read as `{0: electricity, 1: chilledwater, 2: steam, 3: hotwater}`. Not every building has all meter types.
* `timestamp`  - When the measurement was taken
* `meter_reading` - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

**building_meta.csv**
* `site_id` - Foreign key for the weather files.
* `building_id` - Foreign key for training.csv
* `primary_use` - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
* `square_feet` - Gross floor area of the building
* `year_built` - Year building was opened
* `floor_count` - Number of floors of the building

**weather_[train/test].csv**

Weather data from a meteorological station as close as possible to the site.

* `site_id`
* `air_temperature` - Degrees Celsius
* `cloud_coverage` - Portion of the sky covered in clouds, in oktas
* `dew_temperature` - Degrees Celsius
* `precip_depth_1_hr` - Millimeters
* `sea_level_pressure` - Millibar/hectopascals
* `wind_direction` - Compass direction (0-360)
* `wind_speed` - Meters per second

##  Loading data

In [29]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import datetime as dt
import gc
from src.functions import utils as utl
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
# Importing data
train = utl.import_data('../data/raw/train.csv') 


In [None]:
building_meta = utl.import_data('../data/raw/building_metadata.csv')

In [30]:
weather_train = utl.import_data('../data/raw/weather_train.csv')

Memory usage of dataframe is 9.60 MB
Memory usage after optimization is: 2.65 MB
Decreased by 72.4%


In [None]:
weather_train.to_csv('../data/interim/weather.csv')

## Wrangling


#### Data set `building_meta` 

In [None]:
building_meta.shape

In [None]:
building_meta.head()

In [None]:
building_meta.info()

Feature `year_built`  is `float` type. We're casting it to `str`, removing the '.0' decimal too.

In [None]:
building_meta['year_built'] = building_meta['year_built'].astype(str, errors='ignore')

In [None]:
f = lambda x: x.replace('.0','')
building_meta['year_built'] = building_meta['year_built'].apply(f)

In [None]:
building_meta.head()

** Missing values **

In [None]:
building_meta.isna().sum()

In [None]:
building_meta[building_meta['floor_count'].isna()].tail()

** Duplicated observations **

In [None]:
building_meta[building_meta.duplicated()].sum()

#### Data set `train`

In [None]:
train.shape

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.isna().sum()

In [None]:
train[train.duplicated()].sum()

#### Data set `weather_train`

In [None]:
weather_train.tail()

In [None]:
weather_train.info()

In [None]:
weather_train.isna().sum()

In [None]:
weather_train[weather_train.duplicated()].sum()

#### Merging data sets

In [None]:
# train + building by FK 'building_id'
merge_1 = pd.merge(train, building_meta, how='left', on='building_id')

In [None]:
merge_1.head()

In [None]:
df = pd.merge(merge_1, weather_train, how='left', on=['site_id','timestamp'])

In [None]:
df.head()

In [None]:
del(train, weather_train, building_meta, merge_1)

In [None]:
gc.collect()

In [None]:
# Saving as csv file 
# df.to_csv('../data/processed/df_merged')

In [None]:
df.info()

In [None]:
df[df['floor_count'].notnull()]['floor_count'].head()

In [None]:
(df.isna().sum()/df.shape[0])*100

In [None]:
# Rearranging columns
cols = df.columns.tolist()

In [None]:
cols = [
    'site_id', 
    'building_id', 
    'year_built', 
    'primary_use', 
    'floor_count', 
    'meter', 
    'timestamp', 
    'air_temperature', 
    'cloud_coverage',
    'dew_temperature',
    'precip_depth_1_hr',
    'sea_level_pressure',
    'wind_direction',
    'wind_speed',
    'meter_reading'
]

In [None]:
df = df[cols]

In [None]:
df.head()

In [None]:
pd.to_datetime(df['timestamp'])

In [None]:
df['timestamp'].describe()

## Data quality assessment and profiling

### Handling missing values

In [None]:
df.isna().sum()

#### Variable `floor_count`

There're more than 16 million of `NaN` values in `floor_count`. They're likely to be buildings with only ground-floor. We're filling this missing values with `0` and, by the way, we're casting the type of this variable to int, as now it is a float.

In [None]:
# df['floor_count'] = df['floor_count'].fillna(0.0).astype(int)

In [None]:
# df.head()

In [None]:
# df.isna().sum()

#### Remaining features

Let's visualize the distribution of missing values, using a sample of `df` data set, as it's very large and may cause memory errors:

In [None]:
df_sample = df.sample(500000, replace=False, random_state=666)

In [None]:
df_sample.to_csv('../data/interim/df_sample.csv')

In [None]:
msno.matrix(df_sample)

In [None]:
msno.bar(df_sample)

In [None]:
msno.heatmap(df_sample)

There's high correlation in the distribution of `NaN`values between variables `air_temperature`, `dew_temperature` and `wind_speed`. Also there's is a significative correlation in the missing values distribution for `precip_depth_1_hr`and `sea_level_preassure`.

The feature with most of missing values is `cloud_coverage`, and it doesn't seem to be a correlation with any other feature, regarding `NaN` values.

We need a little more information to be able to decide what to do with missing values. A correlation plot between features could be a good advisor. We're only interested in weather variables, as the rest have no missing values.

### Missing values analysis in `weather_train` data set

In [None]:
# Percentage of missing values per column
(weather_train.isna().sum()/weather_train.shape[0]) * 100

In [None]:
# weather group by site
weather_by_site = weather_train.groupby('site_id')

In [None]:
weather_by_site.count()

In [None]:
weather_by_site.head()

In [None]:
site_list = weather_train['site_id'].unique()

for site in site_list:
    print('SITE {} ============'.format(site))
    print(weather_train.loc[weather_train['site_id'] == site, :].isna().sum()
          /weather_train.loc[weather_train['site_id'] == site, :].shape[0]*100)
    print('\n')

In [None]:
weather_train.loc[weather_train['site_id'] == 3, :].shape[0]

In [None]:
for site in site_list:
    msno.matrix(weather_train.loc[weather_train['site_id'] == site, :], figsize=(9,5), fontsize=10)

In [None]:
for site in site_list:
    msno.heatmap(weather_train.loc[weather_train['site_id'] == site, :], figsize=(9,5), fontsize=10)

In [None]:
corr = weather_by_site.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
(df.isna().sum()/df.shape[0])*100

In [None]:
gc.collect()

In [None]:
df.groupby('building_id').count()

In [None]:
df_by_building = df.groupby('building_id')
del(df_by_building)

In [None]:
df.loc[df['building_id']==1000,:].head()

In [None]:
weather_train['air_temperature'].describe()

In [None]:
weather_train.loc[weather_train['air_temperature'] > 47, 'site_id']

In [None]:
weather_train.loc[(weather_train['site_id']==2) & (weather_train['air_temperature'] > 47), :].head()

## Testing methods for filling gaps in `weather_train` data set
### Method 1: Interpolation with `interpolate()`

In [65]:
weather_site12 = weather_train.loc[weather_train['site_id'] == 12, :]

In [66]:
weather_site12.isna().sum()

site_id                  0
timestamp                0
air_temperature          0
cloud_coverage          59
dew_temperature          0
precip_depth_1_hr     8755
sea_level_pressure      56
wind_direction           1
wind_speed               0
dtype: int64

Imputation of 10 NaNs randomly

In [67]:
# Set 'timestamp' as index
weather12 = weather_site12.copy()
weather12['timestamp'] = pd.to_datetime(weather12['timestamp'])
weather12.index = weather12['timestamp']

In [68]:
weather12.sample(n=10, replace=False, random_state=1).index

DatetimeIndex(['2016-09-09 05:00:00', '2016-05-08 01:00:00',
               '2016-05-07 10:00:00', '2016-07-11 15:00:00',
               '2016-03-04 17:00:00', '2016-02-27 22:00:00',
               '2016-05-11 23:00:00', '2016-06-26 03:00:00',
               '2016-07-17 16:00:00', '2016-12-23 11:00:00'],
              dtype='datetime64[ns]', name='timestamp', freq=None)

In [69]:
weather12_with_nans = weather12.copy()
nan_indexes = weather12.sample(n=10, replace=False, random_state=1).index

In [43]:
nan_indexes

DatetimeIndex(['2016-09-09 05:00:00', '2016-05-08 01:00:00',
               '2016-05-07 10:00:00', '2016-07-11 15:00:00',
               '2016-03-04 17:00:00', '2016-02-27 22:00:00',
               '2016-05-11 23:00:00', '2016-06-26 03:00:00',
               '2016-07-17 16:00:00', '2016-12-23 11:00:00'],
              dtype='datetime64[ns]', name='timestamp', freq=None)

In [89]:
weather12_with_nans.loc[nan_indexes, 'air_temperature'] = np.nan

In [45]:
weather12_with_nans.isna().sum()

site_id                  0
timestamp                0
air_temperature         10
cloud_coverage          59
dew_temperature          0
precip_depth_1_hr     8755
sea_level_pressure      56
wind_direction           1
wind_speed               0
dtype: int64

#### Method `time`

In [47]:
weather12_with_nans['air_temperature'].interpolate(method='time', inplace=True)

In [48]:
weather12_with_nans.isna().sum()

site_id                  0
timestamp                0
air_temperature          0
cloud_coverage          59
dew_temperature          0
precip_depth_1_hr     8755
sea_level_pressure      56
wind_direction           1
wind_speed               0
dtype: int64

In [54]:
aprox = weather12_with_nans.loc[list(nan_indexes), 'air_temperature']

In [55]:
real = weather12.loc[list(nan_indexes), 'air_temperature']

In [62]:
error = abs(aprox - real)

timestamp
2016-09-09 05:00:00    0.500000
2016-05-08 01:00:00    0.398438
2016-05-07 10:00:00    0.304688
2016-07-11 15:00:00    1.046875
2016-03-04 17:00:00    0.000000
2016-02-27 22:00:00    0.900391
2016-05-11 23:00:00    0.796875
2016-06-26 03:00:00    0.148438
2016-07-17 16:00:00    0.890625
2016-12-23 11:00:00    0.156250
Name: air_temperature, dtype: float16

#### Method `linear`

In [80]:
weather12_with_nans['air_temperature'].interpolate(method='linear', inplace=True)

In [84]:
abs(weather12_with_nans.loc[list(nan_indexes), 'air_temperature'] -weather12.loc[list(nan_indexes), 'air_temperature'])

timestamp
2016-09-09 05:00:00    0.500000
2016-05-08 01:00:00    0.398438
2016-05-07 10:00:00    0.304688
2016-07-11 15:00:00    1.046875
2016-03-04 17:00:00    0.000000
2016-02-27 22:00:00    0.900391
2016-05-11 23:00:00    0.796875
2016-06-26 03:00:00    0.148438
2016-07-17 16:00:00    0.890625
2016-12-23 11:00:00    0.156250
Name: air_temperature, dtype: float16

#### Method `quadratic`

In [87]:
weather12_with_nans['air_temperature'].interpolate(method='quadratic', inplace=True)

In [88]:
abs(weather12_with_nans.loc[list(nan_indexes), 'air_temperature'] -weather12.loc[list(nan_indexes), 'air_temperature'])

timestamp
2016-09-09 05:00:00    0.093750
2016-05-08 01:00:00    0.523438
2016-05-07 10:00:00    0.289062
2016-07-11 15:00:00    0.187500
2016-03-04 17:00:00    0.285156
2016-02-27 22:00:00    1.384766
2016-05-11 23:00:00    1.015625
2016-06-26 03:00:00    0.195312
2016-07-17 16:00:00    1.109375
2016-12-23 11:00:00    1.000000
Name: air_temperature, dtype: float16

#### Method `cubic`

In [None]:
weather12_with_nans.loc[nan_indexes, 'air_temperature'] = np.nan

In [90]:
weather12_with_nans['air_temperature'].interpolate(method='cubic', inplace=True)

In [91]:
abs(weather12_with_nans.loc[list(nan_indexes), 'air_temperature'] -weather12.loc[list(nan_indexes), 'air_temperature'])

timestamp
2016-09-09 05:00:00    0.046875
2016-05-08 01:00:00    0.515625
2016-05-07 10:00:00    0.289062
2016-07-11 15:00:00    0.062500
2016-03-04 17:00:00    0.296875
2016-02-27 22:00:00    1.462891
2016-05-11 23:00:00    1.062500
2016-06-26 03:00:00    0.218750
2016-07-17 16:00:00    1.125000
2016-12-23 11:00:00    1.125000
Name: air_temperature, dtype: float16

#### Method `spline`

In [138]:
weather12_with_nans.loc[nan_indexes, 'air_temperature'] = np.nan

In [139]:
weather12_with_nans['air_temperature'].interpolate(method='spline', order=3, inplace=True)

In [140]:
abs(weather12_with_nans.loc[list(nan_indexes), 'air_temperature'] -weather12.loc[list(nan_indexes), 'air_temperature'])

timestamp
2016-09-09 05:00:00    2.414062
2016-05-08 01:00:00    0.453125
2016-05-07 10:00:00    0.171875
2016-07-11 15:00:00    2.968750
2016-03-04 17:00:00    0.414062
2016-02-27 22:00:00    0.250000
2016-05-11 23:00:00    0.546875
2016-06-26 03:00:00    0.093750
2016-07-17 16:00:00    0.062500
2016-12-23 11:00:00    2.406250
Name: air_temperature, dtype: float16

#### Method `polynomial`

In [127]:
weather12_with_nans.loc[nan_indexes, 'air_temperature'] = np.nan

In [128]:
weather12_with_nans['air_temperature'].interpolate(method='polynomial', order=5, inplace=True)

In [129]:
abs(weather12_with_nans.loc[list(nan_indexes), 'air_temperature'] -weather12.loc[list(nan_indexes), 'air_temperature'])

timestamp
2016-09-09 05:00:00    0.085938
2016-05-08 01:00:00    0.460938
2016-05-07 10:00:00    0.289062
2016-07-11 15:00:00    0.390625
2016-03-04 17:00:00    0.308594
2016-02-27 22:00:00    1.724609
2016-05-11 23:00:00    1.242188
2016-06-26 03:00:00    0.328125
2016-07-17 16:00:00    1.203125
2016-12-23 11:00:00    1.570312
Name: air_temperature, dtype: float16

The most accurate is the `time` method (or the linear, which returns exactly the same results).