# Handling Missing Values

In [1]:
# Load libraries

%load_ext autoreload
%autoreload 2
%matplotlib inline

import pandas as pd
import numpy as np
import datetime as dt
import gc
from src.functions import data_exploration as dexp
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import pandas_profiling

import chart_studio as cs
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)

### Loading data

In [4]:
%store -r building_meta_df
%store -r weather_train_df
%store -r train_df

In [5]:
weather_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 9 columns):
site_id               8763 non-null int8
timestamp             8763 non-null datetime64[ns]
air_temperature       8762 non-null float16
cloud_coverage        1701 non-null float16
dew_temperature       8762 non-null float16
precip_depth_1_hr     0 non-null float64
sea_level_pressure    8711 non-null float16
wind_direction        8760 non-null float16
wind_speed            8763 non-null float16
dtypes: datetime64[ns](1), float16(6), float64(1), int8(1)
memory usage: 248.3 KB


## Missing values in `weather_train_df`

As we saw in the initial EDA, for `weather_train_df`:

* `site_id`--> Ignored for analysis.
* `precip_depth_1hr` is completely missing --> Ignored for analysis.
* `cloud_coverage` has about 80% of missing values --> Ignored for analysis. 
* There're 52 missing values in `sea_level_pressure` (0.6%) --> Impute missing values.
* There's 1 missing value in `air_temperature` and `dew_temperature` --> Impute missing values.
* There're 3 missing values in `wind_direction` --> Impute missing values.
* There's no missing values in `wind_speed`

We'll use `interpolate()` function to impute missing values as we have an hourly time serie, so it is a good aproximation to consider weather parameter values don't change too much between consecutive hours. Our previous test demonstrated that the best method of `interpolate()` among all tested was `time` method.

In [6]:
# make index 'timestamp'
weather_train_df.index = weather_train_df['timestamp']
del weather_train_df['timestamp']

In [36]:
# Plot time series
weather_train_df[
    [
    'air_temperature',
    'dew_temperature',
    'sea_level_pressure',
    'wind_direction',
    'wind_speed'
    ]
].iplot(kind='scatter', filename='cufflinks/cf-simple-line')

Let's create a function to find the timestamp of the NaN value for each column.

In [8]:
def show_nan(df):
    results = {}
    for col in df.columns:
        results[col] = df[col][df[col].isna()]
               
    return results


### Variable `air_temperature` 

In [9]:
show_nan(weather_train_df).get('air_temperature')

timestamp
2016-10-18 13:00:00   NaN
Name: air_temperature, dtype: float16

In [16]:
air_temp_nans = weather_train_df.index[weather_train_df['air_temperature'].isna()].tolist()

In [20]:
weather_train_df['air_temperature'].interpolate(method='time', inplace=True)

In [21]:
weather_train_df.loc[air_temp_nans, 'air_temperature']

timestamp
2016-10-18 13:00:00    13.398438
Name: air_temperature, dtype: float16

### Variable `dew_temperature` 

In [18]:
show_nan(weather_train_df).get('dew_temperature')

timestamp
2016-10-18 13:00:00   NaN
Name: dew_temperature, dtype: float16

In [22]:
dew_temp_nans = weather_train_df.index[weather_train_df['dew_temperature'].isna()].tolist()

In [23]:
weather_train_df['dew_temperature'].interpolate(method='time', inplace=True)

In [25]:
weather_train_df.loc[dew_temp_nans, 'dew_temperature']

timestamp
2016-10-18 13:00:00    4.351562
Name: dew_temperature, dtype: float16

### Variable `sea_level_pressure`

In [29]:
show_nan(weather_train_df).get('sea_level_pressure')

timestamp
2016-01-09 06:00:00   NaN
2016-01-12 05:00:00   NaN
2016-01-12 06:00:00   NaN
2016-01-15 11:00:00   NaN
2016-01-30 11:00:00   NaN
2016-02-13 07:00:00   NaN
2016-02-23 05:00:00   NaN
2016-03-18 14:00:00   NaN
2016-03-19 15:00:00   NaN
2016-03-28 05:00:00   NaN
2016-03-28 06:00:00   NaN
2016-04-18 05:00:00   NaN
2016-04-23 17:00:00   NaN
2016-05-02 01:00:00   NaN
2016-05-04 12:00:00   NaN
2016-05-08 00:00:00   NaN
2016-05-08 17:00:00   NaN
2016-05-16 14:00:00   NaN
2016-05-27 16:00:00   NaN
2016-05-28 05:00:00   NaN
2016-06-06 12:00:00   NaN
2016-06-13 05:00:00   NaN
2016-06-30 13:00:00   NaN
2016-07-09 04:00:00   NaN
2016-07-09 08:00:00   NaN
2016-07-22 20:00:00   NaN
2016-07-29 04:00:00   NaN
2016-07-30 15:00:00   NaN
2016-08-02 14:00:00   NaN
2016-08-09 17:00:00   NaN
2016-08-12 22:00:00   NaN
2016-08-13 07:00:00   NaN
2016-08-14 01:00:00   NaN
2016-08-20 09:00:00   NaN
2016-08-23 14:00:00   NaN
2016-09-05 17:00:00   NaN
2016-09-06 16:00:00   NaN
2016-09-10 05:00:00   NaN
20

In [30]:
pressure_nans = weather_train_df.index[weather_train_df['sea_level_pressure'].isna()].tolist()

In [31]:
weather_train_df['sea_level_pressure'].interpolate(method='time', inplace=True)

In [32]:
weather_train_df.loc[pressure_nans, 'sea_level_pressure']

timestamp
2016-01-09 06:00:00     994.5
2016-01-12 05:00:00     994.0
2016-01-12 06:00:00     994.5
2016-01-15 11:00:00    1020.0
2016-01-30 11:00:00    1009.5
2016-02-13 07:00:00     987.0
2016-02-23 05:00:00    1014.0
2016-03-18 14:00:00    1026.0
2016-03-19 15:00:00    1026.0
2016-03-28 05:00:00     977.5
2016-03-28 06:00:00     978.5
2016-04-18 05:00:00    1021.0
2016-04-23 17:00:00    1021.0
2016-05-02 01:00:00    1024.0
2016-05-04 12:00:00    1026.0
2016-05-08 00:00:00    1007.0
2016-05-08 17:00:00    1006.5
2016-05-16 14:00:00    1021.0
2016-05-27 16:00:00    1016.0
2016-05-28 05:00:00    1016.5
2016-06-06 12:00:00    1021.5
2016-06-13 05:00:00    1003.5
2016-06-30 13:00:00    1010.0
2016-07-09 04:00:00    1019.0
2016-07-09 08:00:00    1018.0
2016-07-22 20:00:00    1019.0
2016-07-29 04:00:00    1011.0
2016-07-30 15:00:00    1013.5
2016-08-02 14:00:00    1010.0
2016-08-09 17:00:00    1024.0
2016-08-12 22:00:00    1023.0
2016-08-13 07:00:00    1023.0
2016-08-14 01:00:00    1024.0


### Variable `wind_direction`

In [34]:
show_nan(weather_train_df).get('wind_direction')

timestamp
2016-07-30 15:00:00   NaN
2016-10-07 07:00:00   NaN
2016-10-25 13:00:00   NaN
Name: wind_direction, dtype: float16

In [35]:
wind_dir_nans = weather_train_df.index[weather_train_df['wind_direction'].isna()].tolist()
weather_train_df['wind_direction'].interpolate(method='time', inplace=True)
weather_train_df.loc[wind_dir_nans, 'wind_direction']

timestamp
2016-07-30 15:00:00    305.0
2016-10-07 07:00:00     75.0
2016-10-25 13:00:00     50.0
Name: wind_direction, dtype: float16

### Variables `site_id`, `precip_depth_1_hr` and `cloud_coverage`

As mentioned above, these three variable are being igonored and we'll drop it from the data frame

In [44]:
weather_train_df.drop(columns=['site_id','precip_depth_1_hr','cloud_coverage'], inplace=True)

Finally, let's restore the index timestamp to a column, as we will consider it as a feature

In [42]:
weather_train_df.reset_index(inplace=True)

In [45]:
weather_train_df.head()

Unnamed: 0,timestamp,air_temperature,dew_temperature,sea_level_pressure,wind_direction,wind_speed
0,2016-01-01 00:00:00,3.800781,2.400391,1021.0,240.0,3.099609
1,2016-01-01 01:00:00,3.699219,2.400391,1021.5,230.0,2.599609
2,2016-01-01 02:00:00,2.599609,1.900391,1022.0,0.0,0.0
3,2016-01-01 03:00:00,2.0,1.200195,1022.5,170.0,1.5
4,2016-01-01 04:00:00,2.300781,1.799805,1022.5,110.0,1.5


In [46]:
weather_train_df.to_csv('../../data/interim/site_1/clean_weather.csv')

## Missing values in `building_meta_df`

As we saw in the initial EDA, for `building_meta_df`:
* `site_id`--> Ignored for analysis
* `building_id`--> It's a label to identify each building. Ignored for analysis
* `year_built`--> There are 11 missing values (22% aprox.)