```
From: https://github.com/ksatola
Version: 0.1.0
```

# Model - Prepare Analytical Views For Modelling

We will prepare the data for:
- statistical modeling (without any transformations),
- machine learning analysis by creating lagged variables.

The data will be adjusted for the following kinds of forecasts:
- Next 24 hours (hourly data),
- Next 7 days forecast (daily data).

We will then save these new dataframes.

## Contents

- [Load PM2.5 and PM10 Analytical View From a CSV File](#data_csv_pm25)
- [**Statistical Models**: Build PM2.5 HDF Analytical View Representation File](#data_hdf_pm25)
- [Data-related feature engineering](#feature_pm25_ml)
- [**Machine Learning Models**: Build PM2.5 HDF Analytical View Representation File](#data_hdf_pm25_ml)

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
import sys
sys.path.insert(0, '../src')

In [4]:
import pandas as pd 
import numpy as np

In [5]:
from model import (
    load_data,
    calculate_season,
    build_datetime_features
)

In [6]:
pd.set_option('display.max_rows', 999)
pd.set_option('display.max_columns', 999)
pd.set_option('precision', 5)

In [7]:
! pwd

/Users/ksatola/Documents/git/air-polution/agh


In [8]:
data_path = 'data/'
data_file = data_path + 'dfpm2008_2018.csv'

---
<a id='#data_csv_pm25'></a>

## Load PM2.5 and PM10 Analytical View From a CSV File

In [9]:
df = load_data(data_file)

common.py | 13 | load_data | 31-May-20 16:55:33 | INFO: DataFrame size: (96388, 2)


Unnamed: 0_level_0,pm10,pm25
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-01-01 01:00:00,109.5,92.0
2008-01-01 02:00:00,96.0,81.0
2008-01-01 03:00:00,86.5,73.0
2008-01-01 04:00:00,71.5,60.5
2008-01-01 05:00:00,72.0,61.0


---
<a id='##data_hdf_pm25'></a>

## Build PM2.5 HDF Analytical View Representation File

In [10]:
df.index

Index(['2008-01-01 01:00:00', '2008-01-01 02:00:00', '2008-01-01 03:00:00',
       '2008-01-01 04:00:00', '2008-01-01 05:00:00', '2008-01-01 06:00:00',
       '2008-01-01 07:00:00', '2008-01-01 08:00:00', '2008-01-01 09:00:00',
       '2008-01-01 10:00:00',
       ...
       '2018-12-31 15:00:00.000', '2018-12-31 16:00:00.000',
       '2018-12-31 17:00:00.000', '2018-12-31 18:00:00.000',
       '2018-12-31 19:00:00.000', '2018-12-31 20:00:00.000',
       '2018-12-31 21:00:00.000', '2018-12-31 22:00:00.000',
       '2018-12-31 23:00:00.000', '2019-01-01 00:00:00.000'],
      dtype='object', name='Datetime', length=96388)

### Set the index type to datetime to be able to perform time-related operations

In [11]:
# Convert index to datetime with minutes frequency
df.index = pd.to_datetime(df.index)

In [12]:
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
df.asfreq('T').index

DatetimeIndex(['2008-01-01 01:00:00', '2008-01-01 01:01:00',
               '2008-01-01 01:02:00', '2008-01-01 01:03:00',
               '2008-01-01 01:04:00', '2008-01-01 01:05:00',
               '2008-01-01 01:06:00', '2008-01-01 01:07:00',
               '2008-01-01 01:08:00', '2008-01-01 01:09:00',
               ...
               '2018-12-31 23:51:00', '2018-12-31 23:52:00',
               '2018-12-31 23:53:00', '2018-12-31 23:54:00',
               '2018-12-31 23:55:00', '2018-12-31 23:56:00',
               '2018-12-31 23:57:00', '2018-12-31 23:58:00',
               '2018-12-31 23:59:00', '2019-01-01 00:00:00'],
              dtype='datetime64[ns]', name='Datetime', length=5785861, freq='T')

In [13]:
df.head()

Unnamed: 0_level_0,pm10,pm25
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-01-01 01:00:00,109.5,92.0
2008-01-01 02:00:00,96.0,81.0
2008-01-01 03:00:00,86.5,73.0
2008-01-01 04:00:00,71.5,60.5
2008-01-01 05:00:00,72.0,61.0


In [14]:
# We will only need PM2.5 for modelling
df.drop(columns=['pm10'], axis='columns', inplace=True) # axis=1
df.head()

Unnamed: 0_level_0,pm25
Datetime,Unnamed: 1_level_1
2008-01-01 01:00:00,92.0
2008-01-01 02:00:00,81.0
2008-01-01 03:00:00,73.0
2008-01-01 04:00:00,60.5
2008-01-01 05:00:00,61.0


### Dataset - Hourly Frequency

In [15]:
data_file_hdf = data_path + 'dfpm25_2008-2018_hourly.hdf'
df.to_hdf(data_file_hdf, key='df', mode='w')

### Test Read

In [16]:
df = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df.shape}')
df.head()

Dataframe size: (96388, 1)


Unnamed: 0_level_0,pm25
Datetime,Unnamed: 1_level_1
2008-01-01 01:00:00,92.0
2008-01-01 02:00:00,81.0
2008-01-01 03:00:00,73.0
2008-01-01 04:00:00,60.5
2008-01-01 05:00:00,61.0


### Dataset - Daily Frequency

In [17]:
# Resample data to daily using mean of values
df_daily = df.resample(rule='D').mean() # daily
df_daily.head()

Unnamed: 0_level_0,pm25
Datetime,Unnamed: 1_level_1
2008-01-01,53.58696
2008-01-02,30.95833
2008-01-03,46.10417
2008-01-04,42.97917
2008-01-05,57.3125


In [18]:
data_file_hdf = data_path + 'dfpm25_2008-2018_daily.hdf'
df_daily.to_hdf(data_file_hdf, key='df', mode='w')

### Test Read

In [19]:
df = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df.shape}')
df.head()

Dataframe size: (4019, 1)


Unnamed: 0_level_0,pm25
Datetime,Unnamed: 1_level_1
2008-01-01,53.58696
2008-01-02,30.95833
2008-01-03,46.10417
2008-01-04,42.97917
2008-01-05,57.3125


---
<a id='#feature_pm25_ml'></a>

## Data-related feature engineering

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. 

For ML, instead of dealing with datetime index, we will create additional features based on time data to include in the model.

---
<a id='#data_hdf_pm25_ml'></a>

## Machine Learning Models: Build PM2.5 HDF Analytical View Representation File

In [20]:
df = load_data(data_file)

common.py | 13 | load_data | 31-May-20 16:55:44 | INFO: DataFrame size: (96388, 2)


Unnamed: 0_level_0,pm10,pm25
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-01-01 01:00:00,109.5,92.0
2008-01-01 02:00:00,96.0,81.0
2008-01-01 03:00:00,86.5,73.0
2008-01-01 04:00:00,71.5,60.5
2008-01-01 05:00:00,72.0,61.0


In [21]:
# We will only need PM2.5 for ML modelling
df.drop(columns=['pm10'], axis='columns', inplace=True) # axis=1
df.head()

Unnamed: 0_level_0,pm25
Datetime,Unnamed: 1_level_1
2008-01-01 01:00:00,92.0
2008-01-01 02:00:00,81.0
2008-01-01 03:00:00,73.0
2008-01-01 04:00:00,60.5
2008-01-01 05:00:00,61.0


### Create Lagged Variables - Hourly

In [22]:
# Create 24 hours of lag values to predict current observation
df24h = pd.DataFrame()

# Create column t
df24h['t'] = df['pm25']

for i in range(1, 25):
    df24h[['t-'+str(i)]] = df.shift(i)

df24h.head(26)

Unnamed: 0_level_0,t,t-1,t-2,t-3,t-4,t-5,t-6,t-7,t-8,t-9,t-10,t-11,t-12,t-13,t-14,t-15,t-16,t-17,t-18,t-19,t-20,t-21,t-22,t-23,t-24
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2008-01-01 01:00:00,92.0,,,,,,,,,,,,,,,,,,,,,,,,
2008-01-01 02:00:00,81.0,92.0,,,,,,,,,,,,,,,,,,,,,,,
2008-01-01 03:00:00,73.0,81.0,92.0,,,,,,,,,,,,,,,,,,,,,,
2008-01-01 04:00:00,60.5,73.0,81.0,92.0,,,,,,,,,,,,,,,,,,,,,
2008-01-01 05:00:00,61.0,60.5,73.0,81.0,92.0,,,,,,,,,,,,,,,,,,,,
2008-01-01 06:00:00,67.0,61.0,60.5,73.0,81.0,92.0,,,,,,,,,,,,,,,,,,,
2008-01-01 07:00:00,69.5,67.0,61.0,60.5,73.0,81.0,92.0,,,,,,,,,,,,,,,,,,
2008-01-01 08:00:00,70.5,69.5,67.0,61.0,60.5,73.0,81.0,92.0,,,,,,,,,,,,,,,,,
2008-01-01 09:00:00,62.5,70.5,69.5,67.0,61.0,60.5,73.0,81.0,92.0,,,,,,,,,,,,,,,,
2008-01-01 10:00:00,41.5,62.5,70.5,69.5,67.0,61.0,60.5,73.0,81.0,92.0,,,,,,,,,,,,,,,


In [23]:
# Remove NaNs
df24h = df24h.iloc[24:]
df24h.head()

Unnamed: 0_level_0,t,t-1,t-2,t-3,t-4,t-5,t-6,t-7,t-8,t-9,t-10,t-11,t-12,t-13,t-14,t-15,t-16,t-17,t-18,t-19,t-20,t-21,t-22,t-23,t-24
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2008-01-02 01:00:00,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,73.0,81.0,92.0
2008-01-02 02:00:00,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,73.0,81.0
2008-01-02 03:00:00,25.5,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,73.0
2008-01-02 04:00:00,28.5,25.5,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5
2008-01-02 05:00:00,29.0,28.5,25.5,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0


In [24]:
# Remove Datetime index and calculate date-related features from it
df24h = build_datetime_features(df24h, 'Datetime')
df24h.head()

Unnamed: 0,t,t-1,t-2,t-3,t-4,t-5,t-6,t-7,t-8,t-9,t-10,t-11,t-12,t-13,t-14,t-15,t-16,t-17,t-18,t-19,t-20,t-21,t-22,t-23,t-24,month,day,hour,dayofyear,weekofyear,dayofweek,quarter,season
0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,73.0,81.0,92.0,1,2,1,2,1,2,1,1
1,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,73.0,81.0,1,2,2,2,1,2,1,1
2,25.5,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,73.0,1,2,3,2,1,2,1,1
3,28.5,25.5,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,1,2,4,2,1,2,1,1
4,29.0,28.5,25.5,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,1,2,5,2,1,2,1,1


In [25]:
data_file_hdf = data_path + 'dfpm25_2008-2018_ml_24hours_lags.hdf'
df24h.to_hdf(data_file_hdf, key='df', mode='w')

### Test Read

In [26]:
df24h = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df.shape}')
df24h.head()

Dataframe size: (96388, 1)


Unnamed: 0,t,t-1,t-2,t-3,t-4,t-5,t-6,t-7,t-8,t-9,t-10,t-11,t-12,t-13,t-14,t-15,t-16,t-17,t-18,t-19,t-20,t-21,t-22,t-23,t-24,month,day,hour,dayofyear,weekofyear,dayofweek,quarter,season
0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,73.0,81.0,92.0,1,2,1,2,1,2,1,1
1,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,73.0,81.0,1,2,2,2,1,2,1,1
2,25.5,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,73.0,1,2,3,2,1,2,1,1
3,28.5,25.5,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,60.5,1,2,4,2,1,2,1,1
4,29.0,28.5,25.5,30.0,36.0,40.0,45.0,49.5,51.5,55.5,52.5,46.5,62.5,45.5,32.5,28.0,28.5,26.0,30.5,41.5,62.5,70.5,69.5,67.0,61.0,1,2,5,2,1,2,1,1


### Create Lagged Variables - Daily

In [27]:
# Convert index to datetime with minutes frequency
df.index = pd.to_datetime(df.index)

# Resample data to daily using mean of values
df_daily = df[['pm25']].resample(rule='D').mean() # daily
df_daily.head()

Unnamed: 0_level_0,pm25
Datetime,Unnamed: 1_level_1
2008-01-01,53.58696
2008-01-02,30.95833
2008-01-03,46.10417
2008-01-04,42.97917
2008-01-05,57.3125


In [28]:
# Create 7 days of lag values to predict current observation
df7d = pd.DataFrame()

# Create column t
df7d['t'] = df_daily['pm25']

for i in range(1, 8):
    df7d[['t-'+str(i)]] = df_daily.shift(i)

df7d.head(10)

Unnamed: 0_level_0,t,t-1,t-2,t-3,t-4,t-5,t-6,t-7
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2008-01-01,53.58696,,,,,,,
2008-01-02,30.95833,53.58696,,,,,,
2008-01-03,46.10417,30.95833,53.58696,,,,,
2008-01-04,42.97917,46.10417,30.95833,53.58696,,,,
2008-01-05,57.3125,42.97917,46.10417,30.95833,53.58696,,,
2008-01-06,36.0625,57.3125,42.97917,46.10417,30.95833,53.58696,,
2008-01-07,46.08333,36.0625,57.3125,42.97917,46.10417,30.95833,53.58696,
2008-01-08,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417,30.95833,53.58696
2008-01-09,101.375,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417,30.95833
2008-01-10,110.08333,101.375,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417


In [29]:
# Remove NaNs
df7d = df7d.iloc[7:]
df7d.head()

Unnamed: 0_level_0,t,t-1,t-2,t-3,t-4,t-5,t-6,t-7
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2008-01-08,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417,30.95833,53.58696
2008-01-09,101.375,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417,30.95833
2008-01-10,110.08333,101.375,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417
2008-01-11,141.83333,110.08333,101.375,45.04167,46.08333,36.0625,57.3125,42.97917
2008-01-12,47.625,141.83333,110.08333,101.375,45.04167,46.08333,36.0625,57.3125


In [30]:
# Remove Datetime index and calculate date-related features from it
df7d = build_datetime_features(df7d, 'Datetime')
df7d.head()

Unnamed: 0,t,t-1,t-2,t-3,t-4,t-5,t-6,t-7,month,day,hour,dayofyear,weekofyear,dayofweek,quarter,season
0,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417,30.95833,53.58696,1,8,0,8,2,1,1,1
1,101.375,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417,30.95833,1,9,0,9,2,2,1,1
2,110.08333,101.375,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417,1,10,0,10,2,3,1,1
3,141.83333,110.08333,101.375,45.04167,46.08333,36.0625,57.3125,42.97917,1,11,0,11,2,4,1,1
4,47.625,141.83333,110.08333,101.375,45.04167,46.08333,36.0625,57.3125,1,12,0,12,2,5,1,1


In [31]:
data_file_hdf = data_path + 'dfpm25_2008-2018_ml_7days_lags.hdf'
df7d.to_hdf(data_file_hdf, key='df', mode='w')

### Test Read

In [32]:
df7d = pd.read_hdf(path_or_buf=data_file_hdf, key="df")
print(f'Dataframe size: {df.shape}')
df7d.head()

Dataframe size: (96388, 1)


Unnamed: 0,t,t-1,t-2,t-3,t-4,t-5,t-6,t-7,month,day,hour,dayofyear,weekofyear,dayofweek,quarter,season
0,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417,30.95833,53.58696,1,8,0,8,2,1,1,1
1,101.375,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417,30.95833,1,9,0,9,2,2,1,1
2,110.08333,101.375,45.04167,46.08333,36.0625,57.3125,42.97917,46.10417,1,10,0,10,2,3,1,1
3,141.83333,110.08333,101.375,45.04167,46.08333,36.0625,57.3125,42.97917,1,11,0,11,2,4,1,1
4,47.625,141.83333,110.08333,101.375,45.04167,46.08333,36.0625,57.3125,1,12,0,12,2,5,1,1
