# Final Preprocessing
This notebook aim to preprocess dataset from csv to numpy array that ready to be train-test-split.  


Addition! I found out that there's some stacked value(?) in the weather columns (eg. `('Rain', 'Clear')`) this may be caused by my previous processing. I find that this going to be a problem if not handled well. So i decide to replace it with only the first value ('Rain'). Also, for the null value in is_holiday, i want it to replaced by 'Not a Holiday'.

**input**:
- dataset/secondhalf_v3.csv -> SecondHalf dataset from prev notebook

**output**:
- outputs/OHE_encoder_second_v3.joblib -> final encoder that will used on the website
- outputs/MM_scaler_second_v2.joblib -> final scaler that will used on the website
- outputs/dataset_for_kaggle.csv -> dataset in pd.DataFrame form (encoded, scaled) that i use to test shap on kaggle
- dataset/secondhalf_v4.csv -> dataset final that will uploaded to database for website
- outputs/train_x_second_v3 -> numpy array, reshaped dataset to 3d before train-test-split (features)
- outputs/train_y_second_v3 -> numpy array, reshaped dataset before train-test-split (label)

# Import Libs and Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df_second = pd.read_csv('dataset/secondhalf_v3.csv')
df_second.info()
df_second.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16948 entries, 0 to 16947
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   is_holiday           16948 non-null  object 
 1   air_pollution_index  16948 non-null  float64
 2   humidity             16948 non-null  float64
 3   wind_speed           16948 non-null  float64
 4   wind_direction       16948 non-null  float64
 5   visibility_in_miles  16948 non-null  float64
 6   dew_point            16948 non-null  float64
 7   temperature          16948 non-null  float64
 8   rain_p_h             16948 non-null  float64
 9   snow_p_h             16948 non-null  float64
 10  clouds_all           16948 non-null  float64
 11  weather_type         16948 non-null  object 
 12  weather_description  16948 non-null  object 
 13  traffic_volume       16948 non-null  float64
 14  date_time            16948 non-null  object 
dtypes: float64(11), object(4)
memory usa

Unnamed: 0,is_holiday,air_pollution_index,humidity,wind_speed,wind_direction,visibility_in_miles,dew_point,temperature,rain_p_h,snow_p_h,clouds_all,weather_type,weather_description,traffic_volume,date_time
0,not_holiday,282.0,65.0,3.0,327.0,5.0,5.0,287.586,0.0,0.0,92.0,Rain,light rain,2886.0,2015-06-11 20:00:00
1,not_holiday,273.0,65.0,3.0,326.909091,5.045455,5.045455,288.038591,0.0,0.0,87.818182,Rain,light rain,2953.909091,2015-06-11 21:00:00
2,not_holiday,264.0,65.0,3.0,326.818182,5.090909,5.090909,288.491182,0.0,0.0,83.636364,Rain,light rain,3021.818182,2015-06-11 22:00:00
3,not_holiday,255.0,65.0,3.0,326.727273,5.136364,5.136364,288.943773,0.0,0.0,79.454545,Rain,light rain,3089.727273,2015-06-11 23:00:00
4,not_holiday,246.0,65.0,3.0,326.636364,5.181818,5.181818,289.396364,0.0,0.0,75.272727,Clear,sky is clear,3157.636364,2015-06-12 00:00:00


In [3]:
df_second['weather_description'].value_counts()

sky is clear                              6310
overcast clouds                           1998
scattered clouds                          1243
broken clouds                             1157
mist                                      1014
light snow                                 964
light rain                                 964
few clouds                                 682
haze                                       571
light intensity drizzle                    514
drizzle                                    440
fog                                        415
moderate rain                              258
heavy intensity rain                       170
proximity thunderstorm                      51
proximity shower rain                       33
heavy intensity drizzle                     24
('light rain', 'few clouds')                24
('light intensity drizzle', 'drizzle')      24
('few clouds', 'few clouds')                24
('light rain', 'sky is clear')              24
thunderstorm 

In [4]:
df_second['weather_type'].value_counts()

Clear                   6271
Clouds                  5206
Mist                    1562
Rain                    1191
Drizzle                 1043
Snow                     605
Haze                     545
Fog                      360
Thunderstorm              69
('Drizzle', 'Clear')      24
('Rain', 'Clear')         24
('Clouds', 'Clouds')      24
('Rain', 'Clouds')        24
Name: weather_type, dtype: int64

In [5]:
df_second['weather_type'].replace("('Clouds', 'Clouds')", "Clouds", inplace=True)
df_second['weather_type'].replace("('Rain', 'Clear')", "Rain", inplace=True)
df_second['weather_type'].replace("('Rain', 'Clouds')", "Rain", inplace=True)
df_second['weather_type'].replace("('Drizzle', 'Clear')", "Drizzle", inplace=True)

In [6]:
df_second['weather_description'].replace("('light rain', 'sky is clear')", "light rain", inplace=True)
df_second['weather_description'].replace("('light intensity drizzle', 'drizzle')", "light intensity drizzle'", inplace=True)
df_second['weather_description'].replace("('light rain', 'few clouds')", "light rain", inplace=True)
df_second['weather_description'].replace("('few clouds', 'few clouds')", "few clouds", inplace=True)
df_second['weather_description'].replace("SQUALLS", "squalls", inplace=True)

In [7]:
df_second['weather_description'].value_counts()

sky is clear                    6310
overcast clouds                 1998
scattered clouds                1243
broken clouds                   1157
mist                            1014
light rain                      1012
light snow                       964
few clouds                       706
haze                             571
light intensity drizzle          514
drizzle                          440
fog                              415
moderate rain                    258
heavy intensity rain             170
proximity thunderstorm            51
proximity shower rain             33
light intensity drizzle'          24
heavy intensity drizzle           24
thunderstorm                      14
snow                               9
very heavy rain                    4
light shower snow                  4
heavy snow                         4
thunderstorm with light rain       4
light intensity shower rain        3
freezing rain                      1
squalls                            1
N

In [8]:
df_second['weather_type'].value_counts()

Clear           6271
Clouds          5230
Mist            1562
Rain            1239
Drizzle         1067
Snow             605
Haze             545
Fog              360
Thunderstorm      69
Name: weather_type, dtype: int64

In [9]:
df_second['is_holiday'].replace("not_holiday", "Not a Holiday", inplace=True)

In [10]:
df_second.head()

Unnamed: 0,is_holiday,air_pollution_index,humidity,wind_speed,wind_direction,visibility_in_miles,dew_point,temperature,rain_p_h,snow_p_h,clouds_all,weather_type,weather_description,traffic_volume,date_time
0,Not a Holiday,282.0,65.0,3.0,327.0,5.0,5.0,287.586,0.0,0.0,92.0,Rain,light rain,2886.0,2015-06-11 20:00:00
1,Not a Holiday,273.0,65.0,3.0,326.909091,5.045455,5.045455,288.038591,0.0,0.0,87.818182,Rain,light rain,2953.909091,2015-06-11 21:00:00
2,Not a Holiday,264.0,65.0,3.0,326.818182,5.090909,5.090909,288.491182,0.0,0.0,83.636364,Rain,light rain,3021.818182,2015-06-11 22:00:00
3,Not a Holiday,255.0,65.0,3.0,326.727273,5.136364,5.136364,288.943773,0.0,0.0,79.454545,Rain,light rain,3089.727273,2015-06-11 23:00:00
4,Not a Holiday,246.0,65.0,3.0,326.636364,5.181818,5.181818,289.396364,0.0,0.0,75.272727,Clear,sky is clear,3157.636364,2015-06-12 00:00:00


In [11]:
round_cols = ['air_pollution_index', 'humidity', 'wind_speed', 'wind_direction', 'visibility_in_miles', 'dew_point', 'temperature', 'rain_p_h', 'snow_p_h', 'clouds_all', 'traffic_volume']

In [12]:
df_second[round_cols] = df_second[round_cols].round(2)

In [13]:
df_second.info()
df_second.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16948 entries, 0 to 16947
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   is_holiday           16948 non-null  object 
 1   air_pollution_index  16948 non-null  float64
 2   humidity             16948 non-null  float64
 3   wind_speed           16948 non-null  float64
 4   wind_direction       16948 non-null  float64
 5   visibility_in_miles  16948 non-null  float64
 6   dew_point            16948 non-null  float64
 7   temperature          16948 non-null  float64
 8   rain_p_h             16948 non-null  float64
 9   snow_p_h             16948 non-null  float64
 10  clouds_all           16948 non-null  float64
 11  weather_type         16948 non-null  object 
 12  weather_description  16948 non-null  object 
 13  traffic_volume       16948 non-null  float64
 14  date_time            16948 non-null  object 
dtypes: float64(11), object(4)
memory usa

Unnamed: 0,is_holiday,air_pollution_index,humidity,wind_speed,wind_direction,visibility_in_miles,dew_point,temperature,rain_p_h,snow_p_h,clouds_all,weather_type,weather_description,traffic_volume,date_time
0,Not a Holiday,282.0,65.0,3.0,327.0,5.0,5.0,287.59,0.0,0.0,92.0,Rain,light rain,2886.0,2015-06-11 20:00:00
1,Not a Holiday,273.0,65.0,3.0,326.91,5.05,5.05,288.04,0.0,0.0,87.82,Rain,light rain,2953.91,2015-06-11 21:00:00
2,Not a Holiday,264.0,65.0,3.0,326.82,5.09,5.09,288.49,0.0,0.0,83.64,Rain,light rain,3021.82,2015-06-11 22:00:00
3,Not a Holiday,255.0,65.0,3.0,326.73,5.14,5.14,288.94,0.0,0.0,79.45,Rain,light rain,3089.73,2015-06-11 23:00:00
4,Not a Holiday,246.0,65.0,3.0,326.64,5.18,5.18,289.4,0.0,0.0,75.27,Clear,sky is clear,3157.64,2015-06-12 00:00:00


In [14]:
# df_second.to_csv('dataset/secondhalf_v4.csv', index=False)

# One Hot Encode

In [14]:
from sklearn.preprocessing import OneHotEncoder

In [15]:
# Categorical Features
ohe_column = ['is_holiday', 'weather_type', 'weather_description']

In [16]:
encoder_second = OneHotEncoder(sparse=False)

In [17]:
encoded_data_second = encoder_second.fit_transform(df_second[ohe_column])

In [18]:
encoded_data_second

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [25]:
dir(encoder_second)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_X',
 '_check_n_features',
 '_compute_drop_idx',
 '_fit',
 '_get_feature',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_transform',
 '_validate_data',
 '_validate_keywords',
 'categories',
 'categories_',
 'drop',
 'drop_idx_',
 'dtype',
 'fit',
 'fit_transform',
 'get_feature_names',
 'get_params',
 'handle_unknown',
 'inverse_transform',
 'set_params',
 'sparse',
 'transform']

In [26]:
df_encoded_second = pd.DataFrame(encoded_data_second, columns=encoder_second.get_feature_names())

In [27]:
df_encoded_second.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16948 entries, 0 to 16947
Data columns (total 48 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   x0_Christmas Day                 16948 non-null  float64
 1   x0_Columbus Day                  16948 non-null  float64
 2   x0_Independence Day              16948 non-null  float64
 3   x0_Labor Day                     16948 non-null  float64
 4   x0_Martin Luther King Jr Day     16948 non-null  float64
 5   x0_Memorial Day                  16948 non-null  float64
 6   x0_New Years Day                 16948 non-null  float64
 7   x0_Not a Holiday                 16948 non-null  float64
 8   x0_State Fair                    16948 non-null  float64
 9   x0_Thanksgiving Day              16948 non-null  float64
 10  x0_Veterans Day                  16948 non-null  float64
 11  x0_Washingtons Birthday          16948 non-null  float64
 12  x1_Clear          

Simpan encoder agar dapat digunakan kembali pada pengolahan data lainnya.

In [29]:
import joblib

# Save the encoder to a file so can be used in other files
# joblib.dump(encoder_second, 'outputs/shap/OHE_encoder_second.joblib')

['outputs/shap/OHE_encoder_second.joblib']

# Concat Encoded Cols to Main df

In [30]:
df_concated_second = pd.concat([df_second, df_encoded_second], axis=1)

In [31]:
df_concated_second.drop(columns=['is_holiday', 'weather_type', 'weather_description'], inplace=True)

In [32]:
df_concated_second.info()
df_concated_second.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16948 entries, 0 to 16947
Data columns (total 60 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   air_pollution_index              16948 non-null  float64
 1   humidity                         16948 non-null  float64
 2   wind_speed                       16948 non-null  float64
 3   wind_direction                   16948 non-null  float64
 4   visibility_in_miles              16948 non-null  float64
 5   dew_point                        16948 non-null  float64
 6   temperature                      16948 non-null  float64
 7   rain_p_h                         16948 non-null  float64
 8   snow_p_h                         16948 non-null  float64
 9   clouds_all                       16948 non-null  float64
 10  traffic_volume                   16948 non-null  float64
 11  date_time                        16948 non-null  object 
 12  x0_Christmas Day  

Unnamed: 0,air_pollution_index,humidity,wind_speed,wind_direction,visibility_in_miles,dew_point,temperature,rain_p_h,snow_p_h,clouds_all,...,x2_overcast clouds,x2_proximity shower rain,x2_proximity thunderstorm,x2_scattered clouds,x2_sky is clear,x2_snow,x2_squalls,x2_thunderstorm,x2_thunderstorm with light rain,x2_very heavy rain
0,282.0,65.0,3.0,327.0,5.0,5.0,287.59,0.0,0.0,92.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,273.0,65.0,3.0,326.91,5.05,5.05,288.04,0.0,0.0,87.82,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,264.0,65.0,3.0,326.82,5.09,5.09,288.49,0.0,0.0,83.64,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,255.0,65.0,3.0,326.73,5.14,5.14,288.94,0.0,0.0,79.45,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,246.0,65.0,3.0,326.64,5.18,5.18,289.4,0.0,0.0,75.27,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


# Feature Engineering

In [33]:
df_concated_second['date_time'] = pd.to_datetime(df_concated_second['date_time'])
# Extracting date, month, and time
df_concated_second['day'] = df_concated_second['date_time'].dt.day
df_concated_second['month'] = df_concated_second['date_time'].dt.month
df_concated_second['hour'] = df_concated_second['date_time'].dt.hour

In [34]:
# Put Target column to the last column
df_target_second = df_concated_second['traffic_volume']
df_concated_second.drop(columns=['traffic_volume'], inplace=True)
df_concated_second['traffic_volume'] = df_target_second

In [35]:
# Display the updated DataFrame
df_concated_second.info()
df_concated_second.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16948 entries, 0 to 16947
Data columns (total 63 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   air_pollution_index              16948 non-null  float64       
 1   humidity                         16948 non-null  float64       
 2   wind_speed                       16948 non-null  float64       
 3   wind_direction                   16948 non-null  float64       
 4   visibility_in_miles              16948 non-null  float64       
 5   dew_point                        16948 non-null  float64       
 6   temperature                      16948 non-null  float64       
 7   rain_p_h                         16948 non-null  float64       
 8   snow_p_h                         16948 non-null  float64       
 9   clouds_all                       16948 non-null  float64       
 10  date_time                        16948 non-null  datetime6

Unnamed: 0,air_pollution_index,humidity,wind_speed,wind_direction,visibility_in_miles,dew_point,temperature,rain_p_h,snow_p_h,clouds_all,...,x2_sky is clear,x2_snow,x2_squalls,x2_thunderstorm,x2_thunderstorm with light rain,x2_very heavy rain,day,month,hour,traffic_volume
0,282.0,65.0,3.0,327.0,5.0,5.0,287.59,0.0,0.0,92.0,...,0.0,0.0,0.0,0.0,0.0,0.0,11,6,20,2886.0
1,273.0,65.0,3.0,326.91,5.05,5.05,288.04,0.0,0.0,87.82,...,0.0,0.0,0.0,0.0,0.0,0.0,11,6,21,2953.91
2,264.0,65.0,3.0,326.82,5.09,5.09,288.49,0.0,0.0,83.64,...,0.0,0.0,0.0,0.0,0.0,0.0,11,6,22,3021.82
3,255.0,65.0,3.0,326.73,5.14,5.14,288.94,0.0,0.0,79.45,...,0.0,0.0,0.0,0.0,0.0,0.0,11,6,23,3089.73
4,246.0,65.0,3.0,326.64,5.18,5.18,289.4,0.0,0.0,75.27,...,1.0,0.0,0.0,0.0,0.0,0.0,12,6,0,3157.64


In [29]:
# df_concated_second.to_csv('dataset/df_concated_second')

In [36]:
# Set the 'datetime_col' as the index of the DataFrame
df_concated_second = df_concated_second.set_index('date_time')

In [37]:
df_concated_second.info()
df_concated_second.head()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16948 entries, 2015-06-11 20:00:00 to 2017-05-17 23:00:00
Data columns (total 62 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   air_pollution_index              16948 non-null  float64
 1   humidity                         16948 non-null  float64
 2   wind_speed                       16948 non-null  float64
 3   wind_direction                   16948 non-null  float64
 4   visibility_in_miles              16948 non-null  float64
 5   dew_point                        16948 non-null  float64
 6   temperature                      16948 non-null  float64
 7   rain_p_h                         16948 non-null  float64
 8   snow_p_h                         16948 non-null  float64
 9   clouds_all                       16948 non-null  float64
 10  x0_Christmas Day                 16948 non-null  float64
 11  x0_Columbus Day                  16948 non-nu

Unnamed: 0_level_0,air_pollution_index,humidity,wind_speed,wind_direction,visibility_in_miles,dew_point,temperature,rain_p_h,snow_p_h,clouds_all,...,x2_sky is clear,x2_snow,x2_squalls,x2_thunderstorm,x2_thunderstorm with light rain,x2_very heavy rain,day,month,hour,traffic_volume
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-06-11 20:00:00,282.0,65.0,3.0,327.0,5.0,5.0,287.59,0.0,0.0,92.0,...,0.0,0.0,0.0,0.0,0.0,0.0,11,6,20,2886.0
2015-06-11 21:00:00,273.0,65.0,3.0,326.91,5.05,5.05,288.04,0.0,0.0,87.82,...,0.0,0.0,0.0,0.0,0.0,0.0,11,6,21,2953.91
2015-06-11 22:00:00,264.0,65.0,3.0,326.82,5.09,5.09,288.49,0.0,0.0,83.64,...,0.0,0.0,0.0,0.0,0.0,0.0,11,6,22,3021.82
2015-06-11 23:00:00,255.0,65.0,3.0,326.73,5.14,5.14,288.94,0.0,0.0,79.45,...,0.0,0.0,0.0,0.0,0.0,0.0,11,6,23,3089.73
2015-06-12 00:00:00,246.0,65.0,3.0,326.64,5.18,5.18,289.4,0.0,0.0,75.27,...,1.0,0.0,0.0,0.0,0.0,0.0,12,6,0,3157.64


# MinMaxScaler

In [38]:
from sklearn.preprocessing import MinMaxScaler

In [39]:
scaler_second = MinMaxScaler()
scaler_second = scaler_second.fit(df_concated_second)
scaled_data_second = scaler_second.transform(df_concated_second)

In [41]:
import joblib

# Save the scaler to a file so can be used in other files
# joblib.dump(scaler_second, 'outputs/shap/MM_scaler_second.joblib')

['outputs/shap/MM_scaler_second.joblib']

In [43]:
dir(scaler_second)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_n_features',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_reset',
 '_validate_data',
 'clip',
 'copy',
 'data_max_',
 'data_min_',
 'data_range_',
 'feature_range',
 'fit',
 'fit_transform',
 'get_params',
 'inverse_transform',
 'min_',
 'n_features_in_',
 'n_samples_seen_',
 'partial_fit',
 'scale_',
 'set_params',
 'transform']

In [44]:
feature_names = df_concated_second.columns

In [45]:
df_scaled_second = pd.DataFrame(scaled_data_second, columns=feature_names)

In [46]:
df_scaled_second.index = df_concated_second.index

In [47]:
df_scaled_second.info()
df_scaled_second.head()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16948 entries, 2015-06-11 20:00:00 to 2017-05-17 23:00:00
Data columns (total 62 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   air_pollution_index              16948 non-null  float64
 1   humidity                         16948 non-null  float64
 2   wind_speed                       16948 non-null  float64
 3   wind_direction                   16948 non-null  float64
 4   visibility_in_miles              16948 non-null  float64
 5   dew_point                        16948 non-null  float64
 6   temperature                      16948 non-null  float64
 7   rain_p_h                         16948 non-null  float64
 8   snow_p_h                         16948 non-null  float64
 9   clouds_all                       16948 non-null  float64
 10  x0_Christmas Day                 16948 non-null  float64
 11  x0_Columbus Day                  16948 non-nu

Unnamed: 0_level_0,air_pollution_index,humidity,wind_speed,wind_direction,visibility_in_miles,dew_point,temperature,rain_p_h,snow_p_h,clouds_all,...,x2_sky is clear,x2_snow,x2_squalls,x2_thunderstorm,x2_thunderstorm with light rain,x2_very heavy rain,day,month,hour,traffic_volume
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-06-11 20:00:00,0.941176,0.583333,0.1875,0.908333,0.5,0.5,0.691273,0.0,0.0,0.92,...,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.454545,0.869565,0.396429
2015-06-11 21:00:00,0.910035,0.583333,0.1875,0.908083,0.50625,0.50625,0.698311,0.0,0.0,0.8782,...,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.454545,0.913043,0.405757
2015-06-11 22:00:00,0.878893,0.583333,0.1875,0.907833,0.51125,0.51125,0.705349,0.0,0.0,0.8364,...,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.454545,0.956522,0.415085
2015-06-11 23:00:00,0.847751,0.583333,0.1875,0.907583,0.5175,0.5175,0.712387,0.0,0.0,0.7945,...,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.454545,1.0,0.424413
2015-06-12 00:00:00,0.816609,0.583333,0.1875,0.907333,0.5225,0.5225,0.719581,0.0,0.0,0.7527,...,1.0,0.0,0.0,0.0,0.0,0.0,0.366667,0.454545,0.0,0.433742


# Reshape to fit RNN needs

In [48]:
df_scaled_second.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16948 entries, 2015-06-11 20:00:00 to 2017-05-17 23:00:00
Data columns (total 62 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   air_pollution_index              16948 non-null  float64
 1   humidity                         16948 non-null  float64
 2   wind_speed                       16948 non-null  float64
 3   wind_direction                   16948 non-null  float64
 4   visibility_in_miles              16948 non-null  float64
 5   dew_point                        16948 non-null  float64
 6   temperature                      16948 non-null  float64
 7   rain_p_h                         16948 non-null  float64
 8   snow_p_h                         16948 non-null  float64
 9   clouds_all                       16948 non-null  float64
 10  x0_Christmas Day                 16948 non-null  float64
 11  x0_Columbus Day                  16948 non-nu

In [49]:
# df_scaled_second.to_csv('outputs/dataset_for_kaggle.csv')

In [50]:
trainX_second = []
trainY_second = []

In [51]:
n_future = 1   # Number of data we want to look into the future based on the past data.
n_past = 24  # Number of past data we want to use to predict the future.

In [52]:
for i in range(n_past, len(scaled_data_second) - n_future +1):
    trainX_second.append(scaled_data_second[i - n_past:i, 0:df_scaled_second.shape[1]])
    trainY_second.append(scaled_data_second[i + n_future - 1:i + n_future, -1])

trainX_second, trainY_second = np.array(trainX_second), np.array(trainY_second)

In [53]:
print('trainX_second shape == {}.'.format(trainX_second.shape))
print('trainY_second shape == {}.'.format(trainY_second.shape))

trainX_second shape == (16924, 24, 62).
trainY_second shape == (16924, 1).


In [50]:
# np.save('outputs/train_x_second_v3.npy', trainX_second)
# np.save('outputs/train_y_second_v3.npy', trainY_second)