# Predicting the Holdout Dataset

Jayson Yodico <br>
Asian Institute of Management

## Introduction

This notebook is dedicated to predicting the holdout dataset (supplied by the event organizers). To further understand how the author was able to come up with a data model, please refer to the "Model Development" notebook of this repository.

In [1]:
import pandas as pd
import pygeohash as gh
import matplotlib.pyplot as plt
import numpy as np
import scipy

In [2]:
def denoise(series, pctile):

    """
    Denoises a time series by projecting the series to the frequency domain and
    silencing frequencies less than the frequency threshold.

    PARAMETERS:

        series: DataFrame
            - series to denoised

        pctile: int
            - percentile threshold of denoising. In the frequency domain, all
              frequency components with intensities less than the threshold is
              set to zero. Higher percentile values means higher level of
              denoising.
          
    
    RETURNS:
    
        cleaned_series: ndarray
            - denoised series
    
    """

    sff = scipy.fft(series)    
    abs_sff = abs(sff)
    sff[abs_sff < np.percentile(abs_sff, q=pctile)] = 0
    cleaned_series = np.abs(scipy.ifft(sff))

    return cleaned_series

## Open Training and Holdout Datasets Here

Upon inspection, it was observed that the are no demand data for some location-time bucket pairs. It is necessary to fill up missing values to make all geohash time series complete.

Since it was assumed that all demand data were included in the the timeframe of this training dataset, geohash-time bucket pairs with no demand data are assumed to be zero. This simply means that there is zero demand in that location and time bucket.

The purpose of the preprocessing is to fill up gaps in the time series and to make sure that values are arranged chronologically for each location. In addition, decoding of the geohashes to coordinates (latitude, longitude) are included as additional fields in the processed dataset. The steps are outlined as follows:

Note: The training dataset, training.csv, must be placed in the folder Traffic Management of the repository. The dataset can be downloaded from https://s3-ap-southeast-1.amazonaws.com/grab-aiforsea-dataset/traffic-management.zip.

The holdout dataset is provided by the organizers. The author assumed that the holdout dataset is of the same format and fields as the training dataset.

In [3]:
train_file = 'Traffic Management/training.csv'
test_file = 'Traffic Management/dummytest.csv'

Since this model makes use of past values to predict future values, the test dataset may have values that refer to the training dataset. It is necessary to combine these two datasets first for a complete reference.

In [4]:
df_train = pd.read_csv(train_file)
df_train['data_type'] = 'training'

df_test = pd.read_csv(test_file)
df_test['data_type'] = 'holdout'

df_traintest = pd.concat([df_train, df_test])
df_traintest['timestamp'] = pd.to_datetime(df_traintest['timestamp'], format= '%H:%M').dt.time
df_traintest.head()

Unnamed: 0,geohash6,day,timestamp,demand,data_type
0,qp03wc,18,20:00:00,0.020072,training
1,qp03pn,10,14:30:00,0.024721,training
2,qp09sw,9,06:15:00,0.102821,training
3,qp0991,32,05:00:00,0.088755,training
4,qp090q,15,04:00:00,0.074468,training


1. Create dummy datetime values to serve as reference in ordering the `day` and `timestamp` columns.

In [5]:
# CREATE DUMMY DATETIME VALUES
dates = pd.DataFrame()
dates['dummy_date'] = pd.date_range(start=pd.datetime(2019, 1, 1),
                              periods=len(df_traintest.day.unique())+1)
dates['day'] = np.arange(1, dates.shape[0] + 1)

2. With the dummy datetime values, the sequence can be ordered. For each location, `T_n` serves as an ID indicating the order a value appears in that location-time series.

In [6]:
"""
Create a reference table of chronological ordering indices (T_n) mapped to
timestamp and day values.
"""

timenum = pd.DataFrame(pd.date_range(start=dates.dummy_date.min(),
                                     end=dates.dummy_date.max(),
                                     freq='15min'), columns=['dummy_datetime'])

timenum['dummy_date'] = pd.to_datetime(timenum['dummy_datetime'].dt.date)
timenum['timestamp'] = timenum['dummy_datetime'].dt.time

timenum['T_n'] = np.arange(timenum.shape[0])
timenum = timenum.merge(dates, on='dummy_date', how='left')

del timenum['dummy_datetime'], timenum['dummy_date']
timenum.head()

Unnamed: 0,timestamp,T_n,day
0,00:00:00,0,1
1,00:15:00,1,1
2,00:30:00,2,1
3,00:45:00,3,1
4,01:00:00,4,1


In [7]:
pctile = 45
window = 92

3. Denoising each geohash-time bucket pairs.
4. Extracting features and targets.

Take note that the most recent 5 values have no future values to refer to, so these excluded from the calculation.

In [8]:
# SET AS DATETIME INDICES
df_traintest2 = df_traintest.merge(timenum, on=['day', 'timestamp'], how='left')

g = df_traintest2.groupby(['geohash6'])

all_data = []

for loc in g.groups.keys():
    
    test = g.get_group(loc)
    dummy = timenum[timenum.T_n <= test.T_n.max()]
    dummy = dummy.merge(test[['data_type','T_n', 'demand']], on='T_n', how='inner').fillna(0)

    dummy['geohash6'] = loc
    dummy['lat'] = gh.decode(loc)[0]
    dummy['long'] = gh.decode(loc)[1]
    
    dummy['demand_fft'] = denoise(dummy.demand.values, pctile)
    
    for fwd in range(-2,-6,-1):
        dummy[f'demand+{-1*fwd}'] = dummy['demand'].shift(fwd+1)

    for bwd in range(1, window + 1):
        dummy[f'demand_fft-{bwd}'] = dummy['demand_fft'].shift(bwd)
    
    dummy = dummy.dropna()    
    dummy = dummy[dummy['data_type'] == 'holdout']    
    
    all_data.append(dummy)

test_data = pd.concat(all_data)
test_data.head()

Unnamed: 0,timestamp,T_n,day,data_type,demand,geohash6,lat,long,demand_fft,demand+2,...,demand_fft-83,demand_fft-84,demand_fft-85,demand_fft-86,demand_fft-87,demand_fft-88,demand_fft-89,demand_fft-90,demand_fft-91,demand_fft-92
577,02:45:00,5867,62,holdout,0.020592,qp02yc,-5.48,90.7,0.027832,0.010292,...,0.003353,0.030117,0.016688,0.029827,0.009256,0.034402,0.018682,0.044374,0.00964,0.01028
578,03:00:00,5868,62,holdout,0.010292,qp02yc,-5.48,90.7,0.000855,0.006676,...,0.039929,0.003353,0.030117,0.016688,0.029827,0.009256,0.034402,0.018682,0.044374,0.00964
579,04:00:00,5872,62,holdout,0.006676,qp02yc,-5.48,90.7,0.013886,0.003822,...,0.01848,0.039929,0.003353,0.030117,0.016688,0.029827,0.009256,0.034402,0.018682,0.044374
580,04:30:00,5874,62,holdout,0.003822,qp02yc,-5.48,90.7,0.004838,0.011131,...,0.000338,0.01848,0.039929,0.003353,0.030117,0.016688,0.029827,0.009256,0.034402,0.018682
581,06:45:00,5883,62,holdout,0.011131,qp02yc,-5.48,90.7,0.015758,0.013487,...,0.02806,0.000338,0.01848,0.039929,0.003353,0.030117,0.016688,0.029827,0.009256,0.034402


`compare_actual` is an initialization, a dataframe that puts the actual values `demand`, `demand+2`, `demand+3`, `demand+4`, `demand+5` and predicted values  `pred_demand`, `pred_demand+2`, `pred_demand+3`, `pred_demand+4`, `pred_demand+5` side-by-side for the computation of RMSE.

In [14]:
compare_actual = test_data[['geohash6', 'day','timestamp', 'T_n',
                        'demand','demand+2','demand+3','demand+4',
                        'demand+5']].copy()
compare_actual.head()

Unnamed: 0,geohash6,day,timestamp,T_n,demand,demand+2,demand+3,demand+4,demand+5
577,qp02yc,62,02:45:00,5867,0.020592,0.010292,0.006676,0.003822,0.011131
578,qp02yc,62,03:00:00,5868,0.010292,0.006676,0.003822,0.011131,0.013487
579,qp02yc,62,04:00:00,5872,0.006676,0.003822,0.011131,0.013487,0.003709
580,qp02yc,62,04:30:00,5874,0.003822,0.011131,0.013487,0.003709,0.011041
581,qp02yc,62,06:45:00,5883,0.011131,0.013487,0.003709,0.011041,0.040743


In [15]:
to_predict = test_data.drop(['timestamp', 'T_n', 'day', 'data_type',
                             'demand', 'geohash6','demand_fft',
                             'demand+2','demand+3','demand+4','demand+5'],
                            axis=1)
to_predict.head()

Unnamed: 0,lat,long,demand_fft-1,demand_fft-2,demand_fft-3,demand_fft-4,demand_fft-5,demand_fft-6,demand_fft-7,demand_fft-8,...,demand_fft-83,demand_fft-84,demand_fft-85,demand_fft-86,demand_fft-87,demand_fft-88,demand_fft-89,demand_fft-90,demand_fft-91,demand_fft-92
577,-5.48,90.7,0.010659,0.020209,0.034297,0.068202,0.004567,0.02538,0.001686,0.04026,...,0.003353,0.030117,0.016688,0.029827,0.009256,0.034402,0.018682,0.044374,0.00964,0.01028
578,-5.48,90.7,0.027832,0.010659,0.020209,0.034297,0.068202,0.004567,0.02538,0.001686,...,0.039929,0.003353,0.030117,0.016688,0.029827,0.009256,0.034402,0.018682,0.044374,0.00964
579,-5.48,90.7,0.000855,0.027832,0.010659,0.020209,0.034297,0.068202,0.004567,0.02538,...,0.01848,0.039929,0.003353,0.030117,0.016688,0.029827,0.009256,0.034402,0.018682,0.044374
580,-5.48,90.7,0.013886,0.000855,0.027832,0.010659,0.020209,0.034297,0.068202,0.004567,...,0.000338,0.01848,0.039929,0.003353,0.030117,0.016688,0.029827,0.009256,0.034402,0.018682
581,-5.48,90.7,0.004838,0.013886,0.000855,0.027832,0.010659,0.020209,0.034297,0.068202,...,0.02806,0.000338,0.01848,0.039929,0.003353,0.030117,0.016688,0.029827,0.009256,0.034402


In [16]:
from keras.models import load_model

model = load_model('model1.hdf5')
preds = model.predict(to_predict.values)

In [17]:
compare_actual['pred_demand'] = preds[:,0]
compare_actual['pred_demand+2'] = preds[:,1]
compare_actual['pred_demand+3'] = preds[:,2]
compare_actual['pred_demand+4'] = preds[:,3]
compare_actual['pred_demand+5'] = preds[:,4]

compare_actual.head()

Unnamed: 0,geohash6,day,timestamp,T_n,demand,demand+2,demand+3,demand+4,demand+5,pred_demand,pred_demand+2,pred_demand+3,pred_demand+4,pred_demand+5
577,qp02yc,62,02:45:00,5867,0.020592,0.010292,0.006676,0.003822,0.011131,0.023836,0.023909,0.025075,0.025549,0.025957
578,qp02yc,62,03:00:00,5868,0.010292,0.006676,0.003822,0.011131,0.013487,0.025623,0.025715,0.026089,0.026962,0.027151
579,qp02yc,62,04:00:00,5872,0.006676,0.003822,0.011131,0.013487,0.003709,0.020764,0.022543,0.02368,0.024522,0.025512
580,qp02yc,62,04:30:00,5874,0.003822,0.011131,0.013487,0.003709,0.011041,0.020393,0.021332,0.021505,0.022597,0.022828
581,qp02yc,62,06:45:00,5883,0.011131,0.013487,0.003709,0.011041,0.040743,0.017685,0.019102,0.018623,0.01993,0.020291


The RMSE can be computed for evaluation at this point.