# Contents <a id='back'></a>

* [Introduction](#intro)
* [1. Data Overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [2. Data PreProcessing](#data_preprocessing)
    * [2.1 Data Resampling](#data_resampling)
    * [2.2 Features Engineering](#features_engineering)
* [3. Model Training and Evaluation](#model_training_evaluation)
    * [3.1 Splitting the dataset into a training set and a test set](#splitting_dataset)
    * [3.2 Training Model](#training_model)
* [4. Testing the Model on Test Set](#testing_model_test_set)
* [General Conclusion](#end)

# Introduction <a id='intro'></a>

In this project, I will train different models with varying hyperparameters to predict the number of taxi orders for the next hour.  I will perform data resampling on an hourly basis.


**Objective:**

Find out the best model with RMSE value under 48.

**This project will comprise the following steps:**

1. Data Overview
2. Data preprocessing
3. Model Training and Evaluation
4. Testing the Model on Test Set

[Back to Contents](#back)

## 1. Data Overview <a id='data_review'></a>

The steps to be performed are as follows:
1. Checking the number of rows and columns.
2. Checking for missing values.
3. Checking for duplicate data.
4. Checking statistical information in columns with numerical data types.
5. Checking values in columns with categorical data types.

[Back to Contents](#back)

In [1]:
# load library

import numpy as np, pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)

### 1.1 Data Exploration: df dataset

In [2]:
df = pd.read_csv('data/taxi.csv', parse_dates = [0])

In [3]:
df.shape

(26496, 2)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26496 entries, 0 to 26495
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    26496 non-null  datetime64[ns]
 1   num_orders  26496 non-null  int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 414.1 KB


In [5]:
(df.isnull().sum() / len(df) * 100).sort_values()

datetime      0.0
num_orders    0.0
dtype: float64

### 1.2 Conclusion of Data Overview step**

1. The "parse_dates" argument has been added during the dataset loading stage to convert the Dtype from object to datetime.
2. There are no missing values in both columns.
3. The historical data starts from March 1, 2018, and goes up to August 31, 2018.

## 2. Data Preprocessing <a id='data_preprocessing'></a>

### 2.1 Data Resampling <a id='data_resampling'></a>

In [6]:
df = df.resample('1H', on = 'datetime').sum()

In [7]:
df.shape

(4416, 1)

In [8]:
df.tail(10)

Unnamed: 0_level_0,num_orders
datetime,Unnamed: 1_level_1
2018-08-31 14:00:00,133
2018-08-31 15:00:00,116
2018-08-31 16:00:00,197
2018-08-31 17:00:00,217
2018-08-31 18:00:00,207
2018-08-31 19:00:00,136
2018-08-31 20:00:00,154
2018-08-31 21:00:00,159
2018-08-31 22:00:00,223
2018-08-31 23:00:00,205


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4416 entries, 2018-03-01 00:00:00 to 2018-08-31 23:00:00
Freq: H
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   num_orders  4416 non-null   int64
dtypes: int64(1)
memory usage: 69.0 KB


**Findings**

After performing data resampling on an hourly basis, the number of row becomes 4416, down from the original 26496.

### 2.2 Features Engineering <a id='features_engineering'></a>

In [10]:
# function untuk make features
def make_features(data, max_lag, rolling_mean_size):
    df['year'] = df.index.year
    df['month'] = df.index.month
    df['day'] = df.index.day
    df['hour'] = df.index.hour

    for lag in range(1, max_lag + 1):
        df['lag_{}'.format(lag)] = df['num_orders'].shift(lag)

    data['rolling_mean'] = (
        df['num_orders'].shift().rolling(rolling_mean_size).mean()
    )

In [11]:
make_features(df, 6, 10)

In [12]:
df.head()

Unnamed: 0_level_0,num_orders,year,month,day,hour,lag_1,lag_2,lag_3,lag_4,lag_5,lag_6,rolling_mean
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-03-01 00:00:00,124,2018,3,1,0,,,,,,,
2018-03-01 01:00:00,85,2018,3,1,1,124.0,,,,,,
2018-03-01 02:00:00,71,2018,3,1,2,85.0,124.0,,,,,
2018-03-01 03:00:00,66,2018,3,1,3,71.0,85.0,124.0,,,,
2018-03-01 04:00:00,43,2018,3,1,4,66.0,71.0,85.0,124.0,,,


[Back to Contents](#back)

## 3. Model Training and Evaluation <a id='model_training_evaluation'></a>

### 3.1 Splitting the dataset into a training set and a test set <a id='splitting_dataset'></a>

In [13]:
train, test = train_test_split(df, shuffle=False, test_size=0.2)

In [14]:
train = train.dropna()

In [15]:
# defining features and target
features_train = train.drop(['num_orders'], axis = 1)
target_train = train['num_orders']
features_test = test.drop(['num_orders'], axis = 1)
target_test = test['num_orders']

In [16]:
features_train.shape

(3522, 11)

In [17]:
target_train.shape

(3522,)

In [18]:
features_test.shape

(884, 11)

In [19]:
target_test.shape

(884,)

### 3.2 Training Model <a id='training_model'></a>

#### 3.2.1 Linear Regression Model

In [20]:
linreg_model = LinearRegression()
linreg_model.fit(features_train, target_train)

In [21]:
pred_train_linreg = linreg_model.predict(features_train)
pred_test_linreg = linreg_model.predict(features_test)

In [22]:
rmse_train_linreg = np.sqrt(mean_squared_error(target_train, pred_train_linreg))
rmse_test_linreg = np.sqrt(mean_squared_error(target_test, pred_test_linreg))

In [23]:
print("RMSE for training set using Linear Regression model is: ", rmse_train_linreg)
print("RMSE for test set using Linear Regression model is: ", rmse_test_linreg)

RMSE for training set using Linear Regression model is:  28.993953135372347
RMSE for test set using Linear Regression model is:  47.80919467530927


#### 3.2.2 Decision Tree Regressor Model

In [24]:
for depth in range(1, 20):
    model_dtree = DecisionTreeRegressor(max_depth=depth, random_state = 42)
    model_dtree.fit(features_train, target_train)
    
    pred_train_dt = model_dtree.predict(features_train)
    pred_test_dt = model_dtree.predict(features_test)
    
    rmse_train_dt = np.sqrt(mean_squared_error(target_train, pred_train_dt))
    rmse_test_dt = np.sqrt(mean_squared_error(target_test, pred_test_dt))
    
    print("At max_depth", depth, "the RMSE value for both train set and test set are:", end='')
    print()
    print('train set ', rmse_train_dt)
    print('test set ', rmse_test_dt)
    print()

At max_depth 1 the RMSE value for both train set and test set are:
train set  31.8738529696808
test set  66.75266892956195

At max_depth 2 the RMSE value for both train set and test set are:
train set  29.08273130505385
test set  65.31556238996731

At max_depth 3 the RMSE value for both train set and test set are:
train set  26.863380220409365
test set  59.64397778378063

At max_depth 4 the RMSE value for both train set and test set are:
train set  25.706221631715724
test set  58.75823358931518

At max_depth 5 the RMSE value for both train set and test set are:
train set  24.1374876460292
test set  55.47290601497393

At max_depth 6 the RMSE value for both train set and test set are:
train set  21.944683991086585
test set  53.75862523855387

At max_depth 7 the RMSE value for both train set and test set are:
train set  20.074509673464426
test set  50.590928094129055

At max_depth 8 the RMSE value for both train set and test set are:
train set  18.296289944215076
test set  48.081222357577

#### 3.2.3 Random Forest Regressor Model

In [25]:
max_depth_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
n_estimator_list = [100, 200, 300, 400, 500]

for depth in max_depth_list:
    for n_estimator in n_estimator_list:
        model_rf = RandomForestRegressor(max_depth = depth, n_estimators = n_estimator, random_state = 42)
        model_rf.fit(features_train, target_train)
        
        pred_train_rf = model_rf.predict(features_train)
        pred_test_rf = model_rf.predict(features_test)
        
        rmse_train_rf = np.sqrt(mean_squared_error(target_train, pred_train_rf))
        rmse_test_rf = np.sqrt(mean_squared_error(target_test, pred_test_rf))
        
        print("At depth", depth, "and n_estimator", n_estimator, ", the RMSE value for train set is", rmse_train_rf, ", and for test set, it is", rmse_test_rf , end='')
        print()

At depth 1 and n_estimator 100 , the RMSE value for train set is 31.541709644056652 , and for test set, it is 66.34107601335101
At depth 1 and n_estimator 200 , the RMSE value for train set is 31.524166218579694 , and for test set, it is 66.2783518246163
At depth 1 and n_estimator 300 , the RMSE value for train set is 31.52313978763837 , and for test set, it is 66.29393399157513
At depth 1 and n_estimator 400 , the RMSE value for train set is 31.52008050874516 , and for test set, it is 66.32776756698024
At depth 1 and n_estimator 500 , the RMSE value for train set is 31.518451743096907 , and for test set, it is 66.36426516504564
At depth 2 and n_estimator 100 , the RMSE value for train set is 28.418751424092406 , and for test set, it is 64.21194810658511
At depth 2 and n_estimator 200 , the RMSE value for train set is 28.41923143208154 , and for test set, it is 64.25693978602396
At depth 2 and n_estimator 300 , the RMSE value for train set is 28.42026919498915 , and for test set, it is

### 3.3 Analysis Result

From the testing results on features_train and features_set using three models (Linear Regression, Decision Tree Regressor, and Random Forest), the best result is obtained using the Linear Regression model. The RMSE value of Linear Regression model for the train set and test set is not as large as when using the Random Forest Regressor and Decision Tree Regressor models.

[Back to Contents](#back)

## 4. Testing the Model on Test Set <a id='testing_model_test_set'></a>

In [27]:
model = LinearRegression()
model.fit(features_train, target_train)

In [28]:
pred_train = model.predict(features_train)
pred_test = model.predict(features_test)

In [29]:
rmse_train = np.sqrt(mean_squared_error(target_train, pred_train))
rmse_test = np.sqrt(mean_squared_error(target_test, pred_test))

In [30]:
"The RMSE value for training set using Linear Regression model is", rmse_train

('The RMSE value for training set using Linear Regression model is',
 28.993953135372347)

In [31]:
'The RMSE value for test set using Linear Regression model is ', rmse_test

('The RMSE value for test set using Linear Regression model is ',
 47.80919467530927)

[Back to Contents](#back)

# General Conclusion <a id='end'></a>

1. The best model used for prediction is Linear Regression, with an RMSE value on the test set of 47.81.
2. The RMSE value on the test set is in line with the requirement, which is below 48.

[Back to Contents](#back)