# Seoul Bike Trip Duration Prediction

<img src="Features_Description.png" style="float:right;" width="500"/>

### Context
- Trip duration is the most fundamental measure in all modes of transportation. 
- Hence, it is crucial to predict the trip-time precisely for the advancement of Intelligent Transport Systems (ITS) and traveller information systems. 
- In order to predict the trip duration, data mining techniques are employed in this paper to predict the trip duration of rental bikes in Seoul Bike sharing system. 
- The prediction is carried out with the combination of Seoul Bike data and weather data.

### Content
- The Data used include trip duration, trip distance, pickup-dropoff latitude and longitude, 
temperature, precipitation, wind speed, humidity, solar radiation, snowfall, ground temperature and 1-hour average dust concentration.

### Acknowledgements
- V E, Sathishkumar (2020), "Seoul Bike Trip duration prediction", Mendeley Data, V1, doi: 10.17632/gtfh9z865f.1
- Sathishkumar V E, Jangwoo Park, Yongyun Cho, Seoul bike trip duration prediction using data mining techniques, IET Intelligent Transport Systems, doi: 10.1049/iet-its.2019.0796

### Goal
- Predict the trip duration

### Steps
- Exploratory Data Analysis (EDA)
- **Building Machine Learning Model**
    - Data Preprocessing
    - Feature Selection / Transformation
    - Mahcine Learning Algorithm
    - Feature Importance / Tuning
    - Hyperparameter Tuning
- Model Deployment

## Load libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import joblib

from helper_functions import *

from timeit import default_timer as timer

## Data

In [2]:
dataset = joblib.load('data/dataset_cleaned.pkl')

In [3]:
dataset.sample(10).T

Unnamed: 0,6264153,7518263,5688412,4256657,2125591,5507646,92552,7462415,5741517,982980
Duration,8.0,65.0,11.0,52.0,22.0,2.0,4.0,14.0,19.0,43.0
Distance,790.0,14200.0,1780.0,12960.0,4390.0,27730.0,860.0,2730.0,1600.0,6220.0
PLong,37.536503,37.51395,37.676941,37.59425,37.506302,37.488991,37.558365,37.516598,37.515831,37.499889
PLatd,126.877747,127.030151,127.055099,127.076576,127.121399,126.916382,127.056908,127.00959,127.106796,126.867699
DLong,37.534389,37.477509,37.668671,37.565849,37.531471,37.488991,37.563511,37.524071,37.520451,37.551666
DLatd,126.869598,127.126328,127.047981,127.016403,127.111092,126.916382,127.056725,127.02179,127.104202,126.86409
Haversine,0.756012,9.402831,1.1127,6.171766,2.942596,0.0,0.572437,1.359475,0.562361,5.766138
Pmonth,9.0,10.0,9.0,7.0,5.0,9.0,1.0,10.0,9.0,4.0
Pday,18.0,15.0,6.0,25.0,26.0,2.0,17.0,14.0,7.0,12.0
Phour,17.0,19.0,18.0,8.0,19.0,8.0,14.0,17.0,19.0,20.0


## Data preprocessing

### Check for missing values

In [4]:
dataset.isnull().sum().sum()

0

### Remove duplicated instances

In [5]:
# dataset = dataset.drop_duplicates()
# joblib.dump(dataset, 'data/dataset_cleaned.pkl')

### Divide data into dependent (y) and independent (X) varaibles

In [6]:
X = dataset.drop(columns='Duration').sample(frac=1., random_state=42)
y = dataset['Duration'].sample(frac=1., random_state=42)

KeyboardInterrupt: 

### Check for Categorical values

In [None]:
X.select_dtypes(include='object').sum()

### Check for oridinal variables

In [None]:
ordinal_features = ['Pmonth', 'Pday', 'Phour', 'Pmin', 'PDweek', 'Dmonth', 'Dday', 'Dhour', 'Dmin', 'DDweek']

In [None]:
# dropping not so useful features -- Pmin, Dmin, Pday, Pday
ordinal_features = ['Pmonth', 'Phour', 'PDweek', 'Dmonth', 'Dhour', 'DDweek']

In [None]:
for f in ordinal_features: X[f] = X[f].apply(str)

In [None]:
X = pd.get_dummies(X)

### Remove less relevant features

In [None]:
# X = X.drop(columns=ordinal_features)

# X['long_diff'] = abs(X['DLong'] - X['PLong'])
# X['latd_diff'] = abs(X['DLatd'] - X['PLatd'])

# X = X.drop(columns=['DLong', 'PLong', 'DLatd', 'PLatd'])

In [None]:
X.sample(2)

### Split the dataset into training and test set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Feature scaling

In [None]:
from sklearn.preprocessing import StandardScaler

scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)

## Feature selection / transformation

In [None]:
# will come back to this later

## Mahcine Learning Algorithm

In [None]:
def fit(modelName, model, error=pd.DataFrame([], index=['Train', 'Test', 'y_mean', 'Time (s)'])):

    from sklearn.metrics import mean_squared_error
    
    start = timer()
    model.fit(X_train, y_train)
    rmse_train = mean_squared_error(y_train, model.predict(X_train), squared=False)
    rmse_test  = mean_squared_error(y_test, model.predict(X_test), squared=False)
    end = timer()

    error[modelName] = [rmse_train, rmse_test, y.mean(), end-start]

    return model, error

###  Linear regression

In [None]:
from sklearn.linear_model import LinearRegression

lr, error = fit('LR', LinearRegression())
error

In [None]:
from sklearn.linear_model import Lasso

lasso, error = fit('Lasso', Lasso(), error)
error

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

dt, errro = fit('DT', DecisionTreeRegressor(min_samples_split=10), error)
error

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf, error = fit('RF_10', RandomForestRegressor(n_estimators=10, n_jobs=8), error)
error

### Neural Network

In [None]:
import tensorflow as tf
from tensorflow import keras

nodes = 64
n_hidden = 2
activation = 'relu'

model = tf.keras.Sequential(keras.layers.Dense(nodes, activation=activation, input_shape=X_train.shape[1:]))

for _ in range(n_hidden-1):  
    model.add(keras.layers.Dense(nodes, activation=activation))

model.add(keras.layers.Dense(1))

model.summary()

In [None]:
epochs = 100
batch_size = 10000

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-2)
model.compile(loss='mean_squared_error', optimizer=optimizer)

start = timer()
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size)
end = timer()

In [None]:
from sklearn.metrics import mean_squared_error

rmse_train = mean_squared_error(y_train, model.predict(X_train, batch_size=batch_size))**0.5
rmse_test = mean_squared_error(y_test, model.predict(X_test, batch_size=batch_size))**0.5
error['NN_CPU'] = [rmse_train, rmse_test, y.mean(), end-start]
error

In [None]:
y_pred = model.predict(X_test, batch_size=batch_size)
rel_res = (y_test.values-y_pred[:,0])/y_test.values

fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(15,5), dpi=100)

ax[0].scatter(y_test.values, y_pred[:,0], alpha=0.1);
ax[0].set_xlabel('$y_{test}$')
ax[0].set_ylabel('$y_{pred}$')
ax[0].set_xlim(-10,130)
ax[0].set_ylim(-10,130)
ax[0].plot(range(-10,130), range(-10,130), '--k', lw=2);

ax[1].scatter(range(len(rel_res)), rel_res, alpha=0.1);
ax[1].set_xlabel('$i$')
ax[1].set_ylabel('$|y_{test}^i-y_{pred}^i|/y_{test}^i$');
ax[1].plot(range(len(rel_res)), 0*np.arange(len(rel_res)), '--k', lw=2);