# Seoul Bike Trip Duration Prediction

<img src="Features_Description.png" style="float:right;" width="500"/>

### Context
- Trip duration is the most fundamental measure in all modes of transportation. 
- Hence, it is crucial to predict the trip-time precisely for the advancement of Intelligent Transport Systems (ITS) and traveller information systems. 
- In order to predict the trip duration, data mining techniques are employed in this paper to predict the trip duration of rental bikes in Seoul Bike sharing system. 
- The prediction is carried out with the combination of Seoul Bike data and weather data.

### Content
- The Data used include trip duration, trip distance, pickup-dropoff latitude and longitude, 
temperature, precipitation, wind speed, humidity, solar radiation, snowfall, ground temperature and 1-hour average dust concentration.

### Acknowledgements
- V E, Sathishkumar (2020), "Seoul Bike Trip duration prediction", Mendeley Data, V1, doi: 10.17632/gtfh9z865f.1
- Sathishkumar V E, Jangwoo Park, Yongyun Cho, (2019), Seoul bike trip duration prediction using data mining techniques, IET Intelligent Transport Systems, doi: 10.1049/iet-its.2019.0796

### Goal
- Predict the trip duration

### Steps
- Exploratory Data Analysis (EDA)
- **Data Preprocessing**
- **Feature Selection / Transformation**
- Mahcine Learning Algorithm
- Feature Importance / Engineering
- Hyperparameter Tuning
- Model Deployment

## Load libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import joblib

from helper_functions import *

from timeit import default_timer as timer

In [None]:
# import sklearn
# sklearn.__version__ #'0.22.1'

## Data

In [None]:
dataset = joblib.load('data/dataset.pkl')

In [None]:
dataset.sample(10).T

## Data Preprocessing

In [None]:
dataset.shape

### Check for missing values

In [None]:
dataset.isnull().sum().sum()

### Remove outliers

In [None]:
Q1 = dataset.quantile(0.25)
Q3 = dataset.quantile(0.75)
IQR = Q3 - Q1
a =(dataset < (Q1 - 1.5 * IQR)) |(dataset > (Q3 + 1.5 * IQR))
a.sum(axis=1).sum()

In [None]:
from scipy import stats

z = np.abs(stats.zscore(dataset))
out = np.where(z>3)
out[0].shape

In [None]:
dataset = dataset.drop(index=out[0], axis=0)
dataset.shape

### Remove duplicated instances

In [None]:
dataset = dataset.drop_duplicates()
dataset.shape

### Dump the dataset

In [None]:
joblib.dump(dataset, 'data/dataset_cleaned.pkl')

## Feature Transformation

### Check for categorical features for encoding

In [None]:
dataset.select_dtypes(include='object').sum()

## Feature Selection

In [None]:
frac = 1
# frac = 0.1

X = dataset.drop(columns='Duration').sample(frac=frac, random_state=42)
y = dataset['Duration'].sample(frac=frac, random_state=42)

In [None]:
from sklearn.model_selection import train_test_split

# train:val:test = 80:10:10
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = \
train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

In [None]:
X_train.shape, X_val.shape, X_test.shape

### Recursive Feature Elimination (RFE)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
import timeit

estimator = RandomForestRegressor(n_estimators=100, n_jobs=-1) 
rfe = RFE(estimator, n_features_to_select=10)
X_train_rfe = rfe.fit_transform(X_train, y_train)
X_val_rfe = rfe.fit_transform(X_val, y_val)
X_test_rfe = rfe.fit_transform(X_test, y_test)

In [None]:
plt.figure(figsize=(5,5), dpi=75)
sns.barplot(y=X_train.columns, x=max(rfe.ranking_)-rfe.ranking_);

In [None]:
rfe_features = X_train.columns[rfe.support_]
rfe_features

In [None]:
X_train_ = pd.DataFrame(X_train, columns=rfe_features)
X_val_ = pd.DataFrame(X_val, columns=rfe_features)
X_test_ = pd.DataFrame(X_test, columns=rfe_features)

In [None]:
joblib.dump(X_train_, 'data/X_train.pkl')
joblib.dump(X_val_, 'data/X_val.pkl')
joblib.dump(X_test_, 'data/X_test.pkl')

joblib.dump(y_train, 'data/y_train.pkl')
joblib.dump(y_val, 'data/y_val.pkl')
joblib.dump(y_test, 'data/y_test.pkl')