# Seoul Bike Trip Duration Prediction

<img src="Features_Description.png" style="float:center;" width="500"/>

### Context
Trip duration is the most fundamental measure in all modes of transportation. 
Hence, it is crucial to predict the trip-time precisely for the advancement of Intelligent Transport Systems (ITS) and traveller information systems. 
In order to predict the trip duration, data mining techniques are employed in this paper to predict the trip duration of rental bikes in Seoul Bike sharing system. 
The prediction is carried out with the combination of Seoul Bike data and weather data.
### Content
The Data used include trip duration, trip distance, pickup-dropoff latitude and longitude, 
temperature, precipitation, wind speed, humidity, solar radiation, snowfall, ground temperature and 1-hour average dust concentration.
### Acknowledgements
- V E, Sathishkumar (2020), "Seoul Bike Trip duration prediction", Mendeley Data, V1, doi: 10.17632/gtfh9z865f.1
- Sathishkumar V E, Jangwoo Park, Yongyun Cho, Seoul bike trip duration prediction using data mining techniques, IET Intelligent Transport Systems, doi: 10.1049/iet-its.2019.0796
### Goal
Predict the trip duration

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import joblib

from helper_functions import *

## Data

In [None]:
loadPkl = 1 # 0 --> read csv, 1 --> load pkl

if loadPkl:
    # Use pickle to load the data -- faster than reading a csv
    dataset = joblib.load('data/dataset.pkl')
else:
    # 1.23 GB of data
    dataset = pd.read_csv('data/For_modeling.csv')

    # Drop unwanted columns
    dataset = dataset.drop(columns='Unnamed: 0')

    # Dump the dataset to load it later
    joblib.dump(dataset, 'data/dataset.pkl')


In [None]:
# Check for null values
dataset.isnull().sum().sum()

## EDA

### !!! Smaller dataset for now. Later the entire notebook will be run on the entire data set !!!

In [None]:
df = dataset.sample(frac=0.01)
df.sample(10).T

### Adding catagorical columns for EDA

- Since the data is huge, we make new features (preferably catagorical) for the purpose of EDA
- These new features may also be used later for modeling

In [None]:
df.columns

**_duration_range_**

In [None]:
duration_order = ['00-30', '30-60', '60-90', '90-120']
def duration_range(duration):
    if duration>=0 and duration<=30:
        return duration_order[0]
    elif duration>30 and duration<60:
        return duration_order[1]
    elif duration>=60 and duration<=90:
        return duration_order[2]
    else:
        return duration_order[3]

df['duration_range_'] = df['Duration'].apply(duration_range)

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7,5), dpi=100)
sns.countplot(data=df, x='duration_range_', order=duration_order, ax=ax);

In [None]:
distance_order = ['0-5000', '5000-10000', '10000-15000', '15000+']
distance_order_str = str(distance_order).strip('[').strip(']').split()
def distance_range(distance):
    if distance>=0 and distance<=5000:
        return distance_order[0]
    elif distance>5000 and distance<10000:
        return distance_order[1]
    elif distance>=10000 and distance<=15000:
        return distance_order[2]
    else:
        return distance_order[3]

df['distance_range_'] = df['Distance'].apply(distance_range)

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7,5), dpi=100)
sns.countplot(data=df, x='distance_range_', order=distance_order, hue='duration_range_', hue_order=duration_order, ax=ax);

**_time_of_day_**

In [None]:
day_order = ['Morning', 'Afternoon', 'Evening', 'Night']
def time_of_day(hr):
    if hr>=5 and hr<=12:
        return 'Morning'
    elif hr>12 and hr<17:
        return 'Afternoon'
    elif hr>=17 and hr<=21:
        return 'Evening'
    else:
        return 'Night'

df['time_of_day_'] = df['Phour'].apply(time_of_day)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,5), dpi=100)
sns.countplot(data=df, x='time_of_day_', order=day_order, hue='distance_range_', hue_order=distance_order, ax=ax[0]);
sns.countplot(data=df, x='time_of_day_', order=day_order, hue='duration_range_', hue_order=duration_order, ax=ax[1]);

**_long_diff, latd_diff_**

In [None]:
df['long_diff_'] = abs(df['PLong'] - df['DLong'])
df['latd_diff_'] = abs(df['PLatd'] - df['DLatd'])

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,5), dpi=100)
sns.scatterplot(data=df, x='long_diff_', y='latd_diff_', hue='distance_range_', hue_order=distance_order, alpha=0.5, ax=ax[0]);
sns.scatterplot(data=df, x='long_diff_', y='latd_diff_', hue='duration_range_', hue_order=duration_order, alpha=0.5, ax=ax[1]);

**_geographical_PCs_**

We use PCA to convert the geographical features -

['Distance', 'Haversine', 'PLong', 'DLong', 'PLatd', 'DLatd']

into a few principal components.

In [None]:
num_geographical_PCs = 3
geographical_cols = ['Distance', 'Haversine', 'PLong', 'DLong', 'PLatd', 'DLatd', 'long_diff_', 'latd_diff_']

geographical_PCs, geographical_explained_variance_ratio = pca_pipe(df[geographical_cols], n=num_geographical_PCs)
print(f"geographical_explained_variance_ratio = {geographical_explained_variance_ratio.sum()}")

geo_PC_cols = ['geo_PC1_', 'geo_PC2_', 'geo_PC3_']
df[geo_PC_cols] = geographical_PCs

fig = plt.figure(figsize=(30,5), dpi=100)

ax1 = fig.add_subplot(141, projection = '3d')
ax1.scatter(df.geo_PC1_, df.geo_PC2_, df.geo_PC3_, alpha=0.1);
ax1.set_xlabel(geo_PC_cols[0]); ax1.set_ylabel(geo_PC_cols[1]); ax1.set_zlabel(geo_PC_cols[2]);

ax2 = fig.add_subplot(142)
sns.scatterplot(data=df, x='geo_PC1_', y='geo_PC2_', hue='duration_range_', hue_order=duration_order, alpha=0.5, ax=ax2);

ax3 = fig.add_subplot(143)
sns.scatterplot(data=df, x='geo_PC1_', y='geo_PC3_', hue='duration_range_', hue_order=duration_order, alpha=0.5, ax=ax3);

ax4 = fig.add_subplot(144)
sns.scatterplot(data=df, x='geo_PC2_', y='geo_PC3_', hue='duration_range_', hue_order=duration_order, alpha=0.5, ax=ax4);

**_weather_PCA_**

We use PCA to convert the weather related features -

['Temp', 'Precip', 'Wind', 'Humid', 'Solar', 'Snow', 'GroundTemp', 'Dust'] 

into a few principal components.

In [None]:
num_weather_PCs = 3
weather_cols = ['Temp', 'Precip', 'Wind', 'Humid', 'Solar', 'Snow', 'GroundTemp', 'Dust']

weather_PCs, weather_explained_variance_ratio = pca_pipe(df[weather_cols], n=num_weather_PCs)
print(f"weather_explained_variance_ratio = {weather_explained_variance_ratio.sum()}")

weather_PC_cols = ['weather_PC1_', 'weather_PC2_', 'weather_PC3_']
df[weather_PC_cols] = weather_PCs

fig = plt.figure(figsize=(30,5), dpi=100)

ax1 = fig.add_subplot(141, projection = '3d')
ax1.scatter(df.weather_PC1_, df.weather_PC2_, df.weather_PC3_, alpha=0.1);
ax1.set_xlabel(weather_PC_cols[0]); ax1.set_ylabel(weather_PC_cols[1]); ax1.set_zlabel(weather_PC_cols[2]);

ax2 = fig.add_subplot(142)
sns.scatterplot(data=df, x='weather_PC1_', y='weather_PC2_', hue='duration_range_', hue_order=duration_order, alpha=0.5, ax=ax2);

ax3 = fig.add_subplot(143)
sns.scatterplot(data=df, x='weather_PC1_', y='weather_PC2_', hue='Pmonth', palette='Set1', alpha=0.5, ax=ax3);

ax4 = fig.add_subplot(144)
sns.scatterplot(data=df, x='weather_PC1_', y='weather_PC2_', hue='time_of_day_', palette='Set1', alpha=0.5, ax=ax4);

- Clearly, the 3rd PC has a lot of outliers (variance placeholder) and can be skipped

### Check statistics

In [None]:
df.describe().T

- ['Pmonth', 'Pday', 'Phour', 'Pmin', 'PDweek', 'Dmonth', 'Dday', 'Dhour', 'Dmin', 'DDweek'] and 'time_of_day_' are all catagorical features and will need encoding
- Rest all are continuous features

In [None]:
cat_cols = ['Pmonth', 'Pday', 'Phour', 'Pmin', 'PDweek', 'Dmonth', 'Dday', 'Dhour', 'Dmin', 'DDweek', 'time_of_day_', 'distance_range_', 'duration_range_']

### Correlation

In [None]:
df_corr = df.drop(columns=cat_cols).corr()

In [None]:
plt.figure(figsize=(10,4), dpi=100)
df_corr['Duration'].abs().sort_values()[:-1].plot(kind='bar');

- Distance and Haversine are highly corelated with Duration

### Distribution of all features

In [None]:
df.hist(bins=30, color='steelblue', edgecolor='black', linewidth=1.0,
        xlabelsize=8, ylabelsize=8, grid=False); 
plt.tight_layout(rect=(0, 0, 3, 3))

- Most of the trips are snow and radiation free.
- The distributions of Duration, distance and Haversine are similar as they are highly correlated.

### Scatter plots

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(15,5), dpi=100)

sns.scatterplot(data=df, x='Distance', y='Duration', hue='time_of_day_', alpha=0.5, ax=ax[0]);
sns.scatterplot(data=df, x='weather_PC1_', y='long_diff_', hue='duration_range_', alpha=0.5, ax=ax[1]);

- For small distances, the duration is small but with large variance
- Majority of trips are in evening
- The Duration is also affected by the Temp which is indeed correlated to month of the year

### Count plots

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(20,5), dpi=100)

sns.countplot(data=df, x='Pmonth', hue='PDweek', ax=ax[0]);
sns.countplot(data=df, x='Phour', hue='PDweek', ax=ax[1]);

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(20,5), dpi=100)

sns.countplot(data=df, x='Pmonth', hue='time_of_day_', ax=ax[0]);
sns.countplot(data=df, x='Phour', hue='duration_range_', ax=ax[1]);

- Sept and Oct (autumn) have the maximum no. of trips while Dec-March (winter) has a lot less no. of trips
- The peaks for Phour counts is around 8am and b/w 5-9pm which is usual office timings

### Box and Bar plots

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=3, figsize=(15,10), dpi=100)

sns.barplot(data=df, x='Phour', y='Duration', ax=ax[0]);
sns.barplot(data=df, x='PDweek', y='Duration', ax=ax[1], hue='time_of_day_');
ax[1].legend(loc=(1.01,0.3));
sns.barplot(data=df, x='Pmonth', y='Duration', ax=ax[2], hue='time_of_day_');
ax[2].legend(loc=(1.01,0.3));

- Although the one of the peaks of Phour count was around 8am, the mean duration is actually minimum at 8am!  
- Note that the average duartion on Fri and Sat is the highest
- In winters most of the trips are in afternoon, while in summer most of the trips are in the evening

### (Customary) Pairplot

In [None]:
cols = ['Distance', 'weather_PC1_', 'geo_PC3_', 'long_diff_','duration_range_']
sns.pairplot(df[cols], hue='duration_range_', hue_order=duration_order, diag_kind='kde', corner=False);