# Trip Duration Prediction

This notebook is part of [*Practical Data Science for IOT*](https://github.com/pablodecm/datalab_ml_iot) tutorial by Pablo de Castro

## Tools

This notebook will use the following Python 3
libraries for data analytics and machine learning:
- pandas
- numpy
- matplotlib/seaborn
- scikit-learn
- xgboost
- leaflet/folium

In [None]:
# required in collab
%pip install seaborn
%pip install xgboost
%pip install kaggle
%pip install folium

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb

ModuleNotFoundError: No module named 'seaborn'

## Dataset

In this notebook, we are gonna be using a large dataset from
the [Kaggle New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration)
challenge, which corresponds to real taxi trips data in the city
of New York within the year 2016.

The main task is the prediction of the trip duration given the features, but it
is a really good dataset for exploratory data analysis and applying some
tricks for dealing with location and temporal data in cities.

<div align="center">
  <img src="images/kaggle_trip_duration.png" height="50%" style="max-width: 50%">
</div>

### Data fields

Here is a list and description of all the provided items for each trip:
- **id** - a unique identifier for each trip
- **vendor_id** - a code indicating the provider associated with the trip record
- **pickup_datetime** - date and time when the meter was engaged
- **dropoff_datetime** - date and time when the meter was disengaged
- **passenger_count** - the number of passengers in the vehicle (driver entered value)
- **pickup_longitude** - the longitude where the meter was engaged
- **pickup_latitude** - the latitude where the meter was engaged
- **dropoff_longitude** - the longitude where the meter was disengaged
- **dropoff_latitude** - the latitude where the meter was disengaged
- **store_and_fwd_flag** - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server -  Y=store and forward; N=not a store and forward trip
- **trip_duration** - duration of the trip in seconds


### Download from Kaggle

In order to download datasets from Kaggle with can use the
official CLI interface, but it requires you to have an account
and to get an API key.

In [None]:
!mkdir $HOME/.kaggle

In [None]:
# run if in Google Colab to setup your Kaggle API Key
import json
import getpass
import os

kaggle_json_path = "$HOME/.kaggle/kaggle.json"
if not os.path.isfile(os.path.expandvars(kaggle_json_path)):
  username = getpass.getpass('username')
  api_key = getpass.getpass('Kaggle API key')

  token = {"username": username,"key":api_key}
  with open('kaggle.json', 'w') as file:
      json.dump(token, file)
    
  # jupyter/ipython bash magic (!) works within an if
  !mv kaggle.json $HOME/.kaggle/kaggle.json
  !chmod 600 $HOME/.kaggle/kaggle.json

!kaggle datasets list

In [None]:
!ls $HOME/.kaggle/kaggle.json

In [None]:
!mkdir data
!cd data; kaggle competitions download -c nyc-taxi-trip-duration; cd ..
!cd data; unzip -o nyc-taxi-trip-duration.zip; unzip -o train.zip; unzip -o test.zip; cd ..

In [None]:
import pandas as pd
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')
print("train shape: ", train_df.shape, "test shape: ", test_df.shape)

In [None]:
train_df.head(3)

In [None]:
test_df.head(3)

#### Warming up Exercise

Check that id is unique and train and test set are distinct.

*Hint: Look up `DataFrame.nunique` and `np.intersect1d` in their respective documentations.*

In [None]:
# write here the exercise

### Additional Datasets

In real-world scenarios, there is often something to gain
by combining data from different sources that might be informative
for the task.

In the case of car trip durations, traffic routes and weather
are quite relevant and were allowed in the competition.
Participants have
curated two datasets that might be of use here:
- Traffic route details using the [Open Source Routing Machine OSRM tool](http://project-osrm.org/)
- Weather during the period considered
Part of the information of the former will be used while
the use of the later if left for future extensions.

In [None]:
# traffic route from OSRM
!cd data; kaggle datasets download oscarleo/new-york-city-taxi-with-osrm; cd ..
!mkdir data/osrm
!unzip -o data/new-york-city-taxi-with-osrm.zip -d data/osrm

In [None]:
# weather data
!cd data; kaggle datasets download mathijs/weather-data-in-new-york-city-2016; cd ..
!mkdir data/weather
!unzip -o data/weather-data-in-new-york-city-2016.zip -d data/weather


In [None]:
# add some columns to the train and test data
cols_osrm = ['id', 'total_distance', 'total_travel_time',  'number_of_steps']
fr1 = pd.read_csv('data/osrm/fastest_routes_train_part_1.csv', usecols=cols_osrm)
fr2 = pd.read_csv('data/osrm/fastest_routes_train_part_2.csv', usecols=cols_osrm)
test_street_info = pd.read_csv('data/osrm/fastest_routes_test.csv',
                               usecols=cols_osrm)
train_street_info = pd.concat((fr1, fr2))
train_df = train_df.merge(train_street_info, how='left', on='id')
test_df = test_df.merge(test_street_info, how='left', on='id')

train_df.head(5)

In [None]:
pd.to_datetime("2016-03-14 17:32:30")-pd.to_datetime("2016-03-14 17:24:55")

## Exploratory Data Analysis (EDA)

It is good to get used to the properties data in an interactive
manner before carrying out any model building, which is often done
by some basic visualization and summary descriptions.

In [None]:
train_df.describe()

In [None]:
# really long trips present
3.526282e+06/3600

We can remove outliers to simplify the analysis and make it
more robust, e.g. we can safely remove trips with a duration
further away from the mean than 3 standard deviations.

In [None]:
m = np.mean(train_df['trip_duration'])
s = np.std(train_df['trip_duration'])
# filter 
filter_duration = ((train_df['trip_duration'] <= m + 3*s) &
                   (train_df['trip_duration'] >= m - 3*s))
(~filter_duration).sum()

In [None]:
train_df.shape

Similarly, the bounding box limits of the city of New York can
be easily checked and used for limiting the exploration of data
to trips which were started or finished within the city.

In [None]:
city_long_border = (-74.03, -73.75)
city_lat_border = (40.63, 40.85)
filter_location = ((train_df['pickup_longitude'] <= city_long_border[1]) &
                   (train_df['pickup_longitude'] >= city_long_border[0]) &
                   (train_df['pickup_latitude'] <= city_lat_border[1]) &
                   (train_df['pickup_latitude'] >= city_lat_border[0]) &
                   (train_df['dropoff_longitude'] <= city_long_border[1]) &
                   (train_df['dropoff_longitude'] >= city_long_border[0]) &
                   (train_df['dropoff_latitude'] <= city_lat_border[1]) &
                   (train_df['dropoff_latitude'] >= city_lat_border[0]))
                                 
(~filter_location).sum()

In [None]:
# in case you have not heard of it
# this is a very useful DataFrame function
train_df.info()

### Data Preparation

In [None]:
fig, ax = plt.subplots()

ax.hist(train_df.loc[filter_duration,'trip_duration'], bins=100)
ax.set_xlabel('trip_duration (seconds)')
ax.set_ylabel('number of train records');

In [None]:
train_df['log_trip_duration'] = np.log(train_df['trip_duration'].values + 1)

fig, ax = plt.subplots()

ax.hist(train_df['log_trip_duration'].values, bins=100)
ax.set_xlabel('log(trip_duration)')
ax.set_ylabel('number of train records')

In [None]:
pc = train_df[filter_duration].groupby('passenger_count')['trip_duration'].mean()

fig, ax = plt.subplots()
ax.set_ylim(ymin=0)
ax.set_ylim(ymax=1100)
plt.title('passenger count')
plt.ylabel('Time in Seconds')
sns.barplot(x=pc.index,y=pc.values)

In [None]:
train_df['pickup_datetime']

In [None]:
train_df['pickup_datetime'] = pd.to_datetime(train_df.pickup_datetime)
test_df['pickup_datetime'] = pd.to_datetime(test_df.pickup_datetime)
train_df.loc[:, 'pickup_date'] = train_df['pickup_datetime'].dt.date
test_df.loc[:, 'pickup_date'] = test_df['pickup_datetime'].dt.date
train_df['dropoff_datetime'] = pd.to_datetime(train_df.dropoff_datetime)

In [None]:

fig, ax = plt.subplots()
ax.plot(train_df.groupby('pickup_date').count()[['id']],
         'o-', label='train')
ax.plot(test_df.groupby('pickup_date').count()[['id']],
         'o-', label='test_df')
ax.set_title('Trips over Time.')
ax.legend(loc=0);

In [None]:
city_long_border = (-74.03, -73.75)
city_lat_border = (40.63, 40.85)
fig, ax = plt.subplots(ncols=2, sharex=True, sharey=True)
ax[0].scatter(train_df['pickup_longitude'].values[:100000], train_df['pickup_latitude'].values[:100000],
              color='blue', s=1, label='train', alpha=0.1)
ax[1].scatter(test_df['pickup_longitude'].values[:100000], test_df['pickup_latitude'].values[:100000],
              color='green', s=1, label='test_df', alpha=0.1)
fig.suptitle('Train and test area complete overlap.')
ax[0].legend(loc=0)
ax[0].set_ylabel('latitude')
ax[0].set_xlabel('longitude')
ax[1].set_xlabel('longitude')
ax[1].legend(loc=0)
plt.ylim(city_lat_border)
plt.xlim(city_long_border)

### Folium: Interactive Maps in the Jupyter Notebook


Folium is a flexible Python library that can be used to work with interactive Leaflet.js maps.

In [None]:
import folium

In [None]:
location = (43.471198, -3.801362)
m = folium.Map(location=location,
               zoom_start=13)

popup = '<b>Us!</b>'
folium.Marker(location, popup=popup).add_to(m)

m

#### Folium Exercise

In addition of New York State, where else where New York
taxis picked up? Represent some of far the location outliers
(remember the `~filter_location` variable) in a Folium map.


In [None]:
dist = ((train_df.loc[~filter_location,["dropoff_latitude","dropoff_longitude"]].values - np.array([40.5,-74.]))**2).sum(axis=1)

In [None]:
np.argmax(dist)

In [None]:
# write here the solution of the exercise
idx = 15857
location = (train_df.loc[~filter_location].iloc[idx].dropoff_latitude,
            train_df.loc[~filter_location].iloc[idx].dropoff_longitude)
m = folium.Map(location=location,
               zoom_start=13)

popup = '<b>Us!</b>'
folium.Marker(location, popup=popup).add_to(m)

m

## More Advanced Feature Engineering

### Principal Component Analysis (PCA)

In some cases, PCA over some of the features can be used to
obtain transformed features that can be used more efficiently
for model training.


In [None]:
from sklearn.decomposition import PCA
# fit PCA
coords = np.vstack((train_df[['pickup_latitude',
                           'pickup_longitude']].values,
                    train_df[['dropoff_latitude',
                           'dropoff_longitude']].values,
                    test_df[['pickup_latitude',
                          'pickup_longitude']].values,
                    test_df[['dropoff_latitude',
                          'dropoff_longitude']].values))

pca = PCA().fit(coords)

# add as new features
train_df['pickup_pca0'] = pca.transform(train_df[['pickup_latitude', 'pickup_longitude']].values)[:, 0]
train_df['pickup_pca1'] = pca.transform(train_df[['pickup_latitude', 'pickup_longitude']].values)[:, 1]
train_df['dropoff_pca0'] = pca.transform(train_df[['dropoff_latitude', 'dropoff_longitude']].values)[:, 0]
train_df['dropoff_pca1'] = pca.transform(train_df[['dropoff_latitude', 'dropoff_longitude']].values)[:, 1]
test_df['pickup_pca0'] = pca.transform(test_df[['pickup_latitude', 'pickup_longitude']].values)[:, 0]
test_df['pickup_pca1'] = pca.transform(test_df[['pickup_latitude', 'pickup_longitude']].values)[:, 1]
test_df['dropoff_pca0'] = pca.transform(test_df[['dropoff_latitude', 'dropoff_longitude']].values)[:, 0]
test_df['dropoff_pca1'] = pca.transform(test_df[['dropoff_latitude', 'dropoff_longitude']].values)[:, 1]

train_df.head(5)

### Date extraction

Get hour of the day, day of the week, day of the month
and month number to better encode the date. These will
much better represent the periodicity of the traffic.

In [None]:
train_df['Month'] = train_df['pickup_datetime'].dt.month
test_df['Month'] = test_df['pickup_datetime'].dt.month
train_df['DayofMonth'] = train_df['pickup_datetime'].dt.day
test_df['DayofMonth'] = test_df['pickup_datetime'].dt.day
train_df['Hour'] = train_df['pickup_datetime'].dt.hour
test_df['Hour'] = test_df['pickup_datetime'].dt.hour
train_df['dayofweek'] = train_df['pickup_datetime'].dt.dayofweek
test_df['dayofweek'] = test_df['pickup_datetime'].dt.dayofweek
train_df.head(5)

### Indicator Variables

Categorical data should be used as is within most machine
learning techniques. The `pd.get_dummies` function can facilitate
the transformation to dummy/indicator variables (also referred as
one-hot encoding).

In [None]:
# for example this is the case for the vendor id
vendor_train = pd.get_dummies(train_df['vendor_id'], prefix='vi', prefix_sep='_')
vendor_test = pd.get_dummies(test_df['vendor_id'], prefix='vi', prefix_sep='_')
# store_and_fwd_flag
store_and_fwd_flag_train = pd.get_dummies(train_df['store_and_fwd_flag'], prefix='sf', prefix_sep='_')
store_and_fwd_flag_test = pd.get_dummies(test_df['store_and_fwd_flag'], prefix='sf', prefix_sep='_')
# and passenger_count
passenger_count_train = pd.get_dummies(train_df['passenger_count'], prefix='pc', prefix_sep='_')
passenger_count_test = pd.get_dummies(test_df['passenger_count'], prefix='pc', prefix_sep='_')
# remove some columns so train and test have same shape
passenger_count_train.drop(["pc_7","pc_8","pc_9"],axis=1,inplace=True)
passenger_count_test.drop(["pc_9"],axis=1,inplace=True)


In [None]:
# we can do the same for the  time categoricals
month_train = pd.get_dummies(train_df['Month'], prefix='m', prefix_sep='_')
month_test = pd.get_dummies(test_df['Month'], prefix='m', prefix_sep='_')
dom_train = pd.get_dummies(train_df['DayofMonth'], prefix='dom', prefix_sep='_')
dom_test = pd.get_dummies(test_df['DayofMonth'], prefix='dom', prefix_sep='_')
hour_train = pd.get_dummies(train_df['Hour'], prefix='h', prefix_sep='_')
hour_test = pd.get_dummies(test_df['Hour'], prefix='h', prefix_sep='_')
dow_train = pd.get_dummies(train_df['dayofweek'], prefix='dow', prefix_sep='_')
dow_test = pd.get_dummies(test_df['dayofweek'], prefix='dow', prefix_sep='_')

In [None]:
# remove features that will not be available for test
# or for which the indicator function has been computed
remove_cols = ['id','vendor_id','passenger_count','store_and_fwd_flag',
               'Month','DayofMonth','Hour','dayofweek',
                'pickup_datetime','pickup_date',
               'pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']
train_fea = train_df.drop(remove_cols,axis=1)
train_fea = train_fea.drop(["dropoff_datetime","trip_duration"], axis=1)
test_fea = test_df.drop(remove_cols,axis=1)

In [None]:
train_all_fea = pd.concat([train_fea,
                          vendor_train,
                          passenger_count_train,
                          store_and_fwd_flag_train,
                          month_train,
                          dom_train,
                          hour_train,
                          dow_train], axis=1)
test_all_fea = pd.concat([test_fea, 
                         vendor_test,
                         passenger_count_test,
                         store_and_fwd_flag_test,
                         month_test,
                         dom_test,
                         hour_test,
                         dow_test], axis=1)
print(train_all_fea.shape, test_all_fea.shape)

## XGBoost Training

Now that we have preprocess and engineer the model features
we can train a Gradient Boosting regression model to predict
the trip duration.

In [None]:
from sklearn.model_selection import train_test_split
# we will only consider 100000 examples to speed up training
n_samples = 100000
train, valid = train_test_split(train_all_fea[0:n_samples], test_size = 0.2) 

In [None]:
X_train = train.drop(['log_trip_duration'], axis=1)
y_train = train["log_trip_duration"]
X_valid = valid.drop(['log_trip_duration'], axis=1)
y_valid = valid["log_trip_duration"]

y_valid = y_valid.reset_index().drop('index',axis = 1)
y_train = y_train.reset_index().drop('index',axis = 1)

In [None]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

In [None]:
xgb_pars = {'min_child_weight': 1, 'eta': 0.5,
            'colsample_bytree': 0.9,  'max_depth': 6,
            'subsample': 0.9, 'lambda': 1., 'nthread': -1, 'booster' : 'gbtree', 'silent': 1,
            'eval_metric': 'rmse', 'objective': 'reg:linear'}

model = xgb.train(xgb_pars, dtrain, 10, watchlist, early_stopping_rounds=2,
      maximize=False, verbose_eval=1)
print('Modeling RMSE %.5f' % model.best_score)

### Feature Importances

XGBoost also provides a tool to obtain feature importance.

In [None]:
xgb.plot_importance(model, max_num_features=28, height=0.7)

### Exercise: Design a Better Model

Use XGBoost GridScan or manually change some of the hyper-parameters
in order to obtain better RMSE.

In [None]:
# train another model here

## References

The [top Kaggle Kernels](https://www.kaggle.com/c/nyc-taxi-trip-duration/kernels) (executable environments to similar to Google Collab but aimed for competitions) of the New York City Taxi Trip Duration Playground competition are really good. In particular,
this notebook is heavily based on:
- [1] [Strength of visualization-python visuals tutorial](https://www.kaggle.com/maheshdadhich/strength-of-visualization-python-visuals-tutorial) by BuryBuryZymon
- [2] [From EDA to the Top (LB 0.367)](https://www.kaggle.com/gaborfodor/from-eda-to-the-top-lb-0-367) by beluga
- [3] [NYCT - from A to Z with XGBoost (Tutorial)](https://www.kaggle.com/karelrv/nyct-from-a-to-z-with-xgboost-tutorial) by KarelVerhoeven
