## Introduction
In this competition, we are challenged to build a model that predicts the total ride duration of taxi trips in New York City. The primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

With this simple notebook we try to
 
 - Explore the dataset
 - Extract features
 - Build Baseline model



In [None]:
%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from datetime import timedelta
import datetime as dt
# from geopy.distance import vincenty, great_circle
from haversine import haversine
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import train_test_split

## Data understanding

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
sample_submission = pd.read_csv('../input/sample_submission.csv')

Let's check the data files! According the data description we should find the following columns:

 - **id** - a unique identifier for each trip
 - **vendor_id** - a code indicating the provider associated with the trip record
 - **pickup_datetime** - date and time when the meter was engaged
 - **dropoff_datetime** - date and time when the meter was disengaged
 - **passenger_count** - the number of passengers in the vehicle (driver entered value)
 - **pickup_longitude** - the longitude where the meter was engaged
 - **pickup_latitude** - the latitude where the meter was engaged
 - **dropoff_longitude** - the longitude where the meter was disengaged
 - **dropoff_latitude** - the latitude where the meter was disengaged
 - **store_and_fwd_flag** - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server (Y=store and forward; N=not a store and forward trip)
 - **trip_duration** - duration of the trip in seconds

Obviously dropoff_datetime and trip_duration are only available for the train set.

In [None]:
print('We have {} training rows and {} test rows.'.format(train.shape[0], test.shape[0]))
print('We have {} training columns and {} test columns.'.format(train.shape[1], test.shape[1]))
train.head(2)

## Sanity check

In [None]:
print('Id is unique.') if train.id.nunique() == train.shape[0] else print('oops')
print('Train and test sets are distinct.') if len(np.intersect1d(train.id.values, test.id.values))== 0 else print('oops')
print('We do not need to worry about missing values.') if train.count().min() == train.shape[0] and test.count().min() == test.shape[0] else print('oops')
print('The store_and_fwd_flag has only two values {}.'.format(str(set(train.store_and_fwd_flag.unique()) | set(test.store_and_fwd_flag.unique()))))

In [None]:
train['pickup_datetime'] = pd.to_datetime(train.pickup_datetime)
test['pickup_datetime'] = pd.to_datetime(test.pickup_datetime)
train['dropoff_datetime'] = pd.to_datetime(train.dropoff_datetime)
train['store_and_fwd_flag'] = 1 * (train.store_and_fwd_flag.values == 'Y')
test['store_and_fwd_flag'] = 1 * (test.store_and_fwd_flag.values == 'Y')


In [None]:
train['check_trip_duration'] = (train['dropoff_datetime'] - train['pickup_datetime']).map(lambda x: x.total_seconds())
duration_difference = train[np.abs(train['check_trip_duration'].values  - train['trip_duration'].values) > 1]
print('Trip_duration and datetimes are ok.') if len(duration_difference[['pickup_datetime', 'dropoff_datetime', 'trip_duration', 'check_trip_duration']]) == 0 else print('Ooops.')

## Feature Extraction
Let's calculate the distance (km) between pickup and dropoff points. Currently Haversine is used, geopy has another heuristics (vincenty() or great_circle()) if you prefer.
The cabs are not flying and we are in New York so we could check the Manhattan (L1) distance too :) 

In [None]:
def haversine_array(lat1, lng1, lat2, lng2):
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    AVG_EARTH_RADIUS = 6371  # in km
    # calculate haversine
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
    h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
    return h

def dummy_manhattan_distance(lat1, lng1, lat2, lng2):
    a = haversine_array(lat1, lng1, lat1, lng2)
    b = haversine_array(lat1, lng1, lat2, lng1)
    return a + b


train.loc[:, 'distance_haversine'] = haversine_array(train['pickup_latitude'].values, train['pickup_longitude'].values, train['dropoff_latitude'].values, train['dropoff_longitude'].values)
train.loc[:, 'distance_dummy_manhattan'] = dummy_manhattan_distance(train['pickup_latitude'].values, train['pickup_longitude'].values, train['dropoff_latitude'].values, train['dropoff_longitude'].values)
test.loc[:, 'distance_haversine'] = haversine_array(test['pickup_latitude'].values, test['pickup_longitude'].values, test['dropoff_latitude'].values, test['dropoff_longitude'].values)
test.loc[:, 'distance_dummy_manhattan'] = dummy_manhattan_distance(test['pickup_latitude'].values, test['pickup_longitude'].values, test['dropoff_latitude'].values, test['dropoff_longitude'].values)

Add datetime features.

In [None]:
train.loc[:, 'pickup_date'] = train['pickup_datetime'].dt.date
train.loc[:, 'pickup_weekday'] = train['pickup_datetime'].dt.weekday
train.loc[:, 'pickup_day'] = train['pickup_datetime'].dt.day
train.loc[:, 'pickup_month'] = train['pickup_datetime'].dt.month
train.loc[:, 'pickup_hour'] = train['pickup_datetime'].dt.hour
train.loc[:, 'pickup_minute'] = train['pickup_datetime'].dt.minute
train.loc[:, 'pickup_dt'] = (train['pickup_datetime'] - train['pickup_datetime'].min()).map(lambda x: x.total_seconds())

test.loc[:, 'pickup_date'] = test['pickup_datetime'].dt.date
test.loc[:, 'pickup_weekday'] = test['pickup_datetime'].dt.weekday
test.loc[:, 'pickup_day'] = test['pickup_datetime'].dt.day
test.loc[:, 'pickup_month'] = test['pickup_datetime'].dt.month
test.loc[:, 'pickup_hour'] = test['pickup_datetime'].dt.hour
test.loc[:, 'pickup_minute'] = test['pickup_datetime'].dt.minute
test.loc[:, 'pickup_dt'] = (test['pickup_datetime'] - train['pickup_datetime'].min()).map(lambda x: x.total_seconds())

In [None]:
train.loc[:, 'average_speed_h'] = 1000 * train['distance_haversine'] / train['trip_duration']
# train.loc[:, 'average_speed_v'] = 1000 * train['distance_vincenty'] / train[trip_duration]
# train.loc[:, 'average_speed_gc'] = 1000 * train['distance_great_circle'] / train[trip_duration]
train.loc[:, 'average_speed_m'] = 1000 * train['distance_dummy_manhattan'] / train['trip_duration']

In [None]:
fig, ax = plt.subplots(ncols=3, sharey=True)
ax[0].plot(train.groupby('pickup_hour').mean()['average_speed_h'], 'bo-', alpha=0.5)
ax[1].plot(train.groupby('pickup_weekday').mean()['average_speed_h'], 'go-',  alpha=0.5)
ax[2].plot(train.groupby('pickup_day').mean()['average_speed_h'], 'ro',  alpha=0.5)
ax[0].set_xlabel('hour')
ax[1].set_xlabel('weekday')
ax[2].set_xlabel('day')
ax[0].set_ylabel('average speed')
fig.suptitle('Rush hour average traffic speed')
plt.show()

## Modeling
First let's check the train test split. It helps to decide our validation strategy.

In [None]:
plt.plot(train.groupby('pickup_date').count()[['id']], 'o-', label='train')
plt.plot(test.groupby('pickup_date').count()[['id']], 'o-', label='test')
plt.title('Train and test period complete overlap.')
plt.legend(loc=0)
plt.ylabel('number of records')
plt.show()

In [None]:
fig, ax = plt.subplots(ncols=2, sharex=True, sharey=True)
ax[0].plot(train['pickup_latitude'].values, train['pickup_longitude'].values, 'b.',
           label='train', alpha=0.1)
ax[1].plot(test['pickup_latitude'].values, test['pickup_longitude'].values, 'g.',
           label='train', alpha=0.1)
fig.suptitle('Train and test area complete overlap.')
ax[0].legend(loc=0)
ax[0].set_ylabel('latitude')
ax[0].set_xlabel('longitude')
ax[1].set_xlabel('longitude')
ax[1].legend(loc=0)
plt.xlim([40.5, 41])
plt.ylim([-74.5, -73.5])
plt.show()


Add a few average traffic speed feature. Note that if the train/test split would be time based then we could not use as much temporal features. In this competition we do not need to predict the future.

In [None]:
for gby_col in ['pickup_hour', 'pickup_day', 'pickup_date', 'pickup_weekday']:
    gby = train.groupby(gby_col).mean()[['average_speed_h', 'average_speed_m']]
    gby.columns = ['%s_gby_%s' % (col, gby_col) for col in gby.columns]
    train = pd.merge(train, gby, how='left', left_on=gby_col, right_index=True)
    test = pd.merge(test, gby, how='left', left_on=gby_col, right_index=True)

In [None]:
train['trip_duration'].describe()


We can see that the max trip_duration is ~ 1000 hours. Fortunately the evaluation metric is RMSLE and not RMSE . Outliers will cause less trouble. We could logtransform our target label and use RMSE during training.

In [None]:
feature_names = list(train.columns)
do_not_use_for_training = ['id', 'pickup_datetime', 'dropoff_datetime', 'trip_duration',
                           'check_trip_duration', 'pickup_date', 'average_speed_h', 'average_speed_m']
feature_names = [f for f in train.columns if f not in do_not_use_for_training]
print(feature_names)
train[feature_names].count()

y = np.log(train['trip_duration'].values + 1)
plt.hist(y, bins=100)
plt.xlabel('log(trip_duration)')
plt.ylabel('number of train records')
plt.show()

In [None]:
Xtr, Xv, ytr, yv = train_test_split(train[feature_names].values, y, test_size=0.2, random_state=1987)
dtrain = xgb.DMatrix(Xtr, label=ytr)
dvalid = xgb.DMatrix(Xv, label=yv)
dtest = xgb.DMatrix(test[feature_names].values)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

In [None]:
xgb_pars = {'min_child_weight': 10, 'eta': 0.2, 'colsample_bytree': 0.5, 'max_depth': 10,
            'subsample': 0.95, 'lambda': 1., 'nthread': -1, 'booster' : 'gbtree', 'silent': 1,
            'eval_metric': 'rmse', 'objective': 'reg:linear'}


In [None]:
model = xgb.train(xgb_pars, dtrain, 400, watchlist, early_stopping_rounds=50,
                  maximize=False, verbose_eval=50)

In [None]:
ypred = model.predict(dvalid)
plt.scatter(ypred, yv, alpha=0.1)
plt.xlabel('log(prediction)')
plt.ylabel('log(ground truth)')
plt.show()

plt.scatter(np.exp(ypred), np.exp(yv), alpha=0.1)
plt.xlabel('prediction')
plt.ylabel('ground truth')
plt.show()

Let's try our first submission.

In [None]:
ytest = model.predict(dtest)
print((test.shape, ytest.shape))
test['trip_duration'] = np.exp(ytest)
test[['id', 'trip_duration']].to_csv('first_submission.csv', index=False)