# New York City Taxi Trip Duration

### ---my first Kaggle Kernel

When travel, we always want to know the expected arriving time. It is espically true when we take taxi. The time it takes a taxi to arrive a destination can depend on many things, such as distance, local traffic, driver's skills, and etc. In this problem, we are going to explore the data collected from taxis in New York city and estimate the taxi trip duration there.

The training data has about 1.5 million samples and 11 features collected during the first half year of 2016. Testing data has 0.6 million samples and 9 feature and is from the same period of time. 

In [None]:
# load python modules 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from time import time
from datetime import datetime
import pandas as pd

from statsmodels.api import OLS as lm
from sklearn.ensemble import RandomForestRegressor as rf
from pandas.tseries.holiday import USFederalHolidayCalendar

In [None]:
# load data
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

In [None]:
print("shape (train): {:d} obs. {:d} features in train".format(*train.shape))
print("shape (test): {:d} obs. {:d} features in test".format(*test.shape))

In [None]:
print("\t\t\t", "train\t\t\t","test")
print("starting time:", pd.to_datetime(train['pickup_datetime']).min(), \
     pd.to_datetime(test['pickup_datetime']).min())
print("ending time:  ", pd.to_datetime(train['dropoff_datetime']).max())

In [None]:
# define some functions that will be used later

def Manhattan_distance(data, direct):
    """ claculate direct distance (optional) and Manhattan distance
        and add column log10(distance) to data """
    lat = (data['pickup_latitude']+data['dropoff_latitude'])/2
    dx = (data['dropoff_longitude'] - data['pickup_longitude'])*np.cos(lat)
    dy = data['dropoff_latitude'] - data['pickup_latitude']
    theta = np.pi*30/180
    rotate = np.array([[np.cos(theta), np.sin(theta)],[-np.sin(theta), np.cos(theta)]])
    loc = np.dot(np.array([dx,dy]).T, rotate)
    if direct:
        data['distance'] = np.log10(np.sqrt((loc[:,0])**2 + (loc[:,1])**2)+1e-6)
    data['Manhattan_dist'] = np.log10(np.abs(loc[:,0]) + np.abs(loc[:,1])+1e-6)
    
def speed(data):
    """ calculate average speed and add column log10(speed) to data """
    Manhattan_distance(data, False)
    data['speed'] = data['Manhattan_dist']-np.log10(data['trip_duration'])
    
def separate_time(data):
    """ extract weekday, time from datetime and add corresponding columns to data """
    pickup_time = pd.DatetimeIndex(pd.to_datetime(data["pickup_datetime"]))
    data['pickup_date'] = pickup_time.weekday
    data['pickup_hour'] = pickup_time.hour
    try:
        data['dropoff_hour'] = pd.DatetimeIndex(pd.to_datetime(data["dropoff_datetime"])).hour
    except KeyError:
        pass
    
def remove_exotic(data):
    """ remove exotic data according to criteria in the section about outliers """
    Manhattan_distance(data, False)
    select = (data['passenger_count']>0) & (data["Manhattan_dist"]>-5.9)
    try:
        return data[(data['trip_duration']<3600*8) & select].copy()
    except KeyError:
        pass
    return data[select].copy()

def preprocess(data):
    """ remove exotic data, change data format and add new features necessary """
    Manhattan_distance(data, False)
    data['store_and_fwd_flag'] = np.where(data['store_and_fwd_flag']=="Y", 1, 0).ravel()
    separate_time(data)
    try:
        data['log_duration'] = np.log(data['trip_duration']+1)
        data['speed'] = data['Manhattan_dist']-np.log10(data['trip_duration'])
    except KeyError:
        pass

## data exploration -- a brief summary of each variable

We first explore the data provided. To estimate the trip duration, intuitively we need the time and location of pickup and dropoff. The training data provides a variable *trip_duration*, which is exactly *dropoff_datetime* minus *pickup_datetime* in unit of seconds. This information together with *dropoff_datetime* will not be available in the test data. In addition to location and time information, we are given variables vendor id, number of passengers and whether the trip is recorded automatically. Following is a brief summary of the three variables not related with location or time (we will look at them later).

* ***vendor_id*** is a variable that has only two values: 1 and 2. The number of vendor 2 is slightly more than vendor 1.
* ***passenger_count*** gives the number of passengers during the trip. One-passenger trip gets the most counts. Trips with more passengers happens fewer times. This is consistent with our commen sense. The training and testing data have the same distribution.
* Most data have ***store_and_fwd_flag*** as N. It may not be a useful variable.

The codes generate these results are below.

In [None]:
# first look at the training data
train.head()

In [None]:
test.head()

In [None]:
# check vendor_id
counts = pd.DataFrame({"train": train['vendor_id'].value_counts(), \
                       "test": test['vendor_id'].value_counts()})
counts.plot.bar(stacked=True)
plt.title("vendor_id")
plt.show()

In [None]:
# check passenger_count
counts = pd.DataFrame({"train": train['passenger_count'].value_counts(), \
                       "test": test['passenger_count'].value_counts()})
counts.plot.bar(stacked=True)
for i in range(counts.shape[0]):
    plt.text(i, counts.sum(axis=1).iat[i], int(counts.sum(axis=1).iat[i]), \
             rotation=0, ha='center', va='bottom')
plt.title("passenger_count")
plt.show()

In [None]:
# check store_and_fwd_flag
counts = pd.DataFrame({"train": train['store_and_fwd_flag'].value_counts(), \
                       "test": test['store_and_fwd_flag'].value_counts()})
counts.plot.bar(stacked=True)
for i in range(counts.shape[0]):
    plt.text(i, counts.sum(axis=1).iat[i], int(counts.sum(axis=1).iat[i]), \
             ha='center', va='bottom')
plt.title("store_and_fwd_flag")
plt.show()

 ## data exploration -- duration, distance and speed

Now we delve into two important pieces of information: time and location. Intuitively, we would expect travel duration and travel distance are related linearly. One major problem here is that we do not know the actual travel distance from given data. We can only extract the linear distance between the pickup and dropoff locations. One improvement we can do is to use Manhattan distance, which is the distance calculated with the assumption that all the roads form a square lattice structure. Since roads in Manhattan are not pointing North or Ease, we rotate the map by 30 degrees. The rotated map of Manhattan is shown below.

In [None]:
plt.figure(figsize=(4,8))
plotdata = test[(-74.02<test['pickup_longitude']) &  (test['pickup_longitude']<-73.92) & \
                 (40.7<test['pickup_latitude']) & (test['pickup_latitude']<40.85)]
locy = np.array(plotdata['pickup_latitude'])
locx = (np.array(plotdata['pickup_longitude'])+74)*np.cos(locy*np.pi/180)
loc = np.array([locx, locy]).T
theta = np.pi*30/180
rotate = np.array([[np.cos(theta), np.sin(theta)],[-np.sin(theta), np.cos(theta)]])
loc_new = np.dot(loc, rotate)
plt.scatter(loc_new[:,0], loc_new[:,1], s=0.1, c='y')
plt.xlim([-20.39,-20.34])
plt.ylim([35.24,35.37])
plt.xticks([])
plt.yticks([])
plt.show()

An intuition about time is that travel speeds depend on when and where we travel. Therefore, we can estimate travel duration by estimating averate travel speed. We use Manhattan distances as true travel distances. On the other hand, after taking logarithm, the relationship between speed, duration and distance become linear. Therefore, as long as we are estimating values after taking lograrithm, we will get the same final results no matter whether we are estimating speed or duration.

In [None]:
# trip duration distribution
plt.figure(figsize=(12,4))
plt.title("distribution of log10(trip duration)")
plt.hist(np.log10(train['trip_duration']), bins=200)
plt.show()

In [None]:
# trip distance distribution
Manhattan_distance(train, True)
plt.figure(figsize=(12,4))
plt.title("distribution of log10(trip distance)")
plt.hist([train['distance']+np.log10(2)/2,train['Manhattan_dist']], bins=100, \
         label=['direct*sqrt(2)','Manhattan'])
plt.legend(loc='best')
plt.show()

We plot the trip distance and trip duration after take logarithm of the values. The distributions are approximately Gaussian. We notice that the distribution of Manhattan distance differs with that of direct distance by a factor of $\sqrt{2}$. We do not know which distance may give better prediction, but intuitively we expect Manhattan distance be closer to the true distance. We will not make comparison between the two in this report. Only Manhattan distance will be used in the rest part.

In [None]:
# trip speed distribution
speed(train)
plt.figure(figsize=(12,4))
plt.title("distribution of log10(trip speed)")
plt.hist(train['speed'], bins=300)
plt.show()

As mentioned before, we will focus on estimating  the average trip speed in this report. The distribution of logarithm of trip speed is a nice Gaussian distribution but with a small tail on the left. We will come back to this tail later. For now, we want to get some idea of how locations and time are related with this average speed.

We first plot the scatter plot of speed verses distance. Since we are in log scale, the difference between distance and speed gives duration (in log scale of course). The vertical line on the left is formed by data points with zero true distance. There is another line on the right. It corresponds to largest duration samples. The speed clearly has a upper bound representing the maximun speed possible. There are few extreme points with very high velocity. 

In [None]:
train.sample(n=100000).plot.scatter(x='Manhattan_dist', y='speed', alpha=0.1)
plt.show()

Next, we explore the effect of time. We group those trips by pickup and dropoff hours. For simplicity, we now only consider short trips that begin and end within the same hour. We plot the average speed of these short trips in polar axis below. The dot size represent percentage of travellers in that hour of the whole day. As can be seen the number of travellers remains unchanged during most time of the day. However, the average speed of trips plunges at 8 am in the morning and jumps up at 8 pm in the afternoon. 

In [None]:
separate_time(train)
temp = train.loc[train['pickup_hour']==train['dropoff_hour'], ['speed','pickup_hour']]
plt.figure()
ax = plt.subplot(111, projection='polar')
ax.set_rticks([-4.4,-4.2])
ax.set_rlim([-4.6,-4.1])
ax.set_theta_offset(np.pi/2)
ax.set_theta_direction(-1)
ax.set_xticks(np.linspace(0,2*np.pi,6, endpoint=False))
ax.set_xticklabels(np.linspace(0,24,6, endpoint=False))
x = np.linspace(0,2*np.pi, 25, endpoint=True)[:24]
s = temp.groupby('pickup_hour').count().values.ravel()/temp.shape[0]*24*50
ax.scatter(x, temp.groupby('pickup_hour').mean().values.ravel(), s=s)
ax.set_title("average speed during a day\n(shape represents counts)\n")
plt.show()

Another period in time is week. We plot distributions of trip speed in the following figure. The dotted line is median of all the dates. The short-duration trips and long-duration trips are plotted separately. Notice the sample sizes of the two cases differ by two orders. On Saturdays, the average trips have larger speed in general, but there are a lot of long-duration trips having low speed.

In [None]:
def plot(temp, title):
    fig = plt.figure(figsize=(8,4))
    ax = fig.add_axes([0,0,1,1])
    ax.set_xlim([-0.5,6.5])
    ax.set_ylim([-4.9,-3.7])
    ax.set_xticks(np.arange(7))
    ax.set_ylabel("log10(speed)")
    ax.set_xticklabels(['Sun','Mon','Tue','Wed','Thu','Fri','Sat'])
    ax.axhline(temp['speed'].median(), linestyle='--', color='k')
    ax.plot(np.arange(7), temp.groupby('pickup_date').median(), 'o-', label='Median')
    ax.plot(np.arange(7), temp.groupby('pickup_date').mean(), 'o-', label='Mean')
    ax.set_title(title)
    ax.legend()
    temp2 = temp[(temp['speed']<-3.7)&(temp['speed']>-4.9)]
    for i in range(6,-1,-1):
        ax = fig.add_axes([i/7,0,0.2,1])
        ax.axis('off')
        ax.patch.set_alpha(0)
        ax.set_xlim([0,temp.shape[0]/7.0/10])
        ax.set_ylim([-4.9,-3.7])
        ax.hist(temp2.loc[temp2['pickup_date']==i, 'speed'], \
                          orientation='horizontal', bins=80, alpha=0.5)
#         sns.kdeplot(temp.loc[temp['pickup_date']==i,'speed'], \
#                     shade=True, legend=False, vertical=True)
    plt.show()
temp = train.loc[train['trip_duration']<3600, ['speed','pickup_date']]
plot(temp, "Trips of duration less than 1h (sample size {:d})".format(temp.shape[0]))
temp = train.loc[train['trip_duration']>3600, ['speed','pickup_date']]
plot(temp, "Trips of duration more than 1h (sample size {:d})".format(temp.shape[0]))

The effect of locations will not be addressed in this report.

## data exploration -- outliers

We get some idea about the data from previous analysis. We notice that there are some data points with values that are against our commen sense. We will talk about these data in detail

* ***passenger_count***: in the histogram of variable passenger_count, there is a category with zero passengers. We remove these data because a trip should not be valid without passengers.
* ***trip_duration***: in the distribution of variable trip_duration, there is a small peak on the right at around 5, which corresponding to $10^5$ seconds or 27 hours. We do not expect a driver to drive for more than 8 hours, so we will remove data points with trip_duration more than $3600\times8$ seconds
* ***Manhattan_distance***: in the distribution of this variable, there is a small peak on the left at -6. These are data with zero distance. We assume these data are entered by accedent and remove them. 

We apply these criteria on training and testing data and output the size of new data

In [None]:
new_train = remove_exotic(train)
new_test = remove_exotic(test)
print("size of new data v.s. original data")
print("train: {}/{} \t {:2f}%".format(new_train.shape[0], train.shape[0], \
                                     100.0*new_train.shape[0]/train.shape[0]))
print("test:  {}/{} \t {:2f}%".format(new_test.shape[0], test.shape[0], \
                                     100.0*new_test.shape[0]/test.shape[0]))

## estimation -- linear regression

We first use a linear model to do the estimation

In [None]:
preprocess(new_train)
preprocess(test)

In [None]:
columns = ['vendor_id', 'passenger_count', 'store_and_fwd_flag', \
           'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', \
           'Manhattan_dist', 'pickup_date', 'pickup_hour']
X = new_train[columns]
y = new_train['log_duration']
lm_model = lm(y, X)
lm_result = lm_model.fit()
lm_result.summary()

The R-square of linear model is very close to one (0.995), indicating a good fit. However, this is mainly because our data size is very large. The p-values of features are almost zero, indicating all features are necessary. The predicted value gives a score of 0.60486 on test data after submition, which is not too bad.

In [None]:
# predicted values
duration = np.exp(lm_result.predict(test[columns]))-1
duration = np.where(duration<0, 0, duration)

# estimation -- random forest

To get the results quickly, we only use 250,000 samples as training data.

In [None]:
n = X.shape[0]
shuffle = np.random.permutation(n)

rf_model = rf(n_estimators=300)
rf_model.fit(X.values[shuffle[:250000]], y.values[shuffle[:250000]])
feature_importance = rf_model.feature_importances_

In [None]:
plt.figure()
pos = np.arange(len(columns))
plt.barh(pos, feature_importance)
plt.yticks(pos, columns)
plt.xlabel("feature importance")
plt.show()

As expected, the Manhattan_dist have the highest importance value among all the features. We calculate the error using randomly chosen samples from the rest part of training data.

In [None]:
pred = rf_model.predict(X.values[shuffle[100000:]])
print("Error is:", np.sqrt(np.mean((y.values[shuffle][100000:]-pred)**2)))

There are many other things we can do for data exploration and to improve estimation. I cannot exhaust all possibilities in this report. Hope this post can guide those not familar with data analysis and inspire those who are.