# Introduction


I will first perform an Exploratory Data Analysis on the dataset to get some main insights as to which features might be worth using.  My feeling is that hour of the trip (0~23), weekday (1~7) and some measure of distance between pickup and dropoff points might be enough to build a reasonably good model for trip duration.

First, I will import some useful and standard libraries.

In [21]:
import pandas as pd
import datetime as dt
import numpy as np 
import matplotlib.pyplot as plt
from math import sin, cos, sqrt, atan2, radians
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold


%matplotlib inline

Then, I will define some functions that will help be get the features I need from the raw data. Mainly, I will convert the pickup_datetime from a string to a datetime object so that I can extract hour and day of the week easily.
Also, I will use the function *get_distance* to calculate the distance between pickup (lat, long)  and dropoff (lat, long).

In [22]:
def convert_datetime(s):
    if type(s)==str:
        return dt.datetime.strptime(s, '%Y-%m-%d %H:%M:%S')
    else:
        return s
    
def get_hour(d):
    return d.hour

def get_weekday(d):
    weekday = d.isoweekday()
    return weekday

# def get_geohash(row):
#     return geohash.encode(row['pickup_latitude'], row['pickup_longitude'], precision=6)

def get_distance(lat1, long1, lat2, long2):
    R = 6373.0

    lat1 = radians(lat1)
    lon1 = radians(long1)
    
    lat2 = radians(lat2)
    lon2 = radians(long2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c
    
    #in km
    return distance

def distance(row):
    return get_distance(row['pickup_latitude'], row['pickup_longitude'], row['dropoff_latitude'], row['dropoff_longitude'])

def rmsle(y_test, y_pred) : 
    assert len(y_test) == len(y_pred)
    return np.sqrt(np.mean((np.log(1+y_pred) - np.log(1+y_test))**2))

def count_elements(array):
    return len(pd.unique(array))

Then, I will load the training and test datasets.

In [23]:
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv('../input/test.csv')

And use my *convert_datetime* function to turn pickup_datetime strings into datetime objects.

In [None]:
train_data.pickup_datetime = train_data.pickup_datetime.apply(convert_datetime)

Then, I will get the *hour* ,  *weekday*  and *Euclidean_distance* measures.****

In [None]:
train_data['hour'] = train_data.pickup_datetime.apply(get_hour)
train_data['weekday'] = train_data.pickup_datetime.apply(get_weekday)
train_data['Eucl_distance'] = train_data.apply(distance, axis=1)

In [None]:
train_data.describe()

From the description above, it's clear that some demaged data made it to our training set. For example, there are trips with durations of 1 second and others thata lasted more than a day! There are also trips where the pickup point is the same as the dropoff (Eucl_distance = 0), and others that went on for more than 1000 kilometers, which is just impossible. 

So, some data cleaning is needed.

In [None]:
train_data = train_data[(train_data['trip_duration']>=60) & (train_data['trip_duration']<=10800) & (train_data['Eucl_distance']<50)].reset_index()

I've set the following limits for my data: the trip duration must be between 60s (1 minute) and 10800s (3 hours), and the pickup and dropoff points must be less than 50 kilometers apart (remember it's Euclidean distance. It doesn't take into account all the zigzags a car must do to get by in the city). Anything other than that will be discarded.

## EDA

In [None]:
plt.figure(figsize=(10, 12))
pivot_2 = train_data.pivot_table(index='hour' , columns='weekday', values='id', aggfunc=count_elements)
pivot_2.sort_index(level=0, ascending=False, inplace=True)
ax2 = sns.heatmap(pivot_2)
ax2.set_title('# of Rides per Pickup Hour and Weekday')
plt.show()

From the heatmap above, it's clear that few trips happen between 0h~6h during weekdays and 3h~8h on weekends, which is the time when people are commonly sleeping. Also, a peak in trips seems to take place right in the morning (8h~9h) and in the afternoon (18h~21h) on weekdays, which might indicate that a lot of employees/students use taxis to go to/leave work or the university. Finally, the unusually busy hours from 0h~3h on weekends might indicate that our customers use our service to get home from parties/clubs/pubs.

In [None]:
plt.figure(figsize=(10, 12))
pivot_1 = train_data.pivot_table(index='hour' , columns='weekday', values='trip_duration', aggfunc=np.mean)
pivot_1.sort_index(level=0, ascending=False, inplace=True)
ax3 = sns.heatmap(pivot_1)
ax3.set_title('Trip Duration [seconds] per Pickup Hour and Weekday')
plt.show()

From the map above, it's clear that the longest trips happen between 8h and 18h on weekdays, which again coincides with employees arriving/leaving work. This might indicate that the traffic jams on these hours might cause a trip tp be unnecessarily longer than usual. 

In [None]:
tab = train_data.pivot_table(index=['weekday', 'hour'] , values=['trip_duration', 'Eucl_distance'], aggfunc=np.sum)
tab['avg_velocity'] = tab['Eucl_distance'] / (tab['trip_duration']/3600)
tab = tab.reset_index()
tab.avg_velocity = tab.avg_velocity.astype(int)
plt.figure(figsize=(10, 12))
tab = tab.pivot('hour', 'weekday', 'avg_velocity')
tab.sort_index(level=0, ascending=False, inplace=True)
ax4 = sns.heatmap(tab, annot=True, fmt='d')
ax4.set_title('Avg Velocity [km/h] per Pickup hour and weekday')
plt.show()

The velocity map just confirms this. The average speed drops to about 11km/h during buiness hours.  The velocity was calculated using the Euclidean distance over trip duration, which is fundamentally wronk, but is a nice proxy of what the real number is.

## Modelling

Now that we have a sense of our data, let's get to modelling our predictor.

I will be using the KNNRegressor because it makes sense that trips that start at the equal hours, on the equal weekdays with the equal distances, must take an approximately equal time. I hope that the hours/weekday take care of the traffic jam effect.

In [None]:
kf = KFold(n_splits=3)
neighbors_array=[]
training_score=[]
testing_score=[]
for n_neighbors in range(1,10,2):
    knn = KNeighborsRegressor(n_neighbors=n_neighbors)
    print('\nEvaluating metrics with n_neighbors= ' + str(n_neighbors))
    neighbors_array.append(n_neighbors)
    ind_test_score=[]
    ind_train_score=[]
    
    for train_index, test_index in kf.split(train_data):
        train_x = train_data.loc[train_index, ['hour', 'weekday', 'Eucl_distance']]
        train_y = train_data.loc[train_index, ['trip_duration']]

        test_x = train_data.loc[test_index, ['hour', 'weekday', 'Eucl_distance']]
        test_y = train_data.loc[test_index, ['trip_duration']]

        knn.fit(train_x, train_y)
        y_pred_test = knn.predict(test_x)
        y_pred_train = knn.predict(train_x)
        
        ind_train_score.append(rmsle(train_y, y_pred_train))
        ind_test_score.append(rmsle(test_y, y_pred_test))
    training_score.append(np.mean(ind_train_score))
    testing_score.append(np.mean(ind_test_score))

In [None]:
plt.figure(figsize=(10, 12))
plt.plot(neighbors_array[:4], training_score[:4], label='training error')
plt.plot(neighbors_array[:4], testing_score[:4], label='testing error')
plt.title('Training and Testing Errors vs N_Neighbors')
plt.ylabel('RMSE')
plt.xlabel('n_neighbors')
plt.legend()
plt.show()

As expected, the training error starts off very low and the testing error, very high. That indicated the usual overfitting behavior we get from too few neighbors. Both errors tend to meet at around 0.4, which is a relatively good error for such a simple model.

## Final Considerations

This model is very simplistic, and there are many things that can be done to improve it. For example, we could use weather data to figure out if on a specific day/hour it rained/snowed heavy. That certainly influences trip durations. Also, hour and week are cyclical measures. That means that 23 hours is closer to 0 hours than to 20 hours. Same thing goes for week ( Sunday is closer to Monday than it is to Friday). But this behavior is not captured on the chosen model. A nice and elegant way to incorporate such cyclical behavior would be to use sine/cosine features, as explained on this thread: https://www.reddit.com/r/MachineLearning/comments/203pqk/machine_learning_and_comparing_times/?st=jhgziikz&sh=d21d3e70.

Thank you for reading and I'm open to feedbacks and critiques!