# New York City Taxi Trip Duration

This is a simple EDA on the New York City Taxi Trip Duration dataset.   
on this kernel, I'll work only with the train data.  

In this competition, we have to predict the duration of a taxi trip between 2 points of New York.
The duration of the trip is the function of the distance and the average speed of the taxi.  
We can estimate the distance from the pickup and dropoff coordinates. 

The variance in trip duration should depend on the speed variance.  
Some of the factors that can influence the average speed are:
- the day and hour of the trip. The speed should be lower during rush hours.
- accidents.
- road work.
- driving style of the taxi. Maybe some drivers rush more than others.
- bridges, tunnels, ....
- weather

In this EDA, I'll try to identify which of the available features can help us predict the duration of a trip.  
At this point, I won't use external data.

First, load the tools.

In [None]:
import numpy as np
import pandas as pd
import seaborn.apionly as sns
import matplotlib.pyplot as plt
from datetime import date, datetime
from haversine import haversine

# statistics package
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy import stats

# packages for mapping
from mpl_toolkits.basemap import Basemap

# packages for interactive graphs
from ipywidgets import widgets, interact
from IPython.display import display

%matplotlib inline

In [None]:
def data_distribution(data):
    """ Draws a chart showing data distribution
        by combining an histogram and a boxplot
        
    Parameters
    ----------
    data: array or series
        the data to draw the distribution for
        
    """
    
    x = np.array(data)
    
    # set the number of bins using the Rice rule
    # n_bins = twice cube root of number of observations
    n = len(x)
    n_bins = round(2 * n**(1/3))
    
    fig = plt.figure()
    
    # histogram
    ax1 = fig.add_axes([0.1, 0.3, 0.8, 0.6])
    ax1 = plt.hist(x, bins=n_bins, alpha=0.7)
    plt.grid(alpha=.5)
    
    # boxplot
    ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.2])
    ax2 = plt.boxplot(x, vert=False, widths=0.7)
    plt.grid(alpha=.5)
           
    plt.show()

In [None]:
def distance(lat1, lon1, lat2, lon2):
    """calculates the Manhattan distance between 2 points
        using their coordinates
    
    Parameters
    ----------
    lat1: float
        latitude of first point
        
    lon1: float
        longitude of first point
        
    lat2: float
        latitude of second point
    
    lon2: float
        longitude of second point
        
    Returns
    -------
    d: float
        The Manhattan distance between the two points in kilometers
        
    """
    
    d = haversine((lat1, lon1), (lat2, lon1)) + haversine((lat2, lon1), (lat2, lon2))
    return d

## The dataset
Let's look first at what data we have.

In [None]:
df = pd.read_csv("../input/train.csv")
print("Rows: {}".format(df.shape[0]))
print("Columns: {}".format(df.shape[1]))

In [None]:
df.info()

We have about 1.5 million lines. This is a fairly large dataset.  
There are only 11 columns and the entire dataset takes about 122MB in memory.  
There are no missing values.  

From the competition description, we have the following information on the columns content:
- **id** - a unique identifier for each trip
- **vendor_id** - a code indicating the provider associated with the trip record
- **pickup_datetime** - date and time when the meter was engaged
- **dropoff_datetime** - date and time when the meter was disengaged
- **passenger_count** - the number of passengers in the vehicle (driver entered value)
- **pickup_longitude** - the longitude where the meter was engaged
- **pickup_latitude** - the latitude where the meter was engaged
- **dropoff_longitude** - the longitude where the meter was disengaged
- **dropoff_latitude** - the latitude where the meter was disengaged
- **store_and_fwd_flag** - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
- **trip_duration** - duration of the trip in seconds

Let's see a preview:

In [None]:
df.head()

## Tidying the data

First, we convert the datetime columns into actual datetime

In [None]:
df["pickup_datetime"] = pd.to_datetime(df["pickup_datetime"])
df["dropoff_datetime"] = pd.to_datetime(df["dropoff_datetime"])

Next, from the datetime, we can extract the date and hour, but also the year, month, day, hour, minute and seconds. 

In [None]:
df["pickup_month"] = df["pickup_datetime"].apply(lambda x: x.month)
df["pickup_day"] = df["pickup_datetime"].apply(lambda x: x.day)
df["pickup_weekday"] = df["pickup_datetime"].apply(lambda x: x.weekday())
df["pickup_hour"] = df["pickup_datetime"].apply(lambda x: x.hour)
df["pickup_minute"] = df["pickup_datetime"].apply(lambda x: x.minute)
df["pickup_time"] = df["pickup_hour"] + (df["pickup_minute"] / 60)

df["dropoff_hour"] = df["dropoff_datetime"].apply(lambda x: x.hour)

We can estimate the distance of each trip from the coordinates of the pickup and dropoff points.
As most of New York streets are orthogonal, I calculate the Manhattan distance between the pickup and the dropoff point.

In [None]:
# The distance is calculated in kilometers
df["distance"] = df.apply(lambda row: distance(row["pickup_latitude"], 
                                               row["pickup_longitude"], 
                                               row["dropoff_latitude"], 
                                               row["dropoff_longitude"]), axis=1)

And from the distance and the trip duration, we can estimate an average speed.  
The average speed is what will be influenced by external factors.

In [None]:
# The speed is calculated in km/h
df["speed"] = df["distance"] / (df["trip_duration"] / 3600)

Finally, the store and forward flags is a categorical variable.
We convert it to numbers


In [None]:
flags = {"N":0, "Y":1}
df["store_and_fwd_flag"] = df["store_and_fwd_flag"].map(flags)

In [None]:
df.info()

We can now start our analysis.  
We start with the target, then we'll look at the features.

## Trip duration

In [None]:
df["trip_duration"].describe()

There's a problem with the data.  
The maximum value is 3.5 milion seconds (about 6 days).  
The minimum value is 1 second.  
It seems that we have a lot of erroneous data. 
Let's look at the extremely long trips (> 36,000 seconds)

In [None]:
df[["trip_duration", "vendor_id", "passenger_count", "store_and_fwd_flag", "distance", "speed"]][df["trip_duration"] > 36000].shape[0]

We have about 2,000 records with the trip duration over 36,000 seconds (that's more than 10 hours!).  
There seems to be a lot of records with erroneous data.   
Let's look at the distribution of trips with a duration less than 3,600 seconds.

In [None]:
data_distribution(df["trip_duration"][df["trip_duration"] <= 3600])

The distribution is skewed to the right. Most of taxi trips are short trips of less than 10 minutes.

## Location

In [None]:
plt.figure(figsize=(20,20))

# Set the limits of the map to the minimum and maximum coordinates
lat_min = df["pickup_latitude"].min() - .2
lat_max = df["pickup_latitude"].max() + .2
lon_min = df["pickup_longitude"].min() - .2
lon_max = df["pickup_longitude"].max() + .2

# Set the center of the map
cent_lat = (lat_min + lat_max) / 2
cent_lon = (lon_min + lon_max) / 2

map = Basemap(llcrnrlon=lon_min, llcrnrlat=lat_min, urcrnrlon=lon_max, urcrnrlat=lat_max,
             resolution='h', projection='tmerc', lat_0 = cent_lat, lon_0 = cent_lon)

map.drawmapboundary(fill_color='aqua')
map.fillcontinents(color='lightgray', lake_color='aqua')
map.drawcountries(linewidth=2)
map.drawstates(color='b')

long = np.array(df["pickup_longitude"])
lat = np.array(df["pickup_latitude"])

x, y = map(long, lat)
map.plot(x, y,'ro', markersize=3, alpha=1)

plt.show()

Well, we have some strange pickups. There's one in California and one in Canada.  
Some are quite far from New York but remain in the neibourghing states.  
Let's see the trip that started in California.

In [None]:
df[["id", "distance", "trip_duration", "speed"]][df["pickup_longitude"] == lon_min + .2]

The data is wrong. This is a 18 meters trip that lasted 8 minutes.
Let's look at the trip in Canada.


In [None]:
df[["id", "distance", "trip_duration", "speed"]][df["pickup_latitude"] == lat_max - .2]

Here, we have a long trip of more than 1,000 kilometers done by a supersonic taxi.
Let's zoom on New York.


In [None]:
plt.figure(figsize=(20,20))

# Set the limits of the map to the minimum and maximum coordinates
lat_min = 40.6
lat_max = 40.9
lon_min = -74.2
lon_max = -73.7

# Set the center of the map
cent_lat = (lat_min + lat_max) / 2
cent_lon = (lon_min + lon_max) / 2

map = Basemap(llcrnrlon=lon_min, llcrnrlat=lat_min, urcrnrlon=lon_max, urcrnrlat=lat_max,
             resolution='h', projection='tmerc', lat_0 = cent_lat, lon_0 = cent_lon)

map.drawmapboundary(fill_color='aqua')
map.fillcontinents(color='lightgray', lake_color='aqua')
map.drawcountries(linewidth=2)
map.drawstates(color='b')

long = np.array(df["pickup_longitude"])
lat = np.array(df["pickup_latitude"])

x, y = map(long, lat)
map.plot(x, y,'ro', markersize=2, alpha=0.2)

plt.show()

Most of the trips happen in Manhattan. We can also see a fair number of trips originating from Brooklyn.
There's also a line going to the Airport. The pickup place can influence the duration. For example, trips starting from the airport could be longer than the average trip starting in Manhattan.

In [None]:
lm = ols("trip_duration ~ pickup_latitude + pickup_longitude", data=df).fit()
print(lm.summary())

## Distance

The trip duration will depend mostly on the distance.  
We'll look at this variable first.

In [None]:
df["distance"].describe()

In [None]:
data_distribution(df["distance"])

The trips distances are also very skeewed to the right with several trips that are several hundreds kilometers long.


In [None]:
data_distribution(df["distance"][df["distance"] <= 100])

In [None]:
lm = ols("trip_duration ~ distance", data=df).fit()
print(lm.summary())

## Speed

With many outliers in distance and trip duration, we can expect the average speeds to also be spread over a large range.

In [None]:
df["speed"].describe()

In [None]:
data_distribution(df["speed"])

We can see that the speeds can go over thousands of kilometers per hour.
Most trips happening in urban area, we can expect the average speed to be bellow 50 km/h.


In [None]:
data_distribution(df["speed"] <= 50)

In [None]:
lm = ols("trip_duration ~ speed", data=df).fit()
print(lm.summary())

In [None]:
lm = ols("trip_duration ~ speed", data=df[(df["speed"] >= 11.50) & (df["speed"] <= 23.15)]).fit()
print(lm.summary())

## Time

In [None]:
sns.countplot(df["pickup_hour"])

In [None]:
g = sns.FacetGrid(df, col="pickup_weekday")
g.map(plt.hist, "pickup_hour");

We can see here that on weekends, there are more pickups late at night (after midnight) and less in the morning (9 AM).

Traffic conditions should depend on time (i.e. day of week and hour).
Let's see the relationship between the speed and the time of pickup.

In [None]:
fig = plt.figure(figsize=(10,10))

x = df["pickup_time"][df["speed"] < 100]
y = df["speed"][df["speed"] < 100]

plt.scatter(x=x, y=y, alpha=0.01)

plt.xlabel("Pickup time (h.m of day)")
plt.ylabel("Average speed (km/h)")
plt.grid(alpha=0.5)
plt.show()

We get an interesting plot here. We're going to ignore the line at the bottom. Those are speeds close to 0 and most of them should be anomalies.
We can see variations in the minimum speed. It increases during the night to reach a peak at around 5 AM. As the traffic get lighter, the average speed increases.
But what is funny, is that in the same time, the maximum speed decreases also.
Shortly after 5 AM, the maximum speed peaks.
I think that early in the morning, people are commuting. Trips start farther from downtown and more traffic will happen on roads with higher speed limit.
We can see the maximum speed increasing again in the evening.


In [None]:
fig, ax = plt.subplots (4, 2, figsize=(15, 15))

d = 0

days = {0:"Monday", 1:"Tuesday", 2:"Wednesday", 3:"Thursday", 4:"Friday",
        5:"Saturday", 6:"Sunday"}

for r in range(0, 4):
    for c in range(0, 2):
        if d > 6:
            ax[r, c].axis("off")
            break
        x = df["pickup_time"][(df["speed"] < 100) & (df["pickup_weekday"] == d)]
        y = df["speed"][(df["speed"] < 100) & (df["pickup_weekday"] == d)]

        ax[r, c].scatter(x=x, y=y, alpha=0.01)
        ax[r, c].set_title("{}".format(days[d]))
        ax[r, c].axhline(40, linewidth=1, color='r', linestyle="--", alpha=.5)
        ax[r, c].grid(alpha=0.5)
        d += 1

fig.suptitle("Observed average speeds depending on day of week and time of day")
plt.show()

A little bit before 5 AM, there is a decrease of traffic for the first days of the week. People don't stay out too late.  
We can see that this gap disappear on week ends.

We can also compare the location of pickups and dropoffs depending on the hour of the day.
This will give us an idea of the flows of taxis.


In [None]:
# Set the limits of the map to the minimum and maximum coordinates
lat_min = 40.6
lat_max = 40.9
lon_min = -74.05
lon_max = -73.75

# Set the center of the map
cent_lat = (lat_min + lat_max) / 2
cent_lon = (lon_min + lon_max) / 2

columns = ["pickup_latitude", "pickup_longitude", "dropoff_latitude", "dropoff_longitude", "pickup_hour"]
sample = df[columns][(df["pickup_latitude"] >= lat_min) & \
                      (df["pickup_latitude"] <= lat_max) & \
                      (df["pickup_longitude"] >= lon_min) & \
                      (df["pickup_longitude"] <= lon_max) & \
                      (df["speed"] >= 10) & \
                      (df["speed"] <= 60)]


def draw_map(hour):
    fig = plt.figure(figsize=(20, 20))
    
    # plot pickups
    ax = fig.add_subplot(121)
    ax.set_title("Pickups")
    
    # map definition
    map = Basemap(llcrnrlon=lon_min, llcrnrlat=lat_min, urcrnrlon=lon_max, urcrnrlat=lat_max,
                resolution='h', projection='tmerc', lat_0 = cent_lat, lon_0 = cent_lon)

    map.drawmapboundary(fill_color='aqua')
    map.fillcontinents(color='lightgray', lake_color='aqua')
    
    lon = np.array(sample["pickup_longitude"][sample["pickup_hour"] == hour])
    lat = np.array(sample["pickup_latitude"][sample["pickup_hour"] == hour])
    x, y = map(lon, lat)
    map.plot(x, y,'bo', markersize=1, alpha=0.3)

    # plot dropoffs
    ax = fig.add_subplot(122)
    ax.set_title("Dropoffs")
    
    # map definition
    map = Basemap(llcrnrlon=lon_min, llcrnrlat=lat_min, urcrnrlon=lon_max, urcrnrlat=lat_max,
                resolution='h', projection='tmerc', lat_0 = cent_lat, lon_0 = cent_lon)

    map.drawmapboundary(fill_color='aqua')
    map.fillcontinents(color='lightgray', lake_color='aqua')
    
    lon = np.array(sample["dropoff_longitude"][sample["pickup_hour"] == hour])
    lat = np.array(sample["dropoff_latitude"][sample["pickup_hour"] == hour])
    x, y = map(lon, lat)
    map.plot(x, y,'ro', markersize=1, alpha=0.3)

    plt.show()

interact(draw_map, hour=widgets.IntSlider(min=0,max=23,step=1,value=12))

## Passengers

In [None]:
df["passenger_count"].describe()

In [None]:
sns.countplot(df["passenger_count"])

A large majority of the trips have only one passenger.
Surprisingly, the number of passengers can go up to 9. Is there a relation between the number of passengers and the duration of a trip


In [None]:
lm = ols("trip_duration ~ passenger_count", data=df).fit()
print(lm.summary())

## Vendors

In [None]:
sns.countplot(df["vendor_id"])

In [None]:
lm = ols("trip_duration ~ vendor_id", data=df).fit()
print(lm.summary())

In [None]:
# pickup per hour per vendor
vendor1 = df["pickup_hour"][df["vendor_id"] == 1].value_counts()
vendor2 = df["pickup_hour"][df["vendor_id"] == 2].value_counts()
fig = plt.figure()
plt.scatter(x=vendor1.index, y = vendor1, color='r', alpha=.5)
plt.scatter(x=vendor2.index, y = vendor2, color='b', alpha =.5)
plt.title("Total number of pickups per hour")
plt.xlabel("hour of the day")
plt.ylabel("Number of pickups")
plt.show()

Vendor 1 has less records. The red dots are below the blue ones most of the time as expected.  
The difference is smaller when there's less activity (from 2am to 6am).

The activity for both vendors are followig the same trends.

## Store & Forward

I want to check is the anomalies in the data are linked to the store and forward flag.

In [None]:
print("Store and forward flag = 0")
data_distribution(df["trip_duration"][df["store_and_fwd_flag"] == 0])

print("Store and forward flag = 1")
data_distribution(df["trip_duration"][df["store_and_fwd_flag"] == 1])

The trips with the flag on have less extreme values of duration, but we still have some very long trips.  
Let's check with the speed.

In [None]:
print("Store and forward flag = 0")
data_distribution(df["speed"][df["store_and_fwd_flag"] == 0])

print("Store and forward flag = 1")
data_distribution(df["speed"][df["store_and_fwd_flag"] == 1])

In [None]:
df["speed"][df["store_and_fwd_flag"] == 1].describe()

Trips with the store and forward flag seem to have a less anomalies in the data.  
This is going to be very important when making predictions.

## Features correlation. 

I'm going to look now at the relationships between the features. 

In [None]:
corr = df.corr().mul(100).astype(int)
cg = sns.clustermap(data=corr, annot=True, fmt='d')
plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()

There's a high correlation between `dropoff_hour`, `pickup_hour` and `pickup_time`. There's nothing surprising here.  
We have also a strong correlation between the pickup and dropoff longitude.  
There's also nothing surprising here. Most of trips are in Manhattan which is longer on the axis North/South than on the axis East/West.  
The latitudes are less correlated.  
What is surprising, is that there seems to be a small relationship between the vendor and the passenger count.

In [None]:
lm = ols("passenger_count ~ vendor_id", data=df).fit()
print(lm.summary())