# New York City Taxi Trip Duration - Full Exploratory Data Analysis

### Author: Thomas SELECK
### Date: 2017-20-07

The competition dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The purpose of this competition is to predict the duration of each trip in the test set.

The main goal of this notebook is to explore the data provided bto see how it looks like and what we can do with it.

In [None]:
# Load main python packages
import math
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
import warnings
import seaborn as sns
color = sns.color_palette()

warnings.filterwarnings("ignore")
rcParams['figure.figsize'] = 12, 8
np.random.seed(23)

## 1. About Kaggle evaluation metric

The submissions are evaluated using the RMSLE (Root Mean Squared Logarithmic Error) defined as follows:
$$RMSLE = \sqrt(\frac{1}{n}\sum\limits_{i=1}^n (log(p_i + 1) - log(a_i + 1))^2)$$

## 2. About Kaggle submissions

For each trip, we must predict the trip duration.
 
The submission file header's format is: "id,trip_duration".

## 3. Loading the data and first exploration

Let's begin this notebook by loading the data and summarizing it.

In [None]:
# Load the data; specify date columns for automatic parsing
trainingSet_df = pd.read_csv("../input/train.csv", parse_dates = ["pickup_datetime", "dropoff_datetime"])
testingSet_df = pd.read_csv("../input/test.csv", parse_dates = ["pickup_datetime"])

# Extract target
target_sr = trainingSet_df["trip_duration"]
trainingSet_df.drop("trip_duration", axis = 1, inplace = True)

# As the column 'dropoff_datetime' is not included in the testing set (to avoid leakage), we remove it from training set
# to avoid overfitting
trainingSet_df.drop("dropoff_datetime", axis = 1, inplace = True)

In [None]:
# Print the features list
trainingSet_df.info()

In this dataset, we only have 8 features and a row id. So, for each trip we have the following data:

 - Provider associated with the trip
 - Date and time of the beginning of the trip
 - Number of passengers in the vehicle
 - GPS coordinates of the beginning and the end of the trip
 -  A flag indicating if the trip record was store in car's memory or streamed in real time to the server
 
We we'll use the 'id' column as the DataFrame index.

In [None]:
trainingSet_df.index = trainingSet_df["id"].values
target_sr.index = trainingSet_df["id"].values
trainingSet_df.drop("id", axis = 1, inplace = True)
testingSet_df.index = testingSet_df["id"].values
testingSet_df.drop("id", axis = 1, inplace = True)

### 3.1. Target exploration

Here, we'll look at the target and see its distribution and if outliers are present.

In [None]:
sns.distplot(np.log10(target_sr), kde = False)
plt.title("Histogram of the trip duration in seconds")
plt.xlabel("Trip duration (seconds)")
plt.ylabel("Count")

The previous histogram shows that the majority of trips last between 100 and 10,000 seconds.

Some trips last less than ten seconds. How is it possible? These are probably outliers.
there is a bin around 10,000 seconds, which equals about 28 hours. This begins to be a very long trip...
For the trips lasting $10^7$ seconds, which equals about 115 days, err... these are probably outliers.

Let's see how many trips last less than 100 seconds or more than 10,000 seconds.

In [None]:
outliersCount = target_sr.loc[(target_sr < 100) | (target_sr > 10000)].shape[0]
print("Outliers count:", outliersCount)
print("Percentage of outliers:", (outliersCount / target_sr.shape[0]) * 100, "%")

As there is only about 1% of outliers, we'll drop them when we'll build a predictive model, to avoid overfitting.

### 3.2. Features exploration

Now, let's look more closely to the remaining features.

In [None]:
trainingSet_df["vendor_id"].value_counts()

As the 'vendor_id' feature only have two levels, we'll look at trip duration for each vendor_id.

In [None]:
plot_df = pd.concat([trainingSet_df["vendor_id"], target_sr], axis = 1)
median = plot_df.groupby("vendor_id")["trip_duration"].median()
sns.boxplot(x = "vendor_id", y = "trip_duration", data = plot_df, order = median.sort_values().index)
plt.title("Distribution of target values depending on the vendor id")

On the previous plot, we can see that only the first vendor have trips lasting more than 500,000 seconds. Let's remove the outliers to have a more meaningful plot.

In [None]:
plot_df = pd.concat([trainingSet_df["vendor_id"], target_sr], axis = 1)
plot_df = plot_df.loc[(plot_df["trip_duration"] >= 100) & (plot_df["trip_duration"] <= 10000)] # Remove outliers values
median = plot_df.groupby("vendor_id")["trip_duration"].median()
sns.boxplot(x = "vendor_id", y = "trip_duration", data = plot_df, order = median.sort_values().index)
plt.title("Distribution of target values depending on the vendor id; outliers removed")

The 'vendor_id' is not a good predictor of the trip duration. Now, let's look to the 'store_and_fwd_flag' flag.

In [None]:
trainingSet_df["store_and_fwd_flag"].value_counts()

We can see that the majority of trips data are streamed directly to the server. Let's see the influence of this tag to trip duration and vendor id.

In [None]:
plot_df = pd.concat([trainingSet_df["store_and_fwd_flag"], target_sr], axis = 1)
plot_df = plot_df.loc[(plot_df["trip_duration"] >= 100) & (plot_df["trip_duration"] <= 10000)] # Remove outliers values
median = plot_df.groupby("store_and_fwd_flag")["trip_duration"].median()
sns.boxplot(x = "store_and_fwd_flag", y = "trip_duration", data = plot_df, order = median.sort_values().index)
plt.title("Distribution of target values depending on the store and forward flag")

On average, trips where the car doesn't have a direct connection to the server last longer and the number of outliers is smaller.

In [None]:
trainingSet_df.groupby(["vendor_id", "store_and_fwd_flag"]).size()

Interesting, only the first vendor have cars without direct server connection. Now, let's look at passengers count.

In [None]:
sns.distplot(trainingSet_df["passenger_count"], kde = False)
plt.title("Histogram of the number of passengers in each vehicle")
plt.xlabel("Number of passengers")
plt.ylabel("Count")

Mostly, each trip only have one passenger. But some trips have more than 6 passengers. Are they outliers? Can New York City Taxis travel 10 people (one driver and nine passengers)?

We can also notice that some trips does not have any passenger. Maybe these trips are created when the driver go back to the vendor's parking lot, at the end of the day, not taking any client.

Now, let's see how the number of passengers influences the vendor id and the trip duration.

In [None]:
plot_df = pd.concat([trainingSet_df["passenger_count"], target_sr], axis = 1)
plot_df = plot_df.loc[(plot_df["trip_duration"] >= 100) & (plot_df["trip_duration"] <= 10000)] # Remove outliers values
median = plot_df.groupby("passenger_count")["trip_duration"].median()
sns.boxplot(x = "passenger_count", y = "trip_duration", data = plot_df, order = median.sort_values().index)
plt.title("Distribution of target values depending on the number of passengers inside the car")

As we can see, there are only two trips with either 8 or 9 passengers. So these values are outliers. Then, we can see that the number of passengers doesn't influence trip duration. There is an exception when the cab doesn't contain any passenger. In this case, the trip duration is greater.

In [None]:
trainingSet_df.groupby(["vendor_id", "passenger_count"]).size()

We can see that trips involving more than six passengers are only present in vendor two and are very rare. We can consider them as outliers.

Now, let's look at the pickup datetime.

In [None]:
trainingSet_df["pickup_datetime"].hist(bins = 100)

In [None]:
print("Training set time range:", trainingSet_df["pickup_datetime"].min(), "to", trainingSet_df["pickup_datetime"].max())
print("Testing set time range:", testingSet_df["pickup_datetime"].min(), "to", testingSet_df["pickup_datetime"].max())

We can see that the timeframe of the trips goes from the 1st january to the 30th june of 2016. We can see patterns in the data, like a cycle.

The training set and the testing set have the same timeframe, so the split between train and test set is random and not time-based.

Now, let's see how weekday and month influences the number of trips and their duration.

In [None]:
trainingSet_df["weekday"] = trainingSet_df["pickup_datetime"].dt.weekday
numberOfTrips_sr = trainingSet_df.groupby("weekday").size()
numberOfTrips_sr.index = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
numberOfTrips_sr.plot.bar()
plt.title("Number of trips for each day of the week")

In [None]:
trainingSet_df["month"] = trainingSet_df["pickup_datetime"].dt.month
numberOfTrips_sr = trainingSet_df.groupby("month").size()
numberOfTrips_sr.index = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]
numberOfTrips_sr.plot.bar()
plt.title("Number of trips for each month")

Day of week or month doesn't influences that much the number of trips. Maybe it we'll be different for the trip duration.

In [None]:
plot_df = pd.concat([trainingSet_df["pickup_datetime"], target_sr], axis = 1)
plot_df = plot_df.loc[(plot_df["trip_duration"] >= 100) & (plot_df["trip_duration"] <= 10000)] # Remove outliers values
plot_df["weekday"] = plot_df["pickup_datetime"].dt.weekday
numberOfTrips_sr = plot_df.groupby("weekday")["trip_duration"].mean()
numberOfTrips_sr.index = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
numberOfTrips_sr.plot.bar()
plt.title("Mean duration of trips for each day of the week")

Saturday and Sunday trips are shorter than others weekdays. On thurdays, the trips are the longest.

Now, let's look at trip pickup and dropoff locations. These features are important, beacause with locations we can compute the distance between the two places and estimate more accurately each trip duration.

In [None]:
fig = plt.figure()
ax_1 = fig.add_subplot(221)
trainingSet_df["pickup_latitude"].hist(bins = 100, ax = ax_1)
ax_1.set_title("Distribution of pickup latitude")

ax_2 = fig.add_subplot(222)
trainingSet_df["pickup_longitude"].hist(bins = 100, ax = ax_2)
ax_2.set_title("Distribution of pickup longitude")

ax_3 = fig.add_subplot(223)
trainingSet_df["dropoff_latitude"].hist(bins = 100, ax = ax_3)
ax_3.set_title("Distribution of dropoff latitude")

ax_4 = fig.add_subplot(224)
trainingSet_df["dropoff_longitude"].hist(bins = 100, ax = ax_4)
ax_4.set_title("Distribution of dropoff longitude")

As we can see, each coordinate have some outliers. So we'll replace them by the most frequent value.

In [None]:
trainingSet_df["pickup_latitude"].loc[(trainingSet_df["pickup_latitude"] < 40.5) | (trainingSet_df["pickup_latitude"] > 41)] = 40.8
trainingSet_df["dropoff_latitude"].loc[(trainingSet_df["dropoff_latitude"] < 40.5) | (trainingSet_df["dropoff_latitude"] > 41)] = 40.8
testingSet_df["pickup_latitude"].loc[(testingSet_df["pickup_latitude"] < 40.5) | (testingSet_df["pickup_latitude"] > 41)] = 40.8
testingSet_df["dropoff_latitude"].loc[(testingSet_df["dropoff_latitude"] < 40.5) | (testingSet_df["dropoff_latitude"] > 41)] = 40.8

trainingSet_df["pickup_longitude"].loc[(trainingSet_df["pickup_longitude"] < 40.5) | (trainingSet_df["pickup_longitude"] > 41)] = -74
trainingSet_df["dropoff_longitude"].loc[(trainingSet_df["dropoff_longitude"] < 40.5) | (trainingSet_df["dropoff_longitude"] > 41)] = -74
testingSet_df["pickup_longitude"].loc[(testingSet_df["pickup_longitude"] < 40.5) | (testingSet_df["pickup_longitude"] > 41)] = -74
testingSet_df["dropoff_longitude"].loc[(testingSet_df["dropoff_longitude"] < 40.5) | (testingSet_df["dropoff_longitude"] > 41)] = -74

In [None]:
fig = plt.figure()
ax_1 = fig.add_subplot(221)
trainingSet_df["pickup_latitude"].hist(bins = 100, ax = ax_1)
ax_1.set_title("Distribution of pickup latitude")

ax_2 = fig.add_subplot(222)
trainingSet_df["pickup_longitude"].hist(bins = 100, ax = ax_2)
ax_2.set_title("Distribution of pickup longitude")

ax_3 = fig.add_subplot(223)
trainingSet_df["dropoff_latitude"].hist(bins = 100, ax = ax_3)
ax_3.set_title("Distribution of dropoff latitude")

ax_4 = fig.add_subplot(224)
trainingSet_df["dropoff_longitude"].hist(bins = 100, ax = ax_4)
ax_4.set_title("Distribution of dropoff longitude")

Now, we'll compute the distance of each trip using the Haversine formula defined by:
$$a = sin²(\frac{\Delta\phi}{2}) + cos(\phi_1).cos(\phi_2).sin²(\frac{\Delta\phi}{2})$$
$$c = 2.atan2(\sqrt(a), \sqrt(1 - a))$$
$$distance = R.c$$

where:

 - $\phi$ is the latitude, 
 - $\lambda$ is the longitude
 - $R$ is the Earth’s radius (mean radius = 6,371km)

Note: angles need to be in radians to pass to trig functions!

In [None]:
def haversineDistance(x):
    R = 6371e3 # Earth's radius in meters
    origLat = x["pickup_latitude"]
    origLong = x["pickup_longitude"]
    destLat = x["dropoff_latitude"]
    destLong = x["dropoff_longitude"]
    
    phi_1 = math.radians(origLat)
    phi_2 = math.radians(destLat)
    deltaPhi = math.radians(destLat - origLat)
    deltaLambda = math.radians(destLong - origLong)

    a = math.sin(deltaPhi / 2) * math.sin(deltaPhi / 2) + math.cos(phi_1) * math.cos(phi_2) * math.sin(deltaLambda / 2) * math.sin(deltaLambda / 2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    return R * c

trainingSet_df["distance"] = trainingSet_df[["pickup_latitude", "pickup_longitude", "dropoff_latitude", "dropoff_longitude"]].apply(haversineDistance, axis = 1)
testingSet_df["distance"] = testingSet_df[["pickup_latitude", "pickup_longitude", "dropoff_latitude", "dropoff_longitude"]].apply(haversineDistance, axis = 1)

Now, let's look at the distribution of the distances.

In [None]:
sns.distplot(trainingSet_df["distance"], kde = False)
plt.title("Histogram of the trip distances")
plt.xlabel("Trip distance (meters)")
plt.ylabel("Count")

As we can see, the majority of trips have a small distance (less than 5 kilometers). But, there are a lot of trips that have a zero distance. This is strange.

Finally, let's look at how the variables are correlated to the target.

In [None]:
plot_df = pd.concat([trainingSet_df, target_sr], axis = 1)
sns.heatmap(plot_df.corr(), annot = True, square = True)

If you have any suggestions to improve this notebook, please post a comment on it.
Don't forget to upvote it if you liked it.