# EDA (NYC taxi trip duration)

Exploring ride durations of taxi trips in New York City.

## Imports
Import libraries and datasets.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection, preprocessing
from haversine import haversine

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
# Import datasets
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [3]:
print("Shape of training set: ", train.shape)
print("Shape of test set: ", test.shape)

The training set has **11** attributes, but the test set only has **9**. Let's see which attributes are missing from the test set.

In [4]:
print("Columns in training set:\n", train.columns.values)
print('\n')
print("Columns in test set:\n", test.columns.values)

Looks like the test set doesn't have the **dropoff_datetime** and **trip_duration** for each trip.

In [5]:
train.head()

In [6]:
train.info()

In [7]:
train.describe()

In [8]:
# Check for null values
train.isnull().sum()

The training set contains no null values, so we're good to go!

## Data wrangling

In [9]:
# Convert timestamps from strings to datetime objects
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'])
train['dropoff_datetime'] = pd.to_datetime(train['dropoff_datetime'])
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'])

In [10]:
# Check values for store_and_fwd_flag
train['store_and_fwd_flag'].unique()

In [11]:
# Convert store_and_fwd_flag values to boolean values
train['store_and_fwd_flag'] = train['store_and_fwd_flag'].apply(lambda y_or_n: y_or_n == 'Y')
test['store_and_fwd_flag'] = test['store_and_fwd_flag'].apply(lambda y_or_n: y_or_n == 'Y')

## Feature engineering
We will create the following columns:

1. pickup_hour
2. dropoff_hour (only for the training set)
3. pickup_day
4. pickup_location
5. dropoff_location
6. distance_km (great-circle distance between co-ordinates)

In [12]:
# Create pickup_hour and dropoff_hour attributes
train['pickup_hour'] = train['pickup_datetime'].dt.hour
test['pickup_hour'] = test['pickup_datetime'].dt.hour
train['dropoff_hour'] = train['dropoff_datetime'].dt.hour

# Create pickup_day attribute (Monday : 0, Sunday : 6)
train['pickup_day'] = train['pickup_datetime'].dt.dayofweek
test['pickup_day'] = test['pickup_datetime'].dt.dayofweek

In [13]:
# Create pickup_location and dropoff_location attributes
train['pickup_location'] = train[['pickup_latitude', 'pickup_longitude']].apply(tuple, axis=1)
train['dropoff_location'] = train[['dropoff_latitude', 'dropoff_longitude']].apply(tuple, axis=1)

test['pickup_location'] = test[['pickup_latitude', 'pickup_longitude']].apply(tuple, axis=1)
test['dropoff_location'] = test[['dropoff_latitude', 'dropoff_longitude']].apply(tuple, axis=1)

In [14]:
# Create great-circle distance between co-ordinates (in km) attribute
train['distance_km'] = train.apply(lambda df: haversine(df['pickup_location'], df['dropoff_location']), axis=1)

In [15]:
train.head()

## Exploratory data analysis

Let's build a scatterplot of trip_distribution values.

In [16]:
# Plot all trip durations
plt.scatter(train.index, train['trip_duration'].sort_values(ascending=True))
plt.xlabel('Trip')
plt.ylabel('Trip duration')

Whoa! Looks like we have some outliers. Let's learn more about them first.

In [17]:
# Get quantile
quantile_trip_duration = train['trip_duration'].quantile(0.99)
shortest_trip_higher_than_quantile = train['trip_duration'][train['trip_duration'] > quantile_trip_duration].min()
print("Shortest trip duration higher than the 99.999% quantile: ", shortest_trip_higher_than_quantile, "seconds")

In [18]:
# Filter by quantile
train = train[train['trip_duration'] < quantile_trip_duration]

Now that we've removed some outliers, let's check the graph again.

In [19]:
# Plot all trip durations
trip_durations_sorted = train['trip_duration'].copy().sort_values(ascending=True)
plt.scatter(train.index, trip_durations_sorted)
plt.xlabel('Trip')
plt.ylabel('Trip duration')

Now, let's take a look at how trip_duration values are distributed.

In [20]:
# Plot histogram of trip_duration values
plt.figure(figsize=(12, 8))
sns.distplot(train['trip_duration'])
plt.xlabel('Trip duration')

Seems like our data is skewed a little to the right. We can apply a logarithmic function to make it normally distributed.

In [21]:
# Plot histogram of log(trip_duration) values
plt.figure(figsize=(12, 8))
sns.distplot(np.log(train['trip_duration'].values))
plt.xlabel('Logarithm of trip duration')

Next, let's plot some countplots on our engineered features.

In [22]:
# Number of trips for each hour of the day
plt.figure(figsize=(12, 4))
sns.countplot('pickup_hour', data=train, color='#4C72B0')
plt.xlabel('Pickup hour')
plt.ylabel('Trip count')

Looks like most people take taxis between 6:00 and 10:00 PM. 

In contrast, the number of trips decreases gradually after midnight until around 6:00 AM next morning.

In [23]:
# Number of trips for each day of the week
plt.figure(figsize=(8, 4))
sns.countplot('pickup_day', data=train, color='#4C72B0')
plt.xlabel('Pickup day (Monday = 0)')
plt.ylabel('Trip count')

We can see that the number of taxi trips increases as the weekend approaches, with the most taxi trips occuring on Fridays and Saturdays. Understandable.