I have had the idea of creating a comprehensive tutorial kernel accessible to everyone especially those who are new to Data Science, by keeping the narrative similar to thought process. Thanks to Kaggle for this great initiative.

My goals are:

* **Creating a simple yet deep analysis of the data sets by incorporating features one by one to the analysis.** 
* **Performing proper feature engineering**
* **Building predictive models and ensembles**

First, import some generic notebook stuff and pay our tribute to the Great Taxi Driver :D

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"  # for better interative experience

%config InlineBackend.figure_format = 'retina'  # for high resolution plots

![image](http://4.bp.blogspot.com/-ti7Cxqon1cg/TeQ4q9TKRBI/AAAAAAAAAzw/pJZpoOmh7dY/s1600/taxi_driver.jpg)

In [3]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
pd.options.display.max_columns = 100

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import seaborn as sns
sns.set_style('ticks')

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Exploratory Data Analysis

We have data sets with the following features

* **id** - a unique identifier for each trip
* **vendor_id** - a code indicating the provider associated with the trip record
* **pickup_datetime** - date and time when the meter was engaged
* **dropoff_datetime** - date and time when the meter was disengaged
* **passenger_count** - the number of passengers in the vehicle (driver entered value)
* **pickup_longitude** - the longitude where the meter was engaged
* **pickup_latitude** - the latitude where the meter was engaged
* **dropoff_longitude** - the longitude where the meter was disengaged
* **dropoff_latitude** - the latitude where the meter was disengaged
* **store_and_fwd_flag** - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* **trip_duration** - duration of the trip in seconds

Let's start reading the csv data by

1. Specifying `parse_dates` and `infer_datetime_format` helps for faster reading

2.  Generally using `category` types makes the memory usage lower, so we change the types of `vendor_id` and `store_and_fwd_flag` to `category` as follows

In [4]:
%%time
train = pd.read_csv("../input/train.csv", index_col="id", engine='python',
                    parse_dates=['pickup_datetime', 'dropoff_datetime'],
                    infer_datetime_format=True)

test = pd.read_csv("../input/test.csv", index_col="id", engine='python',
                   parse_dates=['pickup_datetime'],
                   infer_datetime_format=True)

In [5]:
# before changing the types, make sure train and test have the same categories. Otherwise,
# we'll get into trouble at test time!
print(f"train and test vendor_id unique values are the same: \
        {train.vendor_id.unique().sort() == test.vendor_id.unique().sort()}")
print(f"train and test store_and_fwd_flag unique values are the same: \
        {train.store_and_fwd_flag.unique().sort() == test.store_and_fwd_flag.unique().sort()}")

train['vendor_id'] = train['vendor_id'].astype('category')
train['store_and_fwd_flag'] = train['store_and_fwd_flag'].astype('category')

test['vendor_id'] = test['vendor_id'].astype('category')
test['store_and_fwd_flag'] = test['store_and_fwd_flag'].astype('category')

**Python 3.6** comes with [f-string](https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-pep498) so I'll use this feature heavily throughout.

Below we can see some preliminary view of our datasets and we notice that `train` and `test` have no missing values.

In [6]:
print(f"train shape: {train.shape}")
print("=======================================================")
print(f"test shape: {test.shape}")
print("=======================================================")
print("train view:")
train.head()
print("test view:")
test.head()
print("train/test columns difference:")
train.columns.difference(test.columns)
print("=======================================================")
print("train info:")
train.info()
print("=======================================================")
print("test info:")
test.info()
print("=======================================================")
print(f"train feature descriptions:")
train.describe()
print(f"test feature descriptions:")
test.describe()
print("=======================================================")
print(f"train/test overlapping index: {train.index.intersection(test.index)}")

In [7]:
# Do NOT run! it'll take a long time
#%%time
#train_uniq = train.apply(lambda s: s.nunique())
#test_uniq = test.apply(lambda s: s.nunique())

#print(f"Number of unique values per column in train: \n {pd.DataFrame(train_uniq).T}")
#print(f"Number of unique values per column in test: \n {pd.DataFrame(test_uniq).T}")

Outputing some simple statistics of the target variable `trip_duration` is pointing that there could exist ourliers in our data. So for now, I remove those data points with trip_duration values less that or greater than 1% and 99% quantiles using pandas `query` and f-string computation and plot the distribution of the resulted dataset.

In [8]:
target = train.trip_duration

print(f"trip_duration \n \
      min: {target.min()} \n \
      max: {target.max()} \n \
      mode: {target.mode()[0]} \n \
      mean: {target.mean()} \n \
      median: {target.median()} \n \
      1% quantile: {target.quantile(q=0.01)} \n \
      99% quantile: {target.quantile(q=0.99)}")

In [9]:
train_pure = train.query(f"{target.quantile(q=0.01)} <= \
                            trip_duration <= {target.quantile(q=0.99)}")
target_pure = train_pure.trip_duration 
sns.distplot(target_pure)
plt.xlabel("trip duration")
plt.title("Sample distribution of tip duration");

Is `trip_duration` distribution likely to be log-normal? 

In [10]:
sns.distplot(np.log(target_pure))
plt.xlabel("log trip duration")
plt.title("Sample distribution of logarithm of tip duration");

Let's use different normality tests to find the p-values

In [11]:
from scipy.stats import shapiro, anderson, kstest, normaltest

target_pure_log = np.log(target_pure)  # target values are already positive

print(f"p-value for Shapiro-Wilk test: {shapiro(target_pure_log)[1]}")

print(f"A 2-sided chi squared probability for the hypothesis test: \
        {normaltest(target_pure_log).pvalue}")

print(f"p-value for the Kolmogorov-Smirnov test: {kstest(target_pure, cdf='norm').pvalue}")

print(f"Anderson-Darling normality test: \n \
        {anderson(target_pure_log, dist='norm')}")

So different tests are rejecting the log-normality assumption of the `target`. Note that p-values are probabilities of seeing data under null hypothesis (i.e. `target` comes from normal distribution).

Let's take a look at the pair-wise plot of a sample of first 10000 data points colored with `vender_id` which we already changed its type to `category`.

In [12]:
train_sample = train_pure[:10000]
sns.pairplot(train_sample, hue='vendor_id');

Plotting the correlations among continuous features shows for example that there's no correlation between the `passenger_count` and `trip_duration`.

In [13]:
train_corr = train_pure.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(train_corr, annot=True)
plt.title("Correlation plot");

## Categorical features first

We can plot our categorical features with `countplot` as follows

In [14]:
print(f"vendor_id value counts: \n {train_pure.vendor_id.value_counts()}")
print(f"store_and_fwd_flag value counts: \n {train_pure.store_and_fwd_flag.value_counts()}")

fig, ax = plt.subplots(1,2, figsize=(15, 7))
sns.countplot(data=train_pure, x='vendor_id', ax=ax[0])
sns.countplot(data=train_pure, x='store_and_fwd_flag', ax=ax[1])
plt.show();

To plot distributions wrt to categorical features, we can use `boxplot` along with `swarmplot` to see the data points on top 

In [15]:
plt.figure(figsize=(12, 10))
ax = sns.boxplot(x='vendor_id', y='trip_duration', data=train_pure[:1000])
ax = sns.swarmplot(x='vendor_id', y='trip_duration', data=train_pure[:1000], color='k');
plt.xlabel("vendor id")
plt.ylabel("trip duration")
plt.show();

Skewness in the distribution of `trip_duration` is evident in both `vendor_id`s above. To examine more, we can add `store_and_fwd_flag` and try to visualize the relations between `trip_duration` distribution and given `vendor_id` and `store_and_fwd_flag`. 

`FaceGrid` is a way to go! `FaceGrid` is a powerful way of plotting *conditional* dependencies between target/features. For example, the following is the (sample) conditional distribution

$$P(\text{trip_duration}|\text{vendor_id, store_and_fwd_flag})$$

of trip_duration given vendor_id and  store_and_fwd_flag.

In [16]:
g = sns.FacetGrid(train_pure, row='vendor_id', hue='store_and_fwd_flag',
                  aspect=3, size=2.5, margin_titles=True)
g.map(sns.kdeplot, 'trip_duration', shade=True).add_legend()
for ax in g.axes.flat:
    ax.yaxis.set_visible(False)
sns.despine(left=True);

The first plot above, shows that the conditional (sample) distribution $$P(\text{trip_duration}|\text{vendor_id = 1, store_and_fwd_flag = Y})$$ has longer tail that $$P(\text{trip_duration}|\text{vendor_id = 1, store_and_fwd_flag = N})$$ and 
$$P(\text{trip_duration}|\text{vendor_id = 2, store_and_fwd_flag = Y})$$ is not present!

## `passenger_count`

In [17]:
print(f"unique values: {train_pure.passenger_count.unique()}")
print(f"number of unique values: {train_pure.passenger_count.nunique()}")

sns.barplot(x="passenger_count", y="trip_duration", data=train_pure, estimator=np.median);
plt.xlabel("passenger count")
plt.ylabel("median of trip duration (seconds)")
plt.title("Median of trip duration barplot over passenger count");

Let's examine the conditional distribution of `trip_duration` given `passenger_count` and `vendor_id` 

$$P(\text{trip_duration} | \text{passenger_count, vendor_id})$$

In [18]:
plt.figure(figsize=(16, 12))
g = sns.FacetGrid(train_pure, col='passenger_count',
                  col_wrap=3, hue='vendor_id',
                  aspect=1, size=2, margin_titles=True)
g.map(sns.kdeplot,  'trip_duration', shade=True).add_legend()
for ax in g.axes.flat:
    ax.yaxis.set_visible(False)
sns.despine(left=True);

From above plots, we can see that the distribution of `trip_duration` given `passenger_count = 0` for both `vendor_id`s. This could show delays/traffics or taxi looking for passengers or even when the vehicle is stopped for some reasons. Other `passenger_count`s, from 1 to 5 seem consistent and 6, 8, 9 are getting a little weird. Btw where's 7?

From various distributions of `trip_duration` we have seen so far, all are left skewed. So we'll use `median` instead of `mean`. 

Now, let's examine the medians of `trip_duration` per each `passenger_count` and `vendor_id` with margins as follows (note that pandas `pivot_table` is a generalization of `groupby`)

In [19]:
train_trip_vendor = train_pure.pivot_table('trip_duration',
                                           index='passenger_count', 
                                           columns='vendor_id',
                                           aggfunc='median',
                                           margins='All')
train_trip_vendor

From the table above, why `trip_duration` for `vendor_id=2` was taken almost twice the time than `vendor_id=1` when there's no passenger? moreover, `trip_duration` for `vendor_id=2` with `passenger_count=8` is significantly lower than the rest.

Add `store_and_fwd_flag` variable and plot to see the behaviour `trip_duration` medians

In [20]:
train_trip_vendor_flag = train_pure.pivot_table('trip_duration',
                                                index='passenger_count', 
                                                columns=['vendor_id',
                                                         'store_and_fwd_flag'],
                                                aggfunc='median')

train_trip_vendor_flag

train_trip_vendor_flag.plot()
plt.xlabel("passenger count")
plt.ylabel("median of trip_duration (seconds)")
plt.show()

## pickup_datetime

It's time to incorporate datetime variables into our analysis slowly! In light of the above plot, we can find the medians of `trip_duration` with the added pickup hour quartiles as follows

In [21]:
pickup_hour = pd.qcut(train_pure['pickup_datetime'].dt.hour, q=[0, .25, .5, .75, 1.])
train_trip_vendor_flag = train_pure.pivot_table('trip_duration',
                                                index='passenger_count', 
                                                columns=['vendor_id',
                                                         'store_and_fwd_flag',
                                                         pickup_hour],
                                                aggfunc='median')

train_trip_vendor_flag

The above table indicates that the median of `trip_duration` between mid-night to 2pm is high *when taxis don't have any passengers* (`passenger_count = 0`). One explanations is it's likely taxis are spending more time looking for passengers between mid-night until 9am and traffic is intense from 9am - 2pm! 

Also other values of `passenger_count` over time (increasing between mid-night to 2pm then decreasing after 2pm) could show that traffic is likely to blame for such behavior of medians. Moreover, 2pm seems to be rush hour in NY.

In [22]:
train_pure['pickup_date'] = train_pure['pickup_datetime'].dt.date
train_pure['pickup_time'] = train_pure['pickup_datetime'].dt.time
train_pure['pickup_month'] = train_pure['pickup_datetime'].dt.month
train_pure['pickup_day'] = train_pure['pickup_datetime'].dt.day
train_pure['pickup_hour'] = train_pure['pickup_datetime'].dt.hour

In [23]:
train_pkdt = train_pure.set_index('pickup_datetime')

fig, ax = plt.subplots(2, 1, figsize=(12, 10))
train_pkdt['trip_duration'].resample('M').median().plot(style='-', ax=ax[0])
train_pkdt['trip_duration'].resample('W').median().plot(style='--', ax=ax[0])
train_pkdt['trip_duration'].resample('D').median().plot(style=':', ax=ax[0])
train_pkdt['trip_duration'].resample('H').median().plot(alpha=0.3, color='k', ax=ax[0])
ax[0].set_xlabel('')
ax[0].set_ylabel('median')
ax[0].set_title('Median trip durations over difference time intervals')
ax[0].legend(['Monthly', 'Weekly', 'Daily', 'Hourly'], loc='upper right')

train_pkdt['trip_duration'].resample('D').count().plot(ax=ax[1])
ax[1].set_xlabel('pickup datetime')
ax[1].set_ylabel('count')

fig.show();

Time series above, clearly is showing the increase in hourly and daily `tip_duration` median or daily trip count in end of January 2016 which NY happened to be [hit by blizzard](https://www.weather.gov/okx/Blizzard_Jan2016) in that period.

Let's look closer now

In [24]:
train_pkdt['trip_duration'].resample('M').median().plot(style='-')
train_pkdt['trip_duration'].resample('W').median().plot(style='--')
train_pkdt['trip_duration'].resample('D').median().plot(style=':')
plt.xlabel('pickup datetime')
plt.title('Median trip durations over difference time intervals')
plt.legend(['Monthly', 'Weekly', 'Daily'], loc='best')
plt.show();

## Geo features

# Feature Engineering

# Predictive Modeling

In [25]:
from fbprophet import Prophet

In [None]:
df_train = train_pkdt['trip_duration'].reset_index()
df_train.rename(columns={'pickup_datetime': 'ds', 'trip_duration': 'y'}, inplace=True)

df_test = test['pickup_datetime'].rename(columns={'pickup_datetime': 'ds'})

In [None]:
%%time
prophet = Prophet(interval_width=0.95)
prophet.fit(df_train.loc[:1000, :])

In [None]:
pred = prophet.predict(df_test.loc[:1000])

In [None]:
df_test.head()

