The idea of this notebook is to show the exploratory data analysis I made for this competition. It will analyze the original train dataset of the competition without addition of external dataset. Be aware that top scorers for this competition all use external dataset in their model and make use of distance calculation.

Since there are many notebook made available for this competition already, it does not make sense to reproduce the same analysis again and again. We will only show results that we could note find somewhere else

In [39]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [40]:
train_df = pd.read_csv('../input/train.csv')

First, we check for duplicate in the dataset and find seven of them

In [41]:
a = train_df.iloc[:,1:].groupby(train_df.iloc[:,1:].columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'count'})

In [42]:
a.sort_values(['count',],ascending=[False,]).iloc[:10,:]

In [43]:
del a

The variable we are proposed to predict seems to be clustered around clusters. We divide them into three of them.

In [44]:
sns.distplot(train_df.trip_duration[train_df.trip_duration < 10000], color="m")

In [45]:
sns.distplot(train_df.trip_duration[(train_df.trip_duration > 10000) & (train_df.trip_duration < 100000)], color="m")

In [46]:
sns.distplot(train_df.trip_duration[(train_df.trip_duration > 100000) & (train_df.trip_duration < 4000000)], color="m")

Regarding trip duration time also appears to play a significant role. Winter months (Januray and February) and May show some specific feature as well.

In [47]:
train_df['pickup_datetime'] = pd.to_datetime(train_df['pickup_datetime'])
train_df['date_'] = train_df['pickup_datetime'].dt.date

In [48]:
fig, axes = plt.subplots(nrows=3, ncols=2)
fig.tight_layout()

train_df.loc[:,['date_','trip_duration']].groupby('date_').min().plot(title = 'min', ax=axes[0,0],figsize=(10,10), rot=90)
train_df.loc[:,['date_','trip_duration']].groupby('date_').median().plot(title = 'median', ax=axes[0,1],figsize=(10,10), rot=90)
train_df.loc[:,['date_','trip_duration']].groupby('date_').max().plot(title = 'max', ax=axes[1,0],figsize=(10,10), rot=90)
train_df.loc[:,['date_','trip_duration']].groupby('date_').mean().plot(title = 'mean', ax=axes[1,1],figsize=(10,10), rot=90)
train_df.loc[:,['date_','trip_duration']].groupby('date_').count().plot(title = 'count', ax=axes[2,0],figsize=(10,10), rot=90)

Hour of the day is also of the essence. Trip duration appaers to be scatter around four clusters - <1k, [>1k, <20k], [>20k, < 80k], [>80k,<90k]. For the third cluster, there is a declining trend from 1 am to 6 pm. While for the fourth cluster, obs seems to be prevalent no matter of the time. The interesting question is what are the discriminating factors that make the obs to be of one type or another.

In [49]:
train_df['hour'] = train_df['pickup_datetime'].dt.hour

In [50]:
fig, axes = plt.subplots(nrows=2, ncols=2)
fig.tight_layout()

sns.boxplot(x="hour", y="trip_duration", data=train_df[(train_df.trip_duration < 1000)], ax=axes[0,0])
sns.boxplot(x="hour", y="trip_duration", data=train_df[(train_df.trip_duration < 1000000)], ax=axes[0,1])
train_df.loc[:,['hour','vendor_id']].groupby('hour').count().plot.bar(title = 'count', ax=axes[1,0],figsize=(10,10), rot=90)

If we separate trips in duration cluster, it becomes pretty apparent that trip duration over 20k are rather exceptional but still happen daily.

In [51]:
fig, axes = plt.subplots(nrows=2, ncols=2)
fig.tight_layout()

train_df.loc[(train_df.trip_duration < 1000),['date_','trip_duration']].groupby('date_').count().plot(title = 'count', ax=axes[0,0],figsize=(10,10), rot=90)
train_df.loc[(train_df.trip_duration > 1000) & (train_df.trip_duration < 20000),['date_','trip_duration']].groupby('date_').count().plot(title = 'count', ax=axes[0,1],figsize=(10,10), rot=90)
train_df.loc[(train_df.trip_duration > 20000) & (train_df.trip_duration < 80000),['date_','trip_duration']].groupby('date_').count().plot(title = 'count', ax=axes[1,0],figsize=(10,10), rot=90)
train_df.loc[(train_df.trip_duration > 80000) & (train_df.trip_duration < 90000),['date_','trip_duration']].groupby('date_').count().plot(title = 'count', ax=axes[1,1],figsize=(10,10), rot=90)

Pickup and dropoff coordinates appear to be concentrated around the same location

In [52]:

pickup_df1 = train_df.loc[:,['pickup_latitude','pickup_longitude','id']].sample(n=1000)

graph1 = sns.jointplot(pickup_df1.pickup_longitude, pickup_df1.pickup_latitude,kind="hex", color="#4CB391")

dropoff_df1 = train_df.loc[:,['dropoff_latitude','dropoff_longitude','id']].sample(n=1000)

graph2 = sns.jointplot(dropoff_df1.dropoff_longitude, dropoff_df1.dropoff_latitude, kind="hex", color="#4CB391")
      

In [53]:
del pickup_df1, dropoff_df1

In [72]:
train_df['pickup_coord'] = train_df['pickup_latitude'].round(2).astype(str) + train_df['pickup_longitude'].round(2).astype(str) 

train_df['dropoff_coord'] = train_df['dropoff_latitude'].round(2).astype(str) + train_df['dropoff_longitude'].round(2).astype(str)

a = pd.pivot_table(train_df.ix[:,['pickup_coord','vendor_id']],index=['pickup_coord'],aggfunc=np.sum).sort_values(['vendor_id'],ascending=False).reset_index()[:30]

b = pd.pivot_table(train_df.ix[:,['dropoff_coord','vendor_id']],index=['dropoff_coord'],aggfunc=np.sum).sort_values(['vendor_id'],ascending=False).reset_index()[:30]

a =pd.DataFrame(a) 
a.columns=['coord','count_pu']

b =pd.DataFrame(b) 
b.columns=['coord','count_do']

a = a.merge(b, on = ['coord'])

In [55]:
ax = a.loc[:,['count_pu','count_do']].plot(kind='bar', rot=90)
ax.set_xticklabels(a.coord)

Based on the top cordinates, we plotted them on a map and realize they are all located in Manhattan. The top five coordinates, which account for roughly one third of the dataset, are also very close nearby.

In [56]:
import folium

In [75]:
#To center map, take mean value of coordinates
stamen01 = folium.Map(location=[40.75092, -73.97349], tiles='Stamen Toner',
                    zoom_start=12)

In [76]:
feature_group = folium.FeatureGroup("cluster top coord")

for i in range(28):
    feature_group.add_child(folium.CircleMarker(location=[float(a.coord[i][:5])
                                                ,float(a.coord[i][5:])], 
                                                radius= a.count_pu[i]/10000,           
                                                popup=a.coord[i]))
stamen01.add_child(feature_group)

In [59]:
a = train_df.ix[:,['pickup_coord','trip_duration','pickup_longitude', 'pickup_latitude']].groupby('pickup_coord').median().sort_values(by='trip_duration',ascending =False).reset_index()[:50]

We do the same exercice based on the median trip duration and quickly realize whole of the longer trip duration have pick-up coordinates outside Manhattan. Some of them even show pick cooredinates outside the state of New York.

In [60]:
#To center map, take mean value of coordinates
stamen02 = folium.Map(location=[40.75092, -73.97349], tiles='Stamen Toner',
                    zoom_start=8)

In [61]:
feature_group = folium.FeatureGroup("cluster top coord")



for i in range(50):
    feature_group.add_child(folium.CircleMarker(location=[np.round(a.pickup_latitude[i],2)
                                                ,np.round(a.pickup_longitude[i],2)], 
                                                radius= a.trip_duration[i]/1000,           
                                                popup=a.pickup_coord[i]))



stamen02.add_child(feature_group)


In [62]:
del a, b

Here is an attempt to find relationship among coordinates using a chord diagram.  An interesting fact is that the two top coordinates come particular strong out of that diagram. Most of the fares happen within the limit of Manhattan borough.

In [63]:
chord_source = pd.pivot_table(train_df.ix[:,['pickup_coord','dropoff_coord','vendor_id']],index=['pickup_coord','dropoff_coord'],aggfunc=np.sum)

In [64]:
chord_source =  chord_source.reset_index()
chord_source.rename(columns={'vendor_id':'link_count'},inplace =True)
chord_source = chord_source[chord_source.link_count >0]
chord_source =chord_source[chord_source.pickup_coord != chord_source.dropoff_coord]
chord_source =chord_source.sort_values(['link_count',], ascending=[False,])

In [65]:
from bokeh.charts import  Chord
from bokeh.charts import output_file,show

In [66]:
chord_from_df = Chord(chord_source.iloc[:200,:], source="pickup_coord", target="dropoff_coord", value="link_count")
output_file('chord-diagram-bokeh.html', mode="inline")
show(chord_from_df)

In [67]:
from IPython.core.display import HTML

In [68]:
#HTML('chord-diagram-bokeh.html') 
#Not possible to display within kernell limitation. Must be reproduced locally

In [69]:
del chord_from_df, chord_source