There are many irregularities in the data that we must hunt down in order to improve the prediction. In the following notebook you'll find an analysis of the features that contain most information about the rides - the dates and times of pickup and dropoff, and their location.
This analysis emphasize how easy it is to notice some phenomena given the right visualization tool.

In [1]:
%pylab inline
import os
import pandas as pd
import seaborn as sns
from haversine import AVG_EARTH_RADIUS

# Loading Data

In [2]:
train = pd.read_csv("../input/train.csv", index_col=0)

In the next bit, we just loat and transform the data a bit. The main new feature I present is the geodesic distance, in kilometers (sorry, I'm metric) using the [haversine](https://en.wikipedia.org/wiki/Haversine_formula) formula
$$
d\left(x,y\right) = 2 \cdot r \cdot \arcsin \left( \sqrt{ \sin^2 \left( \frac{x_{\text{lat}} - y_{\text{lat}}}{2} \right)  + \cos\left(x_{\text{lat}}\right)\cos\left(y_{\text{lat}}\right)\sin^2 \left( \frac{x_{\text{long}} - y_{\text{long}}}{2} \right)}\right)
$$

This is done so the values I get later would make more sense.

In [3]:
train = train.assign(
    pickup_datetime = pd.to_datetime(train.pickup_datetime),
    dropoff_datetime = pd.to_datetime(train.dropoff_datetime),
    vendor_id = train.vendor_id.astype("category"),
    store_and_fwd_flag = train.store_and_fwd_flag.astype("category"),
    passenger_count = train.passenger_count.astype("category", ordered=True),
    dlat = (train.pickup_latitude - train.dropoff_latitude) * pi/180,
    dlong = (train.pickup_longitude - train.dropoff_longitude) * pi/180
).assign(
    euclidean_distance = lambda df: (2* AVG_EARTH_RADIUS*
                                     arcsin(sqrt(square(sin(df.dlat/2)) + 
                                            cos(df.pickup_latitude * pi/180) * 
                                            cos(df.dropoff_latitude * pi/180) * 
                                            square(sin(df.dlong/2)))))
)

# Initial Cleaning and Summary

Early plotting reveals blatent outliers, that we want to get rid off for now. Seeing as Manhatten circumference is less than 60km, and that 500,000 seconds is more than five days, I'm pretty certain these values are mistake.

In [4]:
train = train[
    (train.trip_duration < 500000) &
    (train.euclidean_distance < 125)
].copy()

In [5]:
def summary(df, sample_size = None):
    if sample_size is None or sample_size > len(df):
        sample_size = len(df)
    sample = np.random.choice(df.index, size=sample_size, replace=False)
    fig, axes = plt.subplots(3,3, figsize=(12,12))
    axes = np.reshape(axes, -1)
    for index, field in enumerate(['vendor_id', 'passenger_count', 'pickup_longitude', 'pickup_latitude',
                                   'dropoff_longitude', 'dropoff_latitude', 'trip_duration', 'store_and_fwd_flag',
                                   'euclidean_distance']):
        data = df.loc[sample, field]
        if isinstance(data.dtype, pd.core.dtypes.dtypes.CategoricalDtype):
            sns.countplot(x=field, data=df.loc[sample, [field]], ax=axes[index])
        else:
            sns.violinplot(data=df.loc[sample, [field]], ax=axes[index])
        axes[index].set_title(field)

summary(train, 10000)

From a quick glance at the `trip duration` and `euclidean distance`, it seems some of the data is still a bit off. This might be worth exploring. Lets look at it a little deeper.

In [6]:
fig, axes = subplots(1, 2, figsize=(15,5))
train.euclidean_distance.hist(bins=100, ax=axes[0])
axes[0].set_yscale("log")
axes[0].set_title("Distance Historgram")

train.trip_duration.hist(bins=100, ax=axes[1])
axes[1].set_yscale("log")
_ = axes[1].set_title("Trip Duration Historgram")

# Exploring Temporal Fields

Ok, so I can't say anything new about the distance. Howver, the `trip duration` variable is very odd. Recall that the `trip duration` is calculated by `dropoff_datetime - pickup_datetime`. Maybe we can visualize it in a smarter way? Let's try to visualize all the average speed throught the week.

In [7]:
DAY_NAME = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

def weekday_hour_table(df, func):
    """
    df - the dataframe to summaries
    func - function for calculating the values for each pair of pickup-dropoff times
    """
    df = df.groupby(
        [
            df.pickup_datetime.apply(lambda dt: (dt.weekday(), dt.hour)),
            df.dropoff_datetime.apply(lambda dt: (dt.weekday(), dt.hour)),
        ]
    ).apply(func).unstack()
    df.index = ["%s %d" % (DAY_NAME[day], hour) for day, hour in df.index]
    df.columns = ["%s %d" % (DAY_NAME[day], hour) for day, hour in df.columns]
    return df

def median_of_average_ride_speed(group):
    return (group.euclidean_distance / group.trip_duration).median()

traffic_by_day = weekday_hour_table(train, median_of_average_ride_speed)
figsize(15,15)
ax = sns.heatmap(log(traffic_by_day+1e-7), xticklabels=2, yticklabels=2)
_= xticks(rotation=90)

Whoa. There is something regularly strange there. We see two phenomena. the first is each day at midnight, many rides are closed (dropoff marked). The second is "ghost rides", rides that take 20-23 hours. Remember the histogram? They probably match the increase in frequency of longer rides.

Let's try to figure out what it is all about. Seems it would be easy to begin with the midnight ends. Technically, we'd plot a summary of all the restricted dataset, but I like to jump to the end.

In [8]:
figsize(15,5)
df = train[
    (train.dropoff_datetime.dt.hour == 0) &
    (train.dropoff_datetime.dt.minute == 0) &
    (train.dropoff_datetime.dt.second == 0)
]
sample = np.random.choice(df.index, size=min(10000, len(df)), replace=False)
_ = sns.stripplot(y="trip_duration", x="vendor_id", data=df.loc[sample].copy(), jitter=True)

So, we can blame `vendor_id` 2 in all this mess. I wonder how the heatmap looks without them?

In [9]:
traffic_by_day = weekday_hour_table(train[train.vendor_id == 1], median_of_average_ride_speed)
figsize(15,10)
ax = sns.heatmap(log(traffic_by_day+1e-7), xticklabels=2, yticklabels=2)
_= xticks(rotation=90)

Much cleaner. Those with a sharp eye will notice an outlier there (recall that light color means really slow ride). We'll deal with it later. 

# Exploring Spatial Fields

We'll begin by removing the outliers (so we can trust the durations. At least those that are far from zero.

In [10]:
outliers = (
    ((
        (train.vendor_id == 2) &
        (train.dropoff_datetime.dt.hour == 0) &
        (train.dropoff_datetime.dt.minute == 0) &
        (train.dropoff_datetime.dt.second == 0)
    ) |
    (
        train.trip_duration > 40000
    ) | 
    (
        train.euclidean_distance == 0
    ))
)
train = train[~outliers].copy()

In [11]:
fig, axes = subplots(1, 2, figsize=(15,5))
train.euclidean_distance.hist(bins=100, ax=axes[0])
(train["euclidean_distance"] / (train["trip_duration"]/3600)).hist(bins=200, ax=axes[0])
axes[0].set_yscale("log")
axes[0].set_title("KM Per Hour Historgram")

((train["trip_duration"]/3600) / train["euclidean_distance"]).hist(bins=200, ax=axes[1])
axes[1].set_yscale("log")
_ = axes[1].set_title("Hour Per KM Historgram")

Okay. Some of the speeds are really not reasonable. 8000 kmph? I mean, come on! (Or 800 hours for one km? Makes no sense). In general, the hour per kilometer figure is strange. So many rides that take more than an two hours per kilometer? Why don't people just walk?

In [12]:
too_fast = ((train["euclidean_distance"] / (train["trip_duration"]/3600)) > 300)
too_slow = ((train["euclidean_distance"] / (train["trip_duration"]/3600)) < 0.5)
# Removing some outliers to make the plot clearer
not_too_far = (train["pickup_latitude"] > 40) &  (train["pickup_longitude"] > -76)

#Plotting
with_speed = train[(too_fast | too_slow) & not_too_far].copy()
with_speed["fast"] = too_fast
_ = sns.lmplot(x='pickup_longitude', y='pickup_latitude', hue="fast", markers='.', size=10, 
               fit_reg=False, data=with_speed)

A close examination (plotting on real map, i.e. with gmaps module) reveals that most of the points are valid. (Long Island, Philadelphia and other areas. Although, there is one point in the middle of the ocean.) Many of the rides are one very short distances, and it seems it was a mistake in entering the time of the location of the ride.

For now, I haven't found a cause of the mistake, so I'm goning to treat everyting as an outlier. At least let's clean it up.

In [13]:
train = train[~too_fast & ~too_slow].copy()