Jupyter notebook with data exploration : variables distribution visualization, some cleaning and prediction using a Random Forest.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from math import radians, cos, sin, asin, sqrt 
from tqdm import tqdm_notebook
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
import calendar
%matplotlib inline

In [None]:
train = pd.DataFrame.from_csv('../input/train.csv')

In [None]:
def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

In [None]:
train["year"] = pd.to_datetime(train['pickup_datetime']).dt.year
train["month"] = pd.to_datetime(train['pickup_datetime']).dt.month
train["day"] = pd.to_datetime(train['pickup_datetime']).dt.weekday
train["pickup_hour"] = pd.to_datetime(train['pickup_datetime']).dt.hour

# Looping through arrays of data is very slow in python. 
# Numpy provides functions that operate on entire arrays of data, 
# which lets you avoid looping and drastically improve performance
train['distance'] = haversine_np(train['pickup_longitude'],
                                 train['pickup_latitude'],
                                 train['dropoff_longitude'],
                                 train['dropoff_latitude'])
train["mean_speed"] = (train.distance / train.trip_duration)*3600
train['alone'] = (train['passenger_count']==1).apply(int)

### Distributions

In [None]:
f, ax = plt.subplots(ncols=2, nrows=3, figsize=(20,15))
train[train.distance < 30].distance.hist(bins=100, ax=ax[0,0])
ax[0, 0].axvline(train[train.distance < 30].distance.median(), color='red')
ax[0, 0].set_xlabel('Distance in km')
ax[0, 0].set_title('Traveled distance distribution')

train[train.mean_speed < 80].mean_speed.hist(bins=100, ax=ax[0,1])
ax[0, 1].axvline(train[train.mean_speed < 80].mean_speed.median(), color='red')
ax[0, 1].set_xlabel('Mean speed in km/h')
ax[0, 1].set_title('Mean speed distribution')

sns.countplot(train.month, ax =ax[1,0])
_ = ax[1,0].set_xticklabels([calendar.month_abbr[int(k.get_text())] for k in ax[1,0].get_xticklabels()])
ax[1, 0].set_title('Travel month distribution')

sns.countplot(train.day, ax =ax[1,1])
_ = ax[1,1].set_xticklabels([calendar.day_abbr[int(k.get_text())] for k in ax[1,1].get_xticklabels()])
ax[1, 1].set_title('Travel day distribution')

sns.countplot(train.pickup_hour, ax =ax[2,0])
ax[2, 0].set_title('Travel hour distribution')

train.groupby(['day', 'pickup_hour']).count()['vendor_id'].plot(ax=ax[2,1])
ax[2, 1].set_title('Travel time distribution during the week')

### Trip duration

In [None]:
sns.countplot('trip_duration', data=train)
plt.yscale('log')

In [None]:
f, ax = plt.subplots(ncols=2, figsize=(15,5))
sns.boxplot(x='pickup_hour', y='trip_duration', data=train[train.trip_duration < 2*3600], ax = ax[0])
sns.boxplot(x='passenger_count', y='trip_duration', data=train[(train.trip_duration < 2*3600) & 
                                                               (train.passenger_count < 7)], ax = ax[1])
ax[1].set_yscale('log')
ax[0].set_yscale('log')

### Loneliness

In [None]:
f, ax = plt.subplots(nrows=2, figsize=(15,10))
sns.countplot('pickup_hour', hue='alone', data=train, ax=ax[0])
sns.countplot('day', hue='alone', data=train, ax=ax[1])

People seems to travel lonely on week day and on the morning/evening, those taxi trips should be to go to work.  

### Vendor id

In [None]:
_ = sns.countplot('vendor_id', data=train)

There are more trip with vendor 2. Let's see, if the `vendor_id` has an influence on distributions.   
As there are more trips with vendor 2, if the vendor has no influence, distribution should be a little more important for vendor 2.

In [None]:
f, ax = plt.subplots(figsize=(20,5), ncols=2)
sns.countplot("passenger_count", hue='vendor_id', data=train, ax =ax[0])
_ = ax[0].set_xlim([0.5, 7])

sns.countplot("pickup_hour", hue='vendor_id', data=train, ax =ax[1])

1. As you can see,  almost all the big cars (>= 5 passengers) belong to vendor 2. Vendor 2 should have a bigger car.  
While vendor seems busy with trip with many passengers, vendor 1 take more lonely passengers.
  
2. The passenger count doesn't seem to influence the pick up hour. Altough, people are more traveling with vendor 1 in proportion in the night (3-6am).


In [None]:
g =sns.FacetGrid(train[train.distance < 30], hue="vendor_id", size=7)
g = g.map(sns.distplot, "distance")
g.add_legend({'green': 'vendor 1', 'blue':"vendor 2"})

No influence of the `vendor id` on the traveled distance

### Cleaning extreme trips

In [None]:
for k in [0.5, 1, 5, 10, 20, 100]:
    print("{} hours+ trips : {:.4f} %".format(k, (len(train[train.trip_duration > k * 3600]) / len(train))*100))

99% of the trips are less than one hour.

In [None]:
extreme = train[train.trip_duration > 3600]
f, ax = plt.subplots(ncols=2, figsize=(15,5))
ax[0].scatter(extreme.distance, extreme.trip_duration)
ax[0].set_yscale('log')
ax[0].set_ylabel('Log Trip Duration')
ax[0].set_xlabel('Distance in km')
ax[0].set_title('Trip duration and distance for 1h+ trip')

sns.distplot(extreme['mean_speed'], ax=ax[1])
ax[1].set_ylabel('count')
ax[1].set_title('Mean speed disitriution for 1h+ trip')

There are some odd long (>1h) trips with a mean speed closed to 0 km/h and some trips with a distance closed to 0 km but with a trip duration > 1h ...

In [None]:
print('The mean trip duration for 1h+ trip with a speed < 1 km/h is {:.2f} hour'.format(extreme[extreme.mean_speed < 1].trip_duration.mean()/3600))

For ~1 day trip seems unreal and they need to be removed. So, i will remove 20h+ trips (<0.15% of the database).

In [None]:
df = train[train.trip_duration < 20*3600]

### Feature importance

In [None]:
y = df.trip_duration
X = df.drop(['pickup_datetime', 'dropoff_datetime', 'trip_duration', 'year', 'mean_speed'], axis=1)

In [None]:
le = LabelEncoder()
X.store_and_fwd_flag = le.fit_transform(X.store_and_fwd_flag)

In [None]:
clf = RandomForestRegressor()
clf.fit(X, y)

In [None]:
plt.figure(figsize=(17,5))
sns.barplot(X.columns[np.argsort(clf.feature_importances_)[::-1]], np.sort(clf.feature_importances_)[::-1])

### Prediction

In [None]:
test = pd.DataFrame.from_csv('../input/test.csv')
test["month"] = pd.to_datetime(test['pickup_datetime']).dt.month
test["day"] = pd.to_datetime(test['pickup_datetime']).dt.weekday
test["pickup_hour"] = pd.to_datetime(test['pickup_datetime']).dt.hour

test['distance'] = haversine_np(test['pickup_longitude'],
                                 test['pickup_latitude'],
                                 test['dropoff_longitude'],
                                 test['dropoff_latitude'])
test['alone'] = (test['passenger_count']==1).apply(int)
test = test.drop('pickup_datetime', axis=1)
test.store_and_fwd_flag = le.transform(test.store_and_fwd_flag)

In [None]:
Juputertest['trip_duration'] = clf.predict(test)