In this notebook I will begin by processing the data, specifically the datetime columns and from there conduct some basic EDA and then test different models such as Gradient Boosting (using XGBoost and Random Forests).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import matplotlib.pyplot as plt
import seaborn as sns
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

# Step One: Processing the Data

Let's begin by reading in the training data as a dataframe and then checking the head of the dataframe.

In [None]:
train = pd.read_csv('../input/train.csv')

In [None]:
train.head()

Let's check out the type of data in each column

In [None]:
train.dtypes

We can see that most of the data is numerical except for the pickup_datetime, dropoff_datetime, and store_and_fwd_flag columns. You might notice that the pickup_datetime and dropoff_datetime columns are of type 'object' and we should convert the data in these columns to datetime objects using the pd.to_datetime method.

In [None]:
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'])

In [None]:
train['dropoff_datetime'] = pd.to_datetime(train['dropoff_datetime'])

From here we can do some feature extraction on the pickup_datetime column by getting the hour, minute, day, and month of the pick up times.

In [None]:
train['pickup_hour'] = train['pickup_datetime'].dt.hour
train['pickup_minute'] = train['pickup_datetime'].dt.minute
train['pickup_day'] = train['pickup_datetime'].dt.day
train['pickup_month'] = train['pickup_datetime'].dt.month
# Just in case, let's get the second of the pickup time as well
train['pickup_second'] = train['pickup_datetime'].dt.second

In [None]:
train.head()

# Step Two : Exploratory Data Analysis

Let's take a look at the distributions of the different time-based variables that we extracted in the previous step.

In [None]:
sns.distplot(train['pickup_month'])

So it seems that while March (corresponding to number 3) is the most common month for pickup times, the months for pick up times are almost evenly spread out.

In [None]:
sns.distplot(train['pickup_day'])

Again, nothing too surprising here! The days for each pickup time are also almost evenly distributed.

In [None]:
sns.distplot(train['pickup_hour'])

Based on the distribution above we can notice the following:

- There are relatively few early morning pickups before 6 am.
- From 6 am until around the early evening at 7 pm (19:00 in the 24-hour clock), the number of pickups start to rise and then peak at this evening time.

In [None]:
sns.distplot(train['pickup_minute'])

Based on the above distribution, it seems that people usually get picked up most often in 5-6 minute time intervals throughout the hour. We can see a total of ten sudden peaks for the hour period. The rest of the distribution is relatively uniform.

In [None]:
sns.distplot(train['pickup_second'])

The seconds follows a similar distribution as the minutes data but I am not sure if this data will be as significant. Nevertheless, for now let's keep this column and see what happens.

Let's look at the distribution of the trip duration.

In [None]:
sns.distplot(train['trip_duration'])

In [None]:
train['trip_duration'].describe()

So it seems 75% of trips are under 1075 seconds (about 17.9 minutes) but the longest trip lasted 3,526,282 seconds (979.5 hours!). 

In [None]:
sns.distplot(train['passenger_count'])

In [None]:
train['passenger_count'].describe()

So it seems no trip has more than nine passengers, which seems logical, with an overwhelming majority of trips having just one passenger.

Let's remove outliers for the trip duration

In [None]:
train = train[train['trip_duration'] < 500000]
sns.distplot(train['trip_duration'])

Let's take a look at the distribution again with trip durations less than 5000

In [None]:
train_below_5000 = train[train['trip_duration'] < 5000]
sns.distplot(train_below_5000['trip_duration'])

Let's now look at a heatmap of the correlations.

In [None]:
sns.heatmap(train.drop(['id','dropoff_datetime', 'pickup_datetime', 'store_and_fwd_flag'], axis=1).corr())

# Step Three: Model Building

Let's try a simple Random Forest Regressor.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
train.head()

Let's go ahead and encode the binary store_and_fwd_flag categorical variable so our model can make use of it.

In [None]:

def encode(X):
    if X == 'Y':
        return 1
    else:
        return 0
    
train['store_and_fwd_flag'] = train['store_and_fwd_flag'].apply(lambda x: encode(x))
train.head()

In [None]:
train_trim = train.drop(['id','dropoff_datetime', 'pickup_datetime', 'store_and_fwd_flag'], axis=1)

Now with a trimmed data frame with the columns that we really need, let's begin by testing a RandomForest Regressor. Please note that I have chosen a small value for the number of trees in order to fit the Kaggle limits for running time. In a practical scenario, you should probably use more trees in the random forest (I usually use at least 100). Since we have a very large dataset, I have chosen a small number of trees. 

In [None]:
# I have chosen these parameters to fit the Kaggle limits on kernel run time
rf_regressor = RandomForestRegressor(n_estimators=10, max_features='sqrt', n_jobs=-1)
X = train_trim.drop('trip_duration', axis=1)
y = train_trim['trip_duration']

Now we can run a cross validation test where we split the data repeatedly into training and test sets to evaluate the performance of our model

In [None]:
from sklearn.model_selection import train_test_split
from random import randint
from sklearn import metrics
for fold in range(3):
    print('Testing model for fold {} ...'.format(fold + 1))
    randnum = randint(1, 102)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=randnum)
    rf_regressor.fit(X_train, y_train)
    predictions = rf_regressor.predict(X_test)
    print('Results for fold {}'.format(fold + 1))
    print('Mean squared logarithmic error: {}'.format(metrics.mean_squared_log_error(predictions, y_test)))

# More coming soon!