It is commonplace in NYC to commute by taxi.  In this report, we will explore feature engineering and how these features can be useful

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

First we load the data, and we will look at the features in the training set.

In [3]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

for col in train.columns:
    print (col)

We see we have pickup and dropoff times, and longitude and latitude.  Lets create some features based off of that.

First, lets look at pickup times.

## Temporal Analysis ##

In [4]:
train[['pickup_datetime']].head(5)

From this, we see we are given the month, year and hour.  Lets see if there are temporal patterns, by extracting the month, year, day and hour of pickup time

Lets convert each of these times into timestamps, and then extract the month, day of the week and hour.

In [5]:
def toDateTime( df ):
    
    df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
    
    df['month'] = df['pickup_datetime'].dt.month
    df['hour'] = df['pickup_datetime'].dt.hour
    df['day_week'] = df['pickup_datetime'].dt.weekday_name
    
    return df

Now lets group by month, hour and day of the week, and look at the summary statistics of the trip duration

In [6]:
train = toDateTime(train)
test = toDateTime(test)

Before we look at the distribution or summary statistics of the trip duration, lets log transform the data.

In [7]:
train['trip_duration'] = np.log1p(train['trip_duration'])

Now lets look at the distribution

In [32]:
train.groupby('month')['trip_duration'].describe()

From this, we see our outliers lie in the January and February.  This can be most likely due to the winters in NYC, and possible snow flurries or storms that were present during that month.

Now lets group by hour.

In [8]:
train.groupby('hour')['trip_duration'].describe()

This shows us that the outliers lie in the early hours and the late hours of the night.  

Finally, lets look at the differences in the day of the week.

In [None]:
train.groupby('day_week')['trip_duration'].describe()

Now lets consider some violin plots to see the distribution of the trip duration by month

In [11]:
def violinPlot( df, by_col ):
    import seaborn as sns
    
    sns.violinplot(x = by_col, y = 'trip_duration', data = df)

In [37]:
violinPlot(train, 'month')

As we can see, the distribution does not differ each month, with the exception of the outliers, but is normal each month. 

Lets look at the day of the week differences

In [12]:
violinPlot(train, 'day_week')

Lets consider the hour to hour differences in the log trip duration

In [None]:
violinPlot(train, 'hour')

As it can be see, the distribution of the log trip duration per hour is normal and slightly differs as the hour changes. This may be a good feature to use for prediction. 

Lets consider how displacement of the pickup and dropoff locations make an impact on the trip duration.

## Location – Location – Location ##

Lets create our y displacement, x displacement and distance variable

In [39]:
def locationFeatures( df ):
    #displacement
    df['y_dis'] = df['pickup_longitude'] - df['dropoff_longitude']
    df['x_dis'] = df['pickup_latitude'] - df['dropoff_latitude']
    
    #square distance
    df['dist_sq'] = (df['y_dis'] ** 2) + (df['x_dis'] ** 2)
    
    #distance
    df['dist_sqrt'] = df['dist_sq'] ** 0.5
    
    return df

In [40]:
train = locationFeatures(train)
test = locationFeatures(test)

We will create various scatter plots with the log trip duration as the x variable.

The direction of where the taxi is going, uptown, downtown, may be important to our analysis.

In [43]:
train.plot(x = 'trip_duration', y = 'y_dis', kind = 'scatter')

In [44]:
train.plot(x = 'trip_duration', y = 'x_dis', kind = 'scatter')

In [45]:
train.plot(x = 'trip_duration', y = 'dist_sq', kind = 'scatter')

In [46]:
train.plot(x = 'trip_duration', y = 'dist_sqrt', kind = 'scatter')

In each of the plots, we see the value of the log trip duration around 8 shows variation in each of the plots, while that is not the case with the other values.

## Conclusion ##

This is just the beginning of this notebook, and I will continue to further develop and post my findings here.  If you think this is interesting or have been imformative, please upvote.