# **EXECUTIVE SUMMARY**    
The objective of this kernel is to build a machine learning model that will predict taxicab trip durations based on 2016 NYC Yellow Cab trip record data.
To that end, I will demonstrate four phases of the data science pipeline:
1. **[Preprocessing](#prep) **: Clean and transform the data into a usable format for analysis.
2. **[Exploratory Analysis](#eda)**: Perform exploratory analysis to identify the best features to be used for modeling.  Graphs have been stylized to replicate visualizations published by popular data science blog, "FiveThirtyEight".  A full tutorial can be found [here](https://www.dataquest.io/blog/making-538-plots/).
2. **[Algorithm Development](#ml)**: Train, test, and refine various models to predict the target variable.  Given that our dependent variable `trip_duration` is a continuous outcome,  the regression algorithms to be protoyped are as follows:
 - [Multivarite Linear Regression](#linear)  
 - [Decision Tree](#tree)
 - [Random Forest](#random)
3. **[Model Deployment](#deployment)**: Apply the best performing model to the test set for contest submission.


# **ABOUT THE DATA **  
The  dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform.  Its variables are as follows:

| **Variable Name** | **Description**|          
| ------------------ |-------------|   
| id      | a unique identifier for each trip |
|vendor_id    | a code indicating the provider associated with the trip record     
|pickup_datetime |  date and time when the meter was engaged|  
|dropoff_datetime|  date and time when the meter was disengaged|  
|passenger_count|  the number of passengers in the vehicle (driver entered value)|  
|pickup_longitude | the longitude where the meter was engaged|  
|pickup_latitude  |   the latitude where the meter was engaged|  
|dropoff_longitude  |   the longitude where the meter was engaged|  
|dropoff_latitude | the latitude where the meter was disengaged|
|store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
|trip_duration | duration of the trip in seconds



# <a id="prep"></a>**1. DATA PRE-PROCESSING**
 First, the data will be loaded and cleaned into a usuable format for analysis. Specifically, I'll need to address:  
 - [missing data](#missing)
 - [outliers](#outliers)  
 - [data types](#types)
 - [feature engineering](#engineering)


In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.mlab as mlab
import pandas as pd
import seaborn as sns
import sklearn
import warnings
warnings.filterwarnings("ignore")

# Settings
import matplotlib
matplotlib.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (8.5, 5)
plt.rcParams["patch.force_edgecolor"] = True
pd.set_option('display.float_format', lambda x: '%.3f' % x)
sns.mpl.rc("figure", figsize=(8.5,5))

# Read data
train_data = pd.read_csv('../input/train.csv')

# View data
train_data.head()

## <a id="missing"></a> Data Pre-Processing: Missing Data 
After loading the data, I'll examine its struture to identify any missing observations that will need to be addressed.

In [None]:
# Data Shape
print('Data Shape',train_data.shape)
train_data.info()

Based on the entry totals above, there are no missing observations  to be imputed. <br/>
 
## <a id="outliers"></a> Data Pre-Processing: Outliers
Next, I will look at a statistical summary of the data to identify any obvious outliers.

In [None]:
# Statistical summary
train_data.describe().transpose()

**passenger_count**  
The `passenger_count` variable has a minimum value of 0 passengers, which does not make sense in the context of this business case.  These observations are most likely errors and will need to removed from the dataset.    

Another red flag is that `passenger_count` has a maximum value of 9 passengers -  highly unlikely for an NYC taxicab.   According to the NYC Taxi & Limousine Commission, the maximum number of people allowed in a yellow taxicab, by law, is five passengers (in a five passenger taxicab).  There are exceptions for passengers under the age of 7 who may sit on an adult's lap. However, it is unlikely that a full 5 person taxi cab would have that many children under the age of 7 on board to yield a passenger count as high as 9.   This observation is likely an error and will also to be removed from the dataset.   

**Longitude and Latitude Coordinates**  
Based on different coordinate estimates of New York City, the latitude and longitude ranges are as follows:  
- Latitude is between 40.7128 and 40.748817
- Longitude is between  - 74.0059 and  - 73.968285 

The statisical summary of pick-up and drop-off coordinates show max and min observations that fall outside of the NYC city coordinate range.  I will exclude these data points as this analysis is  limited to New York City proper.

**trip_duration**  
Lastly, there are unusual observations present in our target variable, `trip_duration`.  A max trip_duration of 3526282.00 sec (~ 980 hours) is not a  realistic trip time - a clear indication that outliers are present in the data.  A systematic way to remove these outliers is to exclude all data points that are a specified number of standard deviations away from the mean.  In this case, I will remove `trip_duration` observations that are more than two standard deviations away from the mean duration time, 959.492 sec (~15.99 min).

In [None]:
# Remove passenger_count outliers
train_data = train_data[train_data['passenger_count']>0]
train_data = train_data[train_data['passenger_count']<9]

# train_data = train_data[train_data['pickup_longitude'] <= -73.968285]
# train_data = train_data[train_data['pickup_longitude'] >= -74.0059]
# train_data = train_data[train_data['pickup_latitude'] <= 40.748817]
# train_data = train_data[train_data['pickup_latitude'] >= 40.7128]
# train_data = train_data[train_data['dropoff_longitude'] <= -73.968285]
# train_data = train_data[train_data['dropoff_longitude'] >= -74.0059]
# train_data = train_data[train_data['dropoff_latitude'] <= 40.748817]
# train_data = train_data[train_data['dropoff_latitude'] >= 40.7128]

# Remove coordinate outliers
train_data = train_data[train_data['pickup_longitude'] <= -73.75]
train_data = train_data[train_data['pickup_longitude'] >= -74.03]
train_data = train_data[train_data['pickup_latitude'] <= 40.85]
train_data = train_data[train_data['pickup_latitude'] >= 40.63]
train_data = train_data[train_data['dropoff_longitude'] <= -73.75]
train_data = train_data[train_data['dropoff_longitude'] >= -74.03]
train_data = train_data[train_data['dropoff_latitude'] <= 40.85]
train_data = train_data[train_data['dropoff_latitude'] >= 40.63]

# Remove trip_duration outliers
trip_duration_mean = np.mean(train_data['trip_duration'])
trip_duration_std = np.std(train_data['trip_duration'])
train_data = train_data[train_data['trip_duration']<=trip_duration_mean + 2*trip_duration_std]
train_data = train_data[train_data['trip_duration']>= trip_duration_mean - 2*trip_duration_std]

# Confirm removal
train_data.describe().transpose()

## <a id="types"></a> **Data Pre-Processing: Data Types**  
Next, I'll do a quick review of the data types to confirm that the variables are being assigned correctly.

In [None]:
train_data.info()

The pickup and dropoff timestamp variables are being treated as non-null objects.  These features should be specified as date objects to allow for easier feature engineering and analysis later on.  

In [None]:
# Convert timestamps to date objects
train_data['pickup_datetime'] = pd.to_datetime(train_data.pickup_datetime) # Pickups
train_data['dropoff_datetime'] = pd.to_datetime(train_data.dropoff_datetime) # Drop-offs

# Confirm changes
train_data.info()

## <a id="engineering"></a> **Data Pre-Processsing: Feature Engineering**  
The `pickup_datetime` and `dropoff_datetime` variables both combine date and time observations into the same column.  I will delimit this date and time information into separate columns to allow for easier analysis downstream.    

The hour and day of week a passenger is picked up may influence trip duration so I will extract these features from the `pickup_datetime` variable also.  This is not necessary for `dropoff_datetime` because, logically, the day and hour a passenger is dropped off would have no bearing on `trip_duration` because this is information is recorded *after* the trip is completed.   

In [None]:
# Delimit pickup_datetime variable 
train_data['pickup_date'] = train_data['pickup_datetime'].dt.date # Extract date
train_data['pickup_time'] = train_data['pickup_datetime'].dt.time # Extract time

# Delimit dropoff_datetime variables
train_data['dropoff_date'] = train_data['dropoff_datetime'].dt.date # Extract date
train_data['dropoff_time'] = train_data['dropoff_datetime'].dt.time # Extract time

# Additional pickup features
train_data['pickup_month'] = train_data['pickup_datetime'].dt.month # Extract month
# train_data['pickup_month'] = train_data.pickup_datetime.dt.to_period('M') # Extract yearmonth
#train_data['pickup_YYYYMM'] = train_data['pickup_datetime'].apply(lambda x: x.strftime('%Y%m')) # Extract yearmonth
train_data['pickup_hour'] = train_data['pickup_datetime'].dt.hour # Extract hour
train_data['pickup_weekday'] = train_data['pickup_datetime'].dt.dayofweek # Extract day of week

# Drop concatentated timestamp columns
train_data.drop(['pickup_datetime'], axis = 1, inplace = True)
train_data.drop(['dropoff_datetime'], axis = 1, inplace = True)

# Confirm changes
train_data.columns

# <a id="eda"></a>**2.  EXPLORATORY DATA ANALYSIS**
## Target Variable: trip_duration
The target variable we are trying to predict is `trip_duration`. I'll need to examine its distribution to see if there are transformations that need to be applied.  


In [None]:
# Mean distribution
mu = train_data['trip_duration'].mean()

# Std distribution
sigma = train_data['trip_duration'].std()
num_bins = 100

# Histogram 
fig = plt.figure(figsize=(8.5, 5))
n, bins, patches = plt.hist(train_data['trip_duration'], num_bins, normed=1,
                           edgecolor = 'black', lw = 1, alpha = .40)
# Normal Distribution
y = mlab.normpdf(bins, mu, sigma)
plt.plot(bins, y, 'r--', linewidth=2)
plt.xlabel('trip_duration')
plt.ylabel('Probability density')

# Adding a title
plt.title(r'$\mathrm{Trip\ duration\ skewed \ to \ the \ right:}\ \mu=%.3f,\ \sigma=%.3f$'%(mu,sigma))
plt.grid(True)
#fig.tight_layout()
plt.show()

# Statistical summary
train_data.describe()[['trip_duration']].transpose()

There is a clear indication that `trip_duration` is highly skewed to the right based on two key signals:  
- Skewness value > 1.0
- Long right tail

The median `trip_duration` is only 655 seconds (~11 min).  However, there are `trip_duration` observations as high as 11,411 sec (3.12 hours) that were not removed previously as they are still within two standard deviations of the mean (our specified outlier cutoff).  As a result, these high `trip_duration` observations are skewing the distribution to the right.  

Thus, applying the log transformation to `trip_duration` will normalize its distribution and reduce the influence of these high observations in the right tail.     

## **Feature Variables**
The remaining columns in our dataset are the 'feature variables'.  These are the variables that will be fed into our machine learning model to predict the dependent variable, `trip_duration`.   I will explore each one of these features to better understand the information it contains and what transformations are needed before we can proceed to the learning process.  

In [None]:
# Feature names
train_data.columns

** Id**  
The `id` variable is a unique identifier of each trip.  I will explore how this feature varies over time (if at all).

In [None]:
# Summarize total trips by day
pickups_by_day = train_data.groupby('pickup_date').count()['id']

# Create graph
pickups_graph = pickups_by_day.plot(x = 'pickup_date', y = 'id', figsize = (8.5,5),legend = True)

# Customize tick size
pickups_graph.tick_params(axis = 'both', which = 'major', labelsize = 12)

# Bold horizontal line at y = 0
pickups_graph.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)

# Customize tick labels of the y-axis
#pickups_graph.set_yticklabels(labels = [-10, '2000   ', '4000   ', '6000   ', '8000   ', '10000   '])

# Add an extra vertical line by tweaking the range of the x-axis
pickups_graph.set_xlim(left = '2015-12-31', right = '2016-06-30')

# Remove the label of the x-axis
pickups_graph.xaxis.label.set_visible(False)

# Add signature bar
pickups_graph.text(x = '2015-12-15', # Adjusts left side of signature bar,has to be in same coordiantes as x-axis
               y = -2500, 
               s = '    ©KAGGLE                                          Source: NYC Taxi and Limousine Commission (TLC)   ', # copyright symbol ALT + 0169
              fontsize = 14, color = '#f0f0f0', backgroundcolor = 'grey')

# Adding a title and a subtitle
pickups_graph.text(x = '2015-12-18', y = 11800,
                   s = "Dramatic drop in total trips in late January or early February",
                   fontsize = 20, weight = 'bold', alpha = .90)

pickups_graph.text(x = '2015-12-18', y = 11000, 
                   s = 'Decline is isolated to a specific day so may be more than just seasonal effects.',
                   fontsize = 14, alpha = .85)
pickups_graph.text(x = '2016-01-27', y = 1500, s = 'What happened?',weight = 0, rotation = 0, backgroundcolor = '#f0f0f0', size = 14)
plt.show()

**Trip Id Over Time**  
There is a unusual drop in the total number of `id` around late January.  At first glance, it is easy to assume that this could be just seasonality.   However, the decrease is much more drastic relative to other winter days before/after the drop and looks to be isolated around a single day.  Thus, a more plausible explanation for this outlier could be order entry error or some other extraneous event.  

In [None]:
# Identify where drop occured
train_data.groupby('pickup_date').count()['id'].sort_values(ascending = True)[[0]]

Upon further investigation, the drop occured on January 23, 2016 - the date of New York's first big snow storm of the year, where they were hit with 26.8 inches of snowfall.  Although there was a significant decline in the overall number of taxi rides, the median `trip_duration` of the rides that *were* given that day of 456.5 seconds does not seem to be out of the ordinary.  

Although the `id` variable provides interesting insight about trips over time, the actual Id of a trip record will not be useful in predicting `trip_duration` in our alogrithm.  Thus, I will remove this feature when it comes time to train the model.  <br/><br/>

**Vendor Id**  
The `vendor_id` variable is a code indicating the provider associated with each trip record. I will examine if there is a provider that takes longer trips relative to the others. 

In [None]:
# Create boxplot
plt.figure(figsize=(8.5,5))
vendor_graph = sns.boxplot(x = 'vendor_id', y = 'trip_duration', data = train_data, 
                          palette = 'gist_rainbow', linewidth = 2.3)

# Customize tick labels of the y-axis
vendor_graph.set_yticklabels(labels = [-10, '0  ', '2000  ', '4000  ', '6000  ', '8000  ', '10000  ','12000 s'])

# Bolding horizontal line at y = 0
vendor_graph.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .70)

# Remove the label of the x-axis
vendor_graph.xaxis.label.set_visible(False)
vendor_graph.yaxis.label.set_visible(False)

# Add signature bar
vendor_graph.text(x = -.66, # Adjusts left side of signature bar
               y = -2500,  
               s = '   ©KAGGLE                                                 Source: NYC Taxi and Limousine Commission (TLC)   ', # copyright symbol ALT + 0169
              fontsize = 14, color = '#f0f0f0', backgroundcolor = 'grey') 

# # Adding a title and a subtitle
vendor_graph.text(x =-.66, y = 13800, s = "Trip durations are similar between NYC taxi vendors",
               fontsize =20 , weight = 'bold', alpha = .90)
vendor_graph.text(x = -.66, y = 13000.3, 
               s = 'Both have a median trip time ~650 seconds with many outliers',
              fontsize = 14, alpha = .85)
plt.show()

# Statistical summary
train_data.groupby('vendor_id')['trip_duration'].describe()


The median `trip_duration` is similar between the two vendors. However, it is worth nothing that each vendor also has a significant number of outliers beyond the upper fence.  That is, both have outliers that are greater than the upper quartile by at least 1.5x the interquartile range.   

**store_and_fwd_flag**  
The `store_and_fwd_flag` variable indicates whether the trip record was stored in vehicle memory before forwarding to the vendor because the vehicle did not have a direct connection to the server.  An immediate question that comes to mind is if these 'Store and Forward' trips are contributing to the `trip_duration` outliers for the two vendors noted above.  

In [None]:
# Create boxplot
plt.figure(figsize=(8.5,5))
vendor_graph = sns.boxplot(x = 'vendor_id', y = 'trip_duration', data = train_data, 
                          orient = 'v',color = 'lightgrey', linewidth = 2.3)
plt.setp(vendor_graph.artists, alpha = 0.5)

# Create strip plot
sns.stripplot(data = train_data, x = 'vendor_id', y = 'trip_duration', jitter = 1, size = 5,
             edgecolor = 'black', linewidth = .2,palette = 'gist_rainbow_r',hue = 'store_and_fwd_flag')

# Customize tick size
vendor_graph.tick_params(axis = 'both', which = 'major', labelsize = 12)

# Customize tick labels of the y-axis
vendor_graph.set_yticklabels(labels = [-10, '0  ', '2000  ', '4000  ', '6000  ', '8000  ', '10000  ','12000 s'])

# Bolding horizontal line at y = 0
vendor_graph.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .70)

# Remove the label of the x-axis
vendor_graph.xaxis.label.set_visible(False)
vendor_graph.yaxis.label.set_visible(False)

# Add signature bar
vendor_graph.text(x = -.66, # Adjusts left side of signature bar
               y = -2500,  
               s = '   ©KAGGLE                                                 Source: NYC Taxi and Limousine Commission (TLC)   ', # copyright symbol ALT + 0169
              fontsize = 14, color = '#f0f0f0', backgroundcolor = 'grey') 

# Adding a title and a subtitle
vendor_graph.text(x =-.66, y = 13800, s = 'Store-and-forward trips found for Vendor 1 only',
               fontsize =20 , weight = 'bold', alpha = .90)
vendor_graph.text(x = -.66, y = 13000.3, 
               s = 'However, server connection does not have much bearing on the high number of outliers',
              fontsize = 14, alpha = .85)
# Format legend
vendor_graph.legend(title = 'store_and_fwd_flag', bbox_to_anchor = (.80,1),loc = 2, fontsize=12)
plt.show()

# Statistical summary
train_data.groupby(['vendor_id','store_and_fwd_flag'])['store_and_fwd_flag'].count().unstack().fillna(0)


Only Vendor 1 had trip records that were stored and forwarded to the vendor rather than directly onto the server.  Intially, I thought these 'Store and Forward' trips would be reponsible for most of that vendor's outliers, but these trips account for only a small portion of Vendor 1 observations.  Most of the outliers are "normal" trips that were directly stored onto the Vendor server.  Thus, this `store_and_fwd_flag`  may not be informative in predicting trip_duration times.   <br/> <br/>

**Passenger Count**  
 The `passenger_count` variable is the number of passengers in the vehicle as inputed by the driver.  The assumption is that trips with more passengers are inherently longer due to more stops, but I will explore this variable to test this assumption.

In [None]:
# Settings
import matplotlib
matplotlib.style.use('fivethirtyeight')

# Create boxplot
plt.figure(figsize=(8.5,5))
passenger_graph = sns.boxplot(x = 'passenger_count', y = 'trip_duration', data = train_data, 
                          palette = 'gist_rainbow', linewidth = 2.3)

# Customize tick size
passenger_graph.tick_params(axis = 'both', which = 'major', labelsize = 12)

# Customize tick labels of the y-axis
passenger_graph.set_yticklabels(labels = [-10, '0  ', '2000  ', '4000  ', '6000  ', '8000  ', '10000  ','12000 s'])

# Bolding horizontal line at y = 0
passenger_graph.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .70)

# Add an extra vertical line by tweaking the range of the x-axis
#month_graph.set_xlim(left = -1, right = 6)

# Remove the label of the x-axis
passenger_graph.xaxis.label.set_visible(False)
passenger_graph.yaxis.label.set_visible(False)

# Add signature bar
passenger_graph.text(x = -1.1, # Adjusts left side of signature bar
               y = -2500,  
               s = '   ©KAGGLE                                                 Source: NYC Taxi and Limousine Commission (TLC)   ', # copyright symbol ALT + 0169
              fontsize = 14, color = '#f0f0f0', backgroundcolor = 'grey') 

# Alternative signature bar
# fte_graph.text(x = 1967.1, y = -6.5,
#               s = '________________________________________________________________________________________________________________',
#               color = 'grey', alpha = .70)
# fte_graph.text(x = 1966.1, y = -9,
#               s ='   ©DATAQUEST                                                                               Source: National Center for Education Statistics   ', # copyright symbol ALT + 0169
#               fontsize = 14, color = 'grey', alpha = .7)

# # Adding a title and a subtitle
passenger_graph.text(x =-1.05, y = 13800, s = "Passenger count does not have much effect on trip duration",
               fontsize =20 , weight = 'bold', alpha = .90)
passenger_graph.text(x = -1.05, y = 13000.3, 
               s = 'Median trip times remain similar despite more passengers being aboard',
              fontsize = 14, alpha = .85)
plt.show()

# Statistical summary
train_data.groupby('passenger_count')['trip_duration'].describe().transpose()

Surpringly, the median `trip_duration` does not vary much as passenger_count increases.  Trips with just one passenger seem to have more outliers than other trips.   <br/> <br/>

**Trip Duration by Pickup Hour and Day**  
I suspect that trips are longer on the weekends due to higher traffic levels, but will need to explore if there are other times during the week `trip duration` is higher than average.  

In [None]:
# Trips by Hour and Day of Week
trip_duration_median = train_data['trip_duration'].median()
plt.figure(figsize=(8.5,5))
pickup_hourday = train_data.groupby(['pickup_hour','pickup_weekday'])['trip_duration'].median().unstack()
hourday_graph = sns.heatmap(pickup_hourday[pickup_hourday>trip_duration_median],
                                   lw = .5, annot = True, cmap = 'GnBu', fmt = 'g',annot_kws = {"size":10} )
# Customize tick label size
hourday_graph.tick_params(axis = 'both', which = 'major', labelsize = 10)

# Customize tick labels of the y-axis
hourday_graph.set_xticklabels(labels = ['Mon', 'Tue', 'Wed','Thu','Fri','Sat','Sun'])

# Bolding horizontal line at y = 0
hourday_graph.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .70)

# Remove the label of the x-axis
hourday_graph.xaxis.label.set_visible(False)

# Add signature bar
hourday_graph.text(x = -.8,  y = -4,
                   s = ' ©KAGGLE                                          Source: NYC Taxi and Limousine Commission (TLC)   ',
fontsize = 14, color = '#f0f0f0', backgroundcolor = 'grey') 

# # Adding a title and a subtitle
hourday_graph.text(x =-.8, y = 27, s = "Trip durations vary greatly depending on day of week",
               fontsize =20 , weight = 'bold', alpha = .90)
hourday_graph.text(x =-.8, y = 25.5, 
               s = 'Median trip times longest during office hours and weekend nights',
              fontsize = 14, alpha = .85)

# plt.ylabel('pickup_hour (military time)')
# plt.xlabel('pickup_weekday (Mon - Sun)')
# plt.title('Median Trip Duration by Pickup Hour and Day of Week')
plt.show()

Trips tend much longer than the median `trip_duration` of 655 seconds during the following parts of the week: 
- **Monday - Thursday Office Hours: ** 8:00 am  -  6:00 pm  
- **Thursday, Friday, Saturday Nights: ** 6:00 pm - midnight
- **Early Saturday & Sunday Mornings: **12:00 am  - 1:00 am  
- **Sunday Afternoons:** 2:00 pm and 4:00 pm<br/><br/>

**Trip Duration by Month**    
Next, I'll examine if `trip_duration` varies by month due to seasonality.  



In [None]:
# Box plot of pickups by month
import matplotlib
matplotlib.style.use('fivethirtyeight')

# Create boxplot
plt.figure(figsize=(8.5,5))
month_graph = sns.boxplot(x = 'pickup_month', y = 'trip_duration', data = train_data, 
                          palette = 'gist_rainbow', linewidth = 2.3)

# Customize tick size
month_graph.tick_params(axis = 'both', which = 'major', labelsize = 12)

# Customize tick labels of the y-axis
month_graph.set_yticklabels(labels = [-10, '0  ', '2000  ', '4000  ', '6000  ', '8000  ', '10000  ','12000 s'])

# Bolding horizontal line at y = 0
month_graph.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .70)


# Add an extra vertical line by tweaking the range of the x-axis
#month_graph.set_xlim(left = -1, right = 6)

# Remove the label of the x-axis
month_graph.xaxis.label.set_visible(False)
month_graph.yaxis.label.set_visible(False)

# Add signature bar
month_graph.text(x = -1.1, # Adjusts left side of signature bar
               y = -2500,  
               s = '   ©KAGGLE                                                 Source: NYC Taxi and Limousine Commission (TLC)   ', # copyright symbol ALT + 0169
              fontsize = 14, color = '#f0f0f0', backgroundcolor = 'grey') 

# Alternative signature bar
# fte_graph.text(x = 1967.1, y = -6.5,
#               s = '________________________________________________________________________________________________________________',
#               color = 'grey', alpha = .70)
# fte_graph.text(x = 1966.1, y = -9,
#               s ='   ©DATAQUEST                                                                               Source: National Center for Education Statistics   ', # copyright symbol ALT + 0169
#               fontsize = 14, color = 'grey', alpha = .7)

# # Adding a title and a subtitle
month_graph.text(x =-1.05, y = 13800, s = "Month of transaction has minimal effect on trip duration",
               fontsize =20 , weight = 'bold', alpha = .90)
month_graph.text(x = -1.05, y = 13000.3, 
               s = 'Median trip times hover around ~650 seconds throughout the year',
              fontsize = 14, alpha = .85)
plt.show()

# Statistical summary
train_data.groupby('pickup_month')['trip_duration'].describe().transpose()

June has the highest median `trip_duration` overall, but only slightly.  Median trip times seem to hover around the 10-12 minute mark and do not vary much from month-to-month.  This may be an indication that this month feature will not be helpful in predicting our target variable, `trip_duration`.  <br/><br/>
**Plot Rides**   
Next, I will plot the pickup and drop off points of each taxi ride.  

In [None]:
longitude = list(train_data.pickup_longitude) + list(train_data.dropoff_longitude)
latitude = list(train_data.pickup_latitude) + list(train_data.dropoff_latitude)
plt.figure(figsize = (10,8))
plt.plot(longitude,latitude,'.',alpha = .40, markersize = .8)
plt.title('Trip Plots')
plt.show()

**Plot by Neighborhood**  
Using the KMeans alogirthm, we can cluster the data points into the different neighborhoods of NYC.  

In [None]:
# Create data frame of coordinates
loc_df = pd.DataFrame()
loc_df['longitude'] = longitude
loc_df['latitude'] = latitude

# Clusters of New York
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=15, random_state=2, n_init = 10).fit(loc_df)
loc_df['label'] = kmeans.labels_
loc_df = loc_df.sample(200000)
plt.figure(figsize = (12,7))
for label in loc_df.label.unique():
    plt.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.8, markersize = 0.8)
plt.title('Clusters of New York')
plt.show()

**Relationships Between Variables**  
Now that each feature has been explored individually, I will examine how they relate to the target variable as well as each other.  Variables that are highly correlated (correlation cofficient >.70) are likely to convey redudant information and can be removed from the dataset.  Reducing the data's dimentionality in this way will make the data easier to work with and allow for better model results.  

In [None]:
# Correlations to trip_duration
corr = train_data.select_dtypes(include = ['float64', 'int64']).iloc[:, 1:].corr()
cor_dict = corr['trip_duration'].to_dict()
del cor_dict['trip_duration']
print("List the numerical features in decending order by their correlation with trip_duration:\n")
for ele in sorted(cor_dict.items(), key = lambda x: -abs(x[1])):
    print("{0}: {1}".format(*ele))
    
# Correlation matrix heatmap
corrmat = train_data.corr()
plt.figure(figsize=(12, 7))

# Number of variables for heatmap
k = 76
cols = corrmat.nlargest(k, 'trip_duration')['trip_duration'].index
cm = np.corrcoef(train_data[cols].values.T)

# Generate mask for upper triangle
mask = np.zeros_like(cm, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

sns.set(font_scale=1)
sns.heatmap(cm, mask=mask, cbar=True, annot=True, square=True,\
                 fmt='.2f',annot_kws={'size': 12}, yticklabels=cols.values,\
                 xticklabels=cols.values, cmap = 'coolwarm',lw = .1)
plt.show() 

**Correlations**  
Not suprisingly, the correlation coefficient of the coordinate features: `pickup_latitude`, `pickup_longitude`, `dropoff_latitude`, `dropoff_longitude`, indicates a linear relationship exists between them.  However, this correlation is under .50 and is not strong enough to merit removal from the data set.    

There is a weak positive correlation between the longitude variables and `trip_duration`.  There is also a weak negative correlation betwen the latitude variables and `trip_duration`.    

Otherwise, there doesn't appear to be much a of a *linear* relationship between our target variable and the remaining features.    

# <a id="ml"></a>**3. ALGORITHM DEVELOPMENT**  
Now that I have a better sense of the dataset's features and target variable, I will need to make sure the data is "model ready" before prototyping a machine learning algorithm.  Namely, I'll need to confirm that:
- there is no missing data
- all categorical features are encoded as numerical
- unnecessary features are removed

In [None]:
# Check for categorical variables
train_data.head()

In [None]:
# Encode categorical variables
train_data['store_and_fwd_flag'] = train_data['store_and_fwd_flag'].map({'N':0,'Y':1})

**Drop Unnecessary Features**  
The machine learning algorithm will not be able to accept dates and times as inputs so the following features will be removed prior to training: `pickup_date`, `pickup_time`, `dropoff_date`,  and `dropoff_time`.  Instead, this information will be represented by the `pickup_month` and `pickup_hour` features that were engineered during pre-processing.

In [None]:
# Remove unnecessary features
train_data.drop(['pickup_date','pickup_time','dropoff_date', 'dropoff_time','id'], 
                axis = 1, inplace = True)

In [None]:
train_data.columns

Now that data is model-ready, we are finally ready for the fun part - building models! First, I will split the data into training and test sets.  Next, I will feed these sets into a number of Regression algorithms to determine which learner is the most performant to use for the Kaggle submission.  

In [None]:
# Split
# Create matrix of features
X = train_data[['vendor_id', 'passenger_count', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'store_and_fwd_flag','pickup_month', 'pickup_hour',
       'pickup_weekday']] # double brackets!

# Create array of target variable 
y = train_data['trip_duration']

# Create train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## <a id="linear"></a>  Multivariate Linear Regression
First, I will try [Linear Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) as it is the most widely-known and understood regression learner.  However, from the outset, it is important to underscore the assumptions we're making when using this model:
1. **Linear Relationship:** The relationship between the independent and dependent variables is linear.
2. **Multivariate Normality:** The variables (features) are normally distributed.  If not, a non-linear transformation (e.g., log-transformation) may be needed to fix the issue.  
3. **No or Little Multicollinearity:** The independent variables (features) are not highly correlated with each other.
4. **No Auto-Corrlation:** Residuals are independent of one another (i.e., outcome is independent of a previous outcome).
5.  **Homoscedasticity:** Residuals are equal across the regression line.

These assumptions may or may not be appropriate for this particular data set.  We can test these assumptions by seeing how well the Linear Regression performs.  <br\>  

[More Information About Assumptions](http://www.statisticssolutions.com/assumptions-of-linear-regression/)

During the training process, I will apply `GridSearchCV()` to use the best parameters possible to train the model. For additional information see links below:

[Documentation](http://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)  
[Scoring Parameters](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)  
[GridSearchCV Regression](https://stats.stackexchange.com/questions/153131/gridsearchcv-regression-vs-linear-regression-vs-stats-model-ols)


In [None]:
#  Import model
from sklearn.linear_model import LinearRegression

#  Instantiate model object
lreg = LinearRegression()

# Fit to training data
lreg.fit(X_train,y_train)
print(lreg)

# Predict
y_pred_lreg = lreg.predict(X_test)

# Score It
from sklearn import metrics
print('\nLinear Regression Performance Metrics')
print('R^2=',metrics.explained_variance_score(y_test,y_pred_lreg))
print('MAE:',metrics.mean_absolute_error(y_test,y_pred_lreg))
print('MSE:',metrics.mean_squared_error(y_test,y_pred_lreg))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,y_pred_lreg)))

According to the R-squared, only 23%  of the variation in the dependent variable is explained by this model.   Such a low score is an indication that the relationship between the features and independent variable may be better explained with a non-linear model.  

## <a id="tree"></a>  **Decision Tree**
There are a number of non-linear regressors to choose from, but I will implement a [Decision Tree Regressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) first for three primary reasons:
-  it is the easiest to interpret
-  does not require feature scaling
-  computationally less expensive than other methods


In [None]:
# Fit
# Import model
from sklearn.tree import DecisionTreeRegressor

# Instantiate model object
dtree = DecisionTreeRegressor()

# Fit to training data
dtree.fit(X_train,y_train)
print(dtree)

# Predict
y_pred_dtree = dtree.predict(X_test)

# Score It
from sklearn import metrics
print('\nDecision Tree Regression Performance Metrics')
print('R^2=',metrics.explained_variance_score(y_test,y_pred_dtree))
print('MAE:',metrics.mean_absolute_error(y_test,y_pred_dtree))
print('MSE:',metrics.mean_squared_error(y_test,y_pred_dtree))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,y_pred_dtree)))

## <a id="random"></a> **Random Forest **
[Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [None]:
# Fit 
# Import model
from sklearn.ensemble import RandomForestRegressor 

# Instantiate model object
rforest = RandomForestRegressor(n_estimators = 20, n_jobs = -1)

# Fit to training data
rforest = rforest.fit(X_train,y_train)
print(rforest)

# Predict
y_pred_rforest = rforest.predict(X_test)

# Score It
from sklearn import metrics
print('\nRandom Forest Regression Performance Metrics')
print('R^2 =',metrics.explained_variance_score(y_test,y_pred_rforest))
print('MAE',metrics.mean_absolute_error(y_test, y_pred_rforest))
print('MSE',metrics.mean_squared_error(y_test, y_pred_rforest))
print('RMSE',np.sqrt(metrics.mean_squared_error(y_test, y_pred_rforest)))

#  <a id="deployment"></a> 4. MODEL DEPLOYMENT 
I've decided to use the Random Forest algorithm on my final Kaggle Submission.  But first, I'll need to pre-process the test data in the same manner as above before feeding it into the model.   

In [None]:
# Load test data
test_data = pd.read_csv('../input/test.csv')

# Test data info
test_data.info()

# Test data shape
print('shape',test_data.shape)

## **Feature Engineering**

In [None]:
# Convert timestamps to date objects
test_data['pickup_datetime'] = pd.to_datetime(test_data.pickup_datetime) # Pickups

# Delimit pickup_datetime variable 
test_data['pickup_date'] = test_data['pickup_datetime'].dt.date # Extract date
test_data['pickup_time'] = test_data['pickup_datetime'].dt.time # Extract time

# Additional pickup features
test_data['pickup_month'] = test_data['pickup_datetime'].dt.month # Extract month

#train_data['pickup_YYYYMM'] = train_data['pickup_datetime'].apply(lambda x: x.strftime('%Y%m')) # Extract yearmonth
test_data['pickup_hour'] = test_data['pickup_datetime'].dt.hour # Extract hour
test_data['pickup_weekday'] = test_data['pickup_datetime'].dt.dayofweek # Extract day of week

# Encode categorical variables
test_data['store_and_fwd_flag'] = test_data['store_and_fwd_flag'].map({'N':0,'Y':1})


## ** Kaggle Submission**

In [None]:
# Create new matrix of features from test data
X_test= test_data[['vendor_id', 'passenger_count', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'store_and_fwd_flag','pickup_month', 'pickup_hour',
       'pickup_weekday']]

# Feed features into random forest
y_pred= rforest.predict(X_test)

In [None]:
# Create contest submission
submission = pd.DataFrame({
    'Id':test_data['id'],
    'trip_duration': y_pred
})
submission.to_csv('mytaxisubmission.csv',index = False)