# Capstone Project 1: New York City Taxi Fare Prediction

In big cities such as New York City, a huge number of taxi rides is taken per day. 
As the popularity of app-based vehicle hiring services grows, accurate prediction of taxi fare is essential for 
enhancing customers’ satisfaction, since it is given as upfront data to the customers. 
There are many factors that should be considered such as the pickup time, pickup or dropoff locations, etc. in predicting taxi fare. 
Providing accurate taxi fare at a specific time enables both drivers and customers to decide whether to select the rides or not. 
The goal of this project is developing a Machine Learning (ML) based model to predict the fare amount for a taxi ride in New York City while some data such as the pickup and dropoff locations are given. 

Predicting accurate taxi fares yields better results for taxi cab and ridesharing companies such as Uber, Lyft, etc. 
Also, this project can be used in traffic congestion prediction and autonomous vehicle research to develop accurate traffic models and choose the fastest and less congested routes. 

The data from a Kaggle competition is used for this project
(https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview).

The dataset for this project includes the features explained below:

   * pickup_datetime - timestamp value indicating when the taxi ride started.
    
   * pickup_longitude - float for longitude coordinate of where the taxi ride started.
    
   * pickup_latitude - float for latitude coordinate of where the taxi ride started.
    
   * dropoff_longitude - float for longitude coordinate of where the taxi ride ended.
    
   * dropoff_latitude - float for latitude coordinate of where the taxi ride ended.
    
   * passenger_count -integer indicating the number of passengers in the taxi ride.

During the modeling phase of the project, these features can be extended. 

Target: 

   * fare_amount - dollar amount of the cost of the taxi ride. 



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import numpy as np

## Loading the Dataset

Read limited number of rows from dataset due to low memory. For this project, 6 million rows are read from 55 million available rows.

In [2]:
train_file = pd.read_csv('/Users/mehrnaz/Desktop/SpringBoard/Assignment/Capstone_Project_1/Data_wrangling/train.csv', nrows = 6000000)

FileNotFoundError: [Errno 2] File b'/Users/mehrnaz/Desktop/SpringBoard/Assignment/Capstone_Project_1/Data_wrangling/train.csv' does not exist: b'/Users/mehrnaz/Desktop/SpringBoard/Assignment/Capstone_Project_1/Data_wrangling/train.csv'

## Part1: Data Wrangling

First try to see:
   * How train_file dataframe looks like. 
   * Shape of train_file. 
   * Statistics of the features

In [None]:
#Check data type of each column
train_file.dtypes

In [None]:
#Let see how dataframe looks like 
train_file.head()

In [None]:
# check statistics of the features
train_file.describe()

In [None]:
#check the number of rows and columns of 'train_file' dataframe
train_file.shape

Based on the above information, following steps should be performed:

* Check if there is any NAN and drop them

* Check the target column:

    * E.g. negative fare_amount does not make sense

In [None]:
#check how manay NANs exsit in dataset
train_file.isnull().sum()

In [None]:
#Remove NAN from the file 
train_file = train_file.dropna()

In [None]:
#Check the number of rows and columns after removing NAN
train_file.shape

In [None]:
#Check if there is negative value for fare_amount and how many
Counter(train_file['fare_amount'] <= 0)

In [None]:
#Remove rows with negative and zero values for 'fare_amount'
train_file = train_file.drop(train_file[train_file['fare_amount'] <= 0].index, axis=0)

In [None]:
#Check the number of rows and columns after removing 'fare_amount' with negative value
train_file.shape

In [None]:
#Plot histogram for 'fare_amount'
train_file.hist(column='fare_amount', log=True)

Based on the histogram for 'fare_amount', the price range between (0,200) dollar makes sense.

Let see how many 'fare_amount' greater than 200 dollar exist.

In [None]:
#check if there is 'fare_amount' greater than $200 
Counter(train_file['fare_amount'] > 200)

In [None]:
#Does not make sense to have 'fare_amount' greater than $200, so consider them as outlier and remove them
train_file = train_file.drop(train_file[train_file['fare_amount'] > 200].index, axis=0)
train_file.shape

Next analyzing the problem features.

In order to analyze latitude and longtitude columns, coordinates of NYC should be considered as boundries.

Googled to find the latitude and longtitude range for NYC:
* The NYC's latitude is in the range of (40, 42) 

* The NYC longtitude is in the range of (-76, -71) 

I have considered a slightly wider range for latitude and longtitude to be more inclusive.

In [None]:
#check to see if there is an outlier for pickup_longitude
train_file['pickup_longitude'].max()

In [None]:
#check to see if there is an outlier for pickup_latitude
train_file['pickup_latitude'].max()

In [None]:
#check to see if there is an outlier for pickup_longitude
train_file['pickup_longitude'].min()

In [None]:
#check to see if there is an outlier for pickup_latitude
train_file['pickup_latitude'].min()

In [None]:
#check to see if there is an outlier for dropoff_longitude
train_file['dropoff_longitude'].max()

In [None]:
#check to see if there is an outlier for dropoff_longitude
train_file['dropoff_longitude'].min()

In [None]:
#check to see if there is an outlier for dropoff_latitude
train_file['dropoff_latitude'].max()

In [None]:
#check to see if there is an outlier for dropoff_latitude
train_file['dropoff_latitude'].min()

In [None]:
#'pickup_latitude' should be in the range of (40, 42)
train_file = train_file[train_file['pickup_latitude'].between(40, 42)]
train_file.shape

In [None]:
#'pickup_longitude' should be in the range of (-76, -71)
train_file = train_file[train_file['pickup_longitude'].between(-76,-71)]
train_file.shape

In [None]:
#'dropoff_latitude' should be in the range of (40, 42)
train_file = train_file[train_file['dropoff_latitude'].between(40, 42)]
train_file.shape

In [None]:
#'dropoff_longitude' should be in the range of (-76, -71)
train_file = train_file[train_file['dropoff_longitude'].between(-76,-71)]
train_file.shape

The other column that should be cleaned up is the 'passenger_count'.

Let's find if there is an outlier for this feature.

In [None]:
train_file.hist(column='passenger_count', log=True)

In [None]:
#Check if there is outlier for passenger_count
train_file['passenger_count'].max()

Maximum number of passengers is 208 that does not make sense for the number of seats on a taxi cab.
The maxminum allowed passengers for an SUV or a Van is 6. So, 6 is considered as an upperbound for the number of passengers in each ride.

In [None]:
#Does not make sense to have 'passenger_count' greater than 6 or less than 1, so consider them as bounds and remove the data out of bounds.
train_file = train_file[train_file['passenger_count'].between(1, 6)]
train_file.shape

In the next step, new features will be created based on the available data to see whether these features affect the fare_amount or not.
    * Distance between pickup and dropoff location should be calculated.
    * The date and time of pickup. 
    

Haversine formula is employed to calculate the distance between pickup and dropoff locations based on longitude and latitude. 

The Haversine formula is (https://en.wikipedia.org/wiki/Haversine_formula):

distance = 2 * r * arcsin(sqrt(sin((latitude2 - latitude1) / 2.0)^2 + cos(latitude1) * cos(latitude2) * sin((longitude2 - longitude1) / 2.0)^2))
    

In [None]:
#Calculate the distance based on Haversine formula
def distance(lat1, lat2, lon1, lon2):
    # radians which converts from degrees to radians.   
    lat1 = np.radians(lat1) 
    lat2 = np.radians(lat2)
    lon1 = np.radians(lon1) 
    lon2 = np.radians(lon2) 
        
    # Haversine formulation  
    dlon = lon2 - lon1  
    dlat = lat2 - lat1
    a = np.sin(dlat / 2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2
    
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371
       
    # calculate the result 
    dis = c * r
    return dis

def creat_new_cloumn(df):
    data = [df]
    for row in data:
        row['distance'] = distance(row['pickup_latitude'], row['dropoff_latitude'], row['pickup_longitude'], row['dropoff_longitude'])
    return row['distance'] 
creat_new_cloumn(train_file)

In [None]:
train_file.head()

In [None]:
train_file.dtypes

In [None]:
#Check if there are ouliers for distance
train_file['distance'].max()

In [None]:
#Check if there are ouliers for distance
train_file['distance'].min()

In [None]:
train_file.hist(column='distance', log=True)

In [None]:
#Does not make sense to have 'distance' greater than 150km, so consider them as outlier and remove them
train_file = train_file.drop(train_file[train_file['distance'] > 150].index, axis=0)
train_file.shape

In [None]:
# Zero 'distance' does not make sense, so consider them as outlier and remove them
train_file = train_file.drop(train_file[train_file['distance'] == 0].index, axis=0)
train_file.shape

Fare_amount usually changes based on the days of the week. To explore how price change and affect the 'fare_amount', the type of the key column should be converted to datetime type to create new year, month, day, dayofweek, and hour columns. Since the 'key' and 'pickup_datetime' columns are the same, the 'key' column should be removed to get rid of duplicated data.

In [None]:
#convert the 'key' column to datetime 
train_file['key'] = pd.to_datetime(train_file['key'])
train_file.head()

In [None]:
train_file.dtypes

In [None]:
#Add 'year' column to dataframe
train_file['year'] = train_file['key'].dt.year

In [None]:
#Add 'year' column to dataframe
train_file['month'] = train_file['key'].dt.month

In [None]:
#Add 'year' column to dataframe
train_file['dayofweek'] = train_file['key'].dt.dayofweek

In [None]:
#Add 'year' column to dataframe
train_file['day'] = train_file['key'].dt.day

In [None]:
#Add 'year' column to dataframe
train_file['hour'] = train_file['key'].dt.hour

In [None]:
train_file.head()

In [None]:
#Make sure only data for recent years 2005 to 2020 are considered
train_file = train_file[train_file['year'].between(2005, 2020)]
train_file.shape

In [None]:
#Remove 'key' column to remove duplicated data
train_file = train_file.drop('key', axis=1)

In [None]:
train_file.head()

# Part 2: Story Telling



In [None]:
#Draw the fare_amount distribution
plt.figure(figsize = (10, 6))
sns.distplot(train_file['fare_amount'])
plt.title('Fare_Amount Distribution')

Fare amount distribution shows that most of the rides are under $25. 

Let's see where pickup and dropoff coordinates are located on the map. 

In [None]:
train_file.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude', color='r')    

In [None]:
train_file.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude', color='b')

As shown in these plots, the concentration of coordinates are between (-74.1, -73.8) and (40.6, 40.8) approximatly.
The coordinates are related to Manhattan neighberhood, since this area is considered as the crowded part of NYC.

Let's see how number of passenger has impacted fare amount.

In [None]:
#Average fare_amount by passenger_count.
train_file.groupby('passenger_count')['fare_amount'].mean().plot.bar()
plt.title('fare_amount per passenger_count')

In [None]:
plt.scatter(x=train_file['passenger_count'], y=train_file['fare_amount'], s=1.5)
plt.xlabel('passenger_count')
plt.ylabel('fare_amount')

The above figures for 'fare_amount' based on 'passenger_count' represent that the 'fare_amount' includes a base amount that increases slightly by the number of passengers in this case for single to five passengers. Then, the 'fare_amount' significantly increased for six passengers because another type of car is required.

To better explore the data, three new columns should be added to calculate fare_amount normalized based on distance and passenger_count.

In [None]:
train_file['fare_amount/distance'] = train_file.apply(lambda row: row.fare_amount / row.distance, axis = 1)
train_file['fare_amount/passenger_count'] = train_file.apply(lambda row: row.fare_amount / row.passenger_count, axis = 1)
train_file['base_fare'] = train_file.apply(lambda row: row.fare_amount / row.passenger_count / row.distance, axis = 1)

In [None]:
train_file.head()

In order to see how days of the week affect the 'fare_amount', other features should be considered constant.
To show how weekdays can affect the 'fare_amount', data corresponding to year 2015 and 1.0 < distance < 3.5 are studied.

In [None]:
#Filter train_file data frame to show the 
filter1_train_file = train_file[(train_file['year'] == 2015) & (train_file['distance'].between(10,30)) & (train_file['passenger_count'].between(2,4))]

In [None]:
filter1_train_file.shape

In [None]:
 filter1_train_file.head()

In [None]:
# The day of the week with Monday=0, Sunday=6
filter1_train_file.groupby('dayofweek')['fare_amount'].mean().plot()
plt.title('average fare_amount base on dayofweek')

The results shows that Monday has the highest 'fare_amount'.
Also, 'fare_amount' is lowest during the weekend.

Showing how fare amount per number of passengers is changed over the years can give us a perspective of how it may change in the coming years. Therefore, trend of fare amount per number of passenger is plotted as:

In [None]:
#Show the trend of average 'fare_amount' per person during years
train_file.groupby('year')['fare_amount/passenger_count'].mean().plot()
plt.title('fare_amount/passenger_count Trend')

As expected, the figure show that fare_amount is increased over the years.
However, increas is slowing down since 2013.

The city grows over the years and it may have huge impact on fare amount.
Checking average distance that passengers commute over the years can explain the changes in fare amount. 

In [None]:
#See trend of distance during the years
train_file.groupby('year')['distance'].mean().plot()
plt.title('trend of distance during the years')

The distance that passengers commute has increased during the years. It can cause the raise in the fare_amount.
The biggest jump is from 2010 to 2011 but it slightly decreased in 2015.

Let's see how many passengers majority of rides has had:

In [None]:
#See passenger_count vs. distance
train_file.groupby('passenger_count')['distance'].count().plot()
plt.title('passenger_count vs. ride count' )

The plot shows that majority of the rides has single passenger. The rides with two passengers is the second most. 