<a href="https://colab.research.google.com/github/pe44enka/TaxiFarePrediction/blob/master/TaxiFarePrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NYC Taxi Fare Prediction**

![](https://static.vecteezy.com/system/resources/previews/000/118/272/original/free-new-york-taxi-watercolor-vector.jpg)


### **Objectives**

Imagine you are in Big Apple. New to town and have no clue how to get from Central park to Empire State Building. After some useless attemps and short fair buttle you finally got your cab. But hey! How much is it gonna cost you in this crazy city?


---

### **Goal of the project**
To predict the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations.

### **Data**
[New York City Taxi Fare Prediction](https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction) dataset available at Kaggle as part of competition.

The dataset contains the following fields:

Field name | Description
--- |--- 
*key* | identifier for each trip
*fare_amount* | the cost of each trip in usd
*pickup_datetime* | date and time when the meter was engaged
*passenger_count* | the number of passengers in the vehicle (driver entered value)
*pickup_longitude* | the longitude where the meter was engaged
*pickup_latitude* | the latitude where the meter was engaged
*dropoff_longitude* | the longitude where the meter was disengaged
*dropoff_latitude* | the latitude where the meter was disengaged

### **Techniques**
In this project we will use:
* **Data preprocessing**: SelectFromModel, SimpleImputer, OneHotEncoder, StandardScaler, ColumnTransformer, pandas.get_dummies
* **ML algorihms**: LinearRegression, DecisionTreeRegressor RandomForestRegressor, GradientBoostingRegressor, XGBRegressor 
* **Hyperparameter turning:** GridSearchCV
* **Model training/applying:** Pipeline, tran_test_split


---

# Load Libraries

In [98]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from geopy.distance import great_circle

# Load Data

In [131]:
df = pd.read_csv('https://raw.githubusercontent.com/pe44enka/TaxiFarePrediction/master/data/train.csv')
print('df.shape: ', df.shape)
df.head()

df.shape:  (1048575, 8)


Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,26:21.0,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,52:16.0,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,35:00.0,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,30:42.0,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,51:00.0,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


# Data Cleaning

## Overview


Before playing with ML models and trying to predict anything let's get ourselves familiar with data we have.

---



In [132]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 8 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   key                1048575 non-null  object 
 1   fare_amount        1048575 non-null  float64
 2   pickup_datetime    1048575 non-null  object 
 3   pickup_longitude   1048575 non-null  float64
 4   pickup_latitude    1048575 non-null  float64
 5   dropoff_longitude  1048565 non-null  float64
 6   dropoff_latitude   1048565 non-null  float64
 7   passenger_count    1048575 non-null  int64  
dtypes: float64(5), int64(1), object(2)
memory usage: 64.0+ MB



---

**Notes:** there are mixed categorical and numerical features as well as missing values in the data'

**Conclusion:** need to get rid of missing values and to deal with categorical data

---



In [133]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,1048575.0,1048575.0,1048575.0,1048565.0,1048565.0,1048575.0
mean,11.34548,-72.52724,39.93094,-72.5275,39.92496,1.684902
std,9.820072,12.00798,7.725806,11.41154,8.529585,1.323155
min,-44.9,-3377.681,-3116.285,-3383.297,-3114.339,0.0
25%,6.0,-73.99207,40.73496,-73.99138,40.73406,1.0
50%,8.5,-73.9818,40.75267,-73.98014,40.75318,1.0
75%,12.5,-73.96711,40.76714,-73.96367,40.76812,2.0
max,500.0,2522.271,2621.628,1717.003,1989.728,208.0


---

**Notes:** 
* target column ```fare_amount``` *min* value is negative. That means that some entities in the data has negative price for taxi ride what is an obvious error and can lead to wrong training of the model
* coordinate columns (```'pickup_latitude```, ```pickup_longitude```, ```dropoff_latitude``` and ```dropoff_longitude```) have *min* and *max* values far away from -90 and 90 where they are supposed to be
* ```passenger_count``` column *min* is zero and *max* is 208 passengers both of which are impossible

**Conclusion:** closely observe
* ```fare_amount``` column to deal with negative values
* coordinate columns to make sure all coordinates are in range (-90, 90)
* ```passenger_count``` column to check that number of passengers per cab is realistic

---

## Missing Values

Let's have a look on features with missing values.

---

In [134]:
df[df.columns[df.isnull().sum().values>0]].isna().sum()

dropoff_longitude    10
dropoff_latitude     10
dtype: int64



---

As we can see ```dropoff_longitude``` and ```dropoff_latitude``` have missing values.
As long as it's unavailable to fill these gaps with any known values and amount of missing values is not high in comparison with all entities in the data (<0.001%), we can skip these entities with dropping the whole row. 


---



In [135]:
df.dropna(axis=0, inplace=True) #drop rows with NaN
df.reset_index(drop=True, inplace=True) #reset index after dropping rows
df.isnull().sum()

key                  0
fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
dtype: int64

In [136]:
print('df.shape: ', df.shape)
df.head()

df.shape:  (1048565, 8)


Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,26:21.0,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,52:16.0,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,35:00.0,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,30:42.0,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,51:00.0,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


In [137]:
df[df.columns[df.isnull().sum().values>0]].isna().sum()

Series([], dtype: float64)



---

There is no missing values in the data.

---



## Duplicates


Also to avoid false influence on the result by same data, let's check it on duplicates.

---

In [138]:
df.nunique() #Count number of distinct elements in specified axis

key                    3600
fare_amount            2155
pickup_datetime      898113
pickup_longitude     114788
pickup_latitude      146414
dropoff_longitude    136393
dropoff_latitude     173759
passenger_count           9
dtype: int64

In [139]:
df.duplicated().sum() #count sum of  boolean Series denoting duplicate rows

0



---

There is no duplicated rows in the data.

---



## Target Feature Values


As it was mentioned earlier, we need to get rid of negative values in the target column ```fare_amount```.


---

In [140]:
df.fare_amount.describe()

count    1.048565e+06
mean     1.134536e+01
std      9.819785e+00
min     -4.490000e+01
25%      6.000000e+00
50%      8.500000e+00
75%      1.250000e+01
max      5.000000e+02
Name: fare_amount, dtype: float64

In [141]:
df[df.fare_amount<=0].shape #amount of negative values for fare_amount column

(69, 8)

---

As we can see there are some negative values in ```fare_amount``` column. These amount is not high in comparison with all entities in the data (<0.007%), we can skip these entities with dropping the whole row.

---

In [142]:
df = df[df.fare_amount>0]
print('df.shape: ', df.shape)
df.head()

df.shape:  (1048496, 8)


Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,26:21.0,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,52:16.0,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,35:00.0,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,30:42.0,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,51:00.0,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


In [143]:
df[df.fare_amount<=0].shape

(0, 8)



---

No entities with negative ride price left.

---



## Coordinate Columns Values


Let's check how many entities have coordinates lower than -90 and higher than 90.

---



In [144]:
#concatenate parts of the data cutted accordingly to the conditions described above
coor_df = pd.concat([df[df.pickup_latitude < - 90], df[df.pickup_latitude > 90],
                     df[df.pickup_longitude < - 90], df[df.pickup_longitude > 90],
                     df[df.dropoff_latitude < - 90], df[df.dropoff_latitude > 90],
                     df[df.dropoff_longitude < - 90], df[df.dropoff_longitude > 90]
                     ])
coor_df = coor_df.drop_duplicates() #remove duplicated rows
coor_df.shape

(49, 8)



---

As we can see there are some  values in coordinate columns that do not follow the condition. These amount is not high in comparison with all entities in the data (<0.005%), we can skip these entities with dropping the whole row.

---



In [145]:
df.drop(index=coor_df.index.to_list(), inplace=True)
print('df.shape: ', df.shape)
df.head()

df.shape:  (1048447, 8)


Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,26:21.0,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,52:16.0,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,35:00.0,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,30:42.0,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,51:00.0,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


In [146]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,1048447.0,1048447.0,1048447.0,1048447.0,1048447.0,1048447.0
mean,11.34629,-72.518,39.92733,-72.51755,39.92727,1.684921
std,9.819187,10.38978,6.086774,10.38845,6.08858,1.323148
min,0.01,-89.43979,-74.01659,-86.80412,-74.0352,0.0
25%,6.0,-73.99207,40.73496,-73.99138,40.73406,1.0
50%,8.5,-73.9818,40.75267,-73.98014,40.75318,1.0
75%,12.5,-73.96711,40.76714,-73.96367,40.76812,2.0
max,500.0,40.85036,69.4,45.58162,81.51018,208.0




---

All coordinates are in required range.

---



## Passenger Count Column

Let's check how many unique values ```passenger_count``` column has.

---

In [147]:
df.passenger_count.unique()

array([  1,   2,   3,   6,   5,   4,   0, 208,   9])

---

With assumption that the largest car that can work in taxi is a mini van with 6 possible passengers, we can determine that ```passenger_count``` column should have values in range (1,6).

As we can see the data has 3 additional number of passengers. Let's how many entities have these values.

---

In [151]:
print('0 passengers: {}\n9 passengers: {}\n208 passengers: {}'.format(df[df.passenger_count==0].shape, df[df.passenger_count==9].shape, df[df.passenger_count==208].shape))

0 passengers: (3714, 8)
9 passengers: (1, 8)
208 passengers: (1, 8)





---

As shown there are just 2 entities with more passengers than 6, we can drop these entities.

But there are a lot entities with 0 passengers. As long as taxi ride can't be without a passenger, we will place 1 passenger to each of those rides.

---



In [156]:
df = df[df.passenger_count<7] #drop rows with passenger_count > 6
df.passenger_count.replace(0,1, inplace=True) #replacing 0 with 1

print('df.shape: ', df.shape)
df.head()

df.shape:  (1048445, 8)


Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,26:21.0,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,52:16.0,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,35:00.0,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,30:42.0,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,51:00.0,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


In [157]:
df.passenger_count.describe()

count    1.048445e+06
mean     1.688259e+00
std      1.304480e+00
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      2.000000e+00
max      6.000000e+00
Name: passenger_count, dtype: float64



---

```passenger_count``` column has *min* 1 passenger and *max* 6 passengers.

---



## Feature Engineering




As we've finished dealing with rows let's have a look on features we have.


---



### Columns Dropping

We will drop column ```key``` as far as it's just indicator of the ride and brings no useful info for further analysis and modeling

----

In [158]:
df.drop(columns=['key'], inplace=True) #remove key column
print('df.shape: ', df.shape)
df.head()

df.shape:  (1048445, 7)


Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


### Column Creating

#### Datetime Columns



On next step we will parse ```pickup_datetime``` into several columns:
* year
* season
* month_name
* day
* day_name
* hour
* rush hour (yes/no): 7am - 10 am, 3pm - 7pm

By default all datetime based columns are considered as strings in pandas. So we need to convert string date to datetime features. And then extract from datetime object all information we need.

---



In [159]:
df.pickup_datetime = pd.to_datetime(df.pickup_datetime)
df.dtypes

fare_amount                      float64
pickup_datetime      datetime64[ns, UTC]
pickup_longitude                 float64
pickup_latitude                  float64
dropoff_longitude                float64
dropoff_latitude                 float64
passenger_count                    int64
dtype: object

In [160]:
df['year'] = df.pickup_datetime.dt.year #year

#getting seasons for each entity
seasons = ['Winter', 'Spring', 'Summer', 'Fall'] #season
df['season'] = [seasons[i-1] for i in (df.pickup_datetime.dt.month%12// 3 + 1).values]

df['month'] = df.pickup_datetime.dt.month_name() #month
df['day'] = df.pickup_datetime.dt.day #day
df['day_name'] = df.pickup_datetime.dt.day_name() #day name
df['hour'] = df.pickup_datetime.dt.hour #hour

#finding out if the ride was in rush hour (7am-10am, 3pm-7pm) or not
rush_hour = []
for i in df.hour.values:
  if i in range(7,11):
    rush_hour.append(1)
  elif i in range(15,20):
    rush_hour.append(1)
  else:
    rush_hour.append(0)
df['rush_hour'] = rush_hour

df.drop(columns=['pickup_datetime'], inplace=True) # drop donor column

print('df.shape: ', df.shape)
df.head()

df.shape:  (1048445, 13)


Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,year,season,month,day,day_name,hour,rush_hour
0,4.5,-73.844311,40.721319,-73.84161,40.712278,1,2009,Summer,June,15,Monday,17,1
1,16.9,-74.016048,40.711303,-73.979268,40.782004,1,2010,Winter,January,5,Tuesday,16,1
2,5.7,-73.982738,40.76127,-73.991242,40.750562,2,2011,Summer,August,18,Thursday,0,0
3,7.7,-73.98713,40.733143,-73.991567,40.758092,1,2012,Spring,April,21,Saturday,4,0
4,5.3,-73.968095,40.768008,-73.956655,40.783762,1,2010,Spring,March,9,Tuesday,7,1


#### Distance Column



One of the most important metrics influncing on the taxi ride fare is distance. To get it we will need to convert latitude and longitude of pick up and drop off into km.

For this purpose we will use **Haversine (or great circle) distance** - the angular distance between two points on the surface of a sphere. The first coordinate of each point is assumed to be the latitude, the second is the longitude.
If speaking on Python we will need to import a new library:
```from geopy.distance import great_circle```


Let's check how it works on the first ride in our data.

---



In [161]:
coordA = [df.pickup_latitude.iloc[0], df.pickup_longitude.iloc[0]]
coordB = [df.dropoff_latitude.iloc[0], df.dropoff_longitude.iloc[0]]
print ('Distance is: {:.3f} km'.format(float(great_circle(coordA, coordB).kilometers)))

Distance is: 1.031 km




---

Now when we know how to count the distance for 1 entity we just need to put the code in loop and get the distance for each of them. Let's do it!

---



In [162]:
distance = []
for i in range(df.shape[0]):
  coordA = [df.pickup_latitude.iloc[i], df.pickup_longitude.iloc[i]]
  coordB = [df.dropoff_latitude.iloc[i], df.dropoff_longitude.iloc[i]]
  distance.append(round(float(great_circle(coordA, coordB).kilometers), 3))

df['distance'] = distance #create a column with distance for each ride
df.drop(columns=['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'], inplace=True) #drop donor columns

print('df.shape: ', df.shape)
df.head()

df.shape:  (1048445, 10)


Unnamed: 0,fare_amount,passenger_count,year,season,month,day,day_name,hour,rush_hour,distance
0,4.5,1,2009,Summer,June,15,Monday,17,1,1.031
1,16.9,1,2010,Winter,January,5,Tuesday,16,1,8.45
2,5.7,2,2011,Summer,August,18,Thursday,0,0,1.39
3,7.7,1,2012,Spring,April,21,Saturday,4,0,2.799
4,5.3,1,2010,Spring,March,9,Tuesday,7,1,1.999


#### Feature Extracting Function

After creating and training model we will have to apply it on the test data to get predictions. To do so we will need to convert test data  to the same format as we did with train data. For this purpose lets create a function that will do it for us:
* drop ```key``` column
* convert ```pickup_datetime``` into datetime data type
* create ```year```, ```season```, ```month```, ```day```, ```day_name```, ```hour```, ```rush_hour``` out of ```pickup_datetime```
* create ```distance``` column out of coordinate columns

---

In [166]:
def clean_data(df):

  df.drop(columns = ['key'], inplace=True) #drop key column

  df.pickup_datetime = pd.to_datetime(df.pickup_datetime) #convert dtype to datetime

  #creating year, season, month, day, day_name, hour columns
  df['year'] = df.pickup_datetime.dt.year #year
  seasons = ['Winter', 'Spring', 'Summer', 'Fall'] 
  df['season'] = [seasons[i-1] for i in (df.pickup_datetime.dt.month%12// 3 + 1).values] #season
  df['month'] = df.pickup_datetime.dt.month_name() #month
  df['day'] = df.pickup_datetime.dt.day #day
  df['day_name'] = df.pickup_datetime.dt.day_name() #day name
  df['hour'] = df.pickup_datetime.dt.hour #hour
  
  #creating rush hour column
  rush_hour = []
  for i in df.hour.values:
    if i in range(7,11):
      rush_hour.append(1)
    elif i in range(15,20):
      rush_hour.append(1)
    else:
      rush_hour.append(0)
  df['rush_hour'] = rush_hour # rush hour
  
  df.drop(columns=['pickup_datetime'], inplace=True) # drop donor pickup_datetime column

  #creating distance column
  distance = []
  for i in range(df.shape[0]):
    coordA = [df.pickup_latitude.iloc[i], df.pickup_longitude.iloc[i]]
    coordB = [df.dropoff_latitude.iloc[i], df.dropoff_longitude.iloc[i]]
    distance.append(round(float(great_circle(coordA, coordB).kilometers), 3))  
  df['distance'] = distance #create a column with distance for each ride
  
  df.drop(columns=['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'], inplace=True) #drop donor columns

  return df

In [164]:
df1 = pd.read_csv('https://raw.githubusercontent.com/pe44enka/TaxiFarePrediction/master/data/test.csv')
df1.head()

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.97332,40.763805,-73.98143,40.743835,1
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862,40.719383,-73.998886,40.739201,1
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982524,40.75126,-73.979654,40.746139,1
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.98116,40.767807,-73.990448,40.751635,1
4,2012-12-01 21:12:12.0000003,2012-12-01 21:12:12 UTC,-73.966046,40.789775,-73.988565,40.744427,1


## Save Clean Data

After finishing process of cleaning data let's save the result to avoid repeating the whole thing in the future.

---

In [120]:
df.to_csv('clean_train.csv')

# Data Analysis & Feature Selection

## Load cleaned data

In [None]:
df = pd.read_csv('')

In [96]:
df.head()

Unnamed: 0,fare_amount,passenger_count,year,month,day,day_name,hour,season,rush_hour,distance
0,4.5,1,2009,June,15,Monday,17,Summer,1,1.031
1,16.9,1,2010,January,5,Tuesday,16,Winter,1,8.45
2,5.7,2,2011,August,18,Thursday,0,Summer,0,1.39
3,7.7,1,2012,April,21,Saturday,4,Spring,0,2.799
4,5.3,1,2010,March,9,Tuesday,7,Spring,1,1.999


In [97]:
df.columns

Index(['fare_amount', 'passenger_count', 'year', 'month', 'day', 'day_name',
       'hour', 'season', 'rush_hour', 'distance'],
      dtype='object')