<a href="https://colab.research.google.com/github/hargurjeet/bt/blob/main/NY_taxi_fare_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NY City Taxi Fare Prediction**


The purpose of this task is to give us a gauge of your skills and experience as a data scientist. 

The Following notebook is contains the details of building ML and the techniques used for the same.

# **Table Of Contents**<a name="top"></a>


---



---


  1. [About the Dataset](#1)
  2. [Loading the dataset Preprocessing](#2)
  3. [Explainatory Data Analysis](#3)
  4. [Train a baseline model](#4)
  5. [Feature Engineering](#5)
  6. [Train and Evaluate models](#6)
  7. [Summary](#7)
  8. [Future Work](#8)
  9. [Reference](#9)

# 1: AboutDataset <a name="1"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

# 2: Loading the dataset Preprocessing <a name="2"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

# 3: Explainatory Data Analysis <a name="3"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

# 4: Train a baseline model <a name="4"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

# 5: Feature Engineering <a name="5"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> *italicized text*

# 6: Train and Evaluate models <a name="6"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

# 7: Summary <a name="7"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

# 8: Future Work <a name="8"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

# 9: Reference <a name="9"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

In [79]:
## Perfoming the required imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

In [80]:
required_cols = ['fare_amount', 'pickup_datetime', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count']

dtypes = {
    'fare_amount' : 'float32', 
    'pickup_datetime': 'float32',
    'pickup_longitude': 'float32',
    'pickup_latitude': 'float32',
    'dropoff_longitude': 'float32',
    'dropoff_latitude': 'float32',
    'passenger_count': 'uint8'
}
file_path = 'https://raw.githubusercontent.com/hargurjeet/bt/main/ny_taxi_fare_data.csv'
df = pd.read_csv(file_path, 
                 usecols = required_cols, 
                 parse_dates=['pickup_datetime'],
                 dtype = dtypes)
df.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.5,2009-06-15 17:26:21+00:00,-73.844315,40.721317,-73.841614,40.712276,1
1,16.9,2010-01-05 16:52:16+00:00,-74.016045,40.711304,-73.979271,40.782005,1
2,5.7,2011-08-18 00:35:00+00:00,-73.982735,40.761269,-73.991241,40.750561,2
3,7.7,2012-04-21 04:30:42+00:00,-73.987129,40.733143,-73.99157,40.758091,1
4,5.3,2010-03-09 07:51:00+00:00,-73.968094,40.768009,-73.956657,40.783764,1


In [81]:
df.shape

(50000, 7)

# Data Exploration

- Basics info regarding dataset
- EDA and visualizaitons
- Key insights

In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   fare_amount        50000 non-null  float32            
 1   pickup_datetime    50000 non-null  datetime64[ns, UTC]
 2   pickup_longitude   50000 non-null  float32            
 3   pickup_latitude    50000 non-null  float32            
 4   dropoff_longitude  50000 non-null  float32            
 5   dropoff_latitude   50000 non-null  float32            
 6   passenger_count    50000 non-null  uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 1.4 MB


There seems to be no null values within the dataset

In [83]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,11.364215,-72.521416,39.931904,-72.517723,39.924244,1.66784
std,9.685438,10.392804,6.224685,10.406597,6.014816,1.289195
min,-5.0,-75.423851,-74.006889,-84.654243,-74.006378,0.0
25%,6.0,-73.992065,40.734879,-73.99115,40.734371,1.0
50%,8.5,-73.981842,40.752678,-73.98008,40.753372,1.0
75%,12.5,-73.967148,40.767361,-73.963585,40.768166,2.0
max,200.0,40.78347,401.083344,40.851028,43.415192,6.0


In [84]:
def min_max_date(df, date_col):
  return df[date_col].max(), df[date_col].min()

min_max_date(df, 'pickup_datetime')

(Timestamp('2015-06-30 22:42:39+0000', tz='UTC'),
 Timestamp('2009-01-01 01:31:49+0000', tz='UTC'))

Obseravations - 
- min fare amount is -5 dollars  and max is 200 dollars
- 50% values are under 8.5$, 75% of rides cost less than 12.5$. This gives the basic understanding how good our models need to be.
- While making prediction, I would expect my prediciton to be under +_ 3$ range otherwise I am off by a lot.
- max passege count is 6 which is highly unlikely. Hence we might requrie some data cleanup.
- observing the min and max date its about 6 years of worth of data

# 3. EDA and Visualization

In [85]:
df.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.5,2009-06-15 17:26:21+00:00,-73.844315,40.721317,-73.841614,40.712276,1
1,16.9,2010-01-05 16:52:16+00:00,-74.016045,40.711304,-73.979271,40.782005,1
2,5.7,2011-08-18 00:35:00+00:00,-73.982735,40.761269,-73.991241,40.750561,2
3,7.7,2012-04-21 04:30:42+00:00,-73.987129,40.733143,-73.99157,40.758091,1
4,5.3,2010-03-09 07:51:00+00:00,-73.968094,40.768009,-73.956657,40.783764,1


#4. Preparing the dataset

We will set aside 20% of our data as validation set for evaluation of our model

In [7]:
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

In [8]:
len(train_df), len(val_df)

(40000, 10000)

In [9]:
## extract input and targets

train_df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count'],
      dtype='object')

In [10]:
input_cols = ['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude', 'passenger_count']
target_cols = 'fare_amount'

In [11]:
train_inputs = train_df[input_cols]
train_targets = train_df[target_cols]
val_inputs = val_df[input_cols]
val_targets = val_df[target_cols]

# 5. Train baseline models

In [12]:
class regressor():
  def fit(self, inputs, targets):
    self.mean = targets.mean()

  def predicts(self, inputs):
    return np.full(inputs.shape[0], self.mean)

In [13]:
np.full([40000], 40000)

array([40000, 40000, 40000, ..., 40000, 40000, 40000])

In [14]:
mean_model = regressor()

In [15]:
mean_model.fit(train_inputs, train_targets)
print(f'Average fare for the taxi id {mean_model.mean}')

Average fare for the taxi id 11.376873970031738


In [16]:
mean_model.predicts(train_inputs)

array([11.37687397, 11.37687397, 11.37687397, ..., 11.37687397,
       11.37687397, 11.37687397])

In [17]:
train_targets

39087    10.000000
30893     4.000000
45278     6.900000
16398     7.700000
13653     4.500000
           ...    
11284     6.500000
44732     3.700000
38158    12.100000
860      12.100000
15795    57.330002
Name: fare_amount, Length: 40000, dtype: float32

In [18]:
mean_model.predicts(val_inputs)

array([11.37687397, 11.37687397, 11.37687397, ..., 11.37687397,
       11.37687397, 11.37687397])

In [19]:
val_targets

33553     7.300000
9427     33.299999
199       5.500000
12447     7.000000
39489     5.300000
           ...    
28567     9.300000
25079     5.500000
18707     6.500000
15200    30.500000
5857      6.900000
Name: fare_amount, Length: 10000, dtype: float32

In [20]:
def rmse(targets, predicitions):
  return mean_squared_error(targets, predicitions, squared = False)

rmse(train_targets, mean_model.predicts(train_inputs))

9.696218748570448

The average score is really bad. Any model we train should be better than this.

In [21]:
## Trying linear regression

linear_model = LinearRegression()
linear_model.fit(train_inputs, train_targets)
train_preds = linear_model.predict(train_inputs)
train_preds

array([11.819403, 11.820234, 11.687607, ..., 11.41951 , 11.284868,
       11.286995], dtype=float32)

In [22]:
train_targets

39087    10.000000
30893     4.000000
45278     6.900000
16398     7.700000
13653     4.500000
           ...    
11284     6.500000
44732     3.700000
38158    12.100000
860      12.100000
15795    57.330002
Name: fare_amount, Length: 40000, dtype: float32

In [23]:
rmse(train_targets, train_preds)

9.694377

In [24]:
val_preds = linear_model.predict(val_inputs)
rmse(val_targets, val_preds)

9.641011

- May the model not learning well from lat and long data
- We are not using pickup date time, fare can be seasonal and can depend on time of the day

# 6. Feature Engineering

- Extract part of the dates
- Remove outlier and invalid data
- Add distances btw pickup and drop
- Add distnace from landmark

## Extract Parts of Date
 - Year
 - Month
 - Day
 - Weekday
 - hour

In [25]:
def part_of_dates(df, column):
  df[column+ '_year']= df[column].dt.year
  df[column+ '_month']= df[column].dt.month
  df[column+ '_day']= df[column].dt.day
  df[column+ '_week']= df[column].dt.weekday
  df[column+ '_hour']= df[column].dt.hour

In [26]:
part_of_dates(train_df, 'pickup_datetime')
part_of_dates(val_df, 'pickup_datetime')

In [27]:
train_df.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_week,pickup_datetime_hour
39087,10.0,2013-07-27 17:04:00+00:00,-73.974335,40.791428,-73.979034,40.766365,5,2013,7,27,5,17
30893,4.0,2013-01-08 09:26:00+00:00,-73.973656,40.751633,-73.969948,40.756702,5,2013,1,8,1,9
45278,6.9,2012-03-17 16:45:00+00:00,-73.975266,40.752281,-73.995094,40.737499,4,2012,3,17,5,16
16398,7.7,2012-06-08 09:01:17+00:00,-73.983032,40.766785,-73.971947,40.789288,1,2012,6,8,4,9
13653,4.5,2015-06-22 17:30:49+00:00,-73.986717,40.771648,-73.98214,40.770699,1,2015,6,22,0,17


## Add distance between pickup and dropp

- https://en.wikipedia.org/wiki/Haversine_formula


In [28]:
def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

In [29]:
def trip_distance(df):
  df['trip_distance'] = haversine_np(df['pickup_longitude'],
                                     df['pickup_latitude'],
                                     df['dropoff_longitude'],
                                     df['dropoff_latitude'])

In [30]:
trip_distance(train_df)
trip_distance(val_df)

In [31]:
train_df[((train_df.trip_distance >=8) | (train_df.trip_distance <= 4))]

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_week,pickup_datetime_hour,trip_distance
39087,10.000000,2013-07-27 17:04:00+00:00,-73.974335,40.791428,-73.979034,40.766365,5,2013,7,27,5,17,2.813101
30893,4.000000,2013-01-08 09:26:00+00:00,-73.973656,40.751633,-73.969948,40.756702,5,2013,1,8,1,9,0.643929
45278,6.900000,2012-03-17 16:45:00+00:00,-73.975266,40.752281,-73.995094,40.737499,4,2012,3,17,5,16,2.341894
16398,7.700000,2012-06-08 09:01:17+00:00,-73.983032,40.766785,-73.971947,40.789288,1,2012,6,8,4,9,2.669228
13653,4.500000,2015-06-22 17:30:49+00:00,-73.986717,40.771648,-73.982140,40.770699,1,2015,6,22,0,17,0.399969
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16850,7.500000,2014-12-05 18:00:56+00:00,-73.983727,40.766174,-73.977547,40.777889,1,2014,12,5,4,18,1.401790
11284,6.500000,2010-12-07 11:23:00+00:00,-73.980911,40.767860,-73.980209,40.780342,1,2010,12,7,1,11,1.388346
44732,3.700000,2009-06-30 22:39:25+00:00,-73.960564,40.775860,-73.961830,40.771255,1,2009,6,30,1,22,0.522993
38158,12.100000,2010-09-14 22:34:00+00:00,-73.974663,40.751743,-73.995987,40.744347,2,2010,9,14,1,22,1.974436


Distance from popular landmarks

- JFK airport
- LGA Airport
- EWR airport
- WTC

In [32]:
jfk_lonlat = -73.7781, 40.6413
lga_lonlat = -73.8740, 40.7769
ewr_lonlat = -74.1745, 40.6895
wtc_lonlat = -74.0099, 40.7126

## Remove outlier and invalid data

- Fare amount
- Passange count
- Pickup latitude and longitude
- drop latitude and longitude

In [33]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,11.364215,-72.521416,39.931904,-72.517723,39.924244,1.66784
std,9.685438,10.392804,6.224685,10.406597,6.014816,1.289195
min,-5.0,-75.423851,-74.006889,-84.654243,-74.006378,0.0
25%,6.0,-73.992065,40.734879,-73.99115,40.734371,1.0
50%,8.5,-73.981842,40.752678,-73.98008,40.753372,1.0
75%,12.5,-73.967148,40.767361,-73.963585,40.768166,2.0
max,200.0,40.78347,401.083344,40.851028,43.415192,6.0


we will use the following
- longitudes: -75 to -72
- latitudes: 40 to 42

In [34]:
train_df.shape, val_df.shape

((40000, 13), (10000, 13))

In [35]:
def remove_outlier(df):
  return df[(
      (df.pickup_longitude>= -75) & (df.pickup_longitude<=-72) & (df.pickup_latitude>= 40) & (df.pickup_latitude <=42) & 
      (df.dropoff_latitude>=40) &(df.dropoff_latitude<=42)& (df.dropoff_longitude>=-75) &(df.dropoff_longitude<=-72) 

  )]

In [36]:
train_df = remove_outlier(train_df)
val_df = remove_outlier(val_df)

In [37]:
train_df.shape, val_df.shape

((39158, 13), (9795, 13))

# Train & Evaluate different models

- Linear Regression
- Random forests
- XGBoost

In [39]:
train_df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_week', 'pickup_datetime_hour', 'trip_distance'],
      dtype='object')

In [41]:
input_cols =[ 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_week', 'pickup_datetime_hour', 'trip_distance']

target_cols =['fare_amount']

In [42]:
train_inputs = train_df[input_cols]
train_targets = train_df[target_cols]

val_inputs = val_df[input_cols]
val_targets = val_df[target_cols]

In [50]:
def evaluate(model):
  train_preds = model.predict(train_inputs)
  val_preds = model.predict(val_inputs)
  train_rmse = mean_squared_error(train_targets, train_preds, squared=False)
  val_rmse = mean_squared_error(val_targets, val_preds, squared=False)

  return train_rmse, val_rmse, train_preds, val_preds

In [44]:
## Linear regression

model1 = LinearRegression()
model1.fit(train_inputs, train_targets)

LinearRegression()

In [51]:
evaluate(model1)

(5.677216889594064, 6.135946392221846, array([[ 9.81290957],
        [ 6.8938265 ],
        [ 9.36002566],
        ...,
        [ 8.16200736],
        [13.98067951],
        [51.28399065]]), array([[10.2415437 ],
        [21.47976393],
        [ 7.18156925],
        ...,
        [ 9.42625427],
        [33.87505085],
        [ 6.77001337]]))

In [76]:
# Random Forest
model2 = RandomForestRegressor(random_state=42, n_jobs=-1, max_depth=5, n_estimators=50)

In [77]:
%%time
model2.fit(train_inputs, train_targets)

  """Entry point for launching an IPython kernel.


CPU times: user 7.31 s, sys: 20.2 ms, total: 7.33 s
Wall time: 4.57 s


RandomForestRegressor(max_depth=5, n_estimators=50, n_jobs=-1, random_state=42)

In [78]:
evaluate(model2)

(4.284449605695932,
 4.816226139981735,
 array([10.63644462,  6.08036485,  8.55094999, ...,  7.32094728,
        11.84028486, 53.57612585]),
 array([ 9.45627717, 27.38763752,  6.08036485, ...,  8.75517624,
        29.49635755,  6.58497181]))

In [73]:
## xgboost
model3= XGBRegressor(n_estimators=50, n_jobs=-1, max_depth=3, objective='reg:squarederror', random_state=42)

In [74]:
model3.fit(train_inputs, train_targets)

XGBRegressor(n_estimators=50, n_jobs=-1, objective='reg:squarederror',
             random_state=42)

In [75]:
evaluate(model3)

(4.0595284,
 4.5863333,
 array([11.112277 ,  5.8606014,  9.027236 , ...,  7.582131 , 12.258968 ,
        53.463707 ], dtype=float32),
 array([ 9.209679 , 25.01746  ,  5.8606014, ...,  8.788843 , 32.39144  ,
         6.4587774], dtype=float32))

# Hyperparameter tuning