<a href="https://colab.research.google.com/github/hargurjeet/bt/blob/main/NY_taxi_fare_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NY City Taxi Fare Prediction**


The purpose of this task is to give us a gauge of your skills and experience as a data scientist. 

The Following notebook is contains the details of building ML and the techniques used for the same.

# **Table Of Contents**<a name="top"></a>


---



---


  1. [About the Dataset](#1)
  2. [Loading the dataset Preprocessing](#2)
  3. [Explainatory Data Analysis](#3)
  4. [Train a baseline model](#4)
  5. [Feature Engineering](#5)
  6. [Train and Evaluate models](#6)
  7. [Summary](#7)
  8. [Future Work](#8)
  9. [Reference](#9)

# 1: AboutDataset <a name="1"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

New York is world famous for its bright yellow taxis. For this task you will build a model which can predict the fare of a new taxi ride. Attached is a data set which contains the following variables:

- key - Unique string identifying each row in both the training and test sets. Comprised of pickup_datetime plus a unique integer, but this doesn't matter, it should just be used as a unique ID field. 
- pickup_datetime - timestamp value indicating when the taxi ride started.
- pickup_longitude - float for longitude coordinate of where the taxi ride started.
- pickup_latitude - float for latitude coordinate of where the taxi ride started.
- dropoff_longitude - float for longitude coordinate of where the taxi ride ended.
- dropoff_latitude - float for latitude coordinate of where the taxi ride ended.
- passenger_count - integer indicating the number of passengers in the taxi ride.
- fare_amount - float dollar amount of the cost of the taxi ride. 


# 2: Loading the dataset Preprocessing <a name="2"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

The dataset has been loaded on my github repo.

I also import other standard imports to perfrom data pre processing, model building...etc

In [21]:
## Libraries to import data and preprocessing
import pandas as pd
import numpy as np

## Libraries for model building
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

In [22]:
# Required columns from the dataset.
required_cols = ['fare_amount', 'pickup_datetime', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count']

# Datatupes are mapped so that pandas can parse them accordingly
dtypes = {
    'fare_amount' : 'float32', 
    'pickup_datetime': 'float32',
    'pickup_longitude': 'float32',
    'pickup_latitude': 'float32',
    'dropoff_longitude': 'float32',
    'dropoff_latitude': 'float32',
    'passenger_count': 'uint8'
}

# Dataset imported from github
file_path = 'https://raw.githubusercontent.com/hargurjeet/bt/main/ny_taxi_fare_data.csv'
df = pd.read_csv(file_path, 
                 usecols = required_cols, 
                 parse_dates=['pickup_datetime'],
                 dtype = dtypes)
df.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.5,2009-06-15 17:26:21+00:00,-73.844315,40.721317,-73.841614,40.712276,1
1,16.9,2010-01-05 16:52:16+00:00,-74.016045,40.711304,-73.979271,40.782005,1
2,5.7,2011-08-18 00:35:00+00:00,-73.982735,40.761269,-73.991241,40.750561,2
3,7.7,2012-04-21 04:30:42+00:00,-73.987129,40.733143,-73.99157,40.758091,1
4,5.3,2010-03-09 07:51:00+00:00,-73.968094,40.768009,-73.956657,40.783764,1


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   fare_amount        50000 non-null  float32            
 1   pickup_datetime    50000 non-null  datetime64[ns, UTC]
 2   pickup_longitude   50000 non-null  float32            
 3   pickup_latitude    50000 non-null  float32            
 4   dropoff_longitude  50000 non-null  float32            
 5   dropoff_latitude   50000 non-null  float32            
 6   passenger_count    50000 non-null  uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 1.4 MB


key insights - 
- Dataset contatins no null values
- The columns data types parsed by pandas looks fine
- The dataset contains 50,000 records

In [24]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,11.364215,-72.521416,39.931904,-72.517723,39.924244,1.66784
std,9.685438,10.392804,6.224685,10.406597,6.014816,1.289195
min,-5.0,-75.423851,-74.006889,-84.654243,-74.006378,0.0
25%,6.0,-73.992065,40.734879,-73.99115,40.734371,1.0
50%,8.5,-73.981842,40.752678,-73.98008,40.753372,1.0
75%,12.5,-73.967148,40.767361,-73.963585,40.768166,2.0
max,200.0,40.78347,401.083344,40.851028,43.415192,6.0


In [25]:
def min_max_date(df, date_col):
  return df[date_col].max(), df[date_col].min()

min_max_date(df, 'pickup_datetime')

(Timestamp('2015-06-30 22:42:39+0000', tz='UTC'),
 Timestamp('2009-01-01 01:31:49+0000', tz='UTC'))

Key Insights - 
- New york longitude and latitude is 40.73, -73.93 (source refer reference section)
- The dataset has longitude data ranges from -75 to 40 and latitude data ranges from -74 to 43, This looks highly suspectable.
- Considering the geographical limitations of city taxi, It would be a worth while excerise to consider records for latitude ranges from 40 - 42 and  ranges longtide from -72 to -75.
-  A maximum amount of 5 passengers are allowed in NY city taxi (source refer  reference section), The dataset contains passanges information from 0 to 6, Hence outlier can be removed.
-  Fare amount can not be -ve. Hence keeping only +ve values of fare amount

# 3: Explainatory Data Analysis <a name="3"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

I have performed EDA in a seprate notebook. To access that notebook click [here](https://colab.research.google.com/github/hargurjeet/bt/blob/main/EDA_and_Visualization_NY_city_taxi.ipynb#scrollTo=TBE7PD_EdX7o)

# 4: Train a baseline model <a name="4"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

We will set aside 20% of our data as validation set for evaluation of our model

In [26]:
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

len(train_df), len(val_df)

(40000, 10000)

In [27]:
input_cols = ['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude', 'passenger_count']
target_cols = 'fare_amount'

train_inputs = train_df[input_cols]
train_targets = train_df[target_cols]
val_inputs = val_df[input_cols]
val_targets = val_df[target_cols]

In [28]:
class regressor():
  def fit(self, inputs, targets):
    self.mean = targets.mean()

  def predicts(self, inputs):
    return np.full(inputs.shape[0], self.mean)

In [29]:
mean_model = regressor()
mean_model.fit(train_inputs, train_targets)
print(f'Average fare for the taxi id {mean_model.mean}')

Average fare for the taxi id 11.376873970031738


In [30]:
mean_model.predicts(train_inputs)

array([11.37687397, 11.37687397, 11.37687397, ..., 11.37687397,
       11.37687397, 11.37687397])

In [31]:
train_targets

39087    10.000000
30893     4.000000
45278     6.900000
16398     7.700000
13653     4.500000
           ...    
11284     6.500000
44732     3.700000
38158    12.100000
860      12.100000
15795    57.330002
Name: fare_amount, Length: 40000, dtype: float32

In [32]:
mean_model.predicts(val_inputs)

array([11.37687397, 11.37687397, 11.37687397, ..., 11.37687397,
       11.37687397, 11.37687397])

In [33]:
def rmse(targets, predicitions):
  return mean_squared_error(targets, predicitions, squared = False)

rmse(train_targets, mean_model.predicts(train_inputs))

9.696218748570448

In [34]:
## Trying linear regression
linear_model = LinearRegression()
linear_model.fit(train_inputs, train_targets)
train_preds = linear_model.predict(train_inputs)
train_preds

array([11.819403, 11.820234, 11.687607, ..., 11.41951 , 11.284868,
       11.286995], dtype=float32)

In [35]:
train_targets

39087    10.000000
30893     4.000000
45278     6.900000
16398     7.700000
13653     4.500000
           ...    
11284     6.500000
44732     3.700000
38158    12.100000
860      12.100000
15795    57.330002
Name: fare_amount, Length: 40000, dtype: float32

In [36]:
rmse(train_targets, train_preds)

9.694377

In [37]:
val_preds = linear_model.predict(val_inputs)
rmse(val_targets, val_preds)

9.641011

Conclusions - The average score is really bad. Any model we train should be better than this.
- May the model not learning well from lat and long data
- We are not using pickup date time, fare can be seasonal and can depend on time of the day

# 5: Feature Engineering <a name="5"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

To work with other features we are required to pre process the remaining columns so that any analysis can be performed. Hence I think of the following approaches

- Cleaning the outlier.
- Split the datetime column to understand the trip data across Year and month.
- Calculating the distance between the pick and drop locations.
- Understanding the distance between drop location against the key destinaitons.
- Key Destination
  - Airports
    - JFK airport
    - LGA Airport
    - EWR airport
    - WTC

In [40]:
# Function to clean all the outliers
print('Record Count Before Cleaning', train_df.shape, val_df.shape)
def remove_outlier(df):
  return df[(
      # """ Controls long and lat ranges
      #     Restrict passanger count between 1 to 5
      #     set the fare price from base price 
      # """

      (df.pickup_longitude>= -75) & (df.pickup_longitude<=-72) & (df.pickup_latitude>= 40) & (df.pickup_latitude <=42) & 
      (df.dropoff_latitude>=40) & (df.dropoff_latitude<=42) & (df.dropoff_longitude>=-75) & (df.dropoff_longitude<=-72) &
      (df.passenger_count >=1) & (df.passenger_count <=5) & (df.fare_amount>1.)

  )]

train_df = remove_outlier(train_df)
val_df = remove_outlier(val_df)

print('Record Count after Cleaning', train_df.shape, val_df.shape)

Record Count Before Cleaning (40000, 7) (10000, 7)
Record Count after Cleaning (38239, 7) (9570, 7)


In [None]:
## Splitting data and time
def part_of_dates(df, column):
  """ Extrace Year, Month, day, weekday and hour from a date value """
  df[column+ '_year']= df[column].dt.year
  df[column+ '_month']= df[column].dt.month
  df[column+ '_day']= df[column].dt.day
  df[column+ '_week']= df[column].dt.weekday
  df[column+ '_hour']= df[column].dt.hour

part_of_dates(train_df, 'pickup_datetime')
part_of_dates(val_df, 'pickup_datetime')

## using Haversin formula to calculate the distance between pickup and drop location
def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

def trip_distance(df):
  df['trip_distance'] = haversine_np(df['pickup_longitude'],
                                     df['pickup_latitude'],
                                     df['dropoff_longitude'],
                                     df['dropoff_latitude'])
  
trip_distance(train_df)
trip_distance(val_df)

In [None]:
jfk_lonlat = -73.7781, 40.6413
lga_lonlat = -73.8740, 40.7769
ewr_lonlat = -74.1745, 40.6895
wtc_lonlat = -74.0099, 40.7126

def add_landmark_dropoff_distance(df, landmark_name, landmark_lonlat):
    lon, lat = landmark_lonlat
    df[landmark_name + '_drop_distance'] = haversine_np(lon, lat, df['dropoff_longitude'], df['dropoff_latitude'])

landmarks = [('jfk', jfk_lonlat), ('lga', lga_lonlat), ('ewr', ewr_lonlat), ('wtc', wtc_lonlat)]
for a_df in [train_df, val_df]:
  for name, lonlat in landmarks: 
    add_landmark_dropoff_distance(a_df, name, lonlat)

In [43]:
train_df.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_week,pickup_datetime_hour,trip_distance,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,wtc_drop_distance
39087,10.0,2013-07-27 17:04:00+00:00,-73.974335,40.791428,-73.979034,40.766365,5,2013,7,27,5,17,2.813101,21.901814,8.916608,18.544836,6.515262
30893,4.0,2013-01-08 09:26:00+00:00,-73.973656,40.751633,-73.969948,40.756702,5,2013,1,8,1,9,0.643929,20.632423,8.381031,18.776476,5.944635
45278,6.9,2012-03-17 16:45:00+00:00,-73.975266,40.752281,-73.995094,40.737499,4,2012,3,17,5,16,2.341894,21.18005,11.093554,16.025425,3.034995
16398,7.7,2012-06-08 09:01:17+00:00,-73.983032,40.766785,-73.971947,40.789288,1,2012,6,8,4,9,2.669228,23.174042,8.355556,20.342896,9.101416
13653,4.5,2015-06-22 17:30:49+00:00,-73.986717,40.771648,-73.98214,40.770699,1,2015,6,22,0,17,0.399969,22.409826,9.126138,18.542492,6.866284


In [44]:
train_df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_week,pickup_datetime_hour,trip_distance,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,wtc_drop_distance
count,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0,38239.0
mean,11.35995,-73.986053,40.749599,-73.985573,40.750057,1.587045,2011.706791,6.280734,15.667695,3.033343,13.469573,3.358001,20.916988,9.680647,18.523767,6.02221
std,9.670594,0.041621,0.031339,0.040956,0.033976,1.142863,1.864356,3.460466,8.651006,1.954304,6.510145,3.916795,3.220436,3.181855,3.863319,4.106068
min,2.5,-74.711647,40.190563,-74.755478,40.190563,1.0,2009.0,1.0,1.0,0.0,0.0,0.0,0.387446,0.265018,0.2774,0.013638
25%,6.0,-73.992332,40.736473,-73.991348,40.735886,1.0,2010.0,3.0,8.0,1.0,9.0,1.262576,20.53061,8.32219,16.512361,3.642102
50%,8.5,-73.98214,40.75346,-73.980476,40.754223,1.0,2012.0,6.0,16.0,3.0,14.0,2.15286,21.183189,9.502868,17.980993,5.524119
75%,12.5,-73.968399,40.767729,-73.965385,40.768618,2.0,2013.0,9.0,23.0,5.0,19.0,3.936313,21.907041,10.950631,19.813603,7.659528
max,200.0,-72.856972,41.650002,-72.854942,41.543217,5.0,2015.0,12.0,31.0,6.0,23.0,103.969444,100.5765,94.867615,114.945412,103.064423


# 6: Train and Evaluate models <a name="6"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

I picked the following models to train my datasets on

- Linear Regression
- Random forests
- XGBoost

In [45]:
input_cols =[ 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_week', 'pickup_datetime_hour', 'trip_distance',
       'jfk_drop_distance', 'lga_drop_distance', 'ewr_drop_distance', 'wtc_drop_distance']

target_cols =['fare_amount']

train_inputs = train_df[input_cols]
train_targets = train_df[target_cols]

val_inputs = val_df[input_cols]
val_targets = val_df[target_cols]

In [46]:
def evaluate(model):
  train_preds = model.predict(train_inputs)
  val_preds = model.predict(val_inputs)
  train_rmse = mean_squared_error(train_targets, train_preds, squared=False)
  val_rmse = mean_squared_error(val_targets, val_preds, squared=False)

  return train_rmse, val_rmse, train_preds, val_preds

In [47]:
## Linear regression
model1 = LinearRegression()
model1.fit(train_inputs, train_targets)
evaluate(model1)

(5.508512074204601, 6.042628372256151, array([[10.058396  ],
        [ 7.02731931],
        [ 8.73334104],
        ...,
        [ 8.06759072],
        [13.76586331],
        [49.94990048]]), array([[10.65041478],
        [20.26064198],
        [ 6.72033916],
        ...,
        [ 9.31658808],
        [33.88253144],
        [ 7.13537951]]))

In [48]:
# Random Forest
model2 = RandomForestRegressor(random_state=42, n_jobs=-1, max_depth=5, n_estimators=100)

In [49]:
%%time
model2.fit(train_inputs, train_targets)

  """Entry point for launching an IPython kernel.


CPU times: user 22.4 s, sys: 22.6 ms, total: 22.4 s
Wall time: 11.6 s


RandomForestRegressor(max_depth=5, n_jobs=-1, random_state=42)

In [50]:
evaluate(model2)

(4.254846976414971,
 4.747344151804759,
 array([10.55463215,  6.05462193,  8.65341964, ...,  7.32847233,
        11.69555435, 53.74064062]),
 array([ 9.88845623, 27.97484576,  6.05462193, ...,  8.78567887,
        28.86964666,  6.58954703]))

In [51]:
## xgboost
model3= XGBRegressor(n_estimators=100, n_jobs=-1, max_depth=3, objective='reg:squarederror', random_state=42)

In [52]:
model3.fit(train_inputs, train_targets)

XGBRegressor(n_jobs=-1, objective='reg:squarederror', random_state=42)

In [53]:
evaluate(model3)

(3.6792917,
 4.4309444,
 array([11.123464,  5.68449 ,  8.936595, ...,  7.462202, 11.991682,
        54.26345 ], dtype=float32),
 array([ 8.603611 , 27.28239  ,  5.231544 , ...,  8.531159 , 31.630281 ,
         6.5144863], dtype=float32))

## 6.1 Hyperparameter tuning

In [59]:
from sklearn.model_selection import GridSearchCV

In [75]:
params = { 'max_depth': [3,6],
           'learning_rate': [0.01, 0.05, 0.1],
           'n_estimators': [100, 500, 1000],
           }

In [80]:
xgbr = XGBRegressor(seed = 42,silent=True,verbose=0)

In [82]:
model_xgb = GridSearchCV(estimator=xgbr, 
                   param_grid=params,
                   scoring='neg_mean_squared_error', 
                   )

In [83]:
%%time
model_xgb.fit(train_inputs, train_targets)

CPU times: user 32min 29s, sys: 2.53 s, total: 32min 31s
Wall time: 32min 24s


GridSearchCV(estimator=XGBRegressor(seed=42, silent=True, verbose=0),
             param_grid={'learning_rate': [0.01, 0.05, 0.1],
                         'max_depth': [3, 6],
                         'n_estimators': [100, 500, 1000]},
             scoring='neg_mean_squared_error')

In [84]:
print("Best parameters:", model_xgb.best_params_)

Best parameters: {'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 500}


# 7: Summary <a name="7"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

Following is the summary of the steps we performed while doing the analysis and model creations.

- Downloaded the dataset from github
- Performed pre processing like checking for null values, removing redundant columns, importing all the required libraries.
- Performed explainatory data analysis in a seprate [notebook](https://colab.research.google.com/github/hargurjeet/MachineLearning/blob/master/Predictive_Maintaince_Classification.ipynb#scrollTo=qmznRkMefrzz)
- Trained a baseline model as a reference.
- Developed new features to imporve the MODEL learning.
- Trained ML models on the newly developed features.
- Evaluated the model performeces.
- Selected the best model and performed hyperparamter tuning.

# 8: Future Work <a name="8"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

- The problem can be approached by developing neural network
- Time series forecasting approach can also be tried out.

# 9: Reference <a name="9"></a>


---
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- https://pandas.pydata.org/docs/
- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
- https://towardsdatascience.com/xgboost-fine-tune-and-optimize-your-model-23d996fab663

# **The End**