# New York City Taxi Fare Prediction

![](https://i.imgur.com/ecwUY8F.png)

Dataset Link: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction

We'll train a machine learning model to predict the fare for a taxi ride in New York city given information like pickup date & time, pickup location, drop location and no. of passengers.

This dataset is taken from a [Kaggle competition](ttps://www.kaggle.com/c/new-york-city-taxi-fare-prediction) organized by Google Cloud. It contains over 55 millions rows of training data. We'll attempt to achieve a respectable score in the competition using just a fraction of the data. Along the way, we'll also look at some practical tips for machine learning. PMost of the ideas & techniques covered in this notebook are derived from other public notebooks & blog posts.

To run this notebook, select "Run" > "Run on Colab" and connect your Google Drive account with Jovian. Make sure to use the GPU runtime if you plan on using a GPU.

You can find the completed version of this notebook here: https://jovian.ai/aakashns/nyc-taxi-fare-prediction-filled





> _**TIP #1**: Create an outline for your notebook & for each section before you start coding_



Here's an outline of the project:

1. Download the dataset
2. Explore & analyze the dataset
3. Prepare the dataset for ML training
4. Train hardcoded & baseline models
5. Make predictions & submit to Kaggle
6. Peform feature engineering
7. Train & evaluate different models
8. Tune hyperparameters for the best models
9. Train on a GPU with the entire dataset
10. Document & publish the project online


## 1. Download the Dataset

Steps:

- Install required libraries
- Download data from Kaggle
- View dataset files
- Load training set with Pandas
- Load test set with Pandas

### Install Required Libraries

In [291]:
!pip install opendatasets pandas numpy scikit-learn xgboost ==quiet

[31mERROR: Invalid requirement: '==quiet'[0m[31m
[0m

### Download Data from Kaggle

We'll use the [opendatasets](https://github.com/JovianML/opendatasets) library to download the dataset. You'll need to upload your [Kaggle API key](https://github.com/JovianML/opendatasets/blob/master/README.md#kaggle-credentials) (a file called `kaggle.json`) to Colab.

In [292]:
!pip install opendatasets ==quiet
import opendatasets as od

[31mERROR: Invalid requirement: '==quiet'[0m[31m
[0m

In [293]:
dataset = "https://www.kaggle.com/c/new-york-city-taxi-fare-prediction"

In [294]:
od.download('https://www.kaggle.com/c/new-york-city-taxi-fare-prediction')

Skipping, found downloaded files in "./new-york-city-taxi-fare-prediction" (use force=True to force download)


### View Dataset Files

Let's look at the size, no. of lines and first few lines of each file.

In [295]:
ls -F --color -lh $(data_dir)

/bin/bash: line 1: data_dir: command not found
total 30M
-rw-r--r-- 1 root root   65 Jan 14 06:30 kaggle.json
-rw-r--r-- 1 root root 374K Jan 14 07:24 linear_model_submission.csv
drwxr-xr-x 2 root root 4.0K Jan 14 06:32 [0m[01;34mnew-york-city-taxi-fare-prediction[0m/
drwxr-xr-x 1 root root 4.0K Jan 11 17:02 [01;34msample_data[0m/
-rw-r--r-- 1 root root  23M Jan 14 07:28 train.parquet
-rw-r--r-- 1 root root 6.4M Jan 14 07:54 val.parquet


In [296]:
!wc -l /content/new-york-city-taxi-fare-prediction/train.csv

55423856 /content/new-york-city-taxi-fare-prediction/train.csv


In [297]:
!wc -l /content/new-york-city-taxi-fare-prediction/test.csv

9914 /content/new-york-city-taxi-fare-prediction/test.csv


In [298]:
!wc -l /content/new-york-city-taxi-fare-prediction/sample_submission.csv

9915 /content/new-york-city-taxi-fare-prediction/sample_submission.csv


In [299]:
!head /content/new-york-city-taxi-fare-prediction/train.csv

key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
2011-01-06 09:50:45.0000002,12.1,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1
2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1
2012-12-03 13:10:00.000000125,9,2012-12-03 13:10:00 UTC,-74.006462,40.7267

Observations:

- This is a supervised learning regression problem
- Training data is 5.5 GB in size
- Training data has 5.5 million rows
- Test set is much smaller (< 10,000 rows)
- The training set has 8 columns:
    - `key` (a unique identifier)
    - `fare_amount` (target column)
    - `pickup_datetime`
    - `pickup_longitude`
    - `pickup_latitude`
    - `dropoff_longitude`
    - `dropoff_latitude`
    - `passenger_count`
- The test set has all columns except the target column `fare_amount`.
- The submission file should contain the `key` and `fare_amount` for each test sample.



### Loading Training Set

> _**TIP #2**: When working with large datasets, always start with a sample to experiment & iterate faster._

Loading the entire dataset into Pandas is going to be slow, so we can use the following optimizations:

- Ignore the `key` column
- Parse pickup datetime while loading data
- Specify data types for other columns
   - `float32` for geo coordinates
   - `float32` for fare amount
   - `uint8` for passenger count
- Work with a 1% sample of the data (~500k rows)

We can apply these optimizations while using [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [300]:
import pandas as pd

In [301]:
train_data_path = "/content/new-york-city-taxi-fare-prediction/train.csv"

In [302]:

selected_cols = "fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count".split(",")
selected_cols

['fare_amount',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']

In [303]:
sample_fraction = 0.01
import random
def skip_row(row_idx):
  if row_idx==0:
    return row_idx
  return random.random()>sample_fraction

In [304]:
dtypes = {
 'fare_amount':'float32',
 'pickup_longitude':'float32',
 'pickup_latitude':'float32',
 'dropoff_longitude':'float32',
 'dropoff_latitude':'float32',
 'passenger_count':'uint8'
 }
random.seed(42)


df = pd.read_csv(
    train_data_path,
    usecols=selected_cols,
    parse_dates = ["pickup_datetime"],
    dtype=dtypes,
    nrows=500_000,
    skiprows = skip_row
  )

In [305]:
df

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.0,2014-12-06 20:36:22+00:00,-73.979813,40.751904,-73.979446,40.755482,1
1,8.0,2013-01-17 17:22:00+00:00,0.000000,0.000000,0.000000,0.000000,2
2,8.9,2011-06-15 18:07:00+00:00,-73.996330,40.753223,-73.978897,40.766964,3
3,6.9,2009-12-14 12:33:00+00:00,-73.982430,40.745747,-73.982430,40.745747,1
4,7.0,2013-11-06 11:26:54+00:00,-73.959061,40.781059,-73.962059,40.768604,1
...,...,...,...,...,...,...,...
499995,4.0,2012-12-26 10:17:30+00:00,-73.924057,40.744015,-73.936485,40.748222,1
499996,11.7,2010-12-10 16:56:45+00:00,-73.958344,40.764458,-73.974609,40.758888,1
499997,11.7,2010-07-14 07:22:00+00:00,-73.957047,40.777454,-73.991127,40.738979,1
499998,7.3,2010-07-15 19:02:00+00:00,-73.954552,40.780979,-73.970619,40.761959,5


> _**TIP #3**: Fix the seeds for random number generators so that you get the same results every time you run your notebook._

### Load Test Set

For the test set, we'll simply provide the data types.

In [306]:
test_data_path = "/content/new-york-city-taxi-fare-prediction/test.csv"

In [307]:
test_df = pd.read_csv(test_data_path,dtype=dtypes,parse_dates=["pickup_datetime"])

In [308]:
test_df

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24+00:00,-73.973320,40.763805,-73.981430,40.743835,1
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24+00:00,-73.986862,40.719383,-73.998886,40.739201,1
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44+00:00,-73.982521,40.751259,-73.979652,40.746140,1
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12+00:00,-73.981163,40.767807,-73.990448,40.751637,1
4,2012-12-01 21:12:12.0000003,2012-12-01 21:12:12+00:00,-73.966049,40.789776,-73.988564,40.744427,1
...,...,...,...,...,...,...,...
9909,2015-05-10 12:37:51.0000002,2015-05-10 12:37:51+00:00,-73.968124,40.796997,-73.955643,40.780388,6
9910,2015-01-12 17:05:51.0000001,2015-01-12 17:05:51+00:00,-73.945511,40.803600,-73.960213,40.776371,6
9911,2015-04-19 20:44:15.0000001,2015-04-19 20:44:15+00:00,-73.991600,40.726608,-73.789742,40.647011,6
9912,2015-01-31 01:05:19.0000005,2015-01-31 01:05:19+00:00,-73.985573,40.735432,-73.939178,40.801731,6


## 2. Explore the Dataset

- Basic info about training set
- Basic info about test set
- Exploratory data analysis & visualization
- Ask & answer questions

### Training Set

In [309]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   fare_amount        500000 non-null  float32            
 1   pickup_datetime    500000 non-null  datetime64[ns, UTC]
 2   pickup_longitude   500000 non-null  float32            
 3   pickup_latitude    500000 non-null  float32            
 4   dropoff_longitude  500000 non-null  float32            
 5   dropoff_latitude   500000 non-null  float32            
 6   passenger_count    500000 non-null  uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 13.8 MB


In [310]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0
mean,11.356144,-72.498863,39.915924,-72.507774,39.937088,1.683902
std,9.824891,11.728171,6.933188,12.224013,9.50666,1.339464
min,-52.0,-1183.362793,-2073.150635,-3356.729736,-2073.150635,0.0
25%,6.0,-73.992043,40.734886,-73.991432,40.733978,1.0
50%,8.5,-73.981827,40.752628,-73.980186,40.753101,1.0
75%,12.5,-73.967178,40.767007,-73.963745,40.768063,2.0
max,499.0,2420.209473,404.983337,2467.752686,3351.403076,208.0


In [311]:
df["pickup_datetime"].min(), df["pickup_datetime"].max()

(Timestamp('2009-01-01 00:11:46+0000', tz='UTC'),
 Timestamp('2015-06-30 23:59:54+0000', tz='UTC'))

Observations about training data:

- 550k+ rows, as expected
- No missing data (in the sample)
- `fare_amount` ranges from \$-52.0 to \$499.0
- `passenger_count` ranges from 0 to 208
- There seem to be some errors in the latitude & longitude values
- Dates range from 1st Jan 2009 to 30th June 2015
- The dataset takes up ~19 MB of space in the RAM

We may need to deal with outliers and data entry errors before we train our model.


### Test Set

In [312]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9914 entries, 0 to 9913
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   key                9914 non-null   object             
 1   pickup_datetime    9914 non-null   datetime64[ns, UTC]
 2   pickup_longitude   9914 non-null   float32            
 3   pickup_latitude    9914 non-null   float32            
 4   dropoff_longitude  9914 non-null   float32            
 5   dropoff_latitude   9914 non-null   float32            
 6   passenger_count    9914 non-null   uint8              
dtypes: datetime64[ns, UTC](1), float32(4), object(1), uint8(1)
memory usage: 319.6+ KB


In [313]:
test_df.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974716,40.751041,-73.973656,40.75174,1.671273
std,0.042774,0.033541,0.039072,0.035435,1.278747
min,-74.25219,40.573143,-74.263245,40.568974,1.0
25%,-73.9925,40.736125,-73.991249,40.735253,1.0
50%,-73.982327,40.753052,-73.980015,40.754065,1.0
75%,-73.968012,40.767113,-73.964062,40.768757,2.0
max,-72.986534,41.709557,-72.990967,41.696682,6.0


In [314]:
df["pickup_datetime"].min(), df["pickup_datetime"].max()

(Timestamp('2009-01-01 00:11:46+0000', tz='UTC'),
 Timestamp('2015-06-30 23:59:54+0000', tz='UTC'))

Some observations about the test set:

- 9914 rows of data
- No missing values
- No obvious data entry errors
- 1 to 6 passengers (we can limit training data to this range)
- Latitudes lie between 40 and 42
- Longitudes lie between -75 and -72
- Pickup dates range from Jan 1st 2009 to Jun  30th 2015 (same as training set)

We can use the ranges of the test set to drop outliers/invalid data from the training set.

### Exploratory Data Analysis and Visualization

**Exercise**: Create graphs (histograms, line charts, bar charts, scatter plots, box plots, geo maps etc.) to study the distrubtion of values in each column, and the relationship of each input column to the target.



### Ask & Answer Questions

**Exercise**: Ask & answer questions about the dataset:

1. What is the busiest day of the week?
2. What is the busiest time of the day?
3. In which month are fares the highest?
4. Which pickup locations have the highest fares?
5. Which drop locations have the highest fares?
6. What is the average ride distance?
7. ???

Performing EDA on your dataset and asking questions will help you develop a deeper understand of the data and give you ideas for feature engineering.



Resources for exploratory analysis & visualization:

- EDA project from scratch: https://www.youtube.com/watch?v=kLDTbavcmd0
- Data Analysis with Python: https://zerotopandas.com

> _**TIP #4**: Take an iterative approach to building ML models: do some EDA, do some feature engineering, train a model, then repeat to improve your model._

## 3. Prepare Dataset for Training

- Split Training & Validation Set
- Fill/Remove Missing Values
- Extract Inputs & Outputs
   - Training
   - Validation
   - Test

### Split Training & Validation Set

We'll set aside 20% of the training data as the validation set, to evaluate the models we train on previously unseen data.

Since the test set and training set have the same date ranges, we can pick a random 20% fraction.

> _**TIP #5**: Your validation set should be as similar to the test set or real-world data as possible i.e. the evaluation metric score of a model on validation & test sets should be very close, otherwise you're shooting in the dark._


In [315]:
from sklearn.model_selection import train_test_split

In [316]:
train_df,val_df = train_test_split(df,test_size=0.2,random_state=42)

In [317]:
len(train_df), len(val_df)

(400000, 100000)

### Fill/Remove Missing Values

There are no missing values in our sample, but if there were, we could simply drop the rows with missing values instead of trying to fill them (since we have a lot of training data)>

In [318]:
train_df = train_df.dropna()
val_df = val_df.dropna()

In [319]:
len(train_df), len(val_df)

(400000, 100000)

### Extract Inputs and Outputs

In [320]:
train_df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count'],
      dtype='object')

In [321]:
input_cols = ['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude', 'passenger_count']

In [322]:
target_df = "fare_amount"

#### Training

In [323]:
train_inputs = train_df[input_cols]

In [324]:
train_targets = train_df[target_df]

In [325]:
train_inputs

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
269056,-73.983948,40.743526,-73.993034,40.740448,1
499174,-73.870773,40.773632,-73.986328,40.733700,1
85143,-73.955322,40.767464,-73.965286,40.759247,1
260335,-73.960876,40.765411,-73.976669,40.756207,1
338124,-73.966148,40.805103,-73.977951,40.783932,1
...,...,...,...,...,...
259178,-73.972656,40.764042,-74.013176,40.707840,2
365838,-73.991982,40.749767,-73.989845,40.720551,3
131932,-73.969055,40.761398,-73.990814,40.751328,1
146867,-73.954620,40.764153,-73.868828,40.737686,2


In [326]:
train_targets

269056     5.300000
499174    43.540001
85143      6.000000
260335    12.500000
338124     9.000000
            ...    
259178    18.500000
365838    10.100000
131932    10.900000
146867    23.000000
121958     9.500000
Name: fare_amount, Length: 400000, dtype: float32

#### Validation

In [327]:
val_inputs = val_df[input_cols]

In [328]:
val_targets = val_df[target_df]

In [329]:
val_inputs


Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
104241,-74.007111,40.727531,-73.998749,40.755474,1
199676,-73.984566,40.760952,-73.975700,40.758713,1
140199,-73.980453,40.730564,-73.992928,40.744850,1
132814,-73.936501,40.761009,-73.934921,40.757839,1
408697,-73.976639,40.743896,-74.005203,40.739937,1
...,...,...,...,...,...
66361,-73.976997,40.758827,-73.976944,40.758839,1
497228,-73.963631,40.765121,-74.011368,40.701763,1
152728,-73.961510,40.760731,-73.968475,40.765610,1
50155,-73.987511,40.720173,-73.951569,40.726868,2


In [330]:
val_targets

104241    16.5
199676    11.0
140199    12.5
132814     7.7
408697     8.5
          ... 
66361     55.0
497228    15.7
152728     4.9
50155     10.5
240408    13.7
Name: fare_amount, Length: 100000, dtype: float32

#### Test

In [331]:
test_inputs = test_df[input_cols]

In [332]:
test_df

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24+00:00,-73.973320,40.763805,-73.981430,40.743835,1
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24+00:00,-73.986862,40.719383,-73.998886,40.739201,1
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44+00:00,-73.982521,40.751259,-73.979652,40.746140,1
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12+00:00,-73.981163,40.767807,-73.990448,40.751637,1
4,2012-12-01 21:12:12.0000003,2012-12-01 21:12:12+00:00,-73.966049,40.789776,-73.988564,40.744427,1
...,...,...,...,...,...,...,...
9909,2015-05-10 12:37:51.0000002,2015-05-10 12:37:51+00:00,-73.968124,40.796997,-73.955643,40.780388,6
9910,2015-01-12 17:05:51.0000001,2015-01-12 17:05:51+00:00,-73.945511,40.803600,-73.960213,40.776371,6
9911,2015-04-19 20:44:15.0000001,2015-04-19 20:44:15+00:00,-73.991600,40.726608,-73.789742,40.647011,6
9912,2015-01-31 01:05:19.0000005,2015-01-31 01:05:19+00:00,-73.985573,40.735432,-73.939178,40.801731,6


## 4. Train Hardcoded & Baseline Models

> _**TIP #6**: Always create a simple hardcoded or baseline model to establish the minimum score any proper ML model should beat._

- Hardcoded model: always predict average fare
- Baseline model: Linear regression

For evaluation the dataset uses RMSE error:
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation

### Train & Evaluate Hardcoded Model

Let's create a simple model that always predicts the average.

In [333]:
import numpy as np
class meanRegressor:
  def fit(self,inputs,targets):
    self.mean = targets.mean()
  def predict(self,inputs):
    return np.full(inputs.shape[0],self.mean)

In [334]:
mean_model = meanRegressor()

In [335]:
mean_model.fit(train_inputs,train_targets)

In [336]:
train_preds = mean_model.predict(train_inputs)

In [337]:
train_preds

array([11.354983, 11.354983, 11.354983, ..., 11.354983, 11.354983,
       11.354983], dtype=float32)

In [338]:
val_preds = mean_model.predict(val_inputs)

In [339]:
val_preds

array([11.354983, 11.354983, 11.354983, ..., 11.354983, 11.354983,
       11.354983], dtype=float32)

In [340]:
from sklearn.metrics import mean_squared_error

In [341]:
def rmse(targets,preds):
  return mean_squared_error(targets,preds,squared=False)

In [342]:
train_rmse = rmse(train_targets,train_preds)
train_rmse

9.763711

In [343]:
val_rmse = rmse(val_targets,val_preds)
val_rmse

10.0658455

Our dumb hard-coded model is off by \$9.899 on average, which is pretty bad considering the average fare is \$11.35.

### Train & Evaluate Baseline Model

We'll train a linear regression model as our baseline, which tries to express the target as a weighted sum of the inputs.

In [344]:
from sklearn.linear_model import LinearRegression

In [345]:
linear_model = LinearRegression()

In [346]:
linear_model.fit(train_inputs,train_targets)

In [347]:
train_preds = linear_model.predict(train_inputs)

In [348]:
train_preds

array([11.283025 , 11.284077 , 11.2834835, ..., 11.283251 , 11.371433 ,
       11.283045 ], dtype=float32)

In [349]:
train_rmse = rmse(train_targets,train_preds)

In [350]:
train_rmse

9.762524

In [351]:
val_preds = linear_model.predict(val_inputs)

In [352]:
val_preds

array([11.282589, 11.283193, 11.282872, ..., 11.283315, 11.370377,
       11.457279], dtype=float32)

In [353]:
val_rmse = rmse(val_targets,val_preds)

In [354]:
val_rmse

10.064393

The linear regression model is off by $9.762524, which isn't much better than simply predicting the average.

This is mainly because the training data (geocoordinates) is not in a format that's useful for the model, and we're not using one of the most important columns: pickup date & time.

However, now we have a baseline that our other models should ideally beat.

## 5. Make Predictions and Submit to Kaggle

> _**TIP #7**: When working on a Kaggle competition, submit early and submit often (ideally daily). The best way to improve your models is to try & beat your previous score._

- Make predictions for test set
- Generate submissions CSV
- Submit to Kaggle
- Record in experiment tracking sheet

In [355]:
test_inputs

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,-73.973320,40.763805,-73.981430,40.743835,1
1,-73.986862,40.719383,-73.998886,40.739201,1
2,-73.982521,40.751259,-73.979652,40.746140,1
3,-73.981163,40.767807,-73.990448,40.751637,1
4,-73.966049,40.789776,-73.988564,40.744427,1
...,...,...,...,...,...
9909,-73.968124,40.796997,-73.955643,40.780388,6
9910,-73.945511,40.803600,-73.960213,40.776371,6
9911,-73.991600,40.726608,-73.789742,40.647011,6
9912,-73.985573,40.735432,-73.939178,40.801731,6


In [356]:
test_preds = linear_model.predict(test_inputs)

In [357]:
test_preds

array([11.283353 , 11.282715 , 11.283155 , ..., 11.721101 , 11.719428 ,
       11.7195635], dtype=float32)

In [358]:
sub_df = pd.read_csv("/content/new-york-city-taxi-fare-prediction/sample_submission.csv")

In [359]:
sub_df

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.35
1,2015-01-27 13:08:24.0000003,11.35
2,2011-10-08 11:53:44.0000002,11.35
3,2012-12-01 21:12:12.0000002,11.35
4,2012-12-01 21:12:12.0000003,11.35
...,...,...
9909,2015-05-10 12:37:51.0000002,11.35
9910,2015-01-12 17:05:51.0000001,11.35
9911,2015-04-19 20:44:15.0000001,11.35
9912,2015-01-31 01:05:19.0000005,11.35


In [360]:
sub_df["fare_amount"] = test_preds

In [361]:
sub_df

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.283353
1,2015-01-27 13:08:24.0000003,11.282715
2,2011-10-08 11:53:44.0000002,11.283155
3,2012-12-01 21:12:12.0000002,11.283254
4,2012-12-01 21:12:12.0000003,11.283641
...,...,...
9909,2015-05-10 12:37:51.0000002,11.720262
9910,2015-01-12 17:05:51.0000001,11.720464
9911,2015-04-19 20:44:15.0000001,11.721101
9912,2015-01-31 01:05:19.0000005,11.719428


In [362]:
sub_df.to_csv("linear_model_submission.csv",index=None)

> _**TIP #8**: Create reusable functions for common tasks. They'll help you iterate faster and free up your mind to think about new ideas._

In [363]:
def predict_and_submit(model, fname):
    test_preds = model.predict(test_inputs)
    sub_df = pd.read_csv(data_dir+'/sample_submission.csv')
    sub_df['fare_amount'] = test_preds
    sub_df.to_csv(fname, index=None)
    return sub_df

> _**TIP #9**: Track your ideas & experiments systematically to avoid become overwhelmed with dozens of models. Use this template: https://bit.ly/mltrackingsheet_

## 6. Feature Engineering

> _**TIP #10**: Take an iterative approach to feature engineering. Add some features, train a model, evaluate it, keep the features if they help, otherwise drop them, then repeat._

- Extract parts of date
- Remove outliers & invalid data
- Add distance between pickup & drop
- Add distance from landmarks

Exercise: We're going to apply all of the above together, but you should observer the effect of adding each feature individually.

### Extract Parts of Date

- Year
- Month
- Day
- Weekday
- Hour



In [364]:
def add_dateparts(df, col):
    df[col + '_year'] = df[col].dt.year
    df[col + '_month'] = df[col].dt.month
    df[col + '_day'] = df[col].dt.day
    df[col + '_weekday'] = df[col].dt.weekday
    df[col + '_hour'] = df[col].dt.hour

In [365]:
add_dateparts(train_df,"pickup_datetime")

In [366]:
add_dateparts(val_df,"pickup_datetime")

In [367]:
train_df

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour
269056,5.300000,2009-01-05 17:53:00+00:00,-73.983948,40.743526,-73.993034,40.740448,1,2009,1,5,0,17
499174,43.540001,2015-06-06 16:42:22+00:00,-73.870773,40.773632,-73.986328,40.733700,1,2015,6,6,5,16
85143,6.000000,2013-04-06 20:21:00+00:00,-73.955322,40.767464,-73.965286,40.759247,1,2013,4,6,5,20
260335,12.500000,2013-12-20 18:07:00+00:00,-73.960876,40.765411,-73.976669,40.756207,1,2013,12,20,4,18
338124,9.000000,2013-01-24 15:38:00+00:00,-73.966148,40.805103,-73.977951,40.783932,1,2013,1,24,3,15
...,...,...,...,...,...,...,...,...,...,...,...,...
259178,18.500000,2009-04-12 09:58:56+00:00,-73.972656,40.764042,-74.013176,40.707840,2,2009,4,12,6,9
365838,10.100000,2012-07-12 19:30:00+00:00,-73.991982,40.749767,-73.989845,40.720551,3,2012,7,12,3,19
131932,10.900000,2011-02-17 18:33:00+00:00,-73.969055,40.761398,-73.990814,40.751328,1,2011,2,17,3,18
146867,23.000000,2014-08-04 11:14:00+00:00,-73.954620,40.764153,-73.868828,40.737686,2,2014,8,4,0,11


In [368]:
val_df

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour
104241,16.5,2014-05-29 09:55:00+00:00,-74.007111,40.727531,-73.998749,40.755474,1,2014,5,29,3,9
199676,11.0,2013-11-21 15:05:00+00:00,-73.984566,40.760952,-73.975700,40.758713,1,2013,11,21,3,15
140199,12.5,2015-05-13 10:54:40+00:00,-73.980453,40.730564,-73.992928,40.744850,1,2015,5,13,2,10
132814,7.7,2011-08-25 21:41:00+00:00,-73.936501,40.761009,-73.934921,40.757839,1,2011,8,25,3,21
408697,8.5,2010-05-27 21:24:00+00:00,-73.976639,40.743896,-74.005203,40.739937,1,2010,5,27,3,21
...,...,...,...,...,...,...,...,...,...,...,...,...
66361,55.0,2011-12-08 09:35:00+00:00,-73.976997,40.758827,-73.976944,40.758839,1,2011,12,8,3,9
497228,15.7,2010-11-16 07:27:22+00:00,-73.963631,40.765121,-74.011368,40.701763,1,2010,11,16,1,7
152728,4.9,2011-11-18 18:21:16+00:00,-73.961510,40.760731,-73.968475,40.765610,1,2011,11,18,4,18
50155,10.5,2012-09-02 00:06:40+00:00,-73.987511,40.720173,-73.951569,40.726868,2,2012,9,2,6,0


### Add Distance Between Pickup and Drop

We can use the haversine distance:
- https://en.wikipedia.org/wiki/Haversine_formula
- https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas

In [369]:
import numpy as np

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km


In [370]:
def add_trip_distance(df):
    df['trip_distance'] = haversine_np(df['pickup_longitude'], df['pickup_latitude'], df['dropoff_longitude'], df['dropoff_latitude'])

In [371]:
add_trip_distance(train_df)

In [372]:
add_trip_distance(val_df)

In [373]:
add_trip_distance(test_df)

In [374]:
train_df

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour,trip_distance
269056,5.300000,2009-01-05 17:53:00+00:00,-73.983948,40.743526,-73.993034,40.740448,1,2009,1,5,0,17,0.837945
499174,43.540001,2015-06-06 16:42:22+00:00,-73.870773,40.773632,-73.986328,40.733700,1,2015,6,6,5,16,10.691995
85143,6.000000,2013-04-06 20:21:00+00:00,-73.955322,40.767464,-73.965286,40.759247,1,2013,4,6,5,20,1.239849
260335,12.500000,2013-12-20 18:07:00+00:00,-73.960876,40.765411,-73.976669,40.756207,1,2013,12,20,4,18,1.677124
338124,9.000000,2013-01-24 15:38:00+00:00,-73.966148,40.805103,-73.977951,40.783932,1,2013,1,24,3,15,2.553845
...,...,...,...,...,...,...,...,...,...,...,...,...,...
259178,18.500000,2009-04-12 09:58:56+00:00,-73.972656,40.764042,-74.013176,40.707840,2,2009,4,12,6,9,7.116529
365838,10.100000,2012-07-12 19:30:00+00:00,-73.991982,40.749767,-73.989845,40.720551,3,2012,7,12,3,19,3.251601
131932,10.900000,2011-02-17 18:33:00+00:00,-73.969055,40.761398,-73.990814,40.751328,1,2011,2,17,3,18,2.146101
146867,23.000000,2014-08-04 11:14:00+00:00,-73.954620,40.764153,-73.868828,40.737686,2,2014,8,4,0,11,7.797785


### Add Distance From Popular Landmarks

> _**TIP #11**: Creative feature engineering (generally involving human insight or external data) is a lot more effective than excessive hyperparameter tuning. Just one or two good feature improve the model's performance drastically._

- JFK Airport
- LGA Airport
- EWR Airport
- Times Square
- Met Meuseum
- World Trade Center

We'll add the distance from drop location.

In [375]:
jfk_lonlat = -73.7781, 40.6413
lga_lonlat = -73.8740, 40.7769
ewr_lonlat = -74.1745, 40.6895
met_lonlat = -73.9632, 40.7794
wtc_lonlat = -74.0099, 40.7126

In [376]:
def add_landmark_dropoff_distance(df, landmark_name, landmark_lonlat):
    lon, lat = landmark_lonlat
    df[landmark_name + '_drop_distance'] = haversine_np(lon, lat, df['dropoff_longitude'], df['dropoff_latitude'])


In [377]:
def add_landmarks(a_df):
  landmarks = [('jfk', jfk_lonlat), ('lga', lga_lonlat), ('ewr', ewr_lonlat), ('met', met_lonlat), ('wtc', wtc_lonlat)]
  for name, lonlat in landmarks:
    add_landmark_dropoff_distance(a_df, name, lonlat)

In [378]:
add_landmarks(train_df)

In [379]:
add_landmarks(val_df)

In [380]:
add_landmarks(test_df)

In [381]:
train_df

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour,trip_distance,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,met_drop_distance,wtc_drop_distance
269056,5.300000,2009-01-05 17:53:00+00:00,-73.983948,40.743526,-73.993034,40.740448,1,2009,1,5,0,17,0.837945,21.198206,10.807053,16.299885,5.004011,3.405029
499174,43.540001,2015-06-06 16:42:22+00:00,-73.870773,40.773632,-73.986328,40.733700,1,2015,6,6,5,16,10.691995,20.330046,10.604343,16.593920,5.438491,3.072369
85143,6.000000,2013-04-06 20:21:00+00:00,-73.955322,40.767464,-73.965286,40.759247,1,2013,4,6,5,20,1.239849,20.505774,7.929215,19.248837,2.246298,6.401762
260335,12.500000,2013-12-20 18:07:00+00:00,-73.960876,40.765411,-73.976669,40.756207,1,2013,12,20,4,18,1.677124,21.045547,8.941334,18.236153,2.815261,5.595941
338124,9.000000,2013-01-24 15:38:00+00:00,-73.966148,40.805103,-73.977951,40.783932,1,2013,1,24,3,15,2.553845,23.121374,8.781693,19.596359,1.339127,8.370646
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259178,18.500000,2009-04-12 09:58:56+00:00,-73.972656,40.764042,-74.013176,40.707840,2,2009,4,12,6,9,7.116529,21.146925,14.006921,13.743814,8.996409,0.596505
365838,10.100000,2012-07-12 19:30:00+00:00,-73.991982,40.749767,-73.989845,40.720551,3,2012,7,12,3,19,3.251601,19.899387,11.589870,15.933608,6.913579,1.906144
131932,10.900000,2011-02-17 18:33:00+00:00,-73.969055,40.761398,-73.990814,40.751328,1,2011,2,17,3,18,2.146101,21.695084,10.233944,16.927792,3.889789,4.594002
146867,23.000000,2014-08-04 11:14:00+00:00,-73.954620,40.764153,-73.868828,40.737686,2,2014,8,4,0,11,7.797785,13.159647,4.379132,26.297808,9.197165,12.203115


### Remove Outliers and Invalid Data

There seems to be some invalid data in each of the following columns:

- Fare amount
- Passenger count
- Pickup latitude & longitude
- Drop latitude & longitude

In [382]:
test_df.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,trip_distance,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,met_drop_distance,wtc_drop_distance
count,9914.0,9914.0,9914.0,9914.0,9914.0,9914.0,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974716,40.751041,-73.973656,40.75174,1.671273,3.433216,20.916754,9.67518,18.546659,4.512898,6.037652
std,0.042774,0.033541,0.039072,0.035435,1.278747,3.969883,3.303943,3.295647,4.03582,4.018427,4.252539
min,-74.25219,40.573143,-74.263245,40.568974,1.0,0.0,0.4019,0.285629,0.28468,0.085747,0.040269
25%,-73.9925,40.736125,-73.991249,40.735253,1.0,1.297261,20.513337,8.311565,16.520517,2.126287,3.670107
50%,-73.982327,40.753052,-73.980015,40.754065,1.0,2.215648,21.181472,9.477797,18.02435,3.698123,5.541466
75%,-73.968012,40.767113,-73.964062,40.768757,2.0,4.043051,21.909794,10.965272,19.880536,5.922544,7.757612
max,-72.986534,41.709557,-72.990967,41.696682,6.0,99.933281,134.497726,126.062576,149.400787,130.347153,138.619492


In [383]:
train_df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour,trip_distance,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,met_drop_distance,wtc_drop_distance
count,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0,400000.0
mean,11.354983,-72.504951,39.917965,-72.510986,39.934383,1.683175,2011.741425,6.264857,15.725352,3.035895,13.500317,19.888382,192.697861,181.913086,190.961746,176.960953,178.449738
std,9.763722,11.890986,7.09113,12.550869,8.705482,1.346067,1.857366,3.435603,8.689137,1.950567,6.518666,372.5495,1221.379761,1224.433594,1226.279297,1226.150146,1226.249634
min,-52.0,-1183.362793,-2073.150635,-3356.729736,-2073.150635,0.0,2009.0,1.0,1.0,0.0,0.0,0.0,0.109021,0.116402,0.129245,0.031195,0.009281
25%,6.0,-73.992065,40.734867,-73.991417,40.733955,1.0,2010.0,3.0,8.0,1.0,9.0,1.213074,20.534558,8.347955,16.500039,2.165506,3.64077
50%,8.5,-73.981827,40.752602,-73.980164,40.753113,1.0,2012.0,6.0,16.0,3.0,14.0,2.116283,21.202111,9.573696,18.01742,3.813133,5.562365
75%,12.5,-73.967171,40.767017,-73.963722,40.768089,2.0,2013.0,9.0,23.0,5.0,19.0,3.879139,21.948405,11.124242,19.957847,6.067542,7.816745
max,350.0,2420.209473,404.766663,2467.752686,2602.013672,208.0,2015.0,12.0,31.0,6.0,23.0,14789.936523,15057.675781,15074.645508,15074.72168,15077.615234,15072.116211


We'll use the following ranges:

- `fare_amount`: \$1 to \$500
- `longitudes`: -75 to -72
- `latitudes`: 40 to 42
- `passenger_count`: 1 to 6


In [384]:
def remove_outliers(df):
    return df[(df['fare_amount'] >= 1.) &
              (df['fare_amount'] <= 500.) &
              (df['pickup_longitude'] >= -75) &
              (df['pickup_longitude'] <= -72) &
              (df['dropoff_longitude'] >= -75) &
              (df['dropoff_longitude'] <= -72) &
              (df['pickup_latitude'] >= 40) &
              (df['pickup_latitude'] <= 42) &
              (df['dropoff_latitude'] >=40) &
              (df['dropoff_latitude'] <= 42) &
              (df['passenger_count'] >= 1) &
              (df['passenger_count'] <= 6)]

In [385]:
train_df = remove_outliers(train_df)

In [386]:
val_df = remove_outliers(val_df)

In [387]:
train_df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour,trip_distance,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,met_drop_distance,wtc_drop_distance
count,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0,390200.0
mean,11.3452,-73.975159,40.750969,-73.97435,40.751339,1.688706,2011.739303,6.268175,15.725015,3.035584,13.50123,3.330171,20.913494,9.692996,18.47628,4.494944,5.967922
std,9.68101,0.039309,0.030027,0.038677,0.033212,1.304173,1.862556,3.436747,8.690088,1.950868,6.517602,3.731573,3.129202,3.109068,3.771652,3.822297,4.012028
min,1.0,-74.839172,40.063896,-74.843079,40.054207,1.0,2009.0,1.0,1.0,0.0,0.0,0.0,0.109021,0.116402,0.129245,0.031195,0.009281
25%,6.0,-73.992279,40.73653,-73.9916,40.735493,1.0,2010.0,3.0,8.0,1.0,9.0,1.253799,20.518509,8.320709,16.474435,2.136188,3.583156
50%,8.5,-73.982117,40.753304,-73.980606,40.753803,1.0,2012.0,6.0,16.0,3.0,14.0,2.153012,21.173024,9.519025,17.96044,3.715596,5.490717
75%,12.5,-73.968391,40.767437,-73.965363,40.768383,2.0,2013.0,9.0,23.0,5.0,19.0,3.916701,21.89833,10.989256,19.789979,5.905019,7.644531
max,300.0,-72.471581,41.787712,-72.113823,41.806301,6.0,2015.0,12.0,31.0,6.0,23.0,113.474625,141.011902,150.943359,174.761353,158.388168,161.233841


### Scaling and One-Hot Encoding

**Exercise**: Try scaling numeric columns to the `(0,1)` range and encoding categorical columns using a one-hot encoder.

We won't do this because we'll be training tree-based models which are generally able to do a good job even without the above.

### Save Intermediate DataFrames

> _**TIP #12**: Save preprocessed & prepared data files to save time & experiment faster. You may also want to create differnt notebooks for EDA, feature engineering and model training._

Let's save the processed datasets in the Apache Parquet format, so that we can load them back easily to resume our work from this point.




In [388]:
train_df.to_parquet("train.parquet")

In [389]:
val_df.to_parquet("val.parquet")

## 7. Train & Evaluate Different Models

We'll train each of the following & submit predictions to Kaggle:

- Linear Regression
- Random Forests
- Gradient Boosting

Exercise: Train Ridge, SVM, KNN, Decision Tree models

### Split Inputs & Targets

In [390]:
train_df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_weekday', 'pickup_datetime_hour', 'trip_distance',
       'jfk_drop_distance', 'lga_drop_distance', 'ewr_drop_distance',
       'met_drop_distance', 'wtc_drop_distance'],
      dtype='object')

In [391]:
input_cols = [ 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_weekday', 'pickup_datetime_hour', 'trip_distance',
       'jfk_drop_distance', 'lga_drop_distance', 'ewr_drop_distance',
       'met_drop_distance', 'wtc_drop_distance']

In [392]:
target_col = "fare_amount"

In [413]:
train_inputs = train_df[input_cols]
train_target = train_df[target_col]

390200

In [419]:
len(train_target)

390200

In [394]:

val_inputs = val_df[input_cols]
val_target = val_df[target_col]

In [395]:
test_df.sample(2)

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,trip_distance,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,met_drop_distance,wtc_drop_distance
7193,2010-08-06 20:27:12.0000003,2010-08-06 20:27:12+00:00,-73.992317,40.734482,-73.991669,40.751415,2,1.88273,21.760553,10.300826,16.865587,3.925665,4.578442
9125,2011-06-24 12:03:00.000000117,2011-06-24 12:03:00+00:00,-73.991035,40.756004,-73.872414,40.774315,5,10.188869,16.781263,0.316723,27.127708,7.660629,13.454297


In [396]:
test_df.columns

Index(['key', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'trip_distance', 'jfk_drop_distance', 'lga_drop_distance',
       'ewr_drop_distance', 'met_drop_distance', 'wtc_drop_distance'],
      dtype='object')

In [397]:
add_dateparts(test_df,"pickup_datetime")

In [398]:
test_inputs = test_df[input_cols]

Let's define a helper function to evaluate models.

In [399]:
def evaluate(model):
    train_preds = model.predict(train_inputs)
    print("train done1")
    train_rmse = rmse(train_targets, train_preds)
    print("train done1")
    val_preds = model.predict(val_inputs)
    val_rmse = rmse(val_targets, val_preds)
    return train_rmse, val_rmse, train_preds, val_preds

### Ridge Regression

See https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [400]:
from sklearn.linear_model import Ridge

In [401]:
model1 = Ridge(random_state=42,alpha=0.9)

In [402]:
train_inputs

Unnamed: 0,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour,trip_distance,jfk_drop_distance,lga_drop_distance,ewr_drop_distance,met_drop_distance,wtc_drop_distance
269056,2009-01-05 17:53:00+00:00,-73.983948,40.743526,-73.993034,40.740448,1,2009,1,5,0,17,0.837945,21.198206,10.807053,16.299885,5.004011,3.405029
499174,2015-06-06 16:42:22+00:00,-73.870773,40.773632,-73.986328,40.733700,1,2015,6,6,5,16,10.691995,20.330046,10.604343,16.593920,5.438491,3.072369
85143,2013-04-06 20:21:00+00:00,-73.955322,40.767464,-73.965286,40.759247,1,2013,4,6,5,20,1.239849,20.505774,7.929215,19.248837,2.246298,6.401762
260335,2013-12-20 18:07:00+00:00,-73.960876,40.765411,-73.976669,40.756207,1,2013,12,20,4,18,1.677124,21.045547,8.941334,18.236153,2.815261,5.595941
338124,2013-01-24 15:38:00+00:00,-73.966148,40.805103,-73.977951,40.783932,1,2013,1,24,3,15,2.553845,23.121374,8.781693,19.596359,1.339127,8.370646
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259178,2009-04-12 09:58:56+00:00,-73.972656,40.764042,-74.013176,40.707840,2,2009,4,12,6,9,7.116529,21.146925,14.006921,13.743814,8.996409,0.596505
365838,2012-07-12 19:30:00+00:00,-73.991982,40.749767,-73.989845,40.720551,3,2012,7,12,3,19,3.251601,19.899387,11.589870,15.933608,6.913579,1.906144
131932,2011-02-17 18:33:00+00:00,-73.969055,40.761398,-73.990814,40.751328,1,2011,2,17,3,18,2.146101,21.695084,10.233944,16.927792,3.889789,4.594002
146867,2014-08-04 11:14:00+00:00,-73.954620,40.764153,-73.868828,40.737686,2,2014,8,4,0,11,7.797785,13.159647,4.379132,26.297808,9.197165,12.203115


In [415]:
train_inputs = train_inputs.drop(columns=['pickup_datetime'])
val_inputs = val_inputs.drop(columns=['pickup_datetime'])
test_inputs = test_inputs.drop(columns=['pickup_datetime'])

KeyError: "['pickup_datetime'] not found in axis"

In [416]:
model1.fit(train_inputs,train_target)

In [423]:
def evaluate(model):
    train_preds = model.predict(train_inputs)
    print("train done1")
    train_rmse = rmse(train_target, train_preds)
    print("train done1")
    val_preds = model.predict(val_inputs)
    val_rmse = rmse(val_target, val_preds)
    return train_rmse, val_rmse, train_preds, val_preds

In [425]:
evaluate(model1)

train done1
train done1


(4.944418179692718,
 5.453732207018126,
 array([ 4.76633732, 27.69718762,  7.27867299, ...,  8.2914329 ,
        24.40591503, 10.5867265 ]),
 array([12.09729659,  6.6226218 , 10.33518392, ...,  5.84954713,
        11.02025813, 17.86537001]))