# New York City Taxi Fare Prediction

We'll train a machine learning model to predict the fare for a taxi ride in New York city given information like pickup date & time, pickup location, drop location and no. of passengers. 

Dataset Link: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction

### Loading Training Set

Loading the entire dataset into Pandas is going to be slow, so we can use the following optimizations:

- Ignore the `key` column
- Parse pickup datetime while loading data 
- Specify data types for other columns
   - `float32` for geo coordinates
   - `float32` for fare amount
   - `uint8` for passenger count
- Work with a 1% sample of the data (~500k rows)

We can apply these optimizations while using [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [19]:
import pandas as pd
import random

In [20]:
# Change this
sample_frac = 0.1

In [21]:
selected_cols = 'fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count'.split(',')
dtypes = {
    'fare_amount': 'float32',
    'pickup_longitude': 'float32',
    'pickup_latitude': 'float32',
    'dropoff_longitude': 'float32',
    'passenger_count': 'float32'
}

def skip_row(row_idx):
    if row_idx == 0:
        return False
    return random.random() > sample_frac

random.seed(42)
df = pd.read_csv("Datasets/nyctaxifare/train.csv", 
                 usecols=selected_cols, 
                 dtype=dtypes, 
                 parse_dates=['pickup_datetime'], 
                 skiprows=skip_row)

In [22]:
df.shape

(5542602, 7)

In [59]:
test_df = pd.read_csv("Datasets/nyctaxifare/test.csv", dtype=dtypes, parse_dates=['pickup_datetime'])

In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5542602 entries, 0 to 5542601
Data columns (total 7 columns):
 #   Column             Dtype              
---  ------             -----              
 0   fare_amount        float32            
 1   pickup_datetime    datetime64[ns, UTC]
 2   pickup_longitude   float32            
 3   pickup_latitude    float32            
 4   dropoff_longitude  float32            
 5   dropoff_latitude   float64            
 6   passenger_count    float32            
dtypes: datetime64[ns, UTC](1), float32(5), float64(1)
memory usage: 190.3 MB


In [61]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,5542602.0,5542602.0,5542602.0,5542569.0,5542569.0,5542602.0
mean,11.36259,-72.50569,39.91776,-72.50184,39.9175,1.686349
std,41.09729,12.84903,10.17996,13.0096,9.81618,1.324577
min,-300.0,-3439.245,-3492.264,-3379.079,-3547.887,0.0
25%,6.0,-73.99207,40.73493,-73.9914,40.73402,1.0
50%,8.5,-73.9818,40.75265,-73.98016,40.75314,1.0
75%,12.5,-73.96708,40.76712,-73.96368,40.76809,2.0
max,93963.36,3457.626,3376.602,3442.185,3400.392,208.0


Observations about training data:

- 550k+ rows, as expected
- No missing data (in the sample)
- `fare_amount` ranges from \$-300.0 to \$93964.0 
- `passenger_count` ranges from 0 to 208 
- There seem to be some errors in the latitude & longitude values
- Dates range from 1st Jan 2009 to 30th June 2015
- The dataset takes up ~571 MB of space in the RAM

We may need to deal with outliers and data entry errors before we train our model.

In [62]:
df['pickup_datetime'].min(), df['pickup_datetime'].max()

(Timestamp('2009-01-01 00:01:56+0000', tz='UTC'),
 Timestamp('2015-06-30 23:59:54+0000', tz='UTC'))

In [63]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9914 entries, 0 to 9913
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   key                9914 non-null   object             
 1   pickup_datetime    9914 non-null   datetime64[ns, UTC]
 2   pickup_longitude   9914 non-null   float32            
 3   pickup_latitude    9914 non-null   float32            
 4   dropoff_longitude  9914 non-null   float32            
 5   dropoff_latitude   9914 non-null   float64            
 6   passenger_count    9914 non-null   float32            
dtypes: datetime64[ns, UTC](1), float32(4), float64(1), object(1)
memory usage: 387.4+ KB


In [64]:
test_df.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974716,40.751041,-73.973656,40.751743,1.671273
std,0.042774,0.033541,0.039072,0.035435,1.278747
min,-74.25219,40.573143,-74.263245,40.568973,1.0
25%,-73.9925,40.736125,-73.991249,40.735254,1.0
50%,-73.982327,40.753052,-73.980015,40.754065,1.0
75%,-73.968012,40.767113,-73.964062,40.768757,2.0
max,-72.986534,41.709557,-72.990967,41.696683,6.0


Some observations about the test set:

- 9914 rows of data
- No missing values
- No obvious data entry errors
- 1 to 6 passengers (we can limit training data to this range)
- Latitudes lie between 40 and 42
- Longitudes lie between -75 and -72
- Pickup dates range from Jan 1st 2009 to Jun  30th 2015 (same as training set)

We can use the ranges of the test set to drop outliers/invalid data from the training set.

### Exploratory Data Analysis and Visualization

**Exercise**: Create graphs (histograms, line charts, bar charts, scatter plots, box plots, geo maps etc.) to study the distrubtion of values in each column, and the relationship of each input column to the target.


### Ask & Answer Questions

**Exercise**: Ask & answer questions about the dataset: 

1. What is the busiest day of the week?
2. What is the busiest time of the day?
3. In which month are fares the highest?
4. Which pickup locations have the highest fares?
5. Which drop locations have the highest fares?
6. What is the average ride distance?

EDA + asking questions will help you develop a deeper understand of the data and give you ideas for feature engineering.

## 3. Prepare Dataset for Training

- Split Training & Validation Set
- Fill/Remove Missing Values
- Extract Inputs & Outputs
   - Training
   - Validation
   - Test

In [65]:
from sklearn.model_selection import train_test_split

In [66]:
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

In [67]:
len(train_df), len(val_df)

(4434081, 1108521)

### Fill/Remove Missing Values

There are no missing values in our sample, but if there were, we could simply drop the rows with missing values instead of trying to fill them (since we have a lot of training data)>

In [68]:
train_df = train_df.dropna()
val_df = val_df.dropna()

### INPUT OUTPUT Creation

In [69]:
input_cols = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
target_col = 'fare_amount'

In [70]:
train_input = train_df[input_cols]
train_target = train_df[target_col]
val_input = val_df[input_cols]
val_target = val_df[target_col]

In [71]:
test_input = test_df[input_cols]

## 4. Train Hardcoded & Baseline Models

- Hardcoded model: always predict average fare
- Baseline model: Linear regression

In [72]:
## Hardcoded model
import numpy as np
class MeanRegressor:
    def fit(self, X, y):
        self.mean = y.mean()
    
    def predict(self, X):
        return np.full(X.shape[0], self.mean)

In [73]:
mean_model = MeanRegressor()
mean_model.fit(train_input, train_target)

In [74]:
mean_model.mean

np.float32(11.367966)

In [75]:
train_preds = mean_model.predict(train_input)
train_preds

array([11.367966, 11.367966, 11.367966, ..., 11.367966, 11.367966,
       11.367966], dtype=float32)

In [76]:
val_preds = mean_model.predict(val_input)
val_preds

array([11.367966, 11.367966, 11.367966, ..., 11.367966, 11.367966,
       11.367966], dtype=float32)

In [77]:
from sklearn.metrics import root_mean_squared_error

In [78]:
val_rmse = root_mean_squared_error(val_target, val_preds)
val_rmse

9.838235855102539

In [79]:
## Linear Model
from sklearn.linear_model import LinearRegression

In [80]:
linear_model = LinearRegression()
linear_model.fit(train_input, train_target)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [81]:
train_preds = linear_model.predict(train_input)
train_preds

array([11.79881713, 11.28554787, 11.28562818, ..., 11.28557965,
       11.28550731, 11.28550394])

In [82]:
root_mean_squared_error(train_preds, train_target)

45.68390868828415

In [83]:
val_preds = linear_model.predict(val_input)
val_preds

array([11.49021224, 11.28560154, 11.28563229, ..., 11.28555081,
       11.69488762, 11.69506312])

In [84]:
root_mean_squared_error(val_preds, val_target)

9.837345833541185

## 6. Feature Engineering


- Extract parts of date
- Remove outliers & invalid data
- Add distance between pickup & drop
- Add distance from landmarks

Exercise: We're going to apply all of the above together, but you should observer the effect of adding each feature individually.

In [85]:
def add_dateparts(df, col):
    df[col + '_year'] = df[col].dt.year
    df[col + '_month'] = df[col].dt.month
    df[col + '_day'] = df[col].dt.day
    df[col + '_weekday'] = df[col].dt.weekday
    df[col + '_hour'] = df[col].dt.hour

In [86]:
add_dateparts(train_df, 'pickup_datetime')
add_dateparts(val_df, 'pickup_datetime')

In [87]:
test_df

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24+00:00,-73.973320,40.763805,-73.981430,40.743835,1.0
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24+00:00,-73.986862,40.719383,-73.998886,40.739201,1.0
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44+00:00,-73.982521,40.751259,-73.979652,40.746139,1.0
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12+00:00,-73.981163,40.767807,-73.990448,40.751635,1.0
4,2012-12-01 21:12:12.0000003,2012-12-01 21:12:12+00:00,-73.966049,40.789776,-73.988564,40.744427,1.0
...,...,...,...,...,...,...,...
9909,2015-05-10 12:37:51.0000002,2015-05-10 12:37:51+00:00,-73.968124,40.796997,-73.955643,40.780388,6.0
9910,2015-01-12 17:05:51.0000001,2015-01-12 17:05:51+00:00,-73.945511,40.803600,-73.960213,40.776371,6.0
9911,2015-04-19 20:44:15.0000001,2015-04-19 20:44:15+00:00,-73.991600,40.726608,-73.789742,40.647011,6.0
9912,2015-01-31 01:05:19.0000005,2015-01-31 01:05:19+00:00,-73.985573,40.735432,-73.939178,40.801731,6.0


In [88]:
train_df

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_weekday,pickup_datetime_hour
2056898,40.0,2013-10-03 09:41:00+00:00,-73.783417,40.648640,-73.738358,40.770730,6.0,2013,10,3,3,9
2987026,6.0,2014-10-07 20:55:40+00:00,-73.960716,40.780994,-73.969803,40.765275,1.0,2014,10,7,1,20
838145,9.0,2014-04-09 23:54:00+00:00,-73.973808,40.752071,-73.949394,40.777280,1.0,2014,4,9,2,23
4760740,8.0,2012-10-17 21:39:52+00:00,-73.975105,40.753578,-73.976768,40.735029,1.0,2012,10,17,2,21
2580632,6.1,2009-05-16 07:04:00+00:00,-73.921036,40.756550,-73.943451,40.747370,2.0,2009,5,16,5,7
...,...,...,...,...,...,...,...,...,...,...,...,...
1570006,11.3,2011-02-14 15:43:42+00:00,-73.980797,40.774883,-73.981026,40.744616,1.0,2011,2,14,0,15
2234489,7.5,2014-06-01 10:26:10+00:00,-73.961388,40.780155,-73.956810,40.767911,2.0,2014,6,1,6,10
4926484,12.1,2009-06-14 11:48:00+00:00,-73.949440,40.781200,-73.982399,40.738453,1.0,2009,6,14,6,11
4304572,10.5,2015-04-29 23:01:40+00:00,-73.987167,40.766186,-73.990158,40.738064,1.0,2015,4,29,2,23


In [89]:

add_dateparts(test_df, 'pickup_datetime')