# New York Taxi Fare Prediction

Can you predict a rider's taxi fare?

Kaggle: https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/overview

Project outline:
1. Download/ Load the dataset
2. Explore & analyze the dataset
3. Prepare the dataset for ML training
4. Train hardcoded & baseline models
5. Make predictions & submit to Kaggle
6. Perform feature engineering
7. Train & evaluate different models
8. Tune hyperparameters for the best models
9. Train on a GPU with the entire dataset
10. Document & publish the project online

## 1. Download/ Load the dataset

Steps:
* Install required libraries
* Download data from Kaggle
* View dataset files
* Load training set in dataframe
* Load test set in dataframe

In [None]:
# Import time package
import time

# Import dask libraries
import dask
from dask import dask_cudf

### View Dataset Files

#### Use Shell commands to view the large dataset

In [None]:
# set path of the directory
data_dir = './'

In [None]:
# List of files with size from directory
!ls -lh {data_dir}

In [None]:
# Load Training set
!head {data_dir}/train.csv

In [None]:
# Load Test set
!head {data_dir}/test.csv

In [None]:
# Load Sample submission file
!head {data_dir}/sample_submission.csv

In [None]:
# No. of lines in training set
!wc -l {data_dir}/train.csv

In [None]:
# No. of lines in test set
!wc -l {data_dir}/test.csv

In [None]:
# No. of lines in submission file
!wc -l {data_dir}/sample_submission.csv

Observation:
* It is a supervised learning regression problem
* Size of training data is 5.5 GB
* Training data has 5.5 million rows
* Test data has < 10000 rows
* Training set has 8 columns:
    * `key` (unique identifier)
    * `fare_amount` (target column)
    * `pickup_datetime`
    * `pickup_longitude`
    * `pickup_latitude`
    * `dropoff_longitude`
    * `dropoff_latitude`
    * `passenger_count`
* Test set does not include target colum `fare_amount`
* Submission file contains `key` and `fare_amount` column for each test data

### Loading Training Set

> Tip: When working with large datasets, always start with a sample to experiment & iterate faster

Loading entire datasets into Pandas is very slow, use the following optimizations:
* Ignore the `key` column
* Parse `pickup_datetime` while loading the data
* Specify data types for other columns
    * `float32` for geo coordinates
    * `float32` for fare 
    * `uint8` for passenger count
* Work with 1% sample data (~500k rows)

In [None]:
# Import packages
import pandas as pd
import random

In [None]:
# set sample fraction of dataset
sample_fraction = 0.01

In [None]:
%time

# defined selected columns
selected_cols = ['fare_amount', 'pickup_datetime', 'pickup_longitude',
                 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']

# defined data types
dtypes = {
    'fare_amount': 'float32',
    'pickup_longitude': 'float32',
    'pickup_latitude': 'float32',
    'dropoff_longitude': 'float32',
    'dropoff_latitude': 'float32',
    'passenger_count': 'uint8'
}

# function for selecting random rows of data
def skip_row(row_idx):
    if row_idx == 0:
        return False
    return random.random() > sample_fraction

# Set random seed
random.seed(42)

# Load data to a pandas dataframe
df = pd.read_csv(data_dir+'/train.csv', usecols=selected_cols, 
                       parse_dates=['pickup_datetime'], dtype=dtypes, 
                       skiprows=skip_row)


In [None]:
# show train dataset
df

***fix seeds for random number generators, so that we can get same results everytime we run the notebook.***

### Load Test Set

Update data types and parse datetime.

In [None]:
# load test dataset
test_df = pd.read_csv(data_dir+'/test.csv', dtype=dtypes, parse_dates=['pickup_datetime'])

In [None]:
# show test dataset
test_df

## 2. Explore & Analyze Dataset

* Basic info about training set
* Basic info about test set
* Exploratory data analysis & visualization
* Ask and answer questions

### Training Set

In [None]:
# show train set information
df.info()

In [None]:
# show summary statistics of train set
df.describe()

In [None]:
# show datetime ranges
df['pickup_datetime'].min(), df['pickup_datetime'].max()

Observations about training data:
* 550K+ rows
* No missing data in sample data
* `fare_amount` ranges from `$-52.0` to `$499.0`
* `passenger_count` ranges from 0 to 208
* There are errors in longitude and latitude values (outliers are there)
* `pickup_datetime` ranges from 1st Jan 2009 to 30th June 2015
* Sample data takes memory up to ~16MB in RAM

***We may need to deal with outliers and data entry error before training ML model***

### Test Set

In [None]:
# show test set information
test_df.info()

In [None]:
# show summary statistics of test set
test_df.describe()

In [None]:
# show datetime ranges
test_df['pickup_datetime'].min(), test_df['pickup_datetime'].max()

Observations about test set:
* 9914 rows of data
* No missing values
* No obvious data entry errors
* `passenger_count` ranges between 1 to 6 passengers (we can limit trianing data to this range)
* Latitude lies between 40 and 42
* Longitude lies between -74 and -73
* `pickup_datetime` ranges from 1st Jan 2009 to 30th June 2015 (same as training set)

***We can use this ranges of the test set to drop outliers/ invalid data from training set.***

### Exploratory Data Analysis and Visualization

**Tasks**: Create graphs (histograms, line charts, bar charts, scatter plots, box plots, geo maps etc.) to study the distribution of values in each column, and the relationship of each input column to the target.

### Ask & Answer Questions

Questions:
1. What is the busiest day of the week?
2. What is the busiest time of the day?
3. In which month are fares the highest?
4. Which pickup locations have the highest fares?
5. Which drop locations have the highest fares?
6. What is the average ride distance?

> Understanding the data using EDA will give ideas for feature engineering.

> Iterative approach building ML models: do some EDA, do some feature engineering, train a model, repeat to improve the model.

## 3. Prepare Dataset for ML Training

* Split Training & Validation Set
* Fill/ Remove Missing Values
* Extract Inputs & Outputs
    * Training 
    * Validation
    * Test

### Split Training & Validation Set

Set aside 20% of training data as validation set, to evaluate the train models. Pick random 20% fraction as test set and training set have same date ranges .

> TIP: Validation set should be similar to test set or real world data as close as possible, i.e. evaluation score of a model on validation & test sets should be very close.

In [None]:
# Import package from sklearn for splitting data
from sklearn.model_selection import train_test_split

In [None]:
# split data into train and validation set
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

In [None]:
# check row count
len(train_df), len(val_df)

### Fill/ Remove Missing Values

There were no missing values in sample data, but if there were, we will drop the rows with missing values instead of trying to fill them (since we have lot of training data)

In [None]:
# drop missing values
train_df = train_df.dropna()
val_df = val_df.dropna()

### Extract Inputs and Outputs

In [None]:
# list all columns
df.columns

In [None]:
# defined input columns
input_cols = ['pickup_longitude', 'pickup_latitude',
              'dropoff_longitude', 'dropoff_latitude', 'passenger_count']

# defined target column
target_col = 'fare_amount'

### Training Inputs

In [None]:
# get training inputs data
train_inputs = train_df[input_cols]

In [None]:
# get training target data
train_targets = train_df[target_col]

In [None]:
# show train inputs data
train_inputs

In [None]:
# show train target data
train_targets

### Validation Inputs

In [None]:
# get validation inputs data
val_inputs = val_df[input_cols]

In [None]:
# get validation target data
val_targets = val_df[target_col]

In [None]:
# show validation inputs data
val_inputs

In [None]:
# show validation target data
val_targets

### Test Inputs

It will not have target data, which we have to predict.

In [None]:
# get test inputs data
test_inputs = test_df[input_cols]

In [None]:
# show test target data
test_inputs

## 4. Train Hardcoded & Baseline models

> TIP: Always create a simple hardcoded or baseline model to establish the minimum score any proper ML model should beat.

* Hardcoded model: always predict average fare
* Baseline model: Linear regression

### Train & Evaluate Hardcoded Model

Create a simple model that always predict the average.

In [None]:
# Import numpy package
import numpy as np

In [None]:
# create a class for training and predicting average
class MeanRegressor:
    def fit(self, inputs, targets):
        self.mean = targets.mean()
    
    def predict(self, inputs):
        return np.full(inputs.shape[0], self.mean)

In [None]:
# Instatiate the model
mean_model = MeanRegressor()

In [None]:
# Use fit function from class
mean_model.fit(train_inputs, train_targets)

In [None]:
# get the mean from the model
mean_model.mean

In [None]:
# predict the mean from model using train set
train_preds = mean_model.predict(train_inputs)

In [None]:
# show prediction results on train set
train_preds

In [None]:
# show actual targets from train set
train_targets

In [None]:
# predict the mean from model using validation set
val_preds = mean_model.predict(val_inputs)

In [None]:
# show prediction results on validation set
val_preds

In [None]:
# show actual targets from validation set
val_targets

In [None]:
# Import metrics packages from sklearn
from sklearn.metrics import mean_squared_error

In [None]:
# define a function to calculate mean squared error
def rmse(targets, preds):
    return mean_squared_error(targets, preds, squared=False)

In [None]:
# calculate train loss (mean squared error)
train_rmse = rmse(train_targets, train_preds)
train_rmse

In [None]:
# calculate validation loss (mean squared error)
val_rmse = rmse(val_targets, val_preds)
val_rmse

Hard-coded model is off by `$9.899` on average, which is pretty bad considering average fare is `$11.35`

### Train & Evaluate Baseline Model

Train a linear regression model as our baseline model, which tries to express the target as weighted sum of the inputs.

In [None]:
# Import linear regression package from sklearn
from sklearn.linear_model import LinearRegression

In [None]:
# Instantiate the linear regression
linear_model = LinearRegression()

In [None]:
# Fit the model to training data
linear_model.fit(train_inputs, train_targets)

In [None]:
# make prediction on training data
train_preds = linear_model.predict(train_inputs)

In [None]:
# show prediction on training data
train_preds

In [None]:
# calculate training loss (root mean squared error)
rmse(train_targets, train_preds)

In [None]:
# make prediction on validation data
val_preds = linear_model.predict(val_inputs)

In [None]:
# show prediction on validation data
val_preds

In [None]:
# calculate validation loss (root mean squared error)
rmse(val_targets, val_preds)

Linear regression model is off by `$9.899`, which isn't much better than simply predicting the average.

This is mainly because, training data(geocoordinates) is not in a format that is useful for the model, we are also not using most important columm `pickup_datetime`

Now, proper model should beat the baseline model.

## 5. Make Predictions & Submit to Kaggle

* Make predictions for test set
* Generate submissions CSV
* Submit to Kaggle
* Record in experiment tracking sheet

In [None]:
# show test inputs
test_inputs

In [None]:
# make predictions on test data
test_preds = linear_model.predict(test_inputs)

In [None]:
# show prediction results on test data
test_preds

In [None]:
# load sample submission data
submission_df = pd.read_csv(data_dir+'/sample_submission.csv')

In [None]:
# show sample submission data
submission_df

Test data and sample submission data has same number of rows and key columns are same in both dataset, so we just need to update the `fare_amount` column with new prediction data.

In [None]:
# now replace the fare_amount with test predictions
submission_df['fare_amount'] = test_preds

In [None]:
# show new sample submission data
submission_df

In [None]:
# save submission data to CSV
submission_df.to_csv('linear_model_submission.csv', index=None)

In [None]:
# function to create submission file
def generate_submission(test_preds, fname):
    sub_df = pd.read_csv(data_dir+'/sample_submission.csv')
    sub_df['fare_amount'] = test_preds
    sub_df.to_csv(fname, index=None)

In [None]:
# create submission file for linear_model
generate_submission(test_preds, 'linreg_submission.csv')

## 6. Feature Engineering

> TIP: Take an iterative approach to feature engineering. Add some features, train a model, evaluate it, keep the features if they help, otherwise drop them, then repeat.

* Extract parts of date
* Remove outliers & invalid data
* Add distance between pickup & drop
* Add distance from landmarks

### Extract Parts of Date
* Year
* Month
* Day
* Weekday
* Hour

In [None]:
# define a function extract parts from datetime
def add_dateparts(df, col):
    df[col + '_year'] = df[col].dt.year
    df[col + '_month'] = df[col].dt.month
    df[col + '_day'] = df[col].dt.day
    df[col + '_weekday'] = df[col].dt.weekday
    df[col + '_hour'] = df[col].dt.hour

In [None]:
# add date parts in train set
add_dateparts(train_df, 'pickup_datetime')

In [None]:
# add date parts in validation set
add_dateparts(val_df, 'pickup_datetime')

In [None]:
# add date parts in test set
add_dateparts(test_df, 'pickup_datetime')

In [None]:
# show train set
train_df

In [None]:
# show validation set
val_df

In [None]:
# show test set
test_df

### Add Distance Between Pickup and Drop Location

We can use the haversine distance: 
- https://en.wikipedia.org/wiki/Haversine_formula
- https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas

In [None]:
# function to calculate distances between two points
import numpy as np

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

In [None]:
# define function to add distance in dataframe
def add_trip_distance(df):
    df['trip_distance'] = haversine_np(df['pickup_longitude'], 
                                          df['pickup_latitude'], 
                                          df['dropoff_longitude'], 
                                          df['dropoff_latitude'])

In [None]:
# add trip distance in train set
add_trip_distance(train_df)

In [None]:
# add trip distance in validation set
add_trip_distance(val_df)

In [None]:
# add trip distance in test set
add_trip_distance(test_df)

In [None]:
# show train set
train_df.head()

In [None]:
# show validation set
val_df.head()

In [None]:
# show test set
test_df.head()

### Add Distance From Popular Landmarks

> TIP: Creative feature engineering (generally involving human insights or external data) is a lot more effective than excessive hyperparameter tuning. Just one or two good feature improve the model's performance drastically.

* JFK Airport
* LGA Airport
* EWR Airport
* Times Square
* Met Meuseum
* World Trade Center

Add distance from drop location. ( **Use distance from pickup location**)

In [None]:
jfk_lonlat = -73.7781, 40.6413
lga_lonlat = -73.8740, 40.7769
ewr_lonlat = -74.1745, 40.6895
met_lonlat = -73.9632, 40.7794
wtc_lonlat = -74.0099, 40.7126

In [None]:
# function to calculate drop location distance from popular landmarks
def add_landmark_dropoff_distance(df, landmark_name, landmark_lonlat):
    lon, lat = landmark_lonlat
    df[landmark_name + '_drop_distance'] = haversine_np(lon, lat, df['dropoff_longitude'], df['dropoff_latitude'])


In [None]:
# add distance data in dataframe
def add_landmarks(a_df):
    landmarks = [('jfk', jfk_lonlat), ('lga', lga_lonlat), ('ewr', ewr_lonlat), ('met', met_lonlat), ('wtc', wtc_lonlat)]
    for name, lonlat in landmarks:
        add_landmark_dropoff_distance(a_df, name, lonlat)

In [None]:
# add drop distance from landmark in train set
add_landmarks(train_df)

In [None]:
# add drop distance from landmark in validation set
add_landmarks(val_df)

In [None]:
# add drop distance from landmark in test set
add_landmarks(test_df)

In [None]:
# show train set
train_df.sample(5)

In [None]:
# show validation set
val_df.sample(5)

In [None]:
# show test set
test_df.sample(5)

### Remove Outliers and Invalid Data

There seems to be some invalide data in each of the following columns:

* Fare amount
* Passenger count
* Pickup latitude & longitude
* Drop latitude & longitude

In [None]:
# show summary statistics for numerical columns in train set
train_df.describe()

In [None]:
# show summary statistics for numerical columns in test set
test_df.describe()

Use the following ranges from test data to filter the train and validation data:
- `fare_amount`: \$1 to \$500
- `longitudes`: -75 to -72
- `latitudes`: 40 to 42
- `passenger_count`: 1 to 6

In [None]:
# function to remove outliers
def remove_outliers(df):
    return df[(df['fare_amount'] >= 1.) & 
              (df['fare_amount'] <= 500.) &
              (df['pickup_longitude'] >= -75) & 
              (df['pickup_longitude'] <= -72) & 
              (df['dropoff_longitude'] >= -75) & 
              (df['dropoff_longitude'] <= -72) & 
              (df['pickup_latitude'] >= 40) & 
              (df['pickup_latitude'] <= 42) & 
              (df['dropoff_latitude'] >=40) & 
              (df['dropoff_latitude'] <= 42) & 
              (df['passenger_count'] >= 1) & 
              (df['passenger_count'] <= 6)]

In [None]:
# remove outlier from train set
train_df = remove_outliers(train_df)

In [None]:
# remove outlier from validation set
val_df = remove_outliers(val_df)

#### Scaling and One-Hot Encoding

**Exercise**: Try scaling numeric columns to the `(0,1)` range and encoding categorical columns using a one-hot encoder.

We won't do this because we'll be training tree-based models which are generally able to do a good job even without the above.

In [None]:
train_df.info()

### Save Intermediate DataFrames

Let's save the processed datasets in the Apache Parquet format, so that we can load them back easily to resume our work from this point.

You may also want to create differnt notebooks for EDA, feature engineering and model training.


In [None]:
# save train data to compressed parquet format
train_df.to_parquet('train.parquet')

In [None]:
# save validation data to compressed parquet format
val_df.to_parquet('val.parquet')

In [None]:
# save test data to compressed parquet format
test_df.to_parquet('test.parquet')

## 7. Train & Evaluate Different Models

Train each of the following & submit predictions to Kaggle:

- Ridge Regression
- Random Forests
- Gradient Boosting

Exercise: Train Ridge, SVM, KNN, Decision Tree models

### Split Inputs & Targes

In [None]:
# get list of columns
train_df.columns

In [None]:
# define input columns
input_cols = ['pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_weekday', 'pickup_datetime_hour', 'trip_distance',
       'jfk_drop_distance', 'lga_drop_distance', 'ewr_drop_distance',
       'met_drop_distance', 'wtc_drop_distance']

In [None]:
# define target column
target_col = 'fare_amount'

In [None]:
# get train inputs and targets data
train_inputs = train_df[input_cols]
train_targets = train_df[target_col]

In [None]:
# get validation inputs and targets data
val_inputs = val_df[input_cols]
val_targets = val_df[target_col]

In [None]:
# get test inputs data
test_inputs = test_df[input_cols]

Define a helper function to evaluate models and generate predictions

In [None]:
# define function to evaluate models
def evaluate(model):
    train_preds = model.predict(train_inputs)
    train_rmse = mean_squared_error(train_targets, train_preds, squared=False)
    val_preds = model.predict(val_inputs)
    val_rmse = mean_squared_error(val_targets, val_preds, squared=False)
    return train_rmse, val_rmse, train_preds, val_preds

In [None]:
# define function to generate predictions
def predict_and_submit(model, fname):
    test_preds = model.predict(test_inputs)
    sub_df = pd.read_csv('sample_submission.csv')
    sub_df['fare_amount'] = test_preds
    sub_df.to_csv(fname, index=None)
    return sub_df

### Ridge Regression

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [None]:
# Import ridge regression package from sklearn
from sklearn.linear_model import Ridge

In [None]:
# Instantiate the model
model1 = Ridge(random_state=42, alpha=0.9)

In [None]:
# Fit the model with training set
%time
model1.fit(train_inputs, train_targets)

In [None]:
# evaluate the model
evaluate(model1)

In [None]:
# predict on test set and generate submission file
predict_and_submit(model1, 'ridge_submission.csv')

### Random Forest

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

In [None]:
# Import random forest regression package from sklearn
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Instantiate the model
model2 = RandomForestRegressor(max_depth=10, n_jobs=-1, random_state=42, n_estimators=50)

In [None]:
# Fit the model with training set
%time
model2.fit(train_inputs, train_targets)

In [None]:
# evaluate the model
evaluate(model2)

In [None]:
# predict on test set and generate submission file
predict_and_submit(model2, 'rf_submission.csv')

### Gradient Boosting

https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

In [None]:
# Import gradient boosting package from sklearn
from xgboost import XGBRegressor

In [None]:
# Instantiate the model
model3 = XGBRegressor(random_state=42, n_jobs=-1, objective='reg:squarederror')

In [None]:
# Fit the model with training set
%time
model3.fit(train_inputs, train_targets)

In [None]:
# evaluate the model
evaluate(model3)

In [None]:
# predict on test set and generate submission file
predict_and_submit(model3, 'xgb_submission.csv')

## 8. Tune Hyperparmeters

https://towardsdatascience.com/mastering-xgboost-2eb6bce6bc76


We'll train parameters for the XGBoost model. Here’s a strategy for tuning hyperparameters:

- Tune the most important/impactful hyperparameter first e.g. n_estimators

- With the best value of the first hyperparameter, tune the next most impactful hyperparameter

- And so on, keep training the next most impactful parameters with the best values for previous parameters...

- Then, go back to the top and further tune each parameter again for further marginal gains

- Hyperparameter tuning is more art than science, unfortunately. Check how the parameters interact with each other.

Let's define a helper function for trying different hyperparameters.

In [None]:
# import matplotlib package
import matplotlib.pyplot as plt

# function calculate training and validation error (root mean squared error)
def test_params(ModelClass, **params):
    """Trains a model with the given parameters and returns training & validation RMSE"""
    model = ModelClass(**params).fit(train_inputs, train_targets)
    train_rmse = mean_squared_error(model.predict(train_inputs), train_targets, squared=False)
    val_rmse = mean_squared_error(model.predict(val_inputs), val_targets, squared=False)
    return train_rmse, val_rmse

# function to plot training and validation error for each parameter
def test_param_and_plot(ModelClass, param_name, param_values, **other_params):
    """Trains multiple models by varying the value of param_name according to param_values"""
    train_errors, val_errors = [], [] 
    for value in param_values:
        params = dict(other_params)
        params[param_name] = value
        train_rmse, val_rmse = test_params(ModelClass, **params)
        train_errors.append(train_rmse)
        val_errors.append(val_rmse)
    
    plt.figure(figsize=(10,6))
    plt.title('Overfitting curve: ' + param_name)
    plt.plot(param_values, train_errors, 'b-o')
    plt.plot(param_values, val_errors, 'r-o')
    plt.xlabel(param_name)
    plt.ylabel('RMSE')
    plt.legend(['Training', 'Validation'])

In [None]:
# set best parameter
best_params = {
    'random_state': 42,
    'n_jobs': -1,
    'objective': 'reg:squarederror',
    'learning_rate': 0.05
}

### No. of Trees

In [None]:
# train using different n_estimators and plot the training and validation error
%time 
test_param_and_plot(XGBRegressor, 'n_estimators', [100, 250, 500], **best_params)

Seems like 500 estimators has the lowest validation loss. However, it also takes a long time. Let's take 250 for now.

In [None]:
# set n_estimators to best value
best_params['n_estimators'] = 250

### Max Depth

In [None]:
# train using different max_depth and plot the training and validation error
%time 
test_param_and_plot(XGBRegressor, 'max_depth', [3, 4, 5, 7], **best_params)

Looks like a max depth of 5 is ideal.

In [None]:
# set max_depth to best value
best_params['max_depth'] = 5

### Learning Rate

In [None]:
# train using different learning rates and plot the training and validation error
%time
test_param_and_plot(XGBRegressor, 'learning_rate', [0.05, 0.1, 0.25], **best_params)

Seems like the best learning rate is 0.25.

In [None]:
# set learning_rate to best value
best_params['learning_rate'] = 0.25

### Other Parameters

Similarly we can experiment with other parameters. 

Here's a set of parameters that works well:

In [None]:
# create final xgboost model object
xgb_model_final = XGBRegressor(objective='reg:squarederror', 
                               n_jobs=-1, 
                               random_state=42,
                               n_estimators=500, 
                               max_depth=8, 
                               learning_rate=0.08, 
                               subsample=0.8, 
                               colsample_bytree=0.8)

In [None]:
# fit the model on training set
%time
xgb_model_final.fit(train_inputs, train_targets)

In [None]:
# evaluate the model
evaluate(xgb_model_final)

In [None]:
# predict on test set and generate submission file
predict_and_submit(xgb_model_final, 'xgb_tuned_submission.csv')

Acieved 460th position out of 1483 i.e. top 30%.

- We are using just 1% of the training data
- We are only using a single model (most top submissions use ensembles)
- Our best model takes just 10 minutes to train (as oppposed to hours/days)
- We haven't fully optimized the hyperparameters yet

Let's save the weights of this model. Follow this guide: https://scikit-learn.org/stable/modules/model_persistence.html

**Tasks**: 

1. Tune hyperparameters for Linear Regression & random forests.
2. Repeat with 3%, 10%, 30% and 100% of the training set. How much reduction in error does 100x more data produce?
3. Ensemble (average) the results from multiple models and observe if they're better than individual models.

### Save Outputs to Google Drive (Optional)

We can save all the output files we've created to Google Drive, so that we can reuse them later if required.

Follow this guide: https://colab.research.google.com/notebooks/io.ipynb

## 9. Train on GPU with entire dataset (Optional)

Steps:
- Install `dask`, `cudf` and `cuml`
- Load the dataset to GPU
- Create training and validation set
- Perform feature engineering
- Train XGBoost `cuml` model
- Make predictions & submit

Follow these guides and fill out the empty cells below:
- https://towardsdatascience.com/nyc-taxi-fare-prediction-605159aa9c24
- https://jovian.ai/allenkong221/nyc-taxi-fare-rapids-dask-gpu/v/1?utm_source=embed#C10

### Install `dask`, `cudf` and `cuml`

### Load the data

### Create training & validation set

### Perform feature engineering

### Train XGBoost model on GPU

### Make Predictions & Submit

## 10. Document & Publish Your Work

- Add explanations using Markdown
- Clean up the code & create functions
- Publish notebook to Jovian
- Write a blog post and embed

Follow this guide: https://www.youtube.com/watch?v=NK6UYg3-Bxs 

In [None]:
## References

* Dataset: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview
* Missing semester (Shell scripting): https://missing.csail.mit.edu/
* Opendatsets library: https://github.com/JovianML/opendatasets 
* EDA project from scratch: https://www.youtube.com/watch?v=kLDTbavcmd0
* GeoPy: https://geopy.readthedocs.io/en/stable/#module-geopy.distance 
* Blog post by Allen Kong: https://towardsdatascience.com/nyc-taxi-fare-prediction-605159aa9c24 
* Machine Learning with Python: Zero to GBMs - https://zerotogbms.com 
* Experiment tracking spreadsheet: https://bit.ly/mltrackingsheet 
* Pandas datetime components: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components 
* Haversine distance: https://en.wikipedia.org/wiki/Haversine_formula 
* Haversine distance with Numpy: https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas 
* RAPIDS (parent project for cudf and cuml): https://rapids.ai/
* Data Science blog post from scratch: https://www.youtube.com/watch?v=NK6UYg3-Bxs 
* Examples of Machine Learning Projects:
    * Walmart Store Sales: https://jovian.ai/anushree-k/final-walmart-simple-rf-gbm
    * Used Car Price Prediction: https://jovian.ai/kara-mounir/used-cars-prices 
    * Lithology Prediction: https://jovian.ai/ramysaleem/ml-project-machine-predicting-lithologies
    * Ad Demand Prediction: https://jovian.ai/deepa-sarojam/online-ad-demand-prediction-ml-prj 



In [None]:
print("--- %s seconds ---" % (time.time() - start_time))