# Uber Fares Dataset - Fine-Tunning Our Models  
In this fifth notebook, our aim is to fine tune our three best models used in the earlier notebook. After tunned them, we will have the best set of parameters for each model and, after that, we are ready to test them on the test set.

## Imports 

In [25]:
# basic libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

# scikit-learn libraries 
from sklearn.model_selection import RandomizedSearchCV # gridsearch CV 
from sklearn.ensemble import RandomForestRegressor # random forest regression 
from sklearn.metrics import mean_squared_error # mean squared error is the metric to be used 

# xgboost and lightgbm 
import xgboost as xgb 
import lightgbm as lgb

import warnings
warnings.simplefilter(action='error', category=FutureWarning)

# joblib and pickle to save models
import joblib

## Loading the Data and Models

In [26]:
root_path = '../../uber-fares-prediction/data/processed/'

# prepared training set 
X_train_prepared = (
    pd.read_csv(
        root_path + 'uber_prepared_train_set.csv'
    )
)

# prepared validation set 
X_test_prepared = (
    pd.read_csv(
        root_path + 'uber_prepared_validation_set.csv'
    )
)

# target validation set 
y_train = (
    pd.read_csv(
        root_path + 'uber_validation_target.csv'
    )
)

In [27]:
# converting into an array
y_train = np.ravel(y_train)

In [28]:
# Random Forest Regression 
random_forest_reg = joblib.load('../models/interim/random_forest_regression.pkl')

In [29]:
# XGBoost Regression 
xgboost_reg = joblib.load('../models/interim/xgboost_regression.pkl')

In [30]:
# LGBM Regression 
lgbm_reg = joblib.load('../models/interim/lgbm_regression.pkl')

## Randomized Search Parameters 

### Random Forest Regression

Let's first search for the best set of hyperparameters for the Random Forest model using a Randomized Search CV. We will define a large set of parameters to be searched on and we expect that the results give the best metric in comparison to the earlier vanilla model to us.

#### Defining a large set of hyperparameter grid for Random Forest

In [46]:
# defining the parameter grid for Randomized Search CV
param_grid = {
    'n_estimators': [None] + list(np.random.randint(100, 300, 50)),  # Number of trees in the forest
    'max_features': [1.0, 'sqrt', 'log2'],  # Number of features to consider at every split
    'max_depth': [None] + list(np.random.randint(5, 25, 5)),  # Maximum depth of the tree
    'min_samples_split': np.random.randint(2, 11, 10),  # Minimum number of samples required to split an internal node
    'min_samples_leaf': np.random.randint(1, 11, 10),  # Minimum number of samples required to be at a leaf node
    'bootstrap': [True, False]  # Method of selecting samples for training each tree
}

In [47]:
# performing randomized search cv 
random_search_random_forest_reg = RandomizedSearchCV(
    random_forest_reg,
    param_distributions=param_grid,
    n_iter=60,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1,
    random_state=42
)

In [48]:
# fit in our training data 
random_search_random_forest_reg.fit(X_train_prepared, y_train)



Now, let us display the best results in a dataframe: 

In [49]:
# displaying the results in a dataframe
results_random_forest_df = (
    pd.DataFrame(
        random_search_random_forest_reg.cv_results_
    )[['params', 'mean_test_score', 'std_test_score']]
)
results_random_forest_df['rmse'] = (
    np.sqrt(-results_random_forest_df['mean_test_score'])
)
results_random_forest_df = (
    results_random_forest_df
    .sort_values(
        by='rmse', 
        ascending=True
    )
    .reset_index(drop=True)
)

In [50]:
# showing the ten best results 
results_random_forest_df.head(10)

Unnamed: 0,params,mean_test_score,std_test_score,rmse
0,"{'n_estimators': 246, 'min_samples_split': 5, ...",-17.761222,4.150904,4.214406
1,"{'n_estimators': 254, 'min_samples_split': 2, ...",-17.970269,4.171547,4.239135
2,"{'n_estimators': 191, 'min_samples_split': 9, ...",-17.990321,4.217163,4.2415
3,"{'n_estimators': 191, 'min_samples_split': 4, ...",-18.005979,4.142482,4.243345
4,"{'n_estimators': 191, 'min_samples_split': 5, ...",-18.079612,4.232067,4.252013
5,"{'n_estimators': 136, 'min_samples_split': 2, ...",-18.165463,4.179599,4.262096
6,"{'n_estimators': 264, 'min_samples_split': 3, ...",-18.233276,4.173229,4.270044
7,"{'n_estimators': 129, 'min_samples_split': 5, ...",-18.284021,4.188776,4.275982
8,"{'n_estimators': 246, 'min_samples_split': 4, ...",-18.333299,4.190251,4.28174
9,"{'n_estimators': 242, 'min_samples_split': 6, ...",-18.426356,4.159153,4.292593


Let's see what were the selected sets of parameters for the five best results:

In [51]:
# creating a dictionary with the five best results 
dict_best_results = {}
key_list = ['first', 'second', 'third', 'fourth', 'fifth']
for i, key in zip(range(5), key_list):
    dict_best_results[key] = results_random_forest_df['params'][i]

In [52]:
dict_best_results

{'first': {'n_estimators': 246,
  'min_samples_split': 5,
  'min_samples_leaf': 3,
  'max_features': 'log2',
  'max_depth': None,
  'bootstrap': False},
 'second': {'n_estimators': 254,
  'min_samples_split': 2,
  'min_samples_leaf': 3,
  'max_features': 'log2',
  'max_depth': None,
  'bootstrap': True},
 'third': {'n_estimators': 191,
  'min_samples_split': 9,
  'min_samples_leaf': 1,
  'max_features': 'log2',
  'max_depth': 16,
  'bootstrap': False},
 'fourth': {'n_estimators': 191,
  'min_samples_split': 4,
  'min_samples_leaf': 2,
  'max_features': 'sqrt',
  'max_depth': 16,
  'bootstrap': False},
 'fifth': {'n_estimators': 191,
  'min_samples_split': 5,
  'min_samples_leaf': 1,
  'max_features': 'log2',
  'max_depth': 15,
  'bootstrap': False}}

Let us save the best model - which is the Random Forest Regression model trained using the best set of parameters that we have obtained.

In [53]:
# getting the best set of parameters 
best_params_random_forest = random_search_random_forest_reg.best_params_

In [54]:
# initializing a random forest regression model using these set of parameters 
best_model_random_forest =  RandomForestRegressor(**best_params_random_forest)

In [55]:
# training the best model on the training data 
best_model_random_forest.fit(X_train_prepared, y_train)

In [56]:
# saving the best model 
best_model_random_forest_path = '../models/final/best_model_random_forest.pkl'
joblib.dump(best_model_random_forest, best_model_random_forest_path, compress=9)

['../models/final/best_model_random_forest.pkl']

### XGBoost Regression 

Now we will repeat the process for the XGBoost Regression model, i.e., we will define a large set of hyperparameters and use the Randomized Search CV to find the best set. 

#### Defining a large set of hyperparameter grid for XGBoost

In [33]:
# defining the parameter grid for XGBoost Regression
param_grid = {
    'learning_rate': [None] + list(np.linspace(0.01, 0.3, 10)),
    'n_estimators': np.random.randint(100, 300, 50),
    'max_depth': [None] + list(np.random.randint(5, 25, 5)),
    'subsample': [None] + list(np.linspace(0.5, 1.0, 6)),
    'colsample_bytree': [None] + list(np.linspace(0.5, 1.0, 6)),
}

In [34]:
# performing randomized search cv for xgboost
random_search_xgboost_reg = RandomizedSearchCV(
    xgboost_reg,
    param_distributions=param_grid,
    n_iter=60,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1,
    random_state=42
)

In [35]:
# fit in our training data 
random_search_xgboost_reg.fit(X_train_prepared, y_train)



Now, let us display the best results in a dataframe: 

In [36]:
# displaying the results in a dataframe
results_xgboost_reg_df = (
    pd.DataFrame(
        random_search_xgboost_reg.cv_results_
    )[['params', 'mean_test_score', 'std_test_score']]
)
results_xgboost_reg_df['rmse'] = np.sqrt(-results_xgboost_reg_df['mean_test_score'])
results_xgboost_reg_df = (
    results_xgboost_reg_df
    .sort_values(by='rmse', ascending=True)
    .reset_index(drop=True)
)

In [37]:
# showing the ten best results 
results_xgboost_reg_df.head(10)

Unnamed: 0,params,mean_test_score,std_test_score,rmse
0,"{'subsample': None, 'n_estimators': 171, 'max_...",-17.587422,4.240086,4.193736
1,"{'subsample': 1.0, 'n_estimators': 227, 'max_d...",-17.630441,4.284454,4.198862
2,"{'subsample': 0.8, 'n_estimators': 208, 'max_d...",-17.910315,4.232913,4.232058
3,"{'subsample': 0.9, 'n_estimators': 127, 'max_d...",-17.916388,4.031413,4.232775
4,"{'subsample': 0.8, 'n_estimators': 239, 'max_d...",-17.924847,4.137898,4.233775
5,"{'subsample': None, 'n_estimators': 145, 'max_...",-17.925871,3.959918,4.233896
6,"{'subsample': None, 'n_estimators': 161, 'max_...",-17.967541,4.275499,4.238814
7,"{'subsample': 0.8, 'n_estimators': 132, 'max_d...",-18.030068,4.191282,4.246183
8,"{'subsample': 0.8, 'n_estimators': 266, 'max_d...",-18.03227,4.002022,4.246442
9,"{'subsample': 0.8, 'n_estimators': 220, 'max_d...",-18.11405,4.268649,4.25606


Let's see what were the selected sets of parameters for the five best results:

In [38]:
# creating a dictionary with the five best results 
dict_best_results = {}
key_list = ['first', 'second', 'third', 'fourth', 'fifth']
for i, key in zip(range(5), key_list):
    dict_best_results[key] = results_xgboost_reg_df['params'][i]

In [39]:
dict_best_results

{'first': {'subsample': None,
  'n_estimators': 171,
  'max_depth': None,
  'learning_rate': 0.10666666666666666,
  'colsample_bytree': 0.9},
 'second': {'subsample': 1.0,
  'n_estimators': 227,
  'max_depth': None,
  'learning_rate': 0.10666666666666666,
  'colsample_bytree': 0.6},
 'third': {'subsample': 0.8,
  'n_estimators': 208,
  'max_depth': None,
  'learning_rate': 0.07444444444444444,
  'colsample_bytree': 0.9},
 'fourth': {'subsample': 0.9,
  'n_estimators': 127,
  'max_depth': 8,
  'learning_rate': 0.042222222222222223,
  'colsample_bytree': None},
 'fifth': {'subsample': 0.8,
  'n_estimators': 239,
  'max_depth': None,
  'learning_rate': 0.1388888888888889,
  'colsample_bytree': 0.7}}

Let us save the best model - which is the XGBoost Regression model trained using the best set of parameters that we have obtained.

In [40]:
# getting the best set of parameters 
best_params_xgboost_reg = random_search_xgboost_reg.best_params_

In [41]:
# initializing a random forest regression model using these set of parameters 
best_model_xgboost_reg =  xgb.XGBRegressor(**best_params_xgboost_reg)

In [42]:
# training the best model on the training data 
best_model_xgboost_reg.fit(X_train_prepared, y_train)

In [43]:
# saving the best model 
best_model_xgboost_reg_path = '../models/final/best_model_xgboost_reg.pkl'
joblib.dump(best_model_xgboost_reg, best_model_xgboost_reg_path)

['../models/final/best_model_xgboost_reg.pkl']

### LightGBM Regression

Finally, we will repeat the process for the LGBM model. Again, one defines a large set of hyperparameters and realize a Randomized Search CV to search the best set of parameters. 

#### Defining a large set of hyperparameter grid for LGBM

In [17]:
# defining the parameter grid for LGBM Regression
param_grid = {
    'n_estimators': np.random.randint(100, 300, 50), 
    'learning_rate': [None] + list(np.linspace(0.01, 0.3, 10)),
    'max_depth': [None] + list(np.random.randint(3, 10, 7)), 
    'subsample': np.linspace(0.5, 1.0, 10), 
    'colsample_bytree': np.linspace(0.5, 1.0, 10), 
    'reg_alpha': np.logspace(-3, 3, 10),  # Logarithmic space between 0.001 and 1000
    'reg_lambda': np.logspace(-3, 3, 10)  # Logarithmic space between 0.001 and 1000
}

In [18]:
# performing randomized search cv for xgboost
random_search_lgbm_reg = RandomizedSearchCV(
    lgbm_reg,
    param_distributions=param_grid,
    n_iter=60,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1,
    random_state=42,
    verbose=False
)

In [19]:
# fit in our training data 
random_search_lgbm_reg.fit(X_train_prepared, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.020677 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1356
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.021284 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1356
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.021450 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.021514 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1356
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.021524 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1356
[LightGBM] [Info] Total Bins 1356
[LightGBM] [Info] Au

Now, let us display the best results in a dataframe: 

In [21]:
# displaying the results in a dataframe
results_lgbm_reg_df = (
    pd.DataFrame(
        random_search_lgbm_reg.cv_results_
    )[['params', 'mean_test_score', 'std_test_score']]
)
results_lgbm_reg_df['rmse'] = np.sqrt(-results_lgbm_reg_df['mean_test_score'])
results_lgbm_reg_df = (
    results_lgbm_reg_df
    .sort_values(by='rmse', ascending=True)
    .reset_index(drop=True)
)

In [57]:
# showing the ten best results 
results_lgbm_reg_df.head(10)

Unnamed: 0,params,mean_test_score,std_test_score,rmse
0,"{'subsample': 0.9444444444444444, 'reg_lambda'...",-18.340254,4.274488,4.282552
1,"{'subsample': 0.7222222222222222, 'reg_lambda'...",-18.344026,4.201073,4.282993
2,"{'subsample': 1.0, 'reg_lambda': 10.0, 'reg_al...",-18.410875,4.11737,4.29079
3,"{'subsample': 0.7777777777777778, 'reg_lambda'...",-18.473076,4.305979,4.298032
4,"{'subsample': 0.5555555555555556, 'reg_lambda'...",-18.502447,4.26286,4.301447
5,"{'subsample': 1.0, 'reg_lambda': 0.02154434690...",-18.529837,4.245879,4.30463
6,"{'subsample': 0.8333333333333333, 'reg_lambda'...",-18.577089,4.246943,4.310115
7,"{'subsample': 0.9444444444444444, 'reg_lambda'...",-18.583269,4.159211,4.310832
8,"{'subsample': 0.6666666666666666, 'reg_lambda'...",-18.618167,4.206446,4.314877
9,"{'subsample': 1.0, 'reg_lambda': 2.15443469003...",-18.655719,4.330805,4.319227


Let's see what were the selected sets of parameters for the five best results:

In [58]:
# creating a dictionary with the five best results 
dict_best_results = {}
key_list = ['first', 'second', 'third', 'fourth', 'fifth']
for i, key in zip(range(5), key_list):
    dict_best_results[key] = results_lgbm_reg_df['params'][i]

In [59]:
dict_best_results

{'first': {'subsample': 0.9444444444444444,
  'reg_lambda': 10.0,
  'reg_alpha': 46.41588833612773,
  'n_estimators': 255,
  'max_depth': None,
  'learning_rate': 0.1711111111111111,
  'colsample_bytree': 0.8888888888888888},
 'second': {'subsample': 0.7222222222222222,
  'reg_lambda': 2.154434690031882,
  'reg_alpha': 215.44346900318823,
  'n_estimators': 257,
  'max_depth': 6,
  'learning_rate': None,
  'colsample_bytree': 0.6666666666666666},
 'third': {'subsample': 1.0,
  'reg_lambda': 10.0,
  'reg_alpha': 46.41588833612773,
  'n_estimators': 208,
  'max_depth': 5,
  'learning_rate': 0.23555555555555557,
  'colsample_bytree': 0.5},
 'fourth': {'subsample': 0.7777777777777778,
  'reg_lambda': 215.44346900318823,
  'reg_alpha': 215.44346900318823,
  'n_estimators': 287,
  'max_depth': 5,
  'learning_rate': 0.2677777777777778,
  'colsample_bytree': 0.7222222222222222},
 'fifth': {'subsample': 0.5555555555555556,
  'reg_lambda': 0.1,
  'reg_alpha': 46.41588833612773,
  'n_estimators': 

Let us save the best model - which is the LGBM Regression model trained using the best set of parameters that we have obtained.

In [62]:
# getting the best set of parameters 
best_params_lgbm_reg = random_search_lgbm_reg.best_params_

In [64]:
# initializing a random forest regression model using these set of parameters 
best_model_lgbm_reg =  lgb.LGBMRegressor(**best_params_lgbm_reg)

In [66]:
# training the best model on the training data 
best_model_lgbm_reg.fit(X_train_prepared, y_train)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000829 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1356
[LightGBM] [Info] Number of data points in the train set: 137340, number of used features: 10
[LightGBM] [Info] Start training from score 11.349932


In [67]:
# saving the best model 
best_model_lgbm_reg_path = '../models/final/best_model_lgbm_reg.pkl'
joblib.dump(best_model_lgbm_reg, best_model_lgbm_reg_path)

['../models/final/best_model_lgbm_reg.pkl']

The next step will be apply these best models in our test set and explore the final results. 