# Uber Fares Dataset - Select and Training Models 
In this fourth notebook, we have two aims:
1) Choose some metrics to evaluate the model performance;
2) Select a set of models and test them in our training data. Once we have the best ones, we will use them to make predictions on test data.

## Imports 

In [13]:
# basic libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

# scikit-learn libraries 
from sklearn.model_selection import cross_val_score # cross-validation
from sklearn.model_selection import GridSearchCV # gridsearch CV 
from sklearn.linear_model import LinearRegression # linear regression
from sklearn.neighbors import KNeighborsRegressor # KNN for regression 
from sklearn.tree import DecisionTreeRegressor # basic decision tree regression 
from sklearn.ensemble import RandomForestRegressor # random forest regression 
from sklearn.metrics import mean_squared_error # mean squared error is the metric to be used 

# xgboost and lightgbm 
import xgboost as xgb 
import lightgbm as lgb

# joblib and pickle to save models
import joblib

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='error', category=FutureWarning)

## Loading the Data 

In [2]:
root_path = '../../uber-fares-prediction/data/processed/'

# prepared training set 
X_train_prepared = (
    pd.read_csv(root_path + 'uber_prepared_train_set.csv')
)

# prepared validation set 
X_test_prepared = (
    pd.read_csv(root_path + 'uber_prepared_validation_set.csv')
)

# target validation set 
y_train = (
    pd.read_csv(root_path + 'uber_validation_target.csv')
)

In [3]:
# converting into an array
y_train = np.ravel(y_train)

## Training a Lot of Models using Cross-Validation 

As we are studying a regression problem, the most common metric to this class of problems is the **Mean Squared Error**:
$$\textrm{MSE}(\textbf{X}, h)=\frac{1}{N}\sum_{i=1}^{N}\left(y^{(i)}-h(\textbf{x}^{(i)})\right)^2,$$
where $h(\textbf{x}^{(i)})$ is the prediction of the model $h$ for the example $\textbf{x}^{(i)}$ of our data, and $y^{(i)}$ is the true label for this example. Beyond it, we will also work with the of MSE:
$$\textrm{RMSE}(\textbf{X}, h)=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(y^{(i)}-h(\textbf{x}^{(i)})\right)^2},$$
which is **Root Mean Squared Error (RMSE)**. Other two important metrics for regression problems are **Mean Absolute Error**:
$$\textrm{MAE}(\textbf{X}, h)=\frac{1}{N}\sum_{i=1}^{N}\left|y^{(i)}-h(\textbf{x}^{(i)})\right|,$$
and $R^2$:
$$R^{2}(h)=1-\frac{\sum_{i=1}^{N}\left(y^{(i)}-h(\textbf{x}^{(i)})\right)^2}{\sum_{i=1}^{N}\left(y^{(i)}-\bar{y}^{(i)}\right)^2}=1-\frac{\textrm{MSE}(h)}{\textrm{MSE}(\bar{y})}$$

Our idea is to select a set of models of different types and testing them into the validation set. Then, after we have selected the best ones (or the best one), we will fine-tunning our model to make better predictions. Finally, the last goal is to apply the model to the test set. 

We will test the following models:
1) Linear Regression (LR);
2) K-Nearest Neighbors Regression (KNN);
3) Decision Tree Regression (DTR);
4) Random Forest Regression (RFR);
5) XGBoost for Regression (XGBR);
6) LightGBM for Regression (LGBR).

Let us instantiate all models using default hyperparameters and create a list of these models:

In [4]:
# instantiating all models 
lin_reg = LinearRegression() # Linear regression
knn_reg = KNeighborsRegressor() # knn regression 
tree_reg = DecisionTreeRegressor() # Decision Tree Regressor - the criterion to split is squared_error by default 
forest_reg = RandomForestRegressor() # Random Forest Regressor - the number of estimators is 100 by default 
xgb_reg = xgb.XGBRegressor() # XGBoost Regressor 
lgb_reg = lgb.LGBMRegressor() # LightGBM Regressor 

In [5]:
models_dict_classes = {
    'LR': lin_reg,
    'KNN': knn_reg,
    'DTR': tree_reg,
    'RFR': forest_reg,
    'XGBR': xgb_reg,
    'LGBR': lgb_reg
}

In [11]:
# evaluating each model in turn 
results = []
names = []
for name, model in models_dict_classes.items(): 
    cv_results = cross_val_score(
        model, 
        X_train_prepared, 
        y_train,
        cv=3,
        scoring = 'neg_mean_squared_error',
        n_jobs=-1
    )
    results.append(np.sqrt(-cv_results))
    names.append(name)
    final_results = dict(zip(names, results))
    print('%s: %f (%f)' % (name, np.sqrt(-cv_results).mean(), np.sqrt(-cv_results).std()))

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dty

LR: 8.649837 (1.499849)


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


KNN: 10.258877 (0.213912)


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dty

DTR: 6.289947 (0.238039)


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dty

RFR: 4.317155 (0.358027)


  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_catego

XGBR: 4.248058 (0.345017)
LGBR: 4.337817 (0.393018)


We choose to evaluate the performance by using RMSE as a standard metric. Of course, RMSE alone cannot say all. In a complete analysis, it is important to observe other metrics like R2 and MAE, for example - we will do that when evaluating the model in our test data. 

The final results, using RMSE, for each model are:
1) Logistic Regression: $8.649837 \pm 1.499849$;
2) kNN Regression: $10.258877 \pm 0.213912$;
3) Decision Tree Regression: $6.289947 \pm 0.238039$;
4) Random Forest Regression: $4.317155 \pm 0.358027$;
5) XGBoost Regression: $4.248058 \pm 0.345017$;
6) LGBM Regression: $4.337817 \pm 0.393018$

We can see the three best models are Random Forest Regression, XGBoost Regression and LGBM Regression. Then, we will maintain them to apply in our unseen data and tunning hyperparameters to obtain the best results.

Let us save the three best vanilla models as pickle files:

In [19]:
# saving Random Forest Regression model
random_forest_model_path = '../models/interim/random_forest_regression.pkl'
joblib.dump(forest_reg, random_forest_model_path)


['../models/interim/random_forest_regression.pkl']

In [22]:
# saving XGBoost regression model 
xgb_reg_model_path = '../models/interim/xgboost_regression.pkl'
joblib.dump(xgb_reg, xgb_reg_model_path)

['../models/interim/xgboost_regression.pkl']

In [23]:
# saving lgbm regression model 
lgb_reg_path = '../models/interim/lgbm_regression.pkl'
joblib.dump(lgb_reg, lgb_reg_path)

['../models/interim/lgbm_regression.pkl']