# Notebook 4: Random Forest Regressor Model

**Model 3**: Random Forest Regressor

**Reason for model**: experiment with bagging ensemble model since other group members have done boosting tree models.

**Metric**: RMSE

**Reason for metric**: focus on penalising large errors over small errors, RMSE is the better choice.

**Metrics of last best model**: SVR

***RMSE Train:*** 167.2884452552443

***RMSE Val:*** 167.7360076114627

In [1]:
ROOT_PATH_FROM_NOTEBOOK = ".."
DATA_PATH = "data"
PROCESSED_DATA_PATH = "processed"
SAMPLE_DATASET_NAME = "data_sample.parquet"

df_path = f"{ROOT_PATH_FROM_NOTEBOOK}/{DATA_PATH}/{PROCESSED_DATA_PATH}/{SAMPLE_DATASET_NAME}"

In [2]:
import sys
import os
from joblib import dump

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

from assignment2_pkg_11919925.metrics.regression import print_regressor_scores_from_gridsearchcv

In [3]:
# Set Pandas option to show all columns in prints
pd.set_option('display.max_columns', None)

In [4]:
# Get the current working directory
current_dir = os.getcwd()

# Add the src directory to sys.path to use custom functions
sys.path.append(os.path.abspath(os.path.join(current_dir, '..', 'src')))

In [5]:
df = pd.read_parquet(df_path)

In [6]:
df.head()

Unnamed: 0,flightDayOfWeekSin,flightDayOfWeekCos,flightMonthSin,flightMonthCos,flightHourSin,flightHourCos,flightMinuteSin,flightMinuteCos,timeDeltaDays,travelDurationDay,totalTravelDistance,totalFare,isBasicEconomy,isRefundable,isNonStop,numLegs,business,coach,first,premium coach
0,-0.974928,-0.222521,0.5,-0.866025,-0.5,-0.8660254,-0.951057,-0.309017,15,0.195139,1191.0,294.6,-1,-1,-1,2,-1,1,-1,-1
1,-0.433884,-0.900969,0.5,-0.866025,0.258819,0.9659258,-0.5,-0.866025,37,0.095139,762.0,262.6,-1,-1,1,1,-1,1,-1,-1
2,0.781831,0.62349,0.866025,-0.5,-1.0,-1.83697e-16,0.5,0.866025,1,0.127083,1235.0,234.59,-1,-1,1,1,-1,1,-1,-1
3,0.974928,-0.222521,0.5,-0.866025,-0.5,-0.8660254,-0.104528,0.994522,34,0.101389,762.0,118.6,-1,-1,1,1,-1,1,-1,-1
4,0.433884,-0.900969,0.5,-0.866025,-0.965926,-0.258819,-0.669131,0.743145,17,0.333333,2618.0,446.6,-1,-1,-1,2,-1,1,-1,-1


In [7]:
y = df.pop('totalFare')
X = df

## Random Forest Regressor with Cross Validation

In [8]:
standardscaler_transformer = Pipeline(
    steps=[
        ('standard_scaler', StandardScaler()
        )
    ]
)

In [9]:
# Build the ColumnTransformer
encoder = ColumnTransformer(
    transformers=[
        ("standard_cols", standardscaler_transformer, ['timeDeltaDays', 'travelDurationDay', 'totalTravelDistance'])
    ]
)

In [10]:
rf_pipe = Pipeline(
    steps=[
        ("scaler", encoder),
        ("regressor", RandomForestRegressor(random_state=42, n_jobs=-1))
    ]
)

rf_search = GridSearchCV(
    estimator=rf_pipe,
    param_grid={
            "regressor__n_estimators": list(range(100, 201, 20)),
            "regressor__max_depth": list(range(3, 12, 2)),
            "regressor__min_samples_leaf": [1] + list(range(10, 101, 20)),
            "regressor__max_features": ["sqrt", "log2"]
        },
        cv=5,
        scoring="neg_root_mean_squared_error",
        refit=True,
        return_train_score=True,
        verbose=5
)

In [11]:
rf_search.fit(X, y)

Fitting 5 folds for each of 360 candidates, totalling 1800 fits
[CV 1/5] END regressor__max_depth=3, regressor__max_features=sqrt, regressor__min_samples_leaf=1, regressor__n_estimators=100;, score=(train=-171.978, test=-170.065) total time=   0.1s
[CV 2/5] END regressor__max_depth=3, regressor__max_features=sqrt, regressor__min_samples_leaf=1, regressor__n_estimators=100;, score=(train=-171.216, test=-173.044) total time=   0.1s
[CV 3/5] END regressor__max_depth=3, regressor__max_features=sqrt, regressor__min_samples_leaf=1, regressor__n_estimators=100;, score=(train=-170.689, test=-174.158) total time=   0.1s
[CV 4/5] END regressor__max_depth=3, regressor__max_features=sqrt, regressor__min_samples_leaf=1, regressor__n_estimators=100;, score=(train=-171.605, test=-170.996) total time=   0.1s
[CV 5/5] END regressor__max_depth=3, regressor__max_features=sqrt, regressor__min_samples_leaf=1, regressor__n_estimators=100;, score=(train=-171.787, test=-170.434) total time=   0.1s
[CV 1/5] EN

In [12]:
print_regressor_scores_from_gridsearchcv(rf_search)

RMSE Train: 145.38896174058732
RMSE Val: 162.8335526958362


In [13]:
# Best parameter for SVR
rf_search.best_params_

{'regressor__max_depth': 11,
 'regressor__max_features': 'log2',
 'regressor__min_samples_leaf': 1,
 'regressor__n_estimators': 180}

In [14]:
dump(rf_search.best_estimator_,  '../models/nicholas_rf_pipe_sample_dataset.joblib')

['../models/nicholas_rf_pipe_sample_dataset.joblib']

**Observations**: The Random Forest came out with better validation performance; however, it was overfitting quite a bit more than SVR.

**Next model**: Artificial Neural Network.