# 5. Model Validation and Tuning
Once we've got the best model, we have to validate and tune it to ensure the best performance. In this case, we're going with the GradientBoosting.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

from sklearn.model_selection import RandomizedSearchCV

In [6]:
df = pd.read_csv("../data/data_imputed.csv")
df_sample = df.sample(frac=0.3, random_state=42, ignore_index=True)

### Define train/test
We get the columns with the most important features from feature importances.

In [7]:
most_imp = ['car_age',
 'transmission_auto',
 'log_power_hp',
 'log_mileage_km',
 'weight_kg',
 'consumption_mixed_l_100km',
 'log_co2_g_km',
 'height_mm',
 'wheelbase_mm',
 'log_engine_displacement_cm3',
 'length_mm',
 'width_mm',
 'trunk_dim_1',
 'tank_capacity_l']

In [8]:
X = df_sample[most_imp]
y = df_sample['log_price']

X.shape, y.shape

((28238, 14), (28238,))

Define train/test

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_test: {X_test.shape},  y_test: {y_test.shape}")

X_train: (22590, 14), y_train: (22590,)
X_test: (5648, 14),  y_test: (5648,)


As we've seen in the previous notebook, for random forest is not strictly necessary to scale the dataset.

In [10]:
# scaler_x = MinMaxScaler()
# x_train_scaled = scaler_x.fit_transform(X_train)
# x_test_scaled = scaler_x.transform(X_test)

In [11]:
# scaler_y = MinMaxScaler()
# y_train_scaled = scaler_y.fit_transform(y_train.values.reshape(-1, 1))
# y_test_scaled = scaler_y.transform(y_test.values.reshape(-1, 1))

_____________________
## GridSearch for model tuning
Let's now tune the model using GridSearch with different hyperparameters. When it's done, this is going to return the best combination of hyperparameters, which we will validate through different metrics: MAE, RMSE and R2 Score

In [12]:
# Models and hyperparameters

model = RandomForestRegressor()

parameters = {
          "n_estimators"           : [100, 200], 
          "criterion"              : ['absolute_error'], 
          "max_features"           : ["sqrt"],
          "min_samples_split"      : [5, 10], 
          "random_state"           : [42],
          "max_depth"              : [10, 20],
          "min_samples_leaf"       : [1, 2]
}

In [13]:
rdm_search = RandomizedSearchCV(
    model,
    param_distributions = parameters,
    n_iter = 10,
    scoring = 'r2',
    cv = 3,
    random_state = 42,
    n_jobs = -1,
    verbose = 1
)

model_result = rdm_search.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [14]:
best_params = model_result.best_params_
best_params

{'random_state': 42,
 'n_estimators': 200,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 20,
 'criterion': 'absolute_error'}

In [15]:
final_model = RandomForestRegressor(**best_params)

In [16]:
final_model.fit(X_train, y_train)
y_pred_log = final_model.predict(X_test)

In [17]:
y_pred = np.expm1(y_pred_log)
y_test = np.expm1(y_test)

In [18]:
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

metrics = {
    "MAE": mae,
    "RMSE": rmse,
    "R2": r2
}

metrics

{'MAE': 2272.9027706146308,
 'RMSE': 17237099.429763347,
 'R2': 0.8837738341530623}

_____________________________
## Final training and storage
Although we should train the model with the whole data set, the truth is that it's taking to much time, even reducing the size of the data set to a 60% of all the data. Therefore, we're going to keep the model as it is right now, since this project focuses specially on the development and deployment preocesseof an ML model.

Finally, we only have to train our final model with the whole dataset, so it an see every data we have and then, save it.

In [21]:
df_60 = df.sample(frac=0.6, random_state=42, ignore_index=True)

X_full = df_60[most_imp]
y_full = np.log1p(df_60['log_price'])

RF_model = rdm_search.best_estimator_
RF_model.fit(X_full, y_full)

KeyboardInterrupt: 

In [None]:
import joblib
joblib.dump(final_model, "../outputs/RandomForest_model.pkl", compress=9)

['../src/RandomForest_model.pkl']

In [1]:
import os

file_size_mb = os.path.getsize("../src/RandomForest_model.pkl") / 1024**2
print(f"Actual file size on disk: {file_size_mb:.2f} MB")

Actual file size on disk: 34.55 MB


In [2]:
categ_var = df.select_dtypes(include='object').columns.tolist() + ['registration_year', 'registration_month', 'cylinders', 'seats', 'doors']
categ_var

NameError: name 'df' is not defined