# Sprint 12 Project: Numerical Methods

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

# Project Requirements

- Train different models with various hyperparameters. The goal is to compare gradient boosting methods to random forest, decision tree, and linear regression. For each, assess quality, prediction speed, and required training time.
    - Analyze the quality of the models
        - Calculate RMSE
    - Analyze the speed of the prediction and required training time
        - Determine cell code runtime in Jupyter Notebook for each step
- Linear regression is not ideal for hyperparameter tuning, but is useful as a sanity check of other methods. Gradient boosting should not underperform linear regression.
- Include linear regression as a sanity check, a tree-based algorithm with hyperparameter tuning (random forest), LightGBM with hyperparameter tuning, CatBoost and XGBoost with hyperparameter tuning (optional).
- Take note of encoding requirements for simple algorithms. LightGBM and CatBoost vs. XGBoost (requires OHE).
- Find and use special command to find cell code runtime in Jupyter Notebook.
- Change only a few model parameters when training gradient boosting models to minimize training time.
- Delete excessive variables from Jupyter Notebook if required, using del operator: del features_train.


## Data Preparation

### Initialization

In [2]:
# Import libraries required for analysis 
import numpy as np
import pandas as pd

# For train-test split
from sklearn.model_selection import train_test_split

# For scaling
from sklearn.preprocessing import StandardScaler

# For RMSE
from sklearn.metrics import mean_squared_error

# For determining processing time
import time 

# Regression Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Grid Search
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV

# https://lightgbm.readthedocs.io/en/stable/Python-Intro.html
from lightgbm import LGBMRegressor

# https://catboost.ai/en/docs/concepts/python-usages-examples
from catboost import CatBoostRegressor

#https://xgboost.readthedocs.io/en/stable/python/python_api.html
from xgboost import XGBRegressor

### Load Data

In [4]:
# Read in data csv and convert to dataframe
df = pd.read_csv('/datasets/car_data.csv')

### Description of Data

#### Features

- *Date Crawled* — date profile was downloaded from the database
- *VehicleType* — vehicle body type
- *RegistrationYear* — vehicle registration year
- *Gearbox* — gearbox type
- *Power* — power (hp)
- *Model* — vehicle model
- *Mileage* — mileage (measured in km due to dataset's regional specifics)
- *RegistrationMonth* — vehicle registration month
- *FuelType* — fuel type
- *Brand* — vehicle brand
- *NotRepaired* — vehicle repaired or not
- *DateCreated* — date of profile creation
- *NumberOfPictures* — number of vehicle pictures
- *PostalCode* — postal code of profile owner (user)
- *LastSeen* — date of the last activity of the user

#### Target

- *Price* — price (Euro)

### Inspect Data

#### View Sample of Data

In [3]:
df.sample(5)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
315693,05/03/2016 21:52,13900,bus,2009,manual,150,vito,150000,9,gasoline,mercedes_benz,no,05/03/2016 00:00,0,99734,08/03/2016 14:16
173563,16/03/2016 06:36,499,sedan,1997,,0,3er,150000,0,petrol,bmw,,16/03/2016 00:00,0,86368,03/04/2016 23:45
101043,11/03/2016 08:37,0,,2016,manual,0,niva,80000,12,petrol,lada,,11/03/2016 00:00,0,71672,11/03/2016 09:45
337031,07/03/2016 01:42,9990,suv,2005,manual,185,xc_reihe,150000,8,gasoline,volvo,no,06/03/2016 00:00,0,82031,22/03/2016 07:48
38121,05/03/2016 22:40,0,sedan,1994,manual,60,golf,150000,5,petrol,volkswagen,yes,05/03/2016 00:00,0,27432,06/04/2016 14:44


#### Explore Dataset

In [4]:
# Use print so I don't lose outputs

# Check for missing values
print('Check for Missing Values')
print(df.isna().sum())
print()

# Check for duplicate rows
print('Check for Duplicate Rows')
print('There are', df.duplicated().sum(), 'duplicate rows')
print()

# Check values for each column
print('\n Describe Dataframe')
print(df.describe())
print()

# Check data types
print('\n Check Data Types')
print(df.info())
print()

Check for Missing Values
DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Mileage                  0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

Check for Duplicate Rows
There are 262 duplicate rows


 Describe Dataframe
               Price  RegistrationYear          Power        Mileage  \
count  354369.000000     354369.000000  354369.000000  354369.000000   
mean     4416.656776       2004.234448     110.094337  128211.172535   
std      4514.158514         90.227958     189.850405   37905.341530   
min         0.000000       1000.000000       0.000000    5000.000000   
25%      1050.000000       1999.000000      69.000000  125000.000000   
50%      2700.000000 

#### Summary of Data Inspection

- There are missing values in the following fields: `Vehicle Type`, `Gearbox`, `Model`, `FuelType`, `NotRepaired`
    - These fields are all categorical variables with the object datatype. I will replace them with `Unknown` since there is no way to determine the correct value. These missing values account for around 20% of the entire dataset which is too large a proportion to drop.
- There are a few fields that will not affect the price of the car in any way and can be dropped from the feature list.
    - `DateCrawled` because it only relates to accessing the database.
    - `RegistrationMonth` because it is not relevant to the value of the car.
    - `LastSeen` because it's based on the last activity of the user, which pertains to updating the listing and such.
    - *Note: I have chosen to keep `PostalCode` in the data because geographic location does affect car values, however the postal code of the user may not always match that of the vehicle.*
- `NumberOfPictures` only has a value of 0, and will therefore not add anything to modeling and should be dropped.    
- There are 262 duplicate rows in the data, and while this is a small percentage of the overall dataset, they should be dropped.    
- After making the above changes, I will do further transformations prior to modeling (see below)

### Making Necessary Changes to Dataset

#### Replace Missing Values with `Unknown`

In [5]:
# Replace missing values with Unknown
df.fillna('Unknown', inplace=True)

# Check for missing values after replacement
df.isna().sum()

DateCrawled          0
Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
DateCreated          0
NumberOfPictures     0
PostalCode           0
LastSeen             0
dtype: int64

#### Drop Duplicates

In [6]:
# Drop one of the duplicate rows
df.drop_duplicates(keep='first', inplace=True)

# Check for duplicate rows after dropping
print('There are', df.duplicated().sum(), 'duplicate rows')

There are 0 duplicate rows


#### Drop Irrelevant Fields

In [6]:
# Drop irrelevant fields from dataframe
df.drop(['DateCrawled', 'RegistrationMonth', 'DateCreated', 'NumberOfPictures', 'LastSeen'], axis=1, inplace = True)

# Check if columns were dropped
df.columns

Index(['Price', 'VehicleType', 'RegistrationYear', 'Gearbox', 'Power', 'Model',
       'Mileage', 'FuelType', 'Brand', 'NotRepaired', 'PostalCode'],
      dtype='object')

### Prepare Data for Models

Before moving on to model training, a few more steps must be taken:
- Encode categorical fields so their data is available to the models.
- Split dataset into three parts (training, validation, test) in order to train, tune, and check the models, respectively.
- Scale fields where numerical values are too large to avoid excess weight being placed on larger values such as vehicle mileage.

#### Encode Categorical Fields

Because there isn't a ranking or order to the categorical fields that need to be perserved when converting to numerical values, I will use OHE instead of ordinal encoding.

In [7]:
# Create a list of the columns that need to be converted
categorical_columns = ['VehicleType','Gearbox','Model','FuelType','Brand','NotRepaired']

# Convert categorical fields to numerical fields using OHE
# Not removing dummy variables because the categorical fields are not binary (1/0)
df_ohe = pd.get_dummies(df, columns=categorical_columns, drop_first=False)

# Print the start of encoded data frame
df_ohe.head()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,PostalCode,VehicleType_Unknown,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_Unknown,NotRepaired_no,NotRepaired_yes
0,480,1993,0,150000,70435,1,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
1,18300,2011,190,125000,66954,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
2,9800,2004,163,125000,90480,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,1500,2001,75,150000,91074,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4,3600,2008,69,90000,60437,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


#### Split Data into Training, Validation, and Test Sets

In [8]:
# The validiation set will be split in half to get 3 datasets with a ratio of 3:1:1
# Set random state to 12345 to replicate training set in future
# 60% of data will be in training_set and other 40% will be in validation set to start
training_set, validation_set_to_split = train_test_split(df_ohe, test_size=0.40, random_state=12345)

# Split validation_set_to_split in half to create a validation and test set of equal sizes
validation_set, test_set = train_test_split(validation_set_to_split, test_size=0.50, random_state=12345)

# Create the features and target training datasets
features_train = training_set.drop(['Price'], axis=1)
target_train = training_set['Price']

# Create the features and target validation datasets
features_valid = validation_set.drop(['Price'], axis=1)
target_valid = validation_set['Price']

# Create the features and target test datasets
features_test = test_set.drop(['Price'], axis=1)
target_test = test_set['Price']

# Print shapes of each dataset to verify accuracy
# Training 
print(features_train.shape) # Training set contains 60% of original dataframe rows 
print(target_train.shape)   # Training set contains 60% of original dataframe rows 

# Validation
print(features_valid.shape) # Validation set contains 25% of original dataframe rows 
print(target_valid.shape)   # Validation set contains 25% of original dataframe rows 

# Test
print(features_test.shape)  # Test set contains 25% of original dataframe rows 
print(target_test.shape)    # Test set contains 25% of original dataframe rows 

# Test & validation sets aren't equal so dropping last row of test
test_set = test_set.drop(test_set.index[-1]) 

# Recreating features and target from adjusted test set
features_test = test_set.drop(['Price'], axis=1)
target_test = test_set['Price']

# Print shapes of each dataset to verify accuracy
# Training 
print(features_train.shape) # Training set contains 60% of original dataframe rows 
print(target_train.shape)   # Training set contains 60% of original dataframe rows 

# Validation
print(features_valid.shape) # Validation set contains 25% of original dataframe rows 
print(target_valid.shape)   # Validation set contains 25% of original dataframe rows 

# Test
print(features_test.shape)  # Test set contains 25% of original dataframe rows 
print(target_test.shape)    # Test set contains 25% of original dataframe rows

(212621, 318)
(212621,)
(70874, 318)
(70874,)
(70874, 318)
(70874,)
(212621, 318)
(212621,)
(70874, 318)
(70874,)
(70873, 318)
(70873,)


#### Perform Scaling

In [9]:
# Create a list of features that need to be scaled
# Scaling is needed because these values are much larger than 0/1 and we don't want the model to overfit for these features
features_to_scale = [ 'RegistrationYear', 'Power', 'Mileage']

# Fit StandardScaler to scale features for training set only
# Then apply the same scaler to transform all three sets (train, validation and test)
# The scaler is never fit using validation/test data
transformer = StandardScaler().fit(features_train[features_to_scale].to_numpy())

# Create a copy of df with scaled/transformed features 
# Apply the same scaler to transform all three sets (train, validation and test) using transformer
# Train
features_train_scaled = features_train.copy()
features_train_scaled.loc[:, features_to_scale] = transformer.transform(features_train[features_to_scale].to_numpy())

# Train
features_valid_scaled = features_valid.copy()
features_valid_scaled.loc[:, features_to_scale] = transformer.transform(features_valid[features_to_scale].to_numpy())

# Test
features_test_scaled = features_test.copy()
features_test_scaled.loc[:, features_to_scale] = transformer.transform(features_test[features_to_scale].to_numpy())

# Print sample of scaled trained dataset
print('Train')
display(features_train_scaled.head())

print('Validation')
display(features_valid_scaled.head())

print('Test')
display(features_test_scaled.head())

Train


Unnamed: 0,RegistrationYear,Power,Mileage,PostalCode,VehicleType_Unknown,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_Unknown,NotRepaired_no,NotRepaired_yes
51358,0.021943,0.271071,0.575195,40227,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
259924,0.092459,0.307001,-2.061801,92363,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
144289,0.162975,-0.006109,0.575195,32278,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
346272,-0.013315,0.307001,0.575195,45879,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
247746,-0.048573,-0.180629,0.575195,56422,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


Validation


Unnamed: 0,RegistrationYear,Power,Mileage,PostalCode,VehicleType_Unknown,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_Unknown,NotRepaired_no,NotRepaired_yes
151885,-0.013315,-0.180629,0.575195,76770,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
58910,-0.025068,0.178678,0.575195,14480,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
151173,-0.072078,-0.288421,-1.798101,45309,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
86763,-0.03682,-0.5656,0.575195,49086,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
234102,0.021943,0.153013,0.575195,44867,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0


Test


Unnamed: 0,RegistrationYear,Power,Mileage,PostalCode,VehicleType_Unknown,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_Unknown,NotRepaired_no,NotRepaired_yes
345557,0.033695,0.255672,-0.084054,21217,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
66038,-0.072078,-0.180629,0.575195,97268,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
254556,-0.001562,-0.000976,0.575195,49744,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
347542,-0.001562,0.106816,0.575195,84051,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
231382,-0.025068,0.178678,0.575195,28865,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### Summary of Data Preparation

The data has been analyzed, processed, and prepared for modeling.

## Model Training

Notes for this section:
- Model performance will be based on:
    - RMSE (quality)
    - Time elapsed to run code (speed)
        - Training
        - Prediction
- The LinearRegression model will serve as a baseline. Because no tuning will be done with this model, it can be trained with the training set and make predictions on the test set and not use the validation set at all.
- I will use GridSearchCV to automate the process of hyperparameter tuning (`n_estimators`, `depth`) for the models below:
    - RandomForestRegressor
    - LightGBM
    - CatBoost
    - XGBoost
    
### Linear Regression (Baseline)

This model serves as a baseline, will not be tuned, and does not require the validation set (see above).

In [11]:
%%time 

# Initialize model
model = LinearRegression()

# Fit model to training data
model.fit(features_train, target_train)

# Predict test target
predicted_values = model.predict(features_valid)

# Calculate RMSE
RMSE = np.sqrt(mean_squared_error(target_valid, predicted_values))

# Print RMSE score
print('The RMSE for the LinearRegression Model is:', round(RMSE,2))

# Print time elapsed for model runtime
print('Run Time for LinearRegression:')

The RMSE for the LinearRegression Model is: 3178.07
Run Time for LinearRegression:
CPU times: user 5.81 s, sys: 5.54 s, total: 11.3 s
Wall time: 11.3 s


### Model Evaluation Function

I have created a function to take each model and return:
- Time to find best hyperparameters during training and prediction
- Best hyperparameters based on RMSE
- RMSE (using training dataset) - *(Note: negated RMSE used for GridSearchCV)*
- RMSE (using validation dataset after tuning hyperparameters) - *(Note: negated RMSE used for GridSearchCV)*

In [12]:
# Create model evaluation function 
# For model tuning, we will use GridSearchCV and keep the search space constant across models
def model_evaluation(model, model_name):
    
    # Create a search space to optimize for best combination of n_estimators and max_depth
    search_space = {'n_estimators': [5,10,20]
                    , 'max_depth': [5,10,20]}
       
    # Initialize Model
    model = model

    # Create a GridSearchCV object
    # Use 3-fold cross validation (5 is the default, but to save time on training, we will use 3-fold)
    # Select Best Model Using neg_root_mean_squared_error as Scorer function 
    model_grid = GridSearchCV(model, param_grid=search_space, cv=3, scoring='neg_root_mean_squared_error', verbose=0)

    # Fit model to training data
    model_grid.fit(features_train, target_train)

    # Save best parameters to variable that produce the smallest RMSE
    # best_parameters will contain a dictionary of the best parameters that produced the lowest RMSE
    best_parameters = model_grid.best_params_

    # Best score is the RMSE value on the training data
    best_score = model_grid.best_score_

    # Best grid will contain the parameters that can be used to predict values on the validation set
    # The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance
    best_grid = model_grid.best_estimator_

    # Predict target values on the validation set
    predicted_values = best_grid.predict(features_valid)

    # Calculate RMSE between predicted target and actual target for validation set
    RMSE = np.sqrt(mean_squared_error(target_valid, predicted_values))

    # Print results of tuning and RMSE on validation set 
    # Also print time for training and calculating RMSE on validation data 
    print("Best Parameters For", model_name,":", best_parameters )
    print("These Parameters Produced a Best RMSE Score of", round(best_score,2), "on the Training Data")
    print()
    print("The RMSE Using the Tuned", model_name, "on the Validation Set:", round(RMSE,2))
    #print("Best grid:", best_grid )
    print()
    print('Training Time for', model_name,':')

### RandomForestRegressor

**Random Forest Regression** is a versatile machine-learning technique for predicting numerical values. It combines the predictions of multiple decision trees to reduce overfitting and improve accuracy.

*Note: Code cell time was ~10 minutes so the below block will be commented out to improve notebook performance.*

Screenshot of code execution:
![](RandomForest.png)

In [13]:
#%%time

## Set variables to RandomForestRegressor
#model = RandomForestRegressor()
#model_name = 'RandomForestRegressor'

## Configure model_evaluation function to use RandomForestRegressor
#model_evaluation(model, model_name)

### LightGBM

**LightGBM** is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

- Faster training speed and higher efficiency.
- Lower memory usage.
- Better accuracy.
- Support of parallel, distributed, and GPU learning.
- Capable of handling large-scale data.

In [14]:
%%time

# Set variables to LGBMRegressor
model = LGBMRegressor(random_state=12345)
model_name = 'LightGBMRegressor'

# Configure model_evaluation function to use LGBMRegressor
model_evaluation(model, model_name)

Best Parameters For LightGBMRegressor : {'max_depth': 10, 'n_estimators': 20}
These Parameters Produced a Best RMSE Score of -2215.83 on the Training Data

The RMSE Using the Tuned LightGBMRegressor on the Validation Set: 2214.98

Training Time for LightGBMRegressor :
CPU times: user 1min 30s, sys: 2.07 s, total: 1min 32s
Wall time: 1min 32s


### CatBoost

**CatBoost** is a popular and high-performance open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. CatBoost introduces two critical algorithmic advances to GBDT:

- The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm
- An innovative algorithm for processing categorical features

Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms.

*Note: Maximum tree depth is 16 for CatBoost, so there are errors from 17-19 based on my GridSearchCV parameter grid.*

In [15]:
%%time

# Set variables to CatBoostRegressor
model = CatBoostRegressor(random_state=12345)
model_name = 'CatBoostRegressor'

# Configure model_evaluation function to use CatBoostRegressor
model_evaluation(model, model_name)

Learning rate set to 0.5
0:	learn: 3382.5329315	total: 73.4ms	remaining: 294ms
1:	learn: 2837.8519096	total: 91.2ms	remaining: 137ms
2:	learn: 2580.7213565	total: 107ms	remaining: 71.4ms
3:	learn: 2436.8302618	total: 123ms	remaining: 30.7ms
4:	learn: 2352.0149568	total: 137ms	remaining: 0us
Learning rate set to 0.5
0:	learn: 3326.6096208	total: 16.8ms	remaining: 67.2ms
1:	learn: 2779.9464324	total: 32.5ms	remaining: 48.8ms
2:	learn: 2563.6401125	total: 52.7ms	remaining: 35.2ms
3:	learn: 2415.5326702	total: 68.2ms	remaining: 17.1ms
4:	learn: 2345.5690615	total: 82.2ms	remaining: 0us
Learning rate set to 0.5
0:	learn: 3318.4534757	total: 16.5ms	remaining: 66.1ms
1:	learn: 2765.9367869	total: 32ms	remaining: 48ms
2:	learn: 2548.0417131	total: 45.9ms	remaining: 30.6ms
3:	learn: 2400.4836789	total: 59.4ms	remaining: 14.8ms
4:	learn: 2331.6411311	total: 72.2ms	remaining: 0us
Learning rate set to 0.5
0:	learn: 3382.5329315	total: 14.9ms	remaining: 134ms
1:	learn: 2837.8519096	total: 29ms	rema

Learning rate set to 0.5
0:	learn: 3148.5167256	total: 39ms	remaining: 741ms
1:	learn: 2535.9655119	total: 79.5ms	remaining: 716ms
2:	learn: 2283.7595320	total: 125ms	remaining: 706ms
3:	learn: 2127.5706307	total: 172ms	remaining: 686ms
4:	learn: 2058.8549357	total: 218ms	remaining: 653ms
5:	learn: 2005.3748936	total: 268ms	remaining: 626ms
6:	learn: 1980.1597511	total: 314ms	remaining: 583ms
7:	learn: 1957.9740371	total: 358ms	remaining: 538ms
8:	learn: 1940.0234365	total: 405ms	remaining: 495ms
9:	learn: 1928.4449226	total: 451ms	remaining: 451ms
10:	learn: 1914.1855554	total: 499ms	remaining: 409ms
11:	learn: 1898.0281077	total: 545ms	remaining: 363ms
12:	learn: 1891.8801860	total: 597ms	remaining: 321ms
13:	learn: 1879.7650132	total: 642ms	remaining: 275ms
14:	learn: 1872.7351673	total: 689ms	remaining: 230ms
15:	learn: 1863.8389331	total: 734ms	remaining: 184ms
16:	learn: 1853.7606802	total: 787ms	remaining: 139ms
17:	learn: 1847.0741445	total: 832ms	remaining: 92.5ms
18:	learn: 1

Traceback (most recent call last):
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/catboost/core.py", line 5299, in fit
    return self._fit(X, y, cat_features, None, None, None, sample_weight, None, None, None, None, baseline,
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/catboost/core.py", line 2021, in _fit
    train_params = self._prepare_train_params(
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/catboost/core.py", line 1953, in _prepare_train_params
    _check_train_params(params)
  File "_catboost.pyx", line 5839, in _catboost._check_train_params
  File "_catboost.pyx", line 5858, in _catboost._check_train_params
_catboost.CatBoostError: catboost/private/libs/options/oblivious_tree_options.cpp:122: Maximum tree depth is 16

Traceback (most recent call last):
  Fil

Traceback (most recent call last):
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/catboost/core.py", line 5299, in fit
    return self._fit(X, y, cat_features, None, None, None, sample_weight, None, None, None, None, baseline,
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/catboost/core.py", line 2021, in _fit
    train_params = self._prepare_train_params(
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/catboost/core.py", line 1953, in _prepare_train_params
    _check_train_params(params)
  File "_catboost.pyx", line 5839, in _catboost._check_train_params
  File "_catboost.pyx", line 5858, in _catboost._check_train_params
_catboost.CatBoostError: catboost/private/libs/options/oblivious_tree_options.cpp:122: Maximum tree depth is 16

Traceback (most recent call last):
  Fil

Learning rate set to 0.5
0:	learn: 3368.1980247	total: 29.8ms	remaining: 119ms
1:	learn: 2852.8239131	total: 56.1ms	remaining: 84.1ms
2:	learn: 2620.5226919	total: 80.8ms	remaining: 53.9ms
3:	learn: 2456.2310367	total: 101ms	remaining: 25.3ms
4:	learn: 2373.7004308	total: 122ms	remaining: 0us
Best Parameters For CatBoostRegressor : {'max_depth': 5, 'n_estimators': 5}
These Parameters Produced a Best RMSE Score of -2345.98 on the Training Data

The RMSE Using the Tuned CatBoostRegressor on the Validation Set: 2378.53

Training Time for CatBoostRegressor :
CPU times: user 28 s, sys: 1.25 s, total: 29.3 s
Wall time: 32 s


### XGBoost

**XGBoost** is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

*Note: Code cell time was over 45 minutes so the below block will be commented out to improve notebook performance.*

Screenshot of code execution:
![](XGBRegressor.png)

In [21]:
#%%time

## Set variables to XGBRegressor
#model = XGBRegressor(random_state=12345)
#model_name = 'XGBRegressor'

## Configure model_evaluation function to use XGBRegressor
#model_evaluation(model, model_name)

## Model Analysis

### Summary of Training and Tuning Results

Rusty Bargain is interested in the following attributes of models:
- The quality of the prediction
- The speed of the prediction
- The time required for training

These attributes are summarized in the table below *(Note: RMSE values were returned to positive values after model testing)*:
    
| Model | RMSE (Predicting Targets on Validation Set) | Time Required for Training + Tuning (s) | Best Hyperparameters |
|:----------:|:----------:|:----------:|:----------:|
| LinearRegression | 3178.07  | 11 (training only)  | N/A  |
| RandomForest  | 1791.91  | 610  | `max_depth=20`, `n_estimators=20`  |
| LightGBM  | 2214.98  | 265  | `max_depth=10`, `n_estimators=20`  |
| CatBoost  | 2378.53  | 66  | `max_depth=5`, `n_estimators=5`  |
| XGBoost | 1783.36  | 2973  | `max_depth=10`, `n_estimators=20`  |

All models tested outperformed the linear regression baseline, which is the expected result. 

Based on the attributes Rusty Bargain is interested in, the best model is either CatBoost or RandomForest. In terms of quality of the prediction, RandomForest produced an RMSE of 1827.94 compared to CatBoost's RMSE of 2378.53. In terms of training and tuning speed, CatBoost took only 66 seconds for hyperparameter tuning, whereas RandomForest took 610 seconds - almost 10 times longer. Based on Rusty Bargain's priorities (quality vs. speed) they should select one of these two models, but further analysis (with the test set) will be performed in the next section. 

I don't consider LightGBM to be a great choice because it produced a similar RMSE to CatBoost while taking around 5 times longer to train and tune. 

I don't consider XGBoost to be a viable model because the training + tuning time was over 45 minutes. RandomForest is able to produce a comparable RMSE in a quarter of the time.

### Evaluate RMSE and Training and Prediction Speed

Using the results from the previous section, I will select the best models and re-train them with the training data in order to determine their training times. I will then use them to make predictions using the test data set and determine their prediction times and RMSE. As above, the LinearRegression model will be the baseline.

Models selected for this section:
- LinearRegression
- RandomForestRegressor (`n_estimators=20`, `max_depth=20`)
- CatBoost (`n_estimators=5`, `max_depth=5`)

### LinearRegression (Baseline)

In [18]:
%%time 

# Start training timer
start_train = time.time()

# Initialize model
model = LinearRegression()

# Fit model to training Data
model.fit(features_train, target_train)

# End training timer and calculate elapsed training time
end_train = time.time()
elapsed_train = end_train - start_train

# Start prediction timer
start_predict = time.time()

# Predict test target
predicted_values = model.predict(features_test)

# Calculate RMSE
RMSE = np.sqrt(mean_squared_error(target_test, predicted_values))

# Print RMSE score
print('The RMSE for the LinearRegression model is:', round(RMSE,2))

# End prediction timer and calculate elapsed prediction time
end_predict = time.time()
elapsed_predict = end_predict - start_predict

# Print elapsed train, elapsed predict, and code cell times
print('Elapsed time for LinearRegression training:', round(elapsed_train,2))
print('Elapsed time for LinearRegression prediction:', round(elapsed_predict,2))
print('Run time for code cell:')

The RMSE for the LinearRegression model is: 3185.54
Elapsed time for LinearRegression training: 9.14
Elapsed time for LinearRegression prediction: 0.19
Run time for code cell:
CPU times: user 8.13 s, sys: 1.22 s, total: 9.35 s
Wall time: 9.32 s


### RandomForest

In [16]:
%%time 

# Start training timer
start_train = time.time()

# Initialize model
model = RandomForestRegressor(random_state=12345, n_estimators=20, max_depth=20)

# Fit model to training Data
model.fit(features_train, target_train)

# End training timer and calculate elapsed training time
end_train = time.time()
elapsed_train = end_train - start_train

# Start prediction timer
start_predict = time.time()

# Predict test target
predicted_values = model.predict(features_test)

# Calculate RMSE
RMSE = np.sqrt(mean_squared_error(target_test, predicted_values))

# Print RMSE score
print('The RMSE for the RandomForestRegressor model is:', round(RMSE,2))

# End prediction timer and calculate elapsed prediction time
end_predict = time.time()
elapsed_predict = end_predict - start_predict

# Print elapsed train, elapsed predict, and code cell times
print('Elapsed time for RandomForestRegressor training:', round(elapsed_train,2))
print('Elapsed time for RandomForestRegressor prediction:', round(elapsed_predict,2))
print('Run time for code cell:')

The RMSE for the RandomForestRegressor model is: 1797.95
Elapsed time for RandomForestRegressor training: 51.84
Elapsed time for RandomForestRegressor prediction: 0.37
Run time for code cell:
CPU times: user 52.2 s, sys: 0 ns, total: 52.2 s
Wall time: 52.2 s


### CatBoost

In [20]:
%%time 

# Start training timer
start_train = time.time()

# Initialize model
model = CatBoostRegressor(random_state=12345, n_estimators=5, max_depth=5, verbose=0)

# Fit model to training Data
model.fit(features_train, target_train)

# End training timer and calculate elapsed training time
end_train = time.time()
elapsed_train = end_train - start_train

# Start prediction timer
start_predict = time.time()

# Predict test target
predicted_values = model.predict(features_test)

# Calculate RMSE
RMSE = np.sqrt(mean_squared_error(target_test, predicted_values))

# Print RMSE score
print('The RMSE for the CatBoostRegressor model is:', round(RMSE,2))

# End prediction timer and calculate elapsed prediction time
end_predict = time.time()
elapsed_predict = end_predict - start_predict

# Print elapsed train, elapsed predict, and code cell times
print('Elapsed time for CatBoostRegressor training:', round(elapsed_train,2))
print('Elapsed time for CatBoostRegressor prediction:', round(elapsed_predict,2))
print('Run time for code cell:')

The RMSE for the CatBoostRegressor model is: 2385.27
Elapsed time for CatBoostRegressor training: 1.06
Elapsed time for CatBoostRegressor prediction: 0.01
Run time for code cell:
CPU times: user 978 ms, sys: 0 ns, total: 978 ms
Wall time: 1.07 s


## Conclusion

Given the results in the previous section (using test data), **Rusty Bargain should implement a RandomForest model** to make its predictions. It produced the lowest RMSE (among the models tested) but isn't particularly fast. **If speed is preferred, I would instead go with CatBoost**, which is dramatically faster at the cost of higher RMSE. See table below for results and comparison.

| Model | RMSE (Predicting Targets on Test Set) | Time Required for Training (s) | Time Required for Prediction (s) | Best Hyperparameters |
|:----------:|:----------:|:----------:|:----------:|:----------:|
| LinearRegression | 3185.54  | 8.69  | 0.14 | N/A  |
| RandomForest  | 1797.95  | 51.84 | 0.37  | `n_estimators=20`, `max_depth=20`  |
| CatBoost  | 2385.27  |  1.06  |  0.01   | `n_estimators=5`, `max_depth=5`  |

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed