### Lab 6: Random forest algorithm practice

OK, your turn. I'll set it up, but then you build a random forest model to predict car selling price. **You'll also need to tune max_depth and n_estimators to minimize MSE.**

Follow the ML workflow:
1. Obtain and isolate the data
2. Split the data into training & test datasets
3. Format the data for the algorithm
4. Create an initial model and train it
5. Use the test set to measure the model's performance
6. Tune the model to minimize error
7. Use the best hyperparameters and create a new model
8. Use the model to make new predictions

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor # Here is the RF regressor
from sklearn.metrics import r2_score,mean_squared_error  # Use our past metrics
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt

### 1. Obtain and isolate the data

In [3]:
# Read data from a .csv (comma-separated-values) file in the local directory
# Adapted from: 
# https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho?select=Car+details+v3.csv
df = pd.read_csv('../data/cars.csv')  # Notice the path to the data file
df.head(4)

Unnamed: 0,name,year,selling_price,km_driven,km/liter,engine,max_power,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,23.4,1248.0,74.0,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,21.14,1498.0,103.52,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,17.7,1497.0,78.0,5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,23.0,1396.0,90.0,5.0


In [4]:
# Your code here

In [5]:
# kcolvin code:
# Pull out selling_price as the target (the value we will predict)
y = df['selling_price']
y.head(2)

0    450000
1    370000
Name: selling_price, dtype: int64

In [6]:
X = df.drop(['name','selling_price'],axis=1)
X.head(2)

Unnamed: 0,year,km_driven,km/liter,engine,max_power,seats
0,2014,145500,23.4,1248.0,74.0,5.0
1,2014,120000,21.14,1498.0,103.52,5.0


### 2. Split the data into training & test datasets

In [7]:
# Your code here

In [8]:
# kcolvin code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

### 3. Format the data for the algorithm

In [9]:
# your code here

In [10]:
# kcolvin code
# Just check features for dataframes
print(type(X_train))
print(type(X_test))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


### 4. Create an baseline  random forest model and train it

In [11]:
# your code here

In [12]:
# Create the RF regressor object
rfr = RandomForestRegressor(max_depth = None, n_estimators = 100) # default parameters
#
# Train the model using the training data
fit_rfr = rfr.fit(X_train, y_train.values.ravel())
# Show hyperparameters
fit_rfr 

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

### 5. Use the test set to measure the model's performance

In [13]:
# Your code here

In [14]:
# kcolvin code
# Predict new MEDV values using the X_test data
y_pred = rfr.predict(X_test)

In [15]:
# Report the performance metrics
# Calculate the metrics
r2 = round(r2_score(y_test, y_pred),2)
mse = round(mean_squared_error(y_test, y_pred),2)
#
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred))
print("MSE: ", mse)
print("RMSE: ", round(mse**(1/2.0),3)) # Root Mean Squared Error

Coefficient of determination: 0.97
MSE:  15616866698.79
RMSE:  124967.463


### 6. Tune max_depth and n_estimators to minimize error

In [16]:
# your code here

In [17]:
# kcolvin code
#  Setup the search space
max_depths = np.linspace(1, 6, 6, endpoint=True)
print(max_depths)
n_estimators = np.array([50,100,200,300,400])
print(n_estimators)
#
# The goal is to minimize mse
best_mse = float('inf') # set this value very high. We will try to minimize it
best_n = 0  # Will keep track of n_estimators
best_md = 0 # Will keep track of max_depth
#
# Let's time how long this takes
import time
st = time.time() # time right now
# 
# Do an embedded for loop and search though each combination of n_estimator and max_depth
for md in max_depths:
    for n in n_estimators:
        # Do the workflow
        rfr = RandomForestRegressor(max_depth = md, n_estimators = n)
        fit_rfr = rfr.fit(X_train, y_train.values.ravel() )
        y_pred = rfr.predict(X_test)
        mse = round(mean_squared_error(y_test, y_pred),2)
        if mse < best_mse: # If the mse is lower, then update current variable values
            best_mse = mse 
            best_n = n
            best_md = md
#
# Get the end time
et = time.time()
#
# get the elapsed time
elapsed_time = et - st
# Report results of search
print('Execution time:', elapsed_time, 'seconds')
print('Best MSE:', best_mse)
print('Best Max Depth:', best_md)
print('Best n_estimators:', best_n)

[1. 2. 3. 4. 5. 6.]
[ 50 100 200 300 400]
Execution time: 26.23802089691162 seconds
Best MSE: 27993497673.97
Best Max Depth: 6.0
Best n_estimators: 300


### 7. Use the best hyperparameters and create a new model

In [18]:
# your code here

In [22]:
# Run this cell several times and discuss output.
rfr = RandomForestRegressor(max_depth = best_md, n_estimators = best_n)
fit_rfr = rfr.fit(X_train, y_train.values.ravel() )
y_pred = rfr.predict(X_test)
#
# Calculate the metrics
r2 = round(r2_score(y_test, y_pred),2)
mse = round(mean_squared_error(y_test, y_pred),2)
#
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred))
print("MSE: ", mse)
print("RMSE: ", round(mse**(1/2.0),3)) # Root Mean Squared Error

Coefficient of determination: 0.95
MSE:  28964963716.71
RMSE:  170190.963


### 8. Use the model to make new predictions

In [20]:
# your code here

In [21]:
# kcolvin code
# Predict 'new' different cars selling price
#
# Recall the Features: 
#   ['year', 'km_driven', 'km/liter', 'engine', 'max_power', 'seats']
#
c1 = [2013.0, 192500.0, 22.32, 1582.0, 126.3, 5.0]
c2 = [2015.0, 9000.0, 9.4, 2179.0, 120.0, 7.0]
c3 = [2017.0, 99000.0, 13.6, 1999.0, 177.0, 4.0]
c4 = [2019.0, 256000.0, 24.0, 1120.0, 70.0, 5.0]
c5 = [2021.0, 5000.0, 36.1, 796.0, 37.0, 2.0]
#
c_lst = [c1, c2, c3, c4, c5]
#
for car in c_lst:
    print('Car features:', car)
    df = pd.DataFrame(data=car)
    pv = int(rfr.predict(df.T).item())
    print('Predicted Value in some currency:', pv , '\n')

Car features: [2013.0, 192500.0, 22.32, 1582.0, 126.3, 5.0]
Predicted Value in some currency: 439147 

Car features: [2015.0, 9000.0, 9.4, 2179.0, 120.0, 7.0]
Predicted Value in some currency: 829973 

Car features: [2017.0, 99000.0, 13.6, 1999.0, 177.0, 4.0]
Predicted Value in some currency: 2952125 

Car features: [2019.0, 256000.0, 24.0, 1120.0, 70.0, 5.0]
Predicted Value in some currency: 514639 

Car features: [2021.0, 5000.0, 36.1, 796.0, 37.0, 2.0]
Predicted Value in some currency: 295211 

