# **Milestone 2**

## **Model Building**

1. What we want to predict is the "Price". We will use the normalized version 'price_log' for modeling.
2. Before we proceed to the model, we'll have to encode categorical features. We will drop categorical features like Name. 
3. We'll split the data into train and test, to be able to evaluate the model that we build on the train data.
4. Build Regression models using train data.
5. Evaluate the model performance.

**Note:** Please load the data frame that was saved in Milestone 1 here before separating the data, and then proceed to the next step in Milestone 2.

### **Load the data**

In [1]:
import pandas as pd



In [2]:
import pandas as pd

cars_data = pd.read_csv("cars_data_updated_cleaned_3.csv")

In [3]:
cars_data.describe(include = ['object']).T

Unnamed: 0,count,unique,top,freq
Name,7249,2038,Mahindra XUV500 W8 2WD,55
Location,7249,11,Mumbai,949
Fuel_Type,7249,5,Diesel,3849
Transmission,7249,2,Manual,5204
Owner_Type,7249,4,First,5949
Brand,7249,32,Maruti,1444
Model,7249,218,Swift,418
Brand_model,7249,222,Maruti_Swift,418


In [4]:
cars_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,7249.0,2013.36474,3.254637,1996.0,2011.0,2014.0,2016.0,2019.0
Kilometers_Driven,7249.0,57723.521175,36541.92268,171.0,34000.0,53416.0,73000.0,720000.0
Mileage,7249.0,18.318437,4.148297,7.5,15.29,18.16,21.1,33.54
Engine,7249.0,1615.675417,591.603602,72.0,1198.0,1493.0,1968.0,5998.0
Power,7249.0,112.341877,52.850118,34.2,75.0,94.0,138.03,616.0
Seats,7249.0,5.279487,0.806445,2.0,5.0,5.0,5.0,10.0
New_price,7249.0,21.218825,24.325179,3.91,7.92,11.48,22.37,375.0
Price,7249.0,9.322196,10.685974,0.44,3.5,5.6,9.89,100.0
kilometers_driven_log,7249.0,10.760273,0.713008,5.141664,10.434116,10.885866,11.198215,13.487006
Price_log,7249.0,1.819207,0.864312,-0.820981,1.252763,1.722767,2.291524,4.60517


### **Split the Data**

<li>Step1: Seperating the indepdent variables (X) and the dependent variable (y). 
<li>Step2: Eliminate variables adding unncessary complexity.   
<li>Step3: Encode the categorical variables in X using pd.dummies.
<li>Step4: Split the data into train and test using train_test_split.

**Think about it:** Why we should drop 'Name','Price','price_log','Kilometers_Driven' from X before splitting? 
- We don't need name column, since we have already extracted the relevant information in the brand and model columns. We can also drop the Name_brand column, since it is redundant and may lead to overfitting the dataset. 
- Price and price_log are depedent variables that we are attempting to model for, so we don't want to use them as inputs. 
- We are using the normalized data 'kilometers_driven_log' to account for the KM driven, so we can drop the raw 'Kilometers_Driven' from the input dataset. 
- Let's also drop Model, since it has 218 unique values, thus simplifying the dataset.

In [5]:
# Step-1
X = cars_data.drop(['Name','Price','Price_log','Kilometers_Driven','Brand_model','Model'], axis = 1)


y = cars_data[["Price_log", "Price"]]

X.describe(include = ['object']).T

Unnamed: 0,count,unique,top,freq
Location,7249,11,Mumbai,949
Fuel_Type,7249,5,Diesel,3849
Transmission,7249,2,Manual,5204
Owner_Type,7249,4,First,5949
Brand,7249,32,Maruti,1444


In [6]:
X.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,7249.0,2013.36474,3.254637,1996.0,2011.0,2014.0,2016.0,2019.0
Mileage,7249.0,18.318437,4.148297,7.5,15.29,18.16,21.1,33.54
Engine,7249.0,1615.675417,591.603602,72.0,1198.0,1493.0,1968.0,5998.0
Power,7249.0,112.341877,52.850118,34.2,75.0,94.0,138.03,616.0
Seats,7249.0,5.279487,0.806445,2.0,5.0,5.0,5.0,10.0
New_price,7249.0,21.218825,24.325179,3.91,7.92,11.48,22.37,375.0
kilometers_driven_log,7249.0,10.760273,0.713008,5.141664,10.434116,10.885866,11.198215,13.487006


In [7]:

# Import Statsmodels 
import statsmodels.api as sm




In [8]:
# Step-2 Use pd.get_dummies(drop_first = True)
X = pd.get_dummies(X, drop_first = True)
#add constants to X
X = sm.add_constant(X)

### **Check for Multicollinearity**

We will use the Variance Inflation Factor (VIF), to check if there is multicollinearity in the data. We can further eliminate columns that are colinear before we split the data for modeling purposes.

Features having a VIF score > 5 will be dropped / treated till all the features have a VIF score < 5

In [9]:
#create new variable to remove multicolinear variables
X_col = X

In [10]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def checking_vif(train):
    vif = pd.DataFrame()
    vif["feature"] = train.columns

    # Calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(train.values, i) for i in range(len(train.columns))
    ]
    return vif.sort_values(by='VIF',ascending = False)


checking_vif(X_col)



Unnamed: 0,feature,VIF
0,const,864005.442069
43,Brand_Maruti,1171.86532
36,Brand_Hyundai,1106.644797
35,Brand_Honda,676.415291
54,Brand_Toyota,478.314293
44,Brand_Mercedes-Benz,367.195872
55,Brand_Volkswagen,359.71412
33,Brand_Ford,338.578222
42,Brand_Mahindra,320.853973
27,Brand_BMW,304.213966


We have more than 50 variables, with 'Brand_Maruti' having the highest VIF score. Let's start the process of eliminating the highest VIF column and then running the function again to check for the updatee VIF scores.

In [11]:
#drop Brand_Maruti, and then rerun
X_col = X_col.drop('Brand_Maruti', axis = 1)

print(checking_vif(X_col))

                      feature            VIF
0                       const  858622.053918
21           Fuel_Type_Petrol      32.294320
18           Fuel_Type_Diesel      31.157081
3                      Engine      10.957605
4                       Power       9.865058
6                   New_price       8.491853
16            Location_Mumbai       3.963368
2                     Mileage       3.890703
12         Location_Hyderabad       3.715357
10        Location_Coimbatore       3.505804
14             Location_Kochi       3.501858
17              Location_Pune       3.441166
43        Brand_Mercedes-Benz       3.309937
15           Location_Kolkata       3.134063
11             Location_Delhi       3.110871
27                  Brand_BMW       3.082426
9            Location_Chennai       2.949365
26                 Brand_Audi       2.801954
13            Location_Jaipur       2.664216
5                       Seats       2.562070
41                 Brand_Land       2.533784
8         

Let's drop Fuel_Type_Petrol, since it has the highest VIF score above 5 and see how that adjusts the data.

In [12]:

#drop Brand_Maruti, and then rerun
X_col = X_col.drop('Fuel_Type_Petrol', axis = 1)

print(checking_vif(X_col))

                      feature            VIF
0                       const  858324.269285
3                      Engine      10.806339
4                       Power       9.841764
6                   New_price       8.491842
16            Location_Mumbai       3.961293
12         Location_Hyderabad       3.714755
2                     Mileage       3.700159
10        Location_Coimbatore       3.504817
14             Location_Kochi       3.501827
17              Location_Pune       3.440980
42        Brand_Mercedes-Benz       3.308882
15           Location_Kolkata       3.133238
11             Location_Delhi       3.110392
26                  Brand_BMW       3.082335
9            Location_Chennai       2.948726
25                 Brand_Audi       2.800278
18           Fuel_Type_Diesel       2.720657
13            Location_Jaipur       2.662805
5                       Seats       2.558152
40                 Brand_Land       2.532729
8          Location_Bangalore       2.483767
21        

In [13]:
##drop Brand_Maruti, and then rerun
X_col = X_col.drop('Engine', axis = 1)

print(checking_vif(X_col))

                      feature            VIF
0                       const  852561.403222
5                   New_price       8.446555
3                       Power       4.765931
15            Location_Mumbai       3.961112
11         Location_Hyderabad       3.714217
9         Location_Coimbatore       3.504732
13             Location_Kochi       3.501551
16              Location_Pune       3.440690
2                     Mileage       3.336779
41        Brand_Mercedes-Benz       3.304356
14           Location_Kolkata       3.133212
10             Location_Delhi       3.110390
25                  Brand_BMW       3.042729
8            Location_Chennai       2.948444
24                 Brand_Audi       2.767441
12            Location_Jaipur       2.662765
39                 Brand_Land       2.527176
7          Location_Bangalore       2.483722
4                       Seats       2.403013
17           Fuel_Type_Diesel       2.299107
20        Transmission_Manual       2.263687
1         

In [14]:
##drop Brand_Maruti, and then rerun
X_col = X_col.drop('New_price', axis = 1)

print(checking_vif(X_col))

                      feature            VIF
0                       const  851313.297225
3                       Power       4.059524
14            Location_Mumbai       3.961107
10         Location_Hyderabad       3.714170
8         Location_Coimbatore       3.503165
12             Location_Kochi       3.500566
15              Location_Pune       3.440669
2                     Mileage       3.334755
13           Location_Kolkata       3.133190
9              Location_Delhi       3.110378
7            Location_Chennai       2.948444
11            Location_Jaipur       2.662615
6          Location_Bangalore       2.483383
4                       Seats       2.376239
16           Fuel_Type_Diesel       2.295888
40        Brand_Mercedes-Benz       2.284316
19        Transmission_Manual       2.263511
24                  Brand_BMW       2.210777
1                        Year       2.199768
23                 Brand_Audi       1.965199
50               Brand_Toyota       1.895720
5       ki

**Multicolinearity analysis complete - all features have VIF score <5**

- Now that we have elminated features with significant multicolinearity, we can proceed to create our datasets for testing and training.

In [15]:
y.isna().sum()

Price_log    0
Price        0
dtype: int64

In [16]:
# Step-3 Splitting data into training and test set:

# Import library for preparing data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_col, y, test_size = 0.30, random_state = 1)


print(X_train.shape, X_test.shape)

(5074, 53) (2175, 53)


In [17]:
# Let us write a function for calculating r2_score and RMSE on train and test data
# This function takes model as an input on which we have trained particular algorithm
# The categorical column as the input and returns the boxplots and histograms for the variable

import numpy as np
from sklearn import metrics

def get_model_score(model, flag = True):
    '''
    model : regressor to predict values of X

    '''
    # Defining an empty list to store train and test results
    score_list = [] 
    
    pred_train = model.predict(X_train)
    
    pred_train_ = np.exp(pred_train)
    
    pred_test = model.predict(X_test)

    
    pred_test_ = np.exp(pred_test)
    
    train_r2 = metrics.r2_score(y_train['Price'], pred_train_)
    
    test_r2 = metrics.r2_score(y_test['Price'], pred_test_)
    
    train_rmse = metrics.mean_squared_error(y_train['Price'], pred_train_, squared = False)
    
    test_rmse = metrics.mean_squared_error(y_test['Price'], pred_test_, squared = False)
    
    # Adding all scores in the list
    score_list.extend((train_r2, test_r2, train_rmse, test_rmse))
    
    # If the flag is set to True then only the following print statements will be dispayed, the default value is True
    if flag == True: 
        
        print("R-square on training set : ", metrics.r2_score(y_train['Price'], pred_train_))
        
        print("R-square on test set : ", metrics.r2_score(y_test['Price'], pred_test_))
        
        print("RMSE on training set : ", np.sqrt(metrics.mean_squared_error(y_train['Price'], pred_train_)))
        
        print("RMSE on test set : ", np.sqrt(metrics.mean_squared_error(y_test['Price'], pred_test_)))
    
    # Returning the list with train and test scores
    return score_list

<hr>

For Regression Problems, some of the algorithms used are :<br>

**1) Linear Regression** <br>
**2) Ridge / Lasso Regression** <br>
**3) Decision Trees** <br>
**4) Random Forest** <br>

### **Fitting a linear model**

Linear Regression can be implemented using: <br>

**1) Sklearn:** https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html <br>
**2) Statsmodels:** https://www.statsmodels.org/stable/regression.html

Let's first model the data using the LinearRegression function found in the sklearn package.

In [18]:
# Import Linear Regression from sklearn
from sklearn.linear_model import LinearRegression

In [19]:
# Create a linear regression model
lr = LinearRegression()

In [20]:
# Fit linear regression model
lr.fit(X_train, y_train['Price_log']) 

LinearRegression()

In [21]:
# Get score of the model
LR_score = get_model_score(lr)



R-square on training set :  0.8719205541656581
R-square on test set :  0.8551970455891145
RMSE on training set :  3.8811187240917393
RMSE on test set :  3.9206798561514185


**Observations from results: _____**
- This model has very high predictive power, given that both the R-square values on the training and testing set are between .85 and.9, indicating that the input parameters can explain nearly 90% of the given price.  
- The model is well fit, since the difference between R-square values is only .016 between the training set and the test set, thus the model is neither over and underfit, having avoided bias.
- The RMSE values are similar and both less than 4, again indicating a properly fit model for the test dataset.

**Let's now build an Ordinary Least Squares (OLS) Linear Regression model using the statsmodels package.**

In [22]:
#First set training, testing sets to be downsized for the ols test to current X_train, X_test
X_train_ols =X_train
X_test_ols = X_test

#create ols model function

def build_ols_model(train):
    
    # Create the model
    olsmodel = sm.OLS(y_train["Price_log"], train)
    
    return olsmodel.fit()


# Fit linear model on new dataset
olsmodel1 = build_ols_model(X_train_ols)

print(olsmodel1.summary())

                            OLS Regression Results                            
Dep. Variable:              Price_log   R-squared:                       0.921
Model:                            OLS   Adj. R-squared:                  0.920
Method:                 Least Squares   F-statistic:                     1152.
Date:                Sat, 17 Dec 2022   Prob (F-statistic):               0.00
Time:                        07:57:09   Log-Likelihood:                -51.255
No. Observations:                5074   AIC:                             206.5
Df Residuals:                    5022   BIC:                             546.2
Df Model:                          51                                         
Covariance Type:            nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                 

In [23]:
#Write a new get model score function to read the OLS test/train data only, as we plan to eliminate any insigniticant variables to optimize a model.
def get_model_score_ols(model, X_train_ols, X_test_ols, flag = True):
    '''
    model : regressor to predict values of X

    '''
    # Defining an empty list to store train and test results
    score_list = [] 
    
    
    
    pred_train = model.predict(X_train_ols)
    
    pred_train_ = np.exp(pred_train)
    
    pred_test = model.predict(X_test_ols)

    
    pred_test_ = np.exp(pred_test)
    
    train_r2 = metrics.r2_score(y_train['Price'], pred_train_)
    
    test_r2 = metrics.r2_score(y_test['Price'], pred_test_)
    
    train_rmse = metrics.mean_squared_error(y_train['Price'], pred_train_, squared = False)
    
    test_rmse = metrics.mean_squared_error(y_test['Price'], pred_test_, squared = False)
    
    # Adding all scores in the list
    score_list.extend((train_r2, test_r2, train_rmse, test_rmse))
    
    # If the flag is set to True then only the following print statements will be dispayed, the default value is True
    if flag == True: 
        
        print("R-square on training set : ", metrics.r2_score(y_train['Price'], pred_train_))
        
        print("R-square on test set : ", metrics.r2_score(y_test['Price'], pred_test_))
        
        print("RMSE on training set : ", np.sqrt(metrics.mean_squared_error(y_train['Price'], pred_train_)))
        
        print("RMSE on test set : ", np.sqrt(metrics.mean_squared_error(y_test['Price'], pred_test_)))
    
    # Returning the list with train and test scores
    return score_list

In [24]:
get_model_score_ols(olsmodel1, X_train_ols, X_test_ols)

R-square on training set :  0.8719205541656947
R-square on test set :  0.8551970455891215
RMSE on training set :  3.881118724091183
RMSE on test set :  3.9206798561513243


[0.8719205541656947, 0.8551970455891215, 3.881118724091183, 3.9206798561513243]

- While the training data R-square and RMSE values were nearly identical to the LinearRegression resuts, 13 features were found to have a p-value greater than or equal to .05.
- This model should be revised further by removing insignificant input variables to see if results may be improved.

Let's remove the insignicant columns (ie. p-values >.05) from the test set and run the OLS model again.

In [25]:
#Let's omit features with p-values less than or equal to .05 in the test, training sets of the ols.

olsmod_omit = pd.DataFrame(olsmodel1.params, columns = ['coef'])
olsmod_omit['pval'] = olsmodel1.pvalues
    # Filter by significant p-value (pval >= 0.05) and sort descending by Odds ratio

olsmod_omit = olsmod_omit[olsmod_omit.pval>= 0.05].sort_values(by = "pval", ascending = False)
    
#identify list of insignificant features that can be dropped for next OMS iteration
insig_var_list = list(olsmod_omit.index.values)
#remove insignificant features from ols train and test sets
X_train_ols= X_train_ols.drop(insig_var_list,axis=1)
X_test_ols=  X_test_ols.drop(insig_var_list,axis=1)
    
X_train_ols.columns



Index(['const', 'Year', 'Mileage', 'Power', 'Seats', 'kilometers_driven_log',
       'Location_Bangalore', 'Location_Coimbatore', 'Location_Delhi',
       'Location_Hyderabad', 'Location_Jaipur', 'Location_Kolkata',
       'Location_Mumbai', 'Location_Pune', 'Fuel_Type_Diesel',
       'Fuel_Type_Electric', 'Transmission_Manual', 'Owner_Type_Second',
       'Owner_Type_Third', 'Brand_Audi', 'Brand_BMW', 'Brand_Chevrolet',
       'Brand_Datsun', 'Brand_Fiat', 'Brand_Ford', 'Brand_Hindustan',
       'Brand_Honda', 'Brand_Jaguar', 'Brand_Jeep', 'Brand_Land',
       'Brand_Mahindra', 'Brand_Mercedes-Benz', 'Brand_Mini',
       'Brand_Mitsubishi', 'Brand_OpelCorsa', 'Brand_Porsche', 'Brand_Skoda',
       'Brand_Tata', 'Brand_Toyota', 'Brand_Volvo'],
      dtype='object')

In [26]:
X_train_ols.shape

(5074, 40)

In [27]:
#let's see the insignificant variables removed from the dataset
print(insig_var_list)

['Brand_Bentley', 'Owner_Type_Fourth & Above', 'Brand_Hyundai', 'Brand_Isuzu', 'Fuel_Type_LPG', 'Brand_Force', 'Location_Kochi', 'Brand_Volkswagen', 'Brand_Nissan', 'Brand_ISUZU', 'Location_Chennai', 'Brand_Renault', 'Brand_Smart']


In [28]:
# Fit linear model on new dataset
olsmodel2 = build_ols_model(X_train_ols)

print(olsmodel2.summary())

                            OLS Regression Results                            
Dep. Variable:              Price_log   R-squared:                       0.921
Model:                            OLS   Adj. R-squared:                  0.920
Method:                 Least Squares   F-statistic:                     1543.
Date:                Sat, 17 Dec 2022   Prob (F-statistic):               0.00
Time:                        07:57:09   Log-Likelihood:                -62.359
No. Observations:                5074   AIC:                             202.7
Df Residuals:                    5035   BIC:                             457.5
Df Model:                          38                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                  -237.17

In [29]:
ols2score = get_model_score_ols(olsmodel2, X_train_ols, X_test_ols)

R-square on training set :  0.8705021023453945
R-square on test set :  0.8538188902424512
RMSE on training set :  3.902550818600047
RMSE on test set :  3.939293116906806


In [30]:
#Check to make sure no insignificant p-values remain
olsmod_omit = pd.DataFrame(olsmodel2.params, columns = ['coef'])
olsmod_omit['pval'] = olsmodel2.pvalues
    # Filter by significant p-value (pval >= 0.05) and sort descending by Odds ratio

olsmod_omit = olsmod_omit[olsmod_omit.pval>= 0.05].sort_values(by = "pval", ascending = False)
    
#identify list of insignificant features that can be dropped for next OMS iteration
insig_var_list = list(olsmod_omit.index.values)
print(insig_var_list)

[]


- After removing the insignificant variables, the OLS model generated a comparably fit model, with no notable change in the training or test R-square values. RMSE values remained the same, with nearly identical 3.9 values, which is the same as the first iteration.
- These values are nearly identical to the LinearRegression() model found, suggesting both models are comparable for this case.

In [31]:
olsmod_coeff = pd.DataFrame(olsmodel2.params, columns = ['coef'])
olsmod_coeff.sort_values(by = "coef", ascending = False)

Unnamed: 0,coef
Brand_Hindustan,2.70363
Fuel_Type_Electric,1.122851
Brand_Mini,1.079887
Brand_Land,0.9300887
Brand_Mercedes-Benz,0.6718424
Brand_Audi,0.5950729
Brand_BMW,0.586891
Brand_Jaguar,0.5689654
Brand_Volvo,0.3846319
Fuel_Type_Diesel,0.3053143


In [32]:
#Brand seems to figure in. Let's see what what the most frequent brands are.
cars_data.Brand.value_counts()

Maruti           1444
Hyundai          1340
Honda             743
Toyota            507
Mercedes-Benz     380
Volkswagen        374
Ford              351
Mahindra          331
BMW               311
Audi              285
Tata              228
Skoda             201
Renault           170
Chevrolet         151
Nissan            117
Land               66
Jaguar             48
Fiat               38
Mitsubishi         36
Mini               31
Volvo              28
Porsche            19
Jeep               19
Datsun             17
ISUZU               3
Force               3
Isuzu               2
Bentley             2
Smart               1
Ambassador          1
Hindustan           1
OpelCorsa           1
Name: Brand, dtype: int64

In [33]:
s = X_train_ols.sum()
s
                                        

const                    5.074000e+03
Year                     1.021577e+07
Mileage                  9.316342e+04
Power                    5.698933e+05
Seats                    2.680400e+04
kilometers_driven_log    5.460953e+04
Location_Bangalore       3.030000e+02
Location_Coimbatore      5.470000e+02
Location_Delhi           4.570000e+02
Location_Hyderabad       6.160000e+02
Location_Jaipur          3.710000e+02
Location_Kolkata         4.670000e+02
Location_Mumbai          6.290000e+02
Location_Pune            5.340000e+02
Fuel_Type_Diesel         2.720000e+03
Fuel_Type_Electric       2.000000e+00
Transmission_Manual      3.654000e+03
Owner_Type_Second        8.090000e+02
Owner_Type_Third         9.900000e+01
Brand_Audi               1.990000e+02
Brand_BMW                2.230000e+02
Brand_Chevrolet          1.110000e+02
Brand_Datsun             1.200000e+01
Brand_Fiat               2.600000e+01
Brand_Ford               2.430000e+02
Brand_Hindustan          1.000000e+00
Brand_Honda 

We can use the coefficients to assess the impact each variable has on the price. More will be said in the final review of the analysis.

*** Identify overall significant variables ***

In [34]:
# Retrive Coeff values, p-values and store them in the dataframe
olsmod = pd.DataFrame(olsmodel1.params, columns = ['coef'])

olsmod['pval'] = olsmodel1.pvalues


# We are looking are overall significant variable

cars_data_sigvar = cars_data.drop(['Brand_model'],axis =1)

pval_filter = olsmod['pval']<= 0.05
imp_vars = olsmod[pval_filter].index.tolist()

# We are going to get overall varaibles (un-one-hot encoded varables) from categorical variables
sig_var = []
for col in imp_vars:
    if '' in col:
        first_part = col.split('_')[0]
        for c in cars_data_sigvar.columns:
            if first_part in c and c not in sig_var :
                sig_var.append(c)

                
start = '\033[1m'
end = '\033[95m'
print(start+ 'Most overall significant categorical varaibles of LINEAR REGRESSION  are ' +end,':\n', sig_var)

[1mMost overall significant categorical varaibles of LINEAR REGRESSION  are [95m :
 ['Year', 'Mileage', 'Power', 'Seats', 'kilometers_driven_log', 'Location', 'Fuel_Type', 'Transmission', 'Owner_Type', 'Brand']


**Build Ridge / Lasso Regression similar to Linear Regression:**<br>

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [35]:
# Import Ridge/ Lasso Regression from sklearn
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

In [36]:
# Create a Ridge regression model
target  = 'Price_log'
features_added = [i for i in X.columns if i not in "Sales"]

In [37]:
#fitting Ridge with the default features
ridge = Ridge()
ridge.fit(X_train, y_train[target])

Ridge()

In [38]:
# Get score of the model
Ridge_model = get_model_score(ridge)

R-square on training set :  0.8704177836327331
R-square on test set :  0.8533529632253352
RMSE on training set :  3.9038211269232415
RMSE on test set :  3.9455660303401205


**Observations from results: _____**
- The R-squared for both training and test sets was .87 and .85, respectively, which indicates an identical strength as the previous models.  The RMSE values for both were 3.9, which is also identical to the previous models.  Thus, the ridge did not improve the results.

## XGBoost

In [39]:
#install xgboost package
!pip install xgboost

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [40]:
import xgboost
print(xgboost.__version__)

1.6.2


In [41]:
from numpy import absolute
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from xgboost import XGBRegressor

In [42]:
# create an xgboost regression model, find Rsquare based on original x_train
xgb = XGBRegressor(objective='reg:squarederror')

# define model evaluation method
#cv = RepeatedKFold(n_splits=120, n_repeats=6, random_state=1)
# evaluate model
#scores = cross_val_score(model, X_train, y_train[target], scoring='r2', cv=cv, n_jobs=-1)
# force scores to be positive
#scores = absolute(scores)
#print('R-square: %.3f (%.3f)' % (scores.mean(), scores.std()))

In [43]:
xgb.fit(X_train, y_train[target])
# Get score of the model
XGB_model = get_model_score(xgb)

R-square on training set :  0.98862525498481
R-square on test set :  0.8888873139848621
RMSE on training set :  1.1566129302503056
RMSE on test set :  3.43442860799185


Results show an unbiased model that may work to predict expected sale price.

## Hypertune XGBoost model

In [44]:
# Choose the type of estimator 
xgb_tuned = XGBRegressor(random_state=1)

# Grid of parameters to choose from
# Check documentation for all the parametrs that the model takes and play with those
parameters = { 'max_depth': [7],
           'learning_rate': [.3],
           'n_estimators': [150],
           'colsample_bytree': [1],
          'subsample':[1]
           }

# Type of scoring used to compare parameter combinations

from sklearn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error
#cv = RepeatedKFold(n_splits=120, n_repeats=6, random_state=1)
scorer = make_scorer(mean_squared_error) #use criteria mean squared error to optimze

# calculating different regression metrics
from sklearn.model_selection import GridSearchCV

# Run the grid search
grid_obj_xgb = GridSearchCV(estimator=xgb_tuned, param_grid=parameters, scoring = scorer, cv=10, n_jobs=10)
grid_obj_xgb = grid_obj_xgb.fit(X_train, y_train[target])

# Set the model to the best combination of parameters
xgb_tuned = grid_obj_xgb.best_estimator_

# Fit the best algorithm to the data
xgb_tuned.fit(X_train, y_train[target])

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.3, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=7, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=150, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=1, reg_alpha=0,
             reg_lambda=1, ...)

In [45]:
xgb_tuned.fit(X_train, y_train[target])
# Get score of the model
xgb_tuned_model = get_model_score(xgb_tuned)

R-square on training set :  0.996847643280918
R-square on test set :  0.9144483332464384
RMSE on training set :  0.6088844545903481
RMSE on test set :  3.0136090226360968


While the R-square and RMSE values are improved on the test set, they are significantly less compared to the training set values, indicating this model is a bit overfit.

In [46]:
grid_obj_xgb.best_params_

{'colsample_bytree': 1,
 'learning_rate': 0.3,
 'max_depth': 7,
 'n_estimators': 150,
 'subsample': 1}

### **Decision Tree** 

https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

In [47]:
# Import Decision tree for Regression from sklearn
from sklearn import tree

from sklearn.tree import DecisionTreeRegressor

In [48]:
# Create a decision tree regression model, use random_state = 1
dtree = DecisionTreeRegressor(random_state = 1)

In [49]:
# Fit decision tree regression model
dtree.fit(X_train, y_train[target])

DecisionTreeRegressor(random_state=1)

In [50]:
# Get score of the model
Dtree_model = get_model_score(dtree)

R-square on training set :  0.9999919148475714
R-square on test set :  0.7796366545435071
RMSE on training set :  0.03083623268912661
RMSE on test set :  4.836624243457745


**Observations from results: _____**
- This model is overfit, even though the training set R-square value is very high (.99), it is notably higher than the test set (.78).
- The RMSE are also disparate, where the test set is more than 10x greater than the training set.
- It is recommended that another model be used if possible to model, as these disparities indicate a biased model that is overfitting the training dataset.

Print the importance of features in the tree building. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.


In [51]:
print(pd.DataFrame(dtree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                    Imp
Power                      6.410034e-01
Year                       2.241740e-01
Mileage                    2.525093e-02
kilometers_driven_log      1.919618e-02
Brand_Mahindra             1.038278e-02
Brand_Mercedes-Benz        9.124087e-03
Brand_Chevrolet            6.881867e-03
Brand_Skoda                5.695154e-03
Brand_Mini                 5.326716e-03
Brand_Tata                 4.800286e-03
Seats                      4.639159e-03
Brand_Land                 4.286220e-03
Location_Kolkata           3.595378e-03
Fuel_Type_Diesel           3.292146e-03
Location_Hyderabad         3.098636e-03
Brand_Honda                3.078370e-03
Brand_Hyundai              2.490765e-03
Brand_Volkswagen           2.273522e-03
Brand_Toyota               2.257867e-03
Brand_BMW                  1.916113e-03
Location_Coimbatore        1.597843e-03
Transmission_Manual        1.461528e-03
Brand_Ford                 1.313782e-03
Location_Bangalore         1.222145e-03


**Observations and insights: _____**
- The top 2 important variables account for nearly 90% significance, indicating a simpler model could be made by combining a number of the variables, especially brand. 


### **Random Forest**

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

In [52]:
# Import Randomforest for Regression from sklearn
from sklearn.ensemble import RandomForestRegressor

In [53]:
# Create a Randomforest regression model
rf_estimator = RandomForestRegressor(random_state = 1)

In [54]:
# Fit Randomforest regression model
rf_estimator.fit(X_train, y_train[target])

RandomForestRegressor(random_state=1)

In [55]:
# Get score of the model

Rf_model = get_model_score(rf_estimator)

R-square on training set :  0.9824833580939456
R-square on test set :  0.8797299596096566
RMSE on training set :  1.4352998320756742
RMSE on test set :  3.5731512355164794


**Observations and insights: _____**
- Similar results as decision tree, where the model is still biased, given the near .10 difference between the training and test set R-square value.  The RMSE are closer, but still disparate.

**Feature Importance**

In [56]:
# Print important features similar to random forest

print(pd.DataFrame(rf_estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                Imp
Power                      0.646699
Year                       0.224256
Mileage                    0.025388
kilometers_driven_log      0.018246
Brand_Mercedes-Benz        0.006747
Fuel_Type_Diesel           0.005903
Brand_Mahindra             0.005687
Seats                      0.005148
Transmission_Manual        0.004843
Brand_Tata                 0.004287
Brand_Land                 0.004260
Brand_Mini                 0.004211
Brand_Chevrolet            0.003970
Location_Kolkata           0.003933
Brand_Honda                0.003668
Brand_Skoda                0.003088
Brand_Hyundai              0.002661
Location_Hyderabad         0.002640
Brand_Toyota               0.002474
Brand_BMW                  0.002075
Brand_Audi                 0.002066
Brand_Volkswagen           0.001903
Location_Coimbatore        0.001761
Owner_Type_Second          0.001684
Location_Mumbai            0.001356
Location_Bangalore         0.001339
Location_Pune              0

**Observations and insights: _____**
- Power and Year prove to be significant features as previously found in the decision tree analysis, accounting for close to 90% of the price.
- This similarly indicates that only a few features need to be accounted for, and thus many, in the brand category especially, could be combined.

### **Hyperparameter Tuning: Decision Tree**

In [57]:
# Choose the type of estimator 
dtree_tuned = DecisionTreeRegressor(random_state = 1)

# Grid of parameters to choose from
# Check documentation for all the parametrs that the model takes and play with those
parameters = {'max_depth': [6], 
              
              'criterion': ['squared_error', 'friedman_mse'],
              
              'min_samples_leaf': [2,],
              
              'max_leaf_nodes': [ 7] + [None]
             }

# Type of scoring used to compare parameter combinations

from sklearn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error

scorer = make_scorer(r2_score)

# calculating different regression metrics
from sklearn.model_selection import GridSearchCV

# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train[target])

# Set the model to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
dtree_tuned.fit(X_train, y_train[target])

DecisionTreeRegressor(max_depth=6, min_samples_leaf=2, random_state=1)

In [58]:
# Get score of the dtree_tuned
dtree_tuned_model = get_model_score(dtree_tuned)

R-square on training set :  0.8519002684130002
R-square on test set :  0.7968044509309221
RMSE on training set :  4.173441630606136
RMSE on test set :  4.644401591902103


**Observations and insights: _____**

- This tuned model is a significant improvement with regard to reducing the bias compared to the initial decision tree model. THe RMSE scores for both train and test sets are closer, although still higher than perhaps desired (3,5) range, and different by nearly 1.  

- Bias has been reduced significantly, with the difference in R values only .05, and both in the .80-.9 range. This indicates the model has modest accuracy and reasonable precision. 
- However, the test set values on the linear regression models were .04 better, and the overfit was less as well (~.02 compared to ~.08 in this case)
- We aim to have a model that closes this gap further, while maintaining or even improving accuracy.

**Feature Importance**

In [59]:
# Print important features of tuned decision tree similar to decision trees
print(pd.DataFrame(dtree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                Imp
Power                      0.710104
Year                       0.237712
Mileage                    0.009344
Brand_Mahindra             0.008987
Brand_Mercedes-Benz        0.006850
Brand_Skoda                0.004683
Brand_Mini                 0.004314
Brand_Chevrolet            0.004066
Brand_Honda                0.002700
Brand_Land                 0.002595
kilometers_driven_log      0.001637
Brand_Volkswagen           0.001425
Brand_Toyota               0.001395
Brand_Tata                 0.001358
Brand_BMW                  0.001132
Brand_Porsche              0.000657
Fuel_Type_Diesel           0.000429
Location_Hyderabad         0.000376
Location_Mumbai            0.000131
Location_Pune              0.000058
Location_Bangalore         0.000045
Brand_Mitsubishi           0.000000
Brand_Nissan               0.000000
Brand_OpelCorsa            0.000000
Brand_Renault              0.000000
Brand_Smart                0.000000
Brand_Jeep                 0

**Observations and insights: _____**

- Power and Year are the most important features, with Power valued even higher (.7) than in the untuned Decision tree (.64).  
- Mileage was not in the top 5, while Mahindra and Tata brands were valued higher, albeit very insignificantly (<.01).
- It appears nearly half of the variables could be omitted or combined.

### **Hyperparameter Tuning: Random Forest**

In [60]:
# Choose the type of Regressor
rf_tuned = RandomForestRegressor(random_state = 1)

# Define the parameters for Grid to choose from 

parameters = {"n_estimators": [500],
              
    "max_depth": [6,7],
              
    "max_features": [.3]
             }
# Check documentation for all the parametrs that the model takes and play with those

# Type of scoring used to compare parameter combinations
scorer = make_scorer(r2_score)

# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train[target])

# Set the model to the best combination of parameters
rf_tuned_regressor = grid_obj.best_estimator_
# Fit the best algorithm to the data
rf_tuned_regressor.fit(X_train, y_train[target])

RandomForestRegressor(max_depth=7, max_features=0.3, n_estimators=500,
                      random_state=1)

In [61]:
# Get score of the rf_tuned_regressor
Rf_tuned_model = get_model_score(rf_tuned_regressor)

R-square on training set :  0.8446687788388183
R-square on test set :  0.8079351453597531
RMSE on training set :  4.274118771808161
RMSE on test set :  4.515404078111758


**Observations and insights: _____**
- Overfitting is reduced on this tuned model compared to the original random forest, reducing the gap from .1 to .05, approximately. Even still the accuracy overall was lower for both training and test sets, given the slightly lower R-square values for both by comparison.
- RMSE values were closer in range, and higher than desired perhaps, also pointing to present, but less bias in the tuned model.
- This model performed even slightly better than the tuned decision tree model, with an R-square value of .85 on the test set, but still lower than the training set by .05.

In [62]:
# Print important features of tuned decision tree similar to decision trees
print(pd.DataFrame(rf_tuned_regressor.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                                Imp
Power                      0.397101
Transmission_Manual        0.187715
Year                       0.159223
Fuel_Type_Diesel           0.067194
Mileage                    0.048319
Brand_Mercedes-Benz        0.026070
kilometers_driven_log      0.025254
Seats                      0.015888
Brand_BMW                  0.012470
Brand_Audi                 0.011763
Brand_Honda                0.006238
Brand_Toyota               0.005119
Brand_Tata                 0.004945
Brand_Land                 0.003806
Location_Coimbatore        0.003461
Owner_Type_Second          0.002918
Brand_Chevrolet            0.002301
Brand_Mahindra             0.002253
Brand_Mini                 0.002217
Owner_Type_Third           0.002179
Brand_Skoda                0.002139
Brand_Hyundai              0.001968
Location_Kolkata           0.001922
Brand_Volkswagen           0.001447
Location_Jaipur            0.000715
Location_Kochi             0.000713
Brand_Porsche              0

## AutoML

In [63]:
!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

Defaulting to user installation because normal site-packages is not writeable
Looking in links: http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html



[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [64]:
import h2o
h2o.init(ip="127.0.0.1", port="8080")


Checking whether there is an H2O instance running at http://127.0.0.1:8080 . connected.


0,1
H2O_cluster_uptime:,1 day 22 hours 23 mins
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.38.0.3
H2O_cluster_version_age:,23 days
H2O_cluster_name:,H2O_from_python_maschmidt87_tfbih7
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,1.671 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


In [65]:
import h2o
from h2o.automl import H2OAutoML

# Start the H2O cluster (locally)
h2o.init()


Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,1 day 22 hours 23 mins
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.38.0.3
H2O_cluster_version_age:,23 days
H2O_cluster_name:,H2O_from_python_maschmidt87_x19a7t
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,1.571 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


In [66]:
#Create training data set and convert to H2o dataframe
train = X_train
train['Price_log']=y_train[target]
train = h2o.H2OFrame(train)
X_train_col= list(X_train.columns)
train

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


const,Year,Mileage,Power,Seats,kilometers_driven_log,Location_Bangalore,Location_Chennai,Location_Coimbatore,Location_Delhi,Location_Hyderabad,Location_Jaipur,Location_Kochi,Location_Kolkata,Location_Mumbai,Location_Pune,Fuel_Type_Diesel,Fuel_Type_Electric,Fuel_Type_LPG,Transmission_Manual,Owner_Type_Fourth & Above,Owner_Type_Second,Owner_Type_Third,Brand_Audi,Brand_BMW,Brand_Bentley,Brand_Chevrolet,Brand_Datsun,Brand_Fiat,Brand_Force,Brand_Ford,Brand_Hindustan,Brand_Honda,Brand_Hyundai,Brand_ISUZU,Brand_Isuzu,Brand_Jaguar,Brand_Jeep,Brand_Land,Brand_Mahindra,Brand_Mercedes-Benz,Brand_Mini,Brand_Mitsubishi,Brand_Nissan,Brand_OpelCorsa,Brand_Porsche,Brand_Renault,Brand_Skoda,Brand_Smart,Brand_Tata,Brand_Toyota,Brand_Volkswagen,Brand_Volvo,Price_log
1,2015,11.74,186.0,5,10.9682,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,3.28466
1,2008,15.0,105.0,5,11.783,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1.90211
1,2016,18.15,82.0,6,10.0869,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1.41099
1,2016,15.8,121.3,5,10.1002,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.3979
1,2015,15.1,140.0,7,10.5799,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2.53607
1,2016,18.5,85.8,5,10.0331,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.81482
1,2017,17.0,121.36,5,9.98967,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.16905
1,2013,14.21,203.0,5,10.779,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,3.09104
1,2015,25.2,74.0,5,11.0037,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.55814
1,2012,19.7,46.3,5,11.1146,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.955511


In [67]:
#Create test data set and convert to H2o dataframe
test = X_test
test['Price_log']=y_test[target]
test = h2o.H2OFrame(test)
X_test_col= list(test.columns)
test

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


const,Year,Mileage,Power,Seats,kilometers_driven_log,Location_Bangalore,Location_Chennai,Location_Coimbatore,Location_Delhi,Location_Hyderabad,Location_Jaipur,Location_Kochi,Location_Kolkata,Location_Mumbai,Location_Pune,Fuel_Type_Diesel,Fuel_Type_Electric,Fuel_Type_LPG,Transmission_Manual,Owner_Type_Fourth & Above,Owner_Type_Second,Owner_Type_Third,Brand_Audi,Brand_BMW,Brand_Bentley,Brand_Chevrolet,Brand_Datsun,Brand_Fiat,Brand_Force,Brand_Ford,Brand_Hindustan,Brand_Honda,Brand_Hyundai,Brand_ISUZU,Brand_Isuzu,Brand_Jaguar,Brand_Jeep,Brand_Land,Brand_Mahindra,Brand_Mercedes-Benz,Brand_Mini,Brand_Mitsubishi,Brand_Nissan,Brand_OpelCorsa,Brand_Porsche,Brand_Renault,Brand_Skoda,Brand_Smart,Brand_Tata,Brand_Toyota,Brand_Volkswagen,Brand_Volvo,Price_log
1,2010,11.5,171.0,7,11.8784,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2.46385
1,2012,20.0,68.0,5,11.2772,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.02065
1,2014,15.8,110.0,5,11.0429,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.01757
1,2013,19.4,86.8,5,9.30565,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.36098
1,2014,18.0,86.7,5,10.7089,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.59737
1,2019,18.06,63.0,7,9.47455,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2.20937
1,2014,13.0,201.1,5,10.5289,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,3.46574
1,2014,17.68,174.33,5,11.0305,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.03351
1,2016,19.01,108.45,5,11.0861,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2.27727
1,2016,16.2,258.0,5,9.80311,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.82275


In [68]:
# Run AutoML for 20 base models
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=X_train_col, y=target, training_frame=train)
#
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

AutoML progress: |█
07:58:36.16: AutoML: XGBoost is not available; skipping it.
07:58:36.53: _train param, Dropping bad and constant columns: [const, Brand_OpelCorsa]
07:58:39.650: _train param, Dropping bad and constant columns: [const, Brand_OpelCorsa]

███
07:58:45.790: _train param, Dropping bad and constant columns: [const, Brand_OpelCorsa]

███
07:58:54.758: _train param, Dropping bad and constant columns: [const, Brand_OpelCorsa]
07:58:57.810: _train param, Dropping bad and constant columns: [const, Brand_OpelCorsa]

█
07:59:00.842: _train param, Dropping bad and constant columns: [const, Brand_OpelCorsa]
07:59:04.279: _train param, Dropping bad and constant columns: [const, Brand_OpelCorsa]

██
07:59:10.102: _train param, Dropping bad and constant columns: [const, Brand_OpelCorsa]

██
07:59:13.135: _train param, Dropping bad and constant columns: [const, Brand_OpelCorsa]

██████████████████████████████████████████████████
08:16:20.238: _train param, Dropping unused columns: [co

model_id,rmse,mse,mae,rmsle,mean_residual_deviance
StackedEnsemble_AllModels_1_AutoML_3_20221217_75835,0.197513,0.0390113,0.136244,0.100273,0.0390113
StackedEnsemble_BestOfFamily_1_AutoML_3_20221217_75835,0.200954,0.0403826,0.139622,0.101838,0.0403826
GBM_5_AutoML_3_20221217_75835,0.207252,0.0429534,0.142651,0.104086,0.0429534
GBM_grid_1_AutoML_3_20221217_75835_model_1,0.2076,0.043098,0.143527,0.10605,0.043098
GBM_2_AutoML_3_20221217_75835,0.210045,0.0441188,0.143903,0.106146,0.0441188
GBM_4_AutoML_3_20221217_75835,0.210487,0.0443047,0.143759,0.105655,0.0443047
GBM_grid_1_AutoML_3_20221217_75835_model_2,0.212486,0.0451501,0.150008,0.105266,0.0451501
GBM_3_AutoML_3_20221217_75835,0.213397,0.0455382,0.145253,0.107418,0.0455382
GBM_grid_1_AutoML_3_20221217_75835_model_5,0.21455,0.0460315,0.145418,0.107257,0.0460315
GBM_grid_1_AutoML_3_20221217_75835_model_4,0.216076,0.0466889,0.148209,0.108288,0.0466889


In [None]:
# To generate predictions on a test set, you can make predictions
# directly on the `H2OAutoML` object or on the leader model
# object directly
preds_test = aml.predict(test)
preds_train= aml.predict(train)


In [None]:
#convert predicted test values to dataframe
score_list_auto = []

preds_test_df = preds_test.as_data_frame()
preds_train_df = preds_train.as_data_frame()

#Analyze R-square, RMSE scores for AutoML model values
pred_test_auto=np.exp(preds_test_df)
pred_train_auto = np.exp(preds_train_df)

train_r2_auto = metrics.r2_score(y_train['Price'], pred_train_auto)
    
test_r2_auto = metrics.r2_score(y_test['Price'], pred_test_auto)
    
train_rmse_auto = metrics.mean_squared_error(y_train['Price'], pred_train_auto, squared = False)
    
test_rmse_auto = metrics.mean_squared_error(y_test['Price'], pred_test_auto, squared = False)
    
# Adding all scores in the list
score_list_auto.extend((train_r2_auto, test_r2_auto, train_rmse_auto, test_rmse_auto))
    
        
print("R-square on training set : ", metrics.r2_score(y_train['Price'], pred_train_auto,))
        
print("R-square on test set : ", metrics.r2_score(y_test['Price'], pred_test_auto,))
        
print("RMSE on training set : ", np.sqrt(metrics.mean_squared_error(y_train['Price'], pred_train_auto,)))
        
print("RMSE on test set : ", np.sqrt(metrics.mean_squared_error(y_test['Price'], pred_test_auto,)))
    
# Returning the list with train and test scores
#score_list

In [74]:
#convert scores to dataframe
AutoML_scores= pd.DataFrame(score_list_auto).T
AutoML_scores.columns = ['Train_r2', 'Test_r2', 'Train_RMSE', 'Test_RMSE']
AutoML_scores.insert(0,'Model','AutoML-StackedEnsemble')
AutoML_scores

Unnamed: 0,Model,Train_r2,Test_r2,Train_RMSE,Test_RMSE
0,AutoML-StackedEnsemble,0.974683,0.914052,1.725533,3.020579


In [83]:
X_train = X_train.drop(['Price_log'], axis =1)
X_train.columns

Index(['const', 'Year', 'Mileage', 'Power', 'Seats', 'kilometers_driven_log',
       'Location_Bangalore', 'Location_Chennai', 'Location_Coimbatore',
       'Location_Delhi', 'Location_Hyderabad', 'Location_Jaipur',
       'Location_Kochi', 'Location_Kolkata', 'Location_Mumbai',
       'Location_Pune', 'Fuel_Type_Diesel', 'Fuel_Type_Electric',
       'Fuel_Type_LPG', 'Transmission_Manual', 'Owner_Type_Fourth & Above',
       'Owner_Type_Second', 'Owner_Type_Third', 'Brand_Audi', 'Brand_BMW',
       'Brand_Bentley', 'Brand_Chevrolet', 'Brand_Datsun', 'Brand_Fiat',
       'Brand_Force', 'Brand_Ford', 'Brand_Hindustan', 'Brand_Honda',
       'Brand_Hyundai', 'Brand_ISUZU', 'Brand_Isuzu', 'Brand_Jaguar',
       'Brand_Jeep', 'Brand_Land', 'Brand_Mahindra', 'Brand_Mercedes-Benz',
       'Brand_Mini', 'Brand_Mitsubishi', 'Brand_Nissan', 'Brand_OpelCorsa',
       'Brand_Porsche', 'Brand_Renault', 'Brand_Skoda', 'Brand_Smart',
       'Brand_Tata', 'Brand_Toyota', 'Brand_Volkswagen', 'Brand_Vol

In [88]:
X_test = X_test.drop(['Price_log'], axis =1)
X_test.columns

Index(['const', 'Year', 'Mileage', 'Power', 'Seats', 'kilometers_driven_log',
       'Location_Bangalore', 'Location_Chennai', 'Location_Coimbatore',
       'Location_Delhi', 'Location_Hyderabad', 'Location_Jaipur',
       'Location_Kochi', 'Location_Kolkata', 'Location_Mumbai',
       'Location_Pune', 'Fuel_Type_Diesel', 'Fuel_Type_Electric',
       'Fuel_Type_LPG', 'Transmission_Manual', 'Owner_Type_Fourth & Above',
       'Owner_Type_Second', 'Owner_Type_Third', 'Brand_Audi', 'Brand_BMW',
       'Brand_Bentley', 'Brand_Chevrolet', 'Brand_Datsun', 'Brand_Fiat',
       'Brand_Force', 'Brand_Ford', 'Brand_Hindustan', 'Brand_Honda',
       'Brand_Hyundai', 'Brand_ISUZU', 'Brand_Isuzu', 'Brand_Jaguar',
       'Brand_Jeep', 'Brand_Land', 'Brand_Mahindra', 'Brand_Mercedes-Benz',
       'Brand_Mini', 'Brand_Mitsubishi', 'Brand_Nissan', 'Brand_OpelCorsa',
       'Brand_Porsche', 'Brand_Renault', 'Brand_Skoda', 'Brand_Smart',
       'Brand_Tata', 'Brand_Toyota', 'Brand_Volkswagen', 'Brand_Vol

**Feature Importance**

**Observations and insights: ______**
- Like the previous decision tree and random forest models, Power and Year are identified as being most important, in that order, and accounting for 80%+ in determining the predicted price.
- Given the consistency of these findings, a simpler model could be developed in which most other variables are dropped.

In [89]:
# Defining list of models you have trained
models = [lr,olsmodel1, ridge,xgb, xgb_tuned, dtree,rf_estimator, dtree_tuned, rf_tuned_regressor]

# Defining empty lists to add train and test results
r2_train = []
r2_test = []
rmse_train = []
rmse_test = []

# Looping through all the models to get the rmse and r2 scores
for model in models:
    
    # Accuracy score
    j = get_model_score(model, False)
    
    r2_train.append(j[0])
    
    r2_test.append(j[1])
    
    rmse_train.append(j[2])
    
    rmse_test.append(j[3])

In [90]:
comparison_frame = pd.DataFrame({'Model':['Linear Regression-sklearn','Linear Regression-statsmodel','Ridge','XGBoost','XGBoost-tuned','Decision Tree', 'Random Forest', 'Decision Tree- tuned','Random Forest - tuned'], 
                                          'Train_r2': r2_train,'Test_r2': r2_test,
                                          'Train_RMSE': rmse_train,'Test_RMSE': rmse_test}) 
comparison_frame

Unnamed: 0,Model,Train_r2,Test_r2,Train_RMSE,Test_RMSE
0,Linear Regression-sklearn,0.871921,0.855197,3.881119,3.92068
1,Linear Regression-statsmodel,0.871921,0.855197,3.881119,3.92068
2,Ridge,0.870418,0.853353,3.903821,3.945566
3,XGBoost,0.988625,0.888887,1.156613,3.434429
4,XGBoost-tuned,0.996848,0.914448,0.608884,3.013609
5,Decision Tree,0.999992,0.779637,0.030836,4.836624
6,Random Forest,0.982483,0.87973,1.4353,3.573151
7,Decision Tree- tuned,0.8519,0.796804,4.173442,4.644402
8,Random Forest - tuned,0.844669,0.807935,4.274119,4.515404


In [91]:
#add AutoML scores to score list
comparison_frame_concat = comparison_frame.append(AutoML_scores, ignore_index=True)
comparison_frame_concat

  comparison_frame_concat = comparison_frame.append(AutoML_scores, ignore_index=True)


Unnamed: 0,Model,Train_r2,Test_r2,Train_RMSE,Test_RMSE
0,Linear Regression-sklearn,0.871921,0.855197,3.881119,3.92068
1,Linear Regression-statsmodel,0.871921,0.855197,3.881119,3.92068
2,Ridge,0.870418,0.853353,3.903821,3.945566
3,XGBoost,0.988625,0.888887,1.156613,3.434429
4,XGBoost-tuned,0.996848,0.914448,0.608884,3.013609
5,Decision Tree,0.999992,0.779637,0.030836,4.836624
6,Random Forest,0.982483,0.87973,1.4353,3.573151
7,Decision Tree- tuned,0.8519,0.796804,4.173442,4.644402
8,Random Forest - tuned,0.844669,0.807935,4.274119,4.515404
9,AutoML-StackedEnsemble,0.974683,0.914052,1.725533,3.020579


**Observations: _____**

- The AutoML-StackedEnsemble model appears to have delivered the most accurate, least overfit model of those created, given that the training and testing R2 results were near (difference of only ~.06) and all above .9.  Linear Regression and Ridge produced a suitable model as well.

-  The initial decision tree and random forest models were overfit, despite showing high accuracy in the training sets' R2 values. The Hyperparameter tuning did improve the overfitting notably, but were still more biased when compared to the linear regression models.

**Note:** You can also try some other algorithms such as KNN and compare the model performance with the existing ones.

### **Insights**

**Refined insights**:
- What are the most meaningful insights from the data relevant to the problem?

- Results consistently show that the engine power and year built of the used car are most determinative in the market value of the used car. Mileage, KM driven, and transmission type may also impact price, but to a much lower degree.
- Conversely, most locations sold and many brands did not show a significant impact on price in any of the used cars.
- Although many brands did not figure significantly in the overall price, some brands did significantly impact used car value. Hindustan, Mini Cooper, Land Rover, Mercedes-Benz, Audi, BMW, Jaguar, and Volvo cars all sold for higher prices, while Tata, Datsun, Chevrolet, Fiat, and Mahindra resulted in lower price  and Chevrolet used cars sold for lower prices, as found in the OLS model.

**Comparison of various techniques and their relative performance**:
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
- Checking for Multicollinearity and then removing features with significant VIF scores prior to applying any of the models helped decomplexify the dataset, thus allowing for more reliable results from each model.
- Both Ordinary Least Squares (OLS) Linear Regression (LR) models from sklearn and statsmodel packages  delivered nearly identical models that seemed to properly fit the data, given the relatively high and proximate R-square values of the training and test data (.87 & .85, respectively).
- The Ridge model provided a model comparable to the LR  models as well.
- The XGBoost model created a slightly more desireable model, with Test r-square values (.87) and RMSE score (3.7) slightly better than the Linear Regression outcomes.  Hypertuning parameters improved results for the training values, but was a bit overfit given the disparate test values.
- The Decision Tree and Random Forest initial models created overfit models, given the high training data R-square values (~.99), compared to the notably lower test values (.78 & .88, respectively).
- The tuned Decision Tree and Random Forest were able to eliminate some of the overfitting of the original models with lower training R-squared values (.89 &.90, respectively), but still had notable bias, given the significantly lower test values (.81 & .84, respectively).
- All models included the RMSE values for training and test that hovered around or below 4.
- The LR models performed nearly identically, delivering a well fitting model, especially compared to the decision tree and random forest models.
- The statsmodel LR model was improved by removing insignificant variables, but did not produce comparatively better results.
- The feature importance analysis revealed most of the features impact the Decision Tree and Random Forest models very little. - Some of these features could be eliminated, and further iterations could then be run on these models.
- The AutoML Stacked Ensemble model ultimately performed the best, with a high R-square test value (.91) and lowest, by comparison RMSE score (3.03).
- K-nearest neighbor could be applied to the dataset in order to better refine the data, and then additional modeling could be done to see if a more desirable model could be formulated.


**Proposal for the final solution design**:
- What model do you propose to be adopted? Why is this the best solution to adopt?
- I propose utilizing the AutoML Stacked Ensemble model. Even though the XGBoost tuned model produced an improved R-square value it was overfit. 
- The model shows no overfitting or underfitting, given the comparable R-square values of the test and training sets.

