# **Milestone 2**

## **Model Building**

1. What we want to predict is the "Price". We will use the normalized version 'price_log' for modeling.
2. Before we proceed to the model, we'll have to encode categorical features. We will drop categorical features like Name. 
3. We'll split the data into train and test, to be able to evaluate the model that we build on the train data.
4. Build Regression models using train data.
5. Evaluate the model performance.

**Note:** Please load the data frame that was saved in Milestone 1 here before separating the data, and then proceed to the next step in Milestone 2.

### **Load the data**

In [91]:
# Import required libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import sklearn.metrics 
import sklearn.metrics as metrics  
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve,recall_score

from sklearn.model_selection import GridSearchCV



# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

cars_data = pd.read_csv("cars_data_updated.csv")
cars_data.head()
cars_data = cars_data.fillna(method='ffill')



### **Split the Data**

<li>Step1: Seperating the indepdent variables (X) and the dependent variable (y). 
<li>Step2: Encode the categorical variables in X using pd.dummies.
<li>Step3: Split the data into train and test using train_test_split.

**Think about it:** Why we should drop 'Name','Price','price_log','Kilometers_Driven' from X before splitting?

In [2]:
# Step-1
X = cars_data.drop(['Name','Price','Price_log','Kilometers_Driven', 'kilometers_driven_log', 'New_price_log', 'Power_log', 'New_price'], axis = 1)

y = cars_data[["Price_log", "Price"]]

In [3]:
# Step-2 Use pd.get_dummies(drop_first = True)
X = pd.get_dummies(X, drop_first = True)

In [4]:
from sklearn.model_selection import train_test_split
# Step-3 Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

print(X_train.shape, X_test.shape)

(5076, 24) (2176, 24)


In [5]:
# # Let us write a function for calculating r2_score and RMSE on train and test data
# This function takes model as an input on which we have trained particular algorithm
# The categorical column as the input and returns the boxplots and histograms for the variable

def get_model_score(model, flag = True):
    '''
    model : regressor to predict values of X

    '''
    # Defining an empty list to store train and test results
    score_list = [] 
    
    pred_train = model.predict(X_train)
    
    pred_train_ = np.exp(pred_train)
    
    pred_test = model.predict(X_test)
    
    pred_test_ = np.exp(pred_test)
    
    train_r2 = metrics.r2_score(y_train['Price'], pred_train_)
    
    test_r2 = metrics.r2_score(y_test['Price'], pred_test_)
    
    train_rmse = metrics.mean_squared_error(y_train['Price'], pred_train_, squared = False)
    
    test_rmse = metrics.mean_squared_error(y_test['Price'], pred_test_, squared = False)
    
    # Adding all scores in the list
    score_list.extend((train_r2, test_r2, train_rmse, test_rmse))
    
    # If the flag is set to True then only the following print statements will be dispayed, the default value is True
    if flag == True: 
        
        print("R-sqaure on training set : ", metrics.r2_score(y_train['Price'], pred_train_))
        
        print("R-square on test set : ", metrics.r2_score(y_test['Price'], pred_test_))
        
        print("RMSE on training set : ", np.sqrt(metrics.mean_squared_error(y_train['Price'], pred_train_)))
        
        print("RMSE on test set : ", np.sqrt(metrics.mean_squared_error(y_test['Price'], pred_test_)))
    
    # Returning the list with train and test scores
    return score_list

<hr>

For Regression Problems, some of the algorithms used are :<br>

**1) Linear Regression** <br>
**2) Ridge / Lasso Regression** <br>
**3) Decision Trees** <br>
**4) Random Forest** <br>

### **Fitting a linear model**

Linear Regression can be implemented using: <br>

**1) Sklearn:** https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html <br>
**2) Statsmodels:** https://www.statsmodels.org/stable/regression.html

In [6]:
# Import Linear Regression from sklearn
from sklearn.linear_model import LinearRegression


In [7]:
# Create a linear regression model
lr = LinearRegression()

cars_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7252 entries, 0 to 7251
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   S.No.                  7252 non-null   int64  
 1   Name                   7252 non-null   object 
 2   Location               7252 non-null   object 
 3   Year                   7252 non-null   int64  
 4   Kilometers_Driven      7252 non-null   int64  
 5   Fuel_Type              7252 non-null   object 
 6   Transmission           7252 non-null   object 
 7   Owner_Type             7252 non-null   object 
 8   Mileage                7252 non-null   float64
 9   Engine                 7252 non-null   float64
 10  Power                  7252 non-null   float64
 11  Seats                  7252 non-null   float64
 12  New_price              7250 non-null   float64
 13  Price                  7252 non-null   float64
 14  kilometers_driven_log  7252 non-null   float64
 15  Pric

In [8]:
# Fit linear regression model
lr.fit(X_train, y_train['Price_log']) 

LinearRegression()

In [9]:
# Get score of the model
LR_score = get_model_score(lr)


R-sqaure on training set :  0.6026935985967086
R-square on test set :  0.7010999697209037
RMSE on training set :  6.608285256712908
RMSE on test set :  5.765673746214778


**Observations from results: 
* The R-square value on the training set is lower than on the actual test set, but still not a great fit.
* The RMSE on both the training set and the test set are not similar at all to the R-square value. 
* We do not have a well fit model.

**Important variables of Linear Regression**

Building a model using statsmodels.

In [10]:
# Import Statsmodels 
import statsmodels.api as sm

# Statsmodel api does not add a constant by default. We need to add it explicitly
x_train = sm.add_constant(X_train)

# Add constant to test data
x_test = sm.add_constant(X_test)

def build_ols_model(train):
    
    # Create the model
    olsmodel = sm.OLS(y_train["Price_log"], train)
    
    return olsmodel.fit()



# Fit linear model on new dataset
olsmodel1 = build_ols_model(x_train)

print(olsmodel1.summary())

                            OLS Regression Results                            
Dep. Variable:              Price_log   R-squared:                       0.683
Model:                            OLS   Adj. R-squared:                  0.682
Method:                 Least Squares   F-statistic:                     473.3
Date:                Thu, 28 Jul 2022   Prob (F-statistic):               0.00
Time:                        11:29:45   Log-Likelihood:                -3547.1
No. Observations:                5076   AIC:                             7142.
Df Residuals:                    5052   BIC:                             7299.
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                 

In [11]:
# Retrive Coeff values, p-values and store them in the dataframe
olsmod = pd.DataFrame(olsmodel1.params, columns = ['coef'])

olsmod['pval'] = olsmodel1.pvalues

In [12]:
# Filter by significant p-value (pval <= 0.05) and sort descending by Odds ratio

olsmod = olsmod.sort_values(by = "pval", ascending = False)

pval_filter = olsmod['pval']<= 0.05

olsmod[pval_filter]

Unnamed: 0,coef,pval
Fuel_Type_Diesel,0.172315,0.01742291
Owner_Type_Third,-0.181412,0.0004783855
Mileage,-0.009237,0.0001338413
Location_Kolkata,-0.20094,1.849726e-06
Engine,0.000261,5.625166e-17
Transmission_Manual,-0.277407,1.245178e-40
Power,0.005006,2.7191679999999996e-57
S.No.,-0.000106,8.253515999999999e-212
const,-214.812696,1.063912e-301
Year,0.107408,6.141052e-303


In [13]:
# We are looking are overall significant varaible

pval_filter = olsmod['pval']<= 0.05
mp_vars = olsmod[pval_filter].index.tolist()

# We are going to get overall varaibles (un-one-hot encoded varables) from categorical varaibles
sig_var = []
for col in mp_vars:
    if '' in col:
        first_part = col.split('_')[0]
        for c in cars_data.columns:
            if first_part in c and c not in sig_var :
                sig_var.append(c)

                
start = '\033[1m'
end = '\033[95m'
print(start+ 'Most overall significant categorical varaibles of LINEAR REGRESSION  are ' +end,':\n', sig_var)

[1mMost overall significant categorical varaibles of LINEAR REGRESSION  are [95m :
 ['Fuel_Type', 'Owner_Type', 'Mileage', 'Location', 'Engine', 'Transmission', 'Power', 'Power_log', 'S.No.', 'Year']


**Build Ridge / Lasso Regression similar to Linear Regression:**<br>

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [54]:
# Import Ridge/ Lasso Regression from sklearn
from sklearn.linear_model import Ridge


In [55]:
# Create a Ridge regression model


rdg = Ridge(alpha = 1.0)


In [64]:
# Fit Ridge regression model

rdg.fit(X_train, y_train['Price_log'])


Ridge()

In [65]:

# Get score of the model

rdg = get_model_score(rdg)



R-sqaure on training set :  0.6026014404806532
R-square on test set :  0.7011265075974136
RMSE on training set :  6.609051632242788
RMSE on test set :  5.765417787501582


**Observations from results:
* Results same as linear regression
* The R-square value on the training set is lower than on the actual test set, but still not a great fit.
* The RMSE on both the training set and the test set are not similar at all to the R-square value. 
* We do not have a well fit model.

### **Decision Tree** 

https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

In [67]:
# Import Decision tree for Regression from sklearn

from sklearn.tree import DecisionTreeRegressor




In [68]:
# Create a decision tree regression model, use random_state = 1
dtree= DecisionTreeRegressor(random_state = 1)



In [69]:
# Fit decision tree regression model
dtree.fit(X_train, y_train['Price_log'])

DecisionTreeRegressor(random_state=1)

In [70]:
# Get score of the model
Dtree_model = get_model_score(dtree)

R-sqaure on training set :  1.0
R-square on test set :  0.7999924566793265
RMSE on training set :  3.883216341005871e-15
RMSE on test set :  4.716396106505038


**Observations from results: 
* The training set has a perfect score making me concerned of an overfit.
* The test data looks good with a .799
* The RSME looks better than the linear regression model.

Print the importance of features in the tree building. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.


In [71]:
print(pd.DataFrame(dtree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp',
                                                                                                       ascending = False))

                                Imp
Power                      0.452682
S.No.                      0.249642
Year                       0.200030
Engine                     0.035708
Mileage                    0.018074
Transmission_Manual        0.007137
Fuel_Type_Diesel           0.006103
Seats                      0.005884
Location_Kolkata           0.004798
Location_Hyderabad         0.003448
Owner_Type_Second          0.002710
Fuel_Type_Petrol           0.002593
Location_Coimbatore        0.002053
Location_Bangalore         0.001908
Location_Mumbai            0.001837
Location_Delhi             0.001668
Location_Jaipur            0.001046
Location_Pune              0.000763
Location_Kochi             0.000762
Location_Chennai           0.000638
Owner_Type_Third           0.000505
Owner_Type_Fourth & Above  0.000008
Fuel_Type_LPG              0.000003
Fuel_Type_Electric         0.000000


**Observations and insights:
* Variables of importance: 'Fuel_Type', 'Owner_Type', 'Location', 'Seats', 'Transmission', 'Engine'

### **Random Forest**

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

In [77]:
# Import Randomforest for Regression from sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression


In [78]:
# Create a Randomforest regression model 
regr = RandomForestRegressor(max_depth=2, random_state=0)


In [79]:
# Fit Randomforest regression model
regr.fit(X_train, y_train['Price_log'])

RandomForestRegressor(max_depth=2, random_state=0)

In [80]:
# Get score of the model
regr_score= get_model_score(regr)


R-sqaure on training set :  0.42505246649018924
R-square on test set :  0.4207492457674372
RMSE on training set :  7.949505388751899
RMSE on test set :  8.026392407400266


**Observations and insights: 
* This model has performed the worst overall with a low R-square value and a high RMSE.

**Feature Importance**

In [83]:
# Print important features similar to decision trees
print(pd.DataFrame(regr.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending= False))


                                Imp
Power                      0.518226
S.No.                      0.296466
Year                       0.141604
Engine                     0.043704
Location_Mumbai            0.000000
Owner_Type_Second          0.000000
Owner_Type_Fourth & Above  0.000000
Transmission_Manual        0.000000
Fuel_Type_Petrol           0.000000
Fuel_Type_LPG              0.000000
Fuel_Type_Electric         0.000000
Fuel_Type_Diesel           0.000000
Location_Pune              0.000000
Location_Kochi             0.000000
Location_Kolkata           0.000000
Location_Jaipur            0.000000
Location_Hyderabad         0.000000
Location_Delhi             0.000000
Location_Coimbatore        0.000000
Location_Chennai           0.000000
Location_Bangalore         0.000000
Seats                      0.000000
Mileage                    0.000000
Owner_Type_Third           0.000000


**Observations and insights: 
* This is showing all variables of equally high importance except for 'Engine', 'Year', 'S.No.', and 'Power'. 

### **Hyperparameter Tuning: Decision Tree**

In [99]:
# Choose the type of regressor. 
dtree_tuned = DecisionTreeRegressor(random_state = 1)

# Grid of parameters to choose from
parameters = {'max_depth': [None], 
              'min_samples_leaf': [1, 3, 5, 7],
              'max_leaf_nodes' : [2, 5, 7] + [None],
             }


# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring = scorer,cv = 5)
grid_obj = grid_obj.fit(X_train, y_train['Price_log'])

# Set the model to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
dtree_tuned.fit(X_train, y_train['Price_log'])

NotFittedError: All estimators failed to fit

In [100]:
# Get score of the dtree_tuned
print(pd.DataFrame(dtree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending= False))


NotFittedError: This DecisionTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

**Observations and insights: _____**

**Feature Importance**

In [None]:
# Print important features of tuned decision tree similar to decision trees

**Observations and insights: _____**

### **Hyperparameter Tuning: Random Forest**

In [None]:
# Choose the type of Regressor

# Define the parameters for Grid to choose from 

# Check documentation for all the parametrs that the model takes and play with those

# Type of scoring used to compare parameter combinations

# Run the grid search

# Set the model to the best combination of parameters

# Fit the best algorithm to the data

In [None]:
# Choose the type of regressor

rf_tuned = RandomForestRegressor(random_state = 1,oob_score = True)

# Grid of parameters to choose from
parameters = {  
                'max_depth':[5,7,None],
                'max_features': ['sqrt','log2'],
                'n_estimators': [250,500,800]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train['Price_log'])

# Set the model to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
rf_tuned.fit(X_train, y_train['Price_log'])

In [None]:
# Get score of the model

**Observations and insights: _____**

**Feature Importance**

In [None]:
# Print important features of tuned decision tree similar to decision trees

**Observations and insights: ______**

In [None]:
# Defining list of models you have trained
models = [lr, dtree, __________________]

# Defining empty lists to add train and test results
r2_train = []
r2_test = []
rmse_train = []
rmse_test = []

# Looping through all the models to get the rmse and r2 scores
for model in models:
    
    # Accuracy score
    j = get_model_score(model, False)
    
    r2_train.append(j[0])
    
    r2_test.append(j[1])
    
    rmse_train.append(j[2])
    
    rmse_test.append(j[3])

In [None]:
comparison_frame = pd.DataFrame({'Model':['Linear Regression','Decision Tree', ___________, ___________], 
                                          'Train_r2': r2_train,'Test_r2': r2_test,
                                          'Train_RMSE': rmse_train,'Test_RMSE': rmse_test}) 
comparison_frame

**Observations: _____**

**Note:** You can also try some other algorithms such as KNN and compare the model performance with the existing ones.

### **Insights**

**Refined insights**:
- What are the most meaningful insights from the data relevant to the problem?

**Comparison of various techniques and their relative performance**:
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

**Proposal for the final solution design**:
- What model do you propose to be adopted? Why is this the best solution to adopt?