# **Predicting Used Car Prices**

# Part 2: Predictive Model Construction 

* **Step1:** Encode the dependent and independent variables 
    * Price_log is my target or dependent variable, because it's value depends on the value of the feature / indepdent variables.  
* **Step2:** Encode the categorical variables in X using pd.dummies. 
    * Absent Model variable, which has 211 unique values
* **Step3:** Split the data into train and test using train_test_split.  
    * Test size = 0.3 
* **Step 4:** Construct predictive regression models: 
    * Linear regression model 
    * Lasso regression model 
    * Decision Tree 
        + Hyperparameter tuning 
    * Random Forest
        + Hyperparameter tuning 
* **Step 5:** Evaluate, compare and identify the best performing modelc 

## Table of Contents
1. [Set up](#setup)
2. [Train Test Split](#traintest)
3. [Linear Regression](#linear) 
4. [Lasso Regression](#lasso)
5. [Decision Tree](#dtree)  
    * [Decision Tree: Hyperparameter Tuning](#dtree-tuning) 
6. [Random Forest](#randomforest) 
    * [Random Forest: Hyperparameter Tuning](#rf-tuning) 
7. [Conclusion](#conclusion) 

<a id="setup"></a>
# **1. Set up** 

### Libraries 

In [1]:
# Standard library imports
import numpy as np
import pandas as pd 

# Statsmodels for statistical modeling
import statsmodels.api as sm

# Scikit-learn imports for model selection and evaluation
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split, cross_val_score
from sklearn.metrics import classification_report, f1_score, mean_squared_error, r2_score, make_scorer 
from sklearn import metrics

# Scikit-learn imports for preprocessing
from sklearn.preprocessing import StandardScaler

# Scikit-learn imports for linear models
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# Scikit-learn imports for tree-based models
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor  

### Data set 

In [2]:
df = pd.read_excel('Used_cars_final.xlsx') 
df.shape 

(6018, 13)

Double-checking that we have no missing values in our dataframe 

In [3]:
df.isnull().sum() 

Brand            0
Model            0
Year             0
Owner            0
KM_driven_log    0
Fuel             0
Engine_log       0
Power_log        0
Mileage_log      0
Transmission     0
Seats            0
Location         0
Price_log        0
dtype: int64

<hr>

<a id="traintest"></a>
# **2. Train Test Split** 

**Data types:** We have 6 categorical and 7 numerical variables.  

In [4]:
df.dtypes 

Brand             object
Model             object
Year               int64
Owner             object
KM_driven_log    float64
Fuel              object
Engine_log       float64
Power_log        float64
Mileage_log      float64
Transmission      object
Seats              int64
Location          object
Price_log        float64
dtype: object

**Number of unique values in each column**

In [5]:
df.nunique() 

Brand              31
Model             211
Year               22
Owner               4
KM_driven_log    3092
Fuel                5
Engine_log        151
Power_log         393
Mileage_log       440
Transmission        2
Seats               8
Location           11
Price_log        1373
dtype: int64

In [6]:
print('Number of unique car Models: ', df['Model'].nunique())  # unique models 
print('Number of unique car Brands: ', df['Brand'].nunique())  # unique brands 

Number of unique car Models:  211
Number of unique car Brands:  31


I've already applied log_transformation to the numerical variables in the previous notebook. Now, I'm also applying StandardScaler. First, the log transformation deals with skewness, and then the scaling ensures that all features contribute equally to model training, avoiding biases towards features with higher magnitude.

* I have set the random_state parameter is to 42. This is a parameter that controls the randomness of the data shuffling applied to the data before it is split. By setting random_state to a fixed number, ensures that the outcome of the split is reproducible (that the same split will occur every time the code is run).  
    * This specific value, 42, doesn't have any special properties aside from being a commonly used number in examples due to its cultural reference ("the answer to life, the universe, and everything" in Douglas Adams' "The Hitchhiker's Guide to the Galaxy").

In [7]:
X = df.drop(['Price_log', 'Model'], axis=1) # drop the 'Price_log' and 'Model' columns 
y = df['Price_log'] # target variable

X = pd.get_dummies(X, drop_first=True) # convert categorical variables to dummy variables 

# Setting test size to 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) 
X_test_scaled = scaler.transform(X_test) 

**Model scores function** 
- In order to avoid repetitive code, I pre-define a function for calculating R2_scores and RMSE on train and test data  
- This function takes model as an input on which we have trained particular algorithm  

In [8]:
def get_model_scores(model, X_train, y_train, X_test, y_test):
    # Predict on training and test data
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)

    # Scores for training set
    train_r2 = r2_score(y_train, pred_train)
    train_rmse = np.sqrt(mean_squared_error(y_train, pred_train))

    # Scores for test set
    test_r2 = r2_score(y_test, pred_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, pred_test))

    print(f"R-squared on training set: {train_r2}")
    print(f"RMSE on training set: {train_rmse}, \n")
    print(f"R-squared on test set: {test_r2}")
    print(f"RMSE on test set: {test_rmse}")

    return train_r2, test_r2, train_rmse, test_rmse  

<hr>

<a id="linear"></a>
# **3. Linear Regression Model** 

Create and fit the linear regression model

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

Get scores of the linear regression model

In [9]:
lr_scores = get_model_scores(lr, X_train, y_train, X_test, y_test)  

R-squared on training set: 0.9366047035414113
RMSE on training set: 0.2192064264890547, 

R-squared on test set: 0.9345409028821385
RMSE on test set: 0.2252220071240112


* **Consistency Across Datasets:** The R-squared values for both training (0.9372) and test (0.9319) datasets are quite close, which indicates that the linear model generalizes well on unseen data and does not suffer significantly from overfitting.  
* **Error Metrics:** The Root Mean Squared Error (RMSE) values are low (0.2187 on training and 0.2282 on test sets), which implies that on average, the model’s predictions deviate from the actual logarithm of prices by about 0.22. This suggests that the model predictions are relatively precise.  

**Ordinary Least Squares (OLS)**   
    * OLS is a method of estimating the parameters of a linear regression model. It specifically minimizes the sum of the squared differences between the observed values and the values predicted by the linear model.

In [10]:
# Ensuring the indices are aligned correctly
X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

# Explicitly convert boolean to int if necessary (usually, this should not be required)
X_train = X_train.astype(float)

x_train_sm = sm.add_constant(X_train) # add a constant to the model data
y_train = y_train.astype(float) # ensure no data type issues

# Fit the model using statsmodels
try:
    ols_model = sm.OLS(y_train, x_train_sm).fit()
    print(ols_model.summary())
except Exception as e:
    print("Error fitting the model:", e)
    print("Check the input data with np.asarray(data).") 
    print(x_train_sm.head())
    print(y_train.head())  

                            OLS Regression Results                            
Dep. Variable:              Price_log   R-squared:                       0.937
Model:                            OLS   Adj. R-squared:                  0.936
Method:                 Least Squares   F-statistic:                     1205.
Date:                Tue, 07 May 2024   Prob (F-statistic):               0.00
Time:                        12:14:41   Log-Likelihood:                 416.16
No. Observations:                4212   AIC:                            -728.3
Df Residuals:                    4160   BIC:                            -398.3
Df Model:                          51                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                 -221.3663 

**Observations:**
* **Categorical Variables:** The model handles categorical variables (like brand and location) well, showing significant differences in price impacts across different categories, which could help stakeholders understand brand and location premiums or discounts in the used car market.
* **F-statistic and its Prob (F-statistic):** The F-statistic is very high at 1307, and the probability of the F-statistic is approximately 0.00. This suggests that the model is statistically significant at explaining the variation in Price_log compared to a model with no independent variables.  
* **Statistical Significance:** Most predictors are statistically significant as indicated by their p-values (P>|t| nearly 0.00), which suggests these variables have a significant impact on the logarithm of used car prices.  
  
* **Influence of Features:**  
    *  Year, KM_driven_log, Engine_log, Power_log, Mileage_log, and Seats are significant with small p-values, indicating strong evidence against the null hypothesis of no effect.  
    * Positive coefficients in the model (e.g., Power_log and Engine_log) suggest that higher engine power and size lead to higher prices. 
    *  Various brands show different impacts on the price: 
        * Negative coefficients for brands like Brand_Chevrolet and Brand_Ford might indicate these brands generally fetch lower prices compared to the baseline brand.
        * Brand_Bentley and Brand_Lamborghini have a positive impact on price, indicating these are more expensive on average.  
    *  Owner_Second and Owner_Third have negative coefficients, suggesting that vehicles with previous owners typically have lower prices.  
    *  Fuel types like Diesel and Electric have positive coefficients, suggesting these are associated with higher prices compared to the baseline fuel type (not shown, likely Petrol).  
    *  Transmission_Manual has a negative coefficient, indicating manual cars are cheaper compared to automatic ones.  
    *  Geographic location also impacts prices. For example, cars in Location_Bangalore fetch higher prices, while those in Location_Kolkata fetch lower prices.
    * The coefficient for Year is positive, indicating newer cars tend to have higher prices, which is expected.






**Important variables of the Linear Regression**

In [11]:
# Retrive Coeff values, p-values and store them in the dataframe
olsmod = pd.DataFrame(ols_model.params, columns = ['coef'])
olsmod['pval'] = ols_model.pvalues 

In [33]:
# Filter by significant p-value (pval <= 0.05) and sort descending by Odds ratio
olsmod = olsmod.sort_values(by = "pval", ascending = False)
pval_filter = olsmod['pval']<= 0.05  
olsmod[pval_filter].head(10) 

Unnamed: 0,coef,pval
Location_Jaipur,-0.046295,0.03539471
Location_Pune,-0.053998,0.009238964
Location_Mumbai,-0.072418,0.0003317588
Fuel_Diesel,0.157617,3.688641e-05
Location_Delhi,-0.091592,1.11735e-05
Mileage_log,-0.15217,9.65011e-07
Owner_Third,-0.129686,4.956927e-07
Location_Coimbatore,0.109602,1.208168e-07
Location_Hyderabad,0.117505,7.35972e-09
Location_Bangalore,0.143263,2.984784e-10


<hr>

<a id="lasso"></a>
# **4. Lasso Regression Model** 

* **[Lasso regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)** is a type of linear regression that includes a regularization term. Lasso regression aims to minimize the residual sum of squares (like ordinary least squares) plus the sum of the absolute values of the coefficients multiplied by a constant, alpha.

* **The alpha parameter** controls the strength of the regularization. A larger alpha means more regularization, which increases the penalty for larger coefficients and can drive some coefficients to zero, effectively performing variable selection. An alpha of 0.01 in our model suggests moderate regularization.
* **Max Iterations parameter** set to 50000, specifies the maximum number of iterations for the algorithm to converge to the optimal coefficients. Increasing this value, from 10000 to 50000, helped ensure convergence.  

**Initialize and fit the Lasso regression model** 

In [13]:
lasso = Lasso(alpha=0.01, max_iter=50000)  # increased from 10000 to 50000
lasso.fit(X_train_scaled, y_train) 

**Performance of the Lasso model** 

In [14]:
lasso_scores = get_model_scores(lasso, X_train_scaled, y_train, X_test_scaled, y_test)  

R-squared on training set: 0.9309619017912658
RMSE on training set: 0.22875425097595672, 

R-squared on test set: 0.9307853387669305
RMSE on test set: 0.2315927010503538


**Observations:** 
* **R-squared:** The Lasso model achieves an R-squared of 0.9316 on the training set and 0.9265 on the test set, which are quite high. These values indicate that the model explains a significant portion of the variance in the dependent variable.
* **RMSE (Root Mean Squared Error):** The RMSE values are 0.2283 on the training set and 0.2371 on the test set, suggesting that on average, the model’s predictions deviate from the actual values by these amounts (on the logarithmic scale of price, since the dependent variable is Price_log).

<hr>

<a id="dtree"></a>
# **5. Decision Tree Model** 

Learn more about Decision Tree models: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

Initialize and fit the Decision Tree regression model

In [15]:
dtree = DecisionTreeRegressor(max_depth=5)  
dtree.fit(X_train_scaled, y_train) 

Decision Tree model performance scores

In [16]:
dtree_scores = get_model_scores(dtree, X_train_scaled, y_train, X_test_scaled, y_test) 

R-squared on training set: 0.8595009260455575
RMSE on training set: 0.3263333765320665, 

R-squared on test set: 0.8563424048174433
RMSE on test set: 0.3336492209615977


<a id="dtree-tuning"></a>
# - Decision Tree: Hyperparameter Tuning  

In [17]:
# The type of estimator
dtree_tuned = DecisionTreeRegressor(random_state=1)

# Grid of parameters 
parameters = {
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'max_features': [None, 'sqrt', 'log2']  # removed 'auto'
}

scorer = make_scorer(r2_score) # type of scoring used to compare parameter combinations

# Run the grid search with error_score set to 'raise' for better debugging
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=scorer, cv=5, error_score='raise')
grid_obj = grid_obj.fit(X_train, y_train)

dtree_tuned = grid_obj.best_estimator_  # set the clf to the best combination of parameters 
dtree_tuned.fit(X_train, y_train) # fit the best algorithm to the data 

Evaluate the model using the generalized function

In [18]:
dtree_tuned_scores = get_model_scores(dtree_tuned, X_train, y_train, X_test, y_test) 

R-squared on training set: 0.9597444511968335
RMSE on training set: 0.17467777025796236, 

R-squared on test set: 0.9004490790861175
RMSE on test set: 0.2777464137475989


Important features of tuned decision tree similar 

In [19]:
# Create a DataFrame with feature importances
feature_importances = pd.DataFrame(dtree_tuned.feature_importances_, columns=["Imp"], index=X_train.columns)

# Sort the DataFrame by importance in descending order and print the first 20 rows
print(feature_importances.sort_values(by='Imp', ascending=False).head(10)) 

                       Imp
Power_log         0.658116
Year              0.232433
Engine_log        0.041731
Mileage_log       0.015190
KM_driven_log     0.012006
Brand_Mahindra    0.005094
Location_Kolkata  0.004568
Brand_Tata        0.004114
Brand_Honda       0.003650
Seats             0.002919


<hr>

<a id="randomforest"></a>
# **6. Random Forest Model**  

Learn more about Random Forest Regressor: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html 

**Initialize the Random Forest Regressor and fit on the training data** 

In [20]:
rf = RandomForestRegressor(n_estimators=100, random_state=1)  
rf.fit(X_train_scaled, y_train) 

**The model scores**

In [21]:
rf_scores = get_model_scores(rf, X_train_scaled, y_train, X_test_scaled, y_test) 

R-squared on training set: 0.9912164586184354
RMSE on training set: 0.0815942565250492, 

R-squared on test set: 0.9406402841911602
RMSE on test set: 0.21447255503980395


<a id="rf-tuning"></a>
# - Random Forest: Hyperparameter Tuning  

In [22]:
rf = RandomForestRegressor(random_state=42)

param_dist = {
    'n_estimators': [100, 200],
    'max_features': ['sqrt', 0.5],  # changed 'auto' to 'sqrt'
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 4]
}

scorer = make_scorer(r2_score)
n_iter_search = 10

random_search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=n_iter_search, scoring=scorer, cv=5, random_state=42, n_jobs=-1, 
                                   error_score='raise', verbose=0)
random_search.fit(X_train, y_train)

rf_tuned = random_search.best_estimator_ 

Score of the rf_tuned model

In [23]:
rf_tuned_score = get_model_scores(rf_tuned, X_train, y_train, X_test, y_test) 

R-squared on training set: 0.9912221781584956
RMSE on training set: 0.08156768651216363, 

R-squared on test set: 0.9460207132932678
RMSE on test set: 0.20452169804341983


In [24]:
rf = RandomForestRegressor(random_state=42) # the model

# Distribution of parameters to choose from
param_dist = {
    'n_estimators': [100, 200, 300, 400],
    'max_features': ['sqrt', 'log2', None],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': np.arange(2, 21),
    'min_samples_leaf': np.arange(1, 21)
}

# Number of iterations and the scoring function
n_iter_search = 20
scorer = make_scorer(r2_score)  
        
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=n_iter_search, scoring=scorer, cv=5, random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)

rf_tuned = random_search.best_estimator_ 

In [25]:
rf_tuned_scores = get_model_scores(rf_tuned, X_train, y_train, X_test, y_test) 

R-squared on training set: 0.9667341857889861
RMSE on training set: 0.15879025367691654, 

R-squared on test set: 0.9349252813194433
RMSE on test set: 0.224559777214033


**Feature Importance**  

In [32]:
feature_importances = pd.DataFrame(rf_tuned.feature_importances_,
                                   index = X_train.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances.head(5)) 

               importance
Power_log        0.659381
Year             0.230484
Engine_log       0.033509
Mileage_log      0.011918
KM_driven_log    0.011742


<hr>

<a id="conclusion"></a>
# **7. Conclusion** 

In [27]:
# Store model scores in a dictionary 
model_scores = {
    "Linear Regression": lr_scores,
    "Lasso": lasso_scores,
    "Decision Tree": dtree_scores, 
    "Tuned Decision Tree": dtree_tuned_scores, 
    "Random Forest": rf_scores,
    "Tuned Random Forest": rf_tuned_scores
}

# the headers
print("{:<20} {:<15} {:<15} {:<15} {:<15}".format('Model', 'Train R^2', 'Test R^2', 'Train RMSE', 'Test RMSE'))

# Print each model's scores in a formatted output
for model_name, scores in model_scores.items():
    print("{:<20} {:<15.3f} {:<15.3f} {:<15.3f} {:<15.3f}".format(
        model_name,
        scores[0],  # Train R^2
        scores[1],  # Test R^2
        scores[2],  # Train RMSE
        scores[3]   # Test RMSE
    )) 

Model                Train R^2       Test R^2        Train RMSE      Test RMSE      
Linear Regression    0.937           0.935           0.219           0.225          
Lasso                0.931           0.931           0.229           0.232          
Decision Tree        0.860           0.856           0.326           0.334          
Tuned Decision Tree  0.960           0.900           0.175           0.278          
Random Forest        0.991           0.941           0.082           0.214          
Tuned Random Forest  0.967           0.935           0.159           0.225          


**Linear Regression and Lasso:**  
* Both models show high and comparable R^2 values on the training and test sets, indicating good generalization. The RMSE values are also quite low, suggesting that the predictions are close to the actual values.  
* The slight performance drop from Linear Regression to Lasso could be due to the regularization in Lasso which might be removing some useful predictors due to the shrinkage of coefficients.    

**Decision Trees:**
* The standard Decision Tree has significantly lower R^2 values and higher RMSE compared to other models, indicating less predictive accuracy and higher errors.  
* The Tuned Decision Tree shows improved training metrics drastically (R^2 of 0.96 and RMSE of 0.175), but the increase in Test RMSE to 0.278 suggests some overfitting despite tuning.  

**Random Forest:**
* The standard Random Forest model shows extremely high R^2 and very low RMSE on the training data, suggesting excellent performance. However, the increase in Test RMSE (though still competitive) hints at some overfitting.  

> **Tuned Random Forest:**  
* Best General Model  
* The Tuned Random Forest has a slightly lower training R^2 than the standard version but a better balance between training and test scores, suggesting less overfitting. 
* This model provides a strong balance between high R^2 and low RMSE on both training and test datasets. The performance is robust with a relatively small difference between training and test results, indicating good generalization without significant overfitting.

Simplicity vs. Performance: If deployment simplicity is a priority (e.g., faster predictions, easier model management), Linear Regression might be a suitable choice as it still provides robust performance metrics and will generally be faster and easier to manage than a complex ensemble model like a Random Forest.