## Used Car Price Prediction

### 1) Problem statement.

This dataset comprises used cars sold on cardehko.com in India as well as important features of these cars.

If user can predict the price of the car based on input features.

• Prediction results can be used to give new seller the price suggestion based on market condition.

### 2) Data Collection.

• The Dataset is collected from scrapping from cardheko webiste

• The data consists of 13 column and 15411 rows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

In [2]:
df = pd.read_csv("cardekho_dataset.csv")
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


# Data Cleaning
## Handling Misssing Values 
1. Handling Missing Values 
2. Handling Duplicates
3. Check Data Type
4. Understand the dataset

In [3]:
### Check Null Values 
## Check Features with non Values
df.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [4]:
## REmove Unnecessory Columns
df.drop('car_name', axis=1, inplace=True)
df.drop('brand', axis=1, inplace=True)


In [5]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [6]:
df['model'].unique()

array(['Alto', 'Grand', 'i20', 'Ecosport', 'Wagon R', 'i10', 'Venue',
       'Swift', 'Verna', 'Duster', 'Cooper', 'Ciaz', 'C-Class', 'Innova',
       'Baleno', 'Swift Dzire', 'Vento', 'Creta', 'City', 'Bolero',
       'Fortuner', 'KWID', 'Amaze', 'Santro', 'XUV500', 'KUV100', 'Ignis',
       'RediGO', 'Scorpio', 'Marazzo', 'Aspire', 'Figo', 'Vitara',
       'Tiago', 'Polo', 'Seltos', 'Celerio', 'GO', '5', 'CR-V',
       'Endeavour', 'KUV', 'Jazz', '3', 'A4', 'Tigor', 'Ertiga', 'Safari',
       'Thar', 'Hexa', 'Rover', 'Eeco', 'A6', 'E-Class', 'Q7', 'Z4', '6',
       'XF', 'X5', 'Hector', 'Civic', 'D-Max', 'Cayenne', 'X1', 'Rapid',
       'Freestyle', 'Superb', 'Nexon', 'XUV300', 'Dzire VXI', 'S90',
       'WR-V', 'XL6', 'Triber', 'ES', 'Wrangler', 'Camry', 'Elantra',
       'Yaris', 'GL-Class', '7', 'S-Presso', 'Dzire LXI', 'Aura', 'XC',
       'Ghibli', 'Continental', 'CR', 'Kicks', 'S-Class', 'Tucson',
       'Harrier', 'X3', 'Octavia', 'Compass', 'CLS', 'redi-GO', 'Glanza',
       

In [7]:
## Getting All different Types of Features

## get all the numeric features
num_features = [feature for feature in df.columns if df[feature].dtype != 'O']
print("Num of numeriacal Features : ", len(num_features))

##Categorical Features 
cat_features = [feature for feature in df.columns if df[feature].dtype == 'O']
print("Num of catgorical Features : ", len(cat_features))

## Discrete Features
discrete_features= [features for features in num_features if len(df[features].unique())<=25]
print("Number of discreate features are",len(discrete_features))

## Contunious_features
Contunious_features= [features for features in num_features if features not in discrete_features]
print("Number of Contunious features features are",len(Contunious_features))


Num of numeriacal Features :  7
Num of catgorical Features :  4
Number of discreate features are 2
Number of Contunious features features are 5


In [8]:
## Independent Types Of Features
from sklearn.model_selection import train_test_split
x = df.drop(['selling_price'], axis=1)
y = df['selling_price']

In [9]:
x.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [10]:
y.head()

0    120000
1    550000
2    215000
3    226000
4    570000
Name: selling_price, dtype: int64

## Feature Encoding and Scaling

#### One Hot Encoding for Columns which had lesser unique values and not ordinal

* One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In [11]:
len(df['model'].unique())

120

In [12]:
df['model'].value_counts()

model
i20            906
Swift Dzire    890
Swift          781
Alto           778
City           757
              ... 
Ghibli           1
Altroz           1
GTC4Lusso        1
Aura             1
Gurkha           1
Name: count, Length: 120, dtype: int64

In [13]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
x['model'] = le.fit_transform(x['model'])

In [14]:
x.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,7,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,54,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,118,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,7,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [15]:
len(df['seller_type'].unique()), len(df['fuel_type'].unique()),len(df['transmission_type'].unique())

(3, 5, 2)

In [16]:
# Create Column Transform with 3 types of transformers

num_features = x.select_dtypes(exclude="object").columns
onehot_columns = ['seller_type',	'fuel_type',	'transmission_type']

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop= 'first')

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, onehot_columns),
        ("StandardScaler", numeric_transformer, num_features)

    ],remainder='passthrough'
)

In [17]:
x=preprocessor.fit_transform(x)


In [18]:
pd.DataFrame(x)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.519714,0.983562,1.247335,-0.000276,-1.324259,-1.263352,-0.403022
1,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.225693,-0.343933,-0.690016,-0.192071,-0.554718,-0.432571,-0.403022
2,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.536377,1.647309,0.084924,-0.647583,-0.554718,-0.479113,-0.403022
3,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.519714,0.983562,-0.360667,0.292211,-0.936610,-0.779312,-0.403022
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-0.666211,-0.012060,-0.496281,0.735736,0.022918,-0.046502,-0.403022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15406,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.508844,0.983562,-0.869744,0.026096,-0.767733,-0.757204,-0.403022
15407,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.556082,-1.339555,-0.728763,-0.527711,-0.216964,-0.220803,2.073444
15408,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.407551,-0.012060,0.220539,0.344954,0.022918,0.068225,-0.403022
15409,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.426247,-0.343933,72.541850,-0.887326,1.329794,0.917158,2.073444


In [19]:
# separate dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state= 40)

## Model Training And Model Selction

In [20]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [21]:
## Create A Function to Evaluate Model

def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true,predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square


In [26]:
## Begining model Training
models={
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "KNeighborsRegressor": KNeighborsRegressor(),
    "DecisionTreeRegressor": DecisionTreeRegressor(),
    "RandomForestRegressor": RandomForestRegressor(),
    "GradientBoostingRegressor": GradientBoostingRegressor(),
    "Adaboost Regressor":  AdaBoostRegressor()
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train,y_train)  #Train model

    # Make Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate Train And Test dataset
    model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train,y_train_pred)
    model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    print(list(models.keys())[i])

    print('Model Performance for training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))
    
    print("----------------------------------------------------------------")

    print('Model Performance for Testing set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))

    print('='*35)
    print('\n')


Linear Regression
Model Performance for training set
- Root Mean Squared Error: 563383.4410
- Mean Absolute Error: 274314.7271
- R2 Score: 0.6230
----------------------------------------------------------------
Model Performance for Testing set
- Root Mean Squared Error: 456216.3540
- Mean Absolute Error: 259285.1738
- R2 Score: 0.6695


Lasso
Model Performance for training set
- Root Mean Squared Error: 563383.4454
- Mean Absolute Error: 274313.2275
- R2 Score: 0.6230
----------------------------------------------------------------
Model Performance for Testing set
- Root Mean Squared Error: 456216.7873
- Mean Absolute Error: 259285.0433
- R2 Score: 0.6695


Ridge
Model Performance for training set
- Root Mean Squared Error: 563383.8313
- Mean Absolute Error: 274277.2770
- R2 Score: 0.6230
----------------------------------------------------------------
Model Performance for Testing set
- Root Mean Squared Error: 456209.2551
- Mean Absolute Error: 259270.5847
- R2 Score: 0.6695


KNei

### Hyperparameter Tuning

In [32]:
#Initialize few parameter for Hyperparameter tuning
knn_params = {"n_neighbors": [2, 3, 10, 20, 40, 50]}
rf_params = {"max_depth": [5, 8, None, 10],
             "max_features": [5, 7, "auto", 8],
             "min_samples_split": [2, 8, 15, 20],
             "n_estimators": [100, 200, 500, 1000]}
ada_params = {
    "n_estimators": [50,60,70,80],
    "loss":['linear','Square', 'exponential']
}
Gradboost_params = {
    "loss": ['squared_error','huber','absolute_error'],
    "criterion": ['friedman_mse','squared_error','mse'],
    "min_samples_split": [2, 8, 15, 20],
    "n_estimators":[100,200,500,1000],
    "max_depth": [5, 8, 15, None, 10],
    "learning_rate":[0.1,0.01,0.02,0.03]
}

In [33]:
Gradboost_params

{'loss': ['squared_error', 'huber', 'absolute_error'],
 'criterion': ['friedman_mse', 'squared_error', 'mse'],
 'min_samples_split': [2, 8, 15, 20],
 'n_estimators': [100, 200, 500, 1000],
 'max_depth': [5, 8, 15, None, 10],
 'learning_rate': [0.1, 0.01, 0.02, 0.03]}

In [37]:
# Model List For Hyperparameter tuning
randomcv_models = [('knn',KNeighborsRegressor(), knn_params),
                   ('rf', RandomForestRegressor(), rf_params),
                   ("Adaboost", AdaBoostRegressor(),ada_params),
                   ("Gradboost", GradientBoostingRegressor,Gradboost_params)
                   ]

In [39]:
###Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

model_params = {}
for name, model, params in randomcv_models:
    random = RandomizedSearchCV(estimator=model,
                                param_distributions=params,
                                n_iter=100,
                                cv=3,
                                verbose=2,
                                n_jobs=-1)
    random.fit(X_train, y_train)
    model_params[name]= random.best_params_

for model_name in model_params:
    print(f"--------------Best Params for {model_name}-------------------")
    print(model_params[model_name])


Fitting 3 folds for each of 6 candidates, totalling 18 fits
Fitting 3 folds for each of 100 candidates, totalling 300 fits


KeyboardInterrupt: 

In [None]:
## Reatraining the model Using Set of  Suitaible hyperparameters
models= {

    "RandomForestRegressor": RandomForestRegressor(n_estimators = 100, min_samples_split= 2, max_features= 8, max_depth= None, n_jobs=-1), # type: ignore
    "KNeighborsRegressor": KNeighborsRegressor(n_neighbors= 2, n_jobs=-1),
    "Adaboost":AdaBoostRegressor( n_estimators= 80, loss= 'linear'),
    "GradientBoost": GradientBoostingRegressor()
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train,y_train)  #Train model

    # Make Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate Train And Test dataset
    model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train,y_train_pred)
    model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    print(list(models.keys())[i])

    print('Model Performance for training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))
    
    print("----------------------------------------------------------------")

    print('Model Performance for Testing set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))

    print('='*35)
    print('\n')

RandomForestRegressor
Model Performance for training set
- Root Mean Squared Error: 136527.2373
- Mean Absolute Error: 39285.9715
- R2 Score: 0.9779
----------------------------------------------------------------
Model Performance for Testing set
- Root Mean Squared Error: 219025.2584
- Mean Absolute Error: 98891.8902
- R2 Score: 0.9238


KNeighborsRegressor
Model Performance for training set
- Root Mean Squared Error: 208086.1609
- Mean Absolute Error: 64045.2577
- R2 Score: 0.9486
----------------------------------------------------------------
Model Performance for Testing set
- Root Mean Squared Error: 233590.6454
- Mean Absolute Error: 109742.2965
- R2 Score: 0.9134


Adaboost
Model Performance for training set
- Root Mean Squared Error: 537877.6509
- Mean Absolute Error: 414995.6836
- R2 Score: 0.6563
----------------------------------------------------------------
Model Performance for Testing set
- Root Mean Squared Error: 529738.9069
- Mean Absolute Error: 409185.6331
- R2 Sc