# Used Car Price Prediction

## 1) Problem statement.

* This dataset comprises used cars sold on cardehko.com in India as well as important features of these cars.
* If user can predict the price of the car based on input features.
* Prediction results can be used to give new seller the price suggestion based on market condition.

## 2) Data Collection.
* The Dataset is collected from scrapping from cardheko webiste
* The data consists of 13 column and 15411 rows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

In [4]:
df = pd.read_csv(r"cardekho_imputated.csv", index_col=[0])

In [5]:
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


## Data Cleaning
### Handling Missing values

* Handling Missing values 
* Handling Duplicates
* Check data type
* Understand the dataset

In [6]:
df.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [7]:
df.drop('car_name', axis=1, inplace=True)
df.drop('brand', axis=1, inplace=True)

In [8]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [9]:
df['model'].unique()

array(['Alto', 'Grand', 'i20', 'Ecosport', 'Wagon R', 'i10', 'Venue',
       'Swift', 'Verna', 'Duster', 'Cooper', 'Ciaz', 'C-Class', 'Innova',
       'Baleno', 'Swift Dzire', 'Vento', 'Creta', 'City', 'Bolero',
       'Fortuner', 'KWID', 'Amaze', 'Santro', 'XUV500', 'KUV100', 'Ignis',
       'RediGO', 'Scorpio', 'Marazzo', 'Aspire', 'Figo', 'Vitara',
       'Tiago', 'Polo', 'Seltos', 'Celerio', 'GO', '5', 'CR-V',
       'Endeavour', 'KUV', 'Jazz', '3', 'A4', 'Tigor', 'Ertiga', 'Safari',
       'Thar', 'Hexa', 'Rover', 'Eeco', 'A6', 'E-Class', 'Q7', 'Z4', '6',
       'XF', 'X5', 'Hector', 'Civic', 'D-Max', 'Cayenne', 'X1', 'Rapid',
       'Freestyle', 'Superb', 'Nexon', 'XUV300', 'Dzire VXI', 'S90',
       'WR-V', 'XL6', 'Triber', 'ES', 'Wrangler', 'Camry', 'Elantra',
       'Yaris', 'GL-Class', '7', 'S-Presso', 'Dzire LXI', 'Aura', 'XC',
       'Ghibli', 'Continental', 'CR', 'Kicks', 'S-Class', 'Tucson',
       'Harrier', 'X3', 'Octavia', 'Compass', 'CLS', 'redi-GO', 'Glanza',
       

In [10]:
num_features = [feature for feature in df.columns if df[feature].dtypes != 'O']
print("Number of Numerical Features: ", len(num_features))

cat_features = [feature for feature in df.columns if df[feature].dtypes == 'O']
print("Number of Categorical Features: ", len(cat_features))

discrete_features = [feature for feature in num_features if len(df[feature].unique()) < 25]
print("Number of Discrete Features: ", len(discrete_features))

continuous_features = [feature for feature in num_features if feature not in discrete_features]
print("Number of Continuous Features: ", len(continuous_features))

Number of Numerical Features:  7
Number of Categorical Features:  4
Number of Discrete Features:  2
Number of Continuous Features:  5


In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X = df.drop('selling_price', axis=1)
y = df['selling_price']

In [13]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [14]:
y

0         120000
1         550000
2         215000
3         226000
4         570000
          ...   
19537     250000
19540     925000
19541     425000
19542    1225000
19543    1200000
Name: selling_price, Length: 15411, dtype: int64

## Feature Encoding and Scaling
**One Hot Encoding for Columns which had lesser unique values and not ordinal**
* One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In [15]:
len(df['model'].unique())

120

In [16]:
df['model'].value_counts()

model
i20            906
Swift Dzire    890
Swift          781
Alto           778
City           757
              ... 
Ghibli           1
Altroz           1
GTC4Lusso        1
Aura             1
Gurkha           1
Name: count, Length: 120, dtype: int64

In [17]:
from sklearn.preprocessing import LabelEncoder

In [18]:
le = LabelEncoder()

In [21]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,7,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,54,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,118,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,7,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [24]:
print(len(X['seller_type'].unique()))
print(len(X['fuel_type'].unique()))
print(len(X['transmission_type'].unique()))

3
5
2


In [19]:
X['model'] = le.fit_transform(X['model'])

In [20]:
num_features = X.select_dtypes(exclude='O').columns
onehot_columns = ['seller_type', 'fuel_type', 'transmission_type']
label_encoder_columns = ['model']

In [25]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

In [26]:
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first')

In [27]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_features),
        ('cat', categorical_transformer, onehot_columns)
    ], remainder='passthrough'
)

In [28]:
X = preprocessor.fit_transform(X)

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [31]:
X_train

array([[ 1.26105315,  0.31981426,  0.28354138, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.79300331, -1.33955467, -0.88375133, ...,  0.        ,
         1.        ,  1.        ],
       [-1.24439011, -1.33955467, -0.96124537, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 1.0407944 ,  0.31981426, -0.69001623, ...,  0.        ,
         1.        ,  1.        ],
       [ 1.53637659, -1.33955467, -0.78688378, ...,  0.        ,
         1.        ,  1.        ],
       [-1.0516637 , -1.33955467, -0.49628113, ...,  0.        ,
         1.        ,  0.        ]])

## Model Training And Model Selection

In [43]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [39]:
def evaluate_model(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    return mse, mae, r2, rmse

In [41]:
models = {
    'Random Forest': RandomForestRegressor(),
    'Linear Regression': LinearRegression(),
    'Lasso': Lasso(),
    'Ridge': Ridge(),
    'KNN': KNeighborsRegressor(),
    'Decision Tree': DecisionTreeRegressor()
}

In [44]:
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse, mae, r2, rmse = evaluate_model(y_test, y_pred)
    print(f"{list(models.keys())[i]} Model Metrics")
    print(f"Mean Squared Error: {mse}")
    print(f"Mean Absolute Error: {mae}")
    print(f"R2 Score: {r2}")
    print(f"Root Mean Squared Error: {rmse}")
    print("\n")

Random Forest Model Metrics
Mean Squared Error: 52120759492.655624
Mean Absolute Error: 101321.55470779806
R2 Score: 0.9307624597837317
Root Mean Squared Error: 228299.71417558898


Linear Regression Model Metrics
Mean Squared Error: 252550062888.56543
Mean Absolute Error: 279618.57941584237
R2 Score: 0.6645109298852007
Root Mean Squared Error: 502543.59302309825


Lasso Model Metrics
Mean Squared Error: 252549134941.56296
Mean Absolute Error: 279614.7453273891
R2 Score: 0.6645121625757549
Root Mean Squared Error: 502542.66977199353


Ridge Model Metrics
Mean Squared Error: 252540243247.96863
Mean Absolute Error: 279557.2168930272
R2 Score: 0.6645239743566809
Root Mean Squared Error: 502533.8229890289


KNN Model Metrics
Mean Squared Error: 64021344520.150826
Mean Absolute Error: 112526.34609146934
R2 Score: 0.9149536488136147
Root Mean Squared Error: 253024.39510875393


Decision Tree Model Metrics
Mean Squared Error: 94133558293.44162
Mean Absolute Error: 124786.04984322628
R2 Score:

In [45]:
knn_params = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

rf_params = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 5, 10, 15, 20, 25, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [46]:
from sklearn.model_selection import RandomizedSearchCV

In [47]:
randomcv_models = [
    ("Random Forest", RandomForestRegressor(), rf_params),
    ("KNN", KNeighborsRegressor(), knn_params)
]

In [49]:
model_param = {}

for name, model, params in randomcv_models:
    randomcv = RandomizedSearchCV(model, params, n_iter=100, cv=3, n_jobs=-1, verbose=2)
    randomcv.fit(X_train, y_train)
    model_param[name] = randomcv

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Fitting 3 folds for each of 20 candidates, totalling 60 fits


In [50]:
for model_name in model_param:
    print(f"-----------------{model_name}-----------------")
    print(model_param[model_name])

-----------------Random Forest-----------------
RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'max_depth': [None, 5, 10, 15, 20, 25,
                                                      30],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [100, 200, 300, 400,
                                                         500]},
                   verbose=2)
-----------------KNN-----------------
RandomizedSearchCV(cv=3, estimator=KNeighborsRegressor(), n_iter=100, n_jobs=-1,
                   param_distributions={'n_neighbors': [3, 5, 7, 9, 11],
                                        'p': [1, 2],
                                        'weights': ['uniform', 'distance']},
                   verbose=2)


In [54]:
models = {
    "Random Forest": RandomForestRegressor(n_estimators=100, min_samples_split=2, max_features='sqrt', max_depth=None),
    "KNN": KNeighborsRegressor(n_neighbors=9, weights='distance', p=1)
}

In [55]:
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse, mae, r2, rmse = evaluate_model(y_test, y_pred)
    print(f"{list(models.keys())[i]} Model Metrics")
    print(f"Mean Squared Error: {mse}")
    print(f"Mean Absolute Error: {mae}")
    print(f"R2 Score: {r2}")
    print(f"Root Mean Squared Error: {rmse}")
    print("\n")

Random Forest Model Metrics
Mean Squared Error: 42900270947.10744
Mean Absolute Error: 98697.54450744412
R2 Score: 0.9430110139625324
Root Mean Squared Error: 207123.8058435279


KNN Model Metrics
Mean Squared Error: 47438327295.8397
Mean Absolute Error: 100756.75852965289
R2 Score: 0.9369826317592125
Root Mean Squared Error: 217803.41433466945


