## Used Car Price Prediction 

1) Problem statement.

This dataset comprises used cars sold on cardehko.com in India as well as important features of these cars.
If user can predict the price of the car based on input features.
Prediction results can be used to give new seller the price suggestion based on market condition.

2) Data Collection. 

The Dataset is collected from scrapping from cardheko webiste

The data consists of 13 column and 15411 rows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings('ignore')

%matplotlib inline


In [2]:
df = pd.read_csv("D:\Machine Learning\Random Forest\Regression\cardekho_imputated.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


## Data Cleaning

Handling Missing values

. Handling Missing values

. Handling Duplicates

. Check data type

. Understand the dataset


In [4]:
## Check Null Values
df.isnull().sum()

Unnamed: 0           0
car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [5]:
## Remove unnecessary columns
df.drop('car_name',axis=1,inplace=True)
df.drop('brand',axis=1,inplace=True)

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [8]:
df['model'].unique()

array(['Alto', 'Grand', 'i20', 'Ecosport', 'Wagon R', 'i10', 'Venue',
       'Swift', 'Verna', 'Duster', 'Cooper', 'Ciaz', 'C-Class', 'Innova',
       'Baleno', 'Swift Dzire', 'Vento', 'Creta', 'City', 'Bolero',
       'Fortuner', 'KWID', 'Amaze', 'Santro', 'XUV500', 'KUV100', 'Ignis',
       'RediGO', 'Scorpio', 'Marazzo', 'Aspire', 'Figo', 'Vitara',
       'Tiago', 'Polo', 'Seltos', 'Celerio', 'GO', '5', 'CR-V',
       'Endeavour', 'KUV', 'Jazz', '3', 'A4', 'Tigor', 'Ertiga', 'Safari',
       'Thar', 'Hexa', 'Rover', 'Eeco', 'A6', 'E-Class', 'Q7', 'Z4', '6',
       'XF', 'X5', 'Hector', 'Civic', 'D-Max', 'Cayenne', 'X1', 'Rapid',
       'Freestyle', 'Superb', 'Nexon', 'XUV300', 'Dzire VXI', 'S90',
       'WR-V', 'XL6', 'Triber', 'ES', 'Wrangler', 'Camry', 'Elantra',
       'Yaris', 'GL-Class', '7', 'S-Presso', 'Dzire LXI', 'Aura', 'XC',
       'Ghibli', 'Continental', 'CR', 'Kicks', 'S-Class', 'Tucson',
       'Harrier', 'X3', 'Octavia', 'Compass', 'CLS', 'redi-GO', 'Glanza',
       

In [None]:
## Getting All Different Types OF Features

num_features = [feature for feature in df.columns if df[feature].dtype != 'O']
print('Num of Numerical Features :', len(num_features))

cat_features = [feature for feature in df.columns if df[feature].dtype == 'O']
print('Num of Categorical Features :', len(cat_features))

discrete_features=[feature for feature in num_features if len(df[feature].unique())<=25]
print('Num of Discrete Features :',len(discrete_features))

continuous_features=[feature for feature in num_features if feature not in discrete_features]
print('Num of Continuous Features :',len(continuous_features))

Num of Numerical Features : 8
Num of Categorical Features : 4
Num of Discrete Features : 2
Num of Continuous Features : 6


In [10]:
# Independent and Dependent Features
from sklearn.model_selection import train_test_split
x = df.drop('selling_price', axis=1)
y = df['selling_price']

In [11]:
x.head()

Unnamed: 0.1,Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


# Feature Encoding and Scaling

## One Hot Encoding for Columns which had lesser unique values and not ordinal

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In [13]:
len(df['model'].unique())

120

In [15]:
df['model'].value_counts()

model
i20            906
Swift Dzire    890
Swift          781
Alto           778
City           757
              ... 
Ghibli           1
Altroz           1
GTC4Lusso        1
Aura             1
Gurkha           1
Name: count, Length: 120, dtype: int64

In [19]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
x['model']=le.fit_transform(x['model'])

In [20]:
x.head

<bound method NDFrame.head of        Unnamed: 0  model  vehicle_age  km_driven seller_type fuel_type  \
0               0      7            9     120000  Individual    Petrol   
1               1     54            5      20000  Individual    Petrol   
2               2    118           11      60000  Individual    Petrol   
3               3      7            9      37000  Individual    Petrol   
4               4     38            6      30000      Dealer    Diesel   
...           ...    ...          ...        ...         ...       ...   
15406       19537    117            9      10723      Dealer    Petrol   
15407       19540     42            2      18000      Dealer    Petrol   
15408       19541     77            6      67000      Dealer    Diesel   
15409       19542    114            5    3800000      Dealer    Diesel   
15410       19543     25            2      13000      Dealer    Petrol   

      transmission_type  mileage  engine  max_power  seats  
0                Man

In [22]:
x.head()

Unnamed: 0.1,Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,0,7,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,1,54,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,2,118,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,3,7,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,4,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [23]:
len(df['seller_type'].unique()),len(df['fuel_type'].unique()),len(df['transmission_type'].unique())

(3, 5, 2)

In [25]:

# Create Column Transformer with 3 types of transformers
num_features = x.select_dtypes(exclude="object").columns
onehot_columns = ['seller_type','fuel_type','transmission_type']

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, onehot_columns),
        ("StandardScaler", numeric_transformer, num_features)
        
    ],remainder='passthrough'
    
)

In [29]:
X=preprocessor.fit_transform(x)

In [32]:
pd.DataFrame(x)

Unnamed: 0.1,Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,0,7,9,120000,Individual,Petrol,Manual,19.70,796,46.30,5
1,1,54,5,20000,Individual,Petrol,Manual,18.90,1197,82.00,5
2,2,118,11,60000,Individual,Petrol,Manual,17.00,1197,80.00,5
3,3,7,9,37000,Individual,Petrol,Manual,20.92,998,67.10,5
4,4,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5
...,...,...,...,...,...,...,...,...,...,...,...
15406,19537,117,9,10723,Dealer,Petrol,Manual,19.81,1086,68.05,5
15407,19540,42,2,18000,Dealer,Petrol,Manual,17.50,1373,91.10,7
15408,19541,77,6,67000,Dealer,Diesel,Manual,21.14,1498,103.52,5
15409,19542,114,5,3800000,Dealer,Diesel,Manual,16.00,2179,140.00,7


In [33]:
# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape


((12328, 15), (3083, 15))

In [34]:

X_train

array([[ 0.        ,  0.        ,  1.        , ...,  1.75390551,
         2.66249771, -0.40302241],
       [ 1.        ,  0.        ,  0.        , ..., -0.55087963,
        -0.38602844, -0.40302241],
       [ 0.        ,  0.        ,  1.        , ...,  0.89033072,
         3.27453006, -0.40302241],
       ...,
       [ 1.        ,  0.        ,  0.        , ..., -0.9366097 ,
        -0.78070786, -0.40302241],
       [ 0.        ,  0.        ,  0.        , ..., -0.55471774,
        -0.43582879, -0.40302241],
       [ 1.        ,  0.        ,  0.        , ..., -0.04616815,
         0.06194201, -0.40302241]])

# Model Training And Model Selection

In [35]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [36]:
##Create a Function to Evaluate Model
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [37]:
## Beginning Model Training
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(),
   
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test dataset
    model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    
    print(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    
    print('='*35)
    print('\n')

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 553850.0494
- Mean Absolute Error: 268104.1303
- R2 Score: 0.6218
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 502582.0834
- Mean Absolute Error: 279686.6479
- R2 Score: 0.6645


Lasso
Model performance for Training set
- Root Mean Squared Error: 553850.0538
- Mean Absolute Error: 268101.7491
- R2 Score: 0.6218
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 502581.1494
- Mean Absolute Error: 279682.7929
- R2 Score: 0.6645


Ridge
Model performance for Training set
- Root Mean Squared Error: 553850.6941
- Mean Absolute Error: 268061.4421
- R2 Score: 0.6218
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 502572.3576
- Mean Absolute Error: 279625.1576
- R2 Score: 0.6645


K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 335460.8145
- Mean 

In [38]:
#Initialize few parameter for Hyperparamter tuning
knn_params = {"n_neighbors": [2, 3, 10, 20, 40, 50]}
rf_params = {"max_depth": [5, 8, 15, None, 10],
             "max_features": [5, 7, "auto", 8],
             "min_samples_split": [2, 8, 15, 20],
             "n_estimators": [100, 200, 500, 1000]}

In [41]:

# Models list for Hyperparameter tuning
randomcv_models = [('KNN', KNeighborsRegressor(), knn_params),
                   ("RF", RandomForestRegressor(), rf_params)
                   
                   ]