# Used car price prediction

### Objective

Build a linear regression model to predict the prices of used cars.

### Data Description
- Brand: brand name of the car
- Model Name: model name of the car
- Location: Location in which the car is being sold or is available for purchase (cities)
- Year: Manufacturing year of the car
- Kilometers_driven: The total kilometers driven in the car by the previous owner(s) in km
- Fuel_Type: The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
- Transmission: The type of transmission used by the car (Automatic/Manual)
- Owner_Type: Type of ownership
- Mileage: The standard mileage offered by the car company in kmpl or km/kg
- Engine: The displacement volume of the engine in CC
- Power: The maximum power of the engine in bhp
- Seats: The number of seats in the car
- New_Price: The price of a new car of the same model in INR Lakhs (1 Lakh = 100,000 INR)
- Price: The price of the used car in INR Lakhs

The used car dataset is taken from https://www.kaggle.com/

## Importing  libraries


In [None]:
# Libraries to manipulate the data
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# set default parameters
sns.set()

# split the data between training and test
from sklearn.model_selection import train_test_split

# import linear regression
from sklearn.linear_model import LinearRegression

# import metrics
from sklearn.metrics import mean_absolute_error , mean_squared_error, r2_score

# suppress warnings
import warnings
warnings.filterwarnings("ignore")

## Loading the dataset

In [None]:
# to read data from google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cars = pd.read_csv("/content/drive/MyDrive/MLPractice/used_cars_data.csv")

## Data Overview

### Showing the first few rows of the dataset

In [None]:
# create copy of the data to avoid any changes in original data.
data = cars.copy()

In [None]:
data.head()

Unnamed: 0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Seats,New_Price,Price,Mileage,Engine,Power,Brand,Model
0,Mumbai,2010,72000.0,CNG,Manual,First,5.0,5.51,1.75,26.6,998.0,58.16,maruti,wagon
1,Pune,2015,41000.0,Diesel,Manual,First,5.0,16.06,12.5,19.67,1582.0,126.2,hyundai,creta
2,Chennai,2011,46000.0,Petrol,Manual,First,5.0,8.61,4.5,18.2,1199.0,88.7,honda,jazz
3,Chennai,2012,87000.0,Diesel,Manual,First,7.0,11.27,6.0,20.77,1248.0,88.76,maruti,ertiga
4,Coimbatore,2013,40670.0,Diesel,Automatic,Second,5.0,53.14,17.74,15.2,1968.0,140.8,audi,a4


### Checking the number of rows and columns

In [None]:
data.shape

(7252, 14)

*   There are 7252 rows and 14 columns




In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7252 entries, 0 to 7251
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Location           7252 non-null   object 
 1   Year               7252 non-null   int64  
 2   Kilometers_Driven  7251 non-null   float64
 3   Fuel_Type          7252 non-null   object 
 4   Transmission       7252 non-null   object 
 5   Owner_Type         7252 non-null   object 
 6   Seats              7199 non-null   float64
 7   New_Price          7252 non-null   float64
 8   Price              6019 non-null   float64
 9   Mileage            7169 non-null   float64
 10  Engine             7206 non-null   float64
 11  Power              7077 non-null   float64
 12  Brand              7252 non-null   object 
 13  Model              7252 non-null   object 
dtypes: float64(7), int64(1), object(6)
memory usage: 793.3+ KB


*  There are 6 objects type or category columns and 7 numerical type columns




In [None]:
data.describe()

Unnamed: 0,Year,Kilometers_Driven,Seats,New_Price,Price,Mileage,Engine,Power
count,7252.0,7251.0,7199.0,7252.0,6019.0,7169.0,7206.0,7077.0
mean,2013.36583,57811.654255,5.280456,21.308387,9.479468,18.346715,1616.590064,112.764474
std,3.254405,37502.06126,0.809327,24.257816,11.187917,4.15817,595.324779,53.497297
min,1996.0,171.0,2.0,3.91,0.44,6.4,72.0,34.2
25%,2011.0,34000.0,5.0,7.88,3.5,15.3,1198.0,75.0
50%,2014.0,53416.0,5.0,11.3,5.64,18.2,1493.0,94.0
75%,2016.0,73000.0,5.0,21.6975,9.95,21.1,1968.0,138.1
max,2019.0,775000.0,10.0,375.0,160.0,33.54,5998.0,616.0




*   Mean and median are almost same for Mileage and Seats so Mileage and Seats have symmetric distribution.




## Data preprocessing

### Checking for duplicate values

In [None]:
# checking for duplicate values
data.duplicated().sum()

2



*   2 duplicate values in the data
*   Let's find out the duplicate values



In [None]:
data[data.duplicated(keep=False)== True]

Unnamed: 0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Seats,New_Price,Price,Mileage,Engine,Power,Brand,Model
3623,Hyderabad,2007,52195.0,Petrol,Manual,First,5.0,4.36,1.75,19.7,796.0,46.3,maruti,alto
4781,Hyderabad,2007,52195.0,Petrol,Manual,First,5.0,4.36,1.75,19.7,796.0,46.3,maruti,alto
6940,Kolkata,2017,13000.0,Diesel,Manual,First,5.0,13.58,,26.0,1498.0,98.6,honda,city
7077,Kolkata,2017,13000.0,Diesel,Manual,First,5.0,13.58,,26.0,1498.0,98.6,honda,city




*   Rows 3623 and 4781 are same
*   Rows 6940 and 7077 are same
*   We can remove duplicate rows

In [None]:
# remove row 4781
data.drop(4781, inplace=True)

# remvoe row 7077
data.drop(7077, inplace=True)

In [None]:
# checking for duplicate values
data.duplicated().sum()

0

*  Now there are no duplicate values in dataset.




## Missing Values Treatment

In [None]:
data.isnull().sum()

Unnamed: 0,0
Location,0
Year,0
Kilometers_Driven,1
Fuel_Type,0
Transmission,0
Owner_Type,0
Seats,53
New_Price,0
Price,1232
Mileage,83




*   There are missing values in columns Kilometers_Driven, Seats, Price, Mileage, Engine and Power.



In [None]:
# as price is the target variable so use the data where price is not missing.
data = data[data["Price"].notna()].copy()

# check if missing data for price is removed from dataset
data.isnull().sum()

Unnamed: 0,0
Location,0
Year,0
Kilometers_Driven,1
Fuel_Type,0
Transmission,0
Owner_Type,0
Seats,42
New_Price,0
Price,0
Mileage,70




*   Filling missing values with median value of feature by brand and model




In [None]:
cols_list = ["Kilometers_Driven","Seats", "Mileage", "Engine", "Power"]

for col in cols_list:
    data[col] = data.groupby(["Brand", "Model"])[col].transform(
        lambda x: x.fillna(x.median())
    )

data.isnull().sum()

Unnamed: 0,0
Location,0
Year,0
Kilometers_Driven,0
Fuel_Type,0
Transmission,0
Owner_Type,0
Seats,3
New_Price,0
Price,0
Mileage,9


*   Filling missing values with median value of feature by brand




In [None]:
cols_list = ["Seats","Mileage", "Engine", "Power"]

for col in cols_list:
    data[col] = data.groupby(["Brand"])[col].transform(
        lambda x: x.fillna(x.median())
    )

data.isnull().sum()

Unnamed: 0,0
Location,0
Year,0
Kilometers_Driven,0
Fuel_Type,0
Transmission,0
Owner_Type,0
Seats,0
New_Price,0
Price,0
Mileage,1



*   Filling missing values with median value of feature




In [None]:
cols_list = ["Mileage", "Power"]

for col in cols_list:
    data[col] = data[col].fillna(data[col].median())

data.isnull().sum()

Unnamed: 0,0
Location,0
Year,0
Kilometers_Driven,0
Fuel_Type,0
Transmission,0
Owner_Type,0
Seats,0
New_Price,0
Price,0
Mileage,0


## Model Building - Linear Regression


###  Categorical columns Treatement


In [None]:
object_columns = data.select_dtypes(include='object')
unique_counts = object_columns.nunique()
print(unique_counts)

Location         11
Fuel_Type         5
Transmission      2
Owner_Type        4
Brand            30
Model           211
dtype: int64




*   There are 211 category in Model column and its not a small category and If we use dummy variabeks then it will increase the number of features in the model.
*   There are 30 different values in Brand
*  As Model will increase features in the model so we will create 2 different models
    1.   Model with all columns
    2.   Model without Model variable








In [None]:
# for adjusted R-squared
def adjusted_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))

# check performance of a regression model
def model_performance_regression(model, predictors, target):
    predictions = model.predict(predictors)
    r2 = r2_score(target, predictions)  #  R-squared
    adjr2 = adjusted_r2_score(predictors, target, predictions)  #  adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, predictions))  #  RMSE
    mae = mean_absolute_error(target, predictions)  #  MAE

    # creating a dataframe of metrics
    data_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adjusted R-squared": adjr2,
        },
        index=[0],
    )
    print(data_perf)
    return data_perf

### Model with all variables

In [None]:
X = data.drop(["Price"], axis=1) # remove target variable
y = data["Price"]

In [None]:
# creating dummy variables
X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True,
)

X.head()

Unnamed: 0,Year,Kilometers_Driven,Seats,New_Price,Mileage,Engine,Power,Location_Bangalore,Location_Chennai,Location_Coimbatore,...,Model_xenon,Model_xf,Model_xj,Model_xuv300,Model_xuv500,Model_xylo,Model_yeti,Model_z4,Model_zen,Model_zest
0,2010,72000.0,5.0,5.51,26.6,998.0,58.16,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2015,41000.0,5.0,16.06,19.67,1582.0,126.2,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,2011,46000.0,5.0,8.61,18.2,1199.0,88.7,False,True,False,...,False,False,False,False,False,False,False,False,False,False
3,2012,87000.0,7.0,11.27,20.77,1248.0,88.76,False,True,False,...,False,False,False,False,False,False,False,False,False,False
4,2013,40670.0,5.0,53.14,15.2,1968.0,140.8,False,False,True,...,False,False,False,False,False,False,False,False,False,False


In [None]:
# splitting the data in 70:30 ratio for train to test data

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
# fitting a linear model
linear_reg_model1 = LinearRegression()
linear_reg_model1.fit(x_train, y_train)

In [None]:
# Checking model performance on train set
print("Training Performance:")
linear_reg_model1_perf_train = model_performance_regression(
    linear_reg_model1, x_train, y_train
)
linear_reg_model1_perf_train

Training Performance:
       RMSE      MAE  R-squared  Adjusted R-squared
0  4.177602  2.24435   0.865733            0.856752


Unnamed: 0,RMSE,MAE,R-squared,Adjusted R-squared
0,4.177602,2.24435,0.865733,0.856752


In [None]:
# Checking model performance on test set
print("Test Performance:")
linear_reg_model1_perf_test = model_performance_regression(linear_reg_model1, x_test, y_test)
linear_reg_model1_perf_test

Test Performance:


Unnamed: 0,RMSE,MAE,R-squared,Adjusted R-squared
0,4.306052,2.387067,0.837175,0.809281


### Model without dummy variables for variable Model

In [None]:
# defining the dependent and independent variables
X = data.drop(["Price", "Model"], axis=1)
y = data["Price"]

# creating dummy variables
X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True,
)

# splitting the data in 70:30 ratio for train to test data

x_train2, x_test2, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)

In [None]:
# fitting a linear model
linear_reg_model2 = LinearRegression()
linear_reg_model2.fit(x_train2, y_train)

In [None]:
# Checking model performance on train set
print("Training Performance:")
linear_reg_model2_perf_train = model_performance_regression(
    linear_reg_model2, x_train2, y_train
)
linear_reg_model2_perf_train

Training Performance:
       RMSE       MAE  R-squared  Adjusted R-squared
0  5.373599  2.885305    0.77785            0.774964


Unnamed: 0,RMSE,MAE,R-squared,Adjusted R-squared
0,5.373599,2.885305,0.77785,0.774964


In [None]:
# Checking model performance on test set
print("Test Performance:")
linear_reg_model2_perf_test = model_performance_regression(linear_reg_model2, x_test2, y_test)
linear_reg_model2_perf_test

Test Performance:


Unnamed: 0,RMSE,MAE,R-squared,Adjusted R-squared
0,4.708758,2.758179,0.805296,0.799292


### Performance Comparison of Models

In [None]:
models_train_comparison = pd.concat(
    [linear_reg_model1_perf_train.T, linear_reg_model2_perf_train.T,], axis=1,
)

models_train_comparison.columns = [
    "Linear Regression with all variable",
    "Linear Regression without dummy variables for Model",
]

print("Training performance comparison:")
models_train_comparison

Training performance comparison:


Unnamed: 0,Linear Regression with all variable,Linear Regression without dummy variables for Model
RMSE,4.177602,5.373599
MAE,2.24435,2.885305
R-squared,0.865733,0.77785
Adjusted R-squared,0.856752,0.774964



# Conclusion
*   Linear Regression with all variable Model performace is good as RMSE and MAE are low and R-squared and adjusted R-squared are better than Linear Regression without dummy variables for Model


