# Regression Example: Used Car Price Prediction
### Multiple Linear regression

This notebook introduces the steps to build a regression model to predict the resale price of an used car.

### Dataset

**Filename**: final_cars_maruti.csv

It is a comma separated file and there are 11 columns in the dataset.

1. Model - Model of the car
2. Location - The location in which the car was sold.
3. Age - Age of the car when the car was sold from the year of purchase.
4. KM_Driven - The total kilometers are driven in the car by the previous owner(s) in '000 kms.
5. Fuel_Type - The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
6. Transmission - The type of transmission used by the car. (Automatic / Manual)
7. Owner_Type - First, Second, Third, or Fourth & Above
8. Mileage - The standard mileage offered by the car company in kmpl or km/kg
9. Power - The maximum power of the engine in bhp.
10. Seats - The number of seats in the car.
11. Price - The resale price of the car (target).


### Participants Exercise:

Build a model by adding the following five parameters and measure accuracy in terms of RMSE and R2.

- Age
- KM_Driven
- Model
- Transmission Type
- Fuel Type

## 1. Loading the Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [2]:
cars_df = pd.read_csv("E:\ML_course\practice\S7_ML_Multiple_Linear_Regression/final_cars_maruti.csv")

In [3]:
cars_df[:5]

Unnamed: 0,Location,Fuel_Type,Transmission,Owner_Type,Seats,Price,Age,Model,Mileage,Power,KM_Driven
0,Chennai,Diesel,Manual,First,7,6.0,8,ertiga,20.77,88.76,87
1,Jaipur,Diesel,Manual,First,5,5.6,5,swift,25.2,74.0,64
2,Jaipur,Diesel,Manual,First,5,5.99,3,swift,28.4,74.0,25
3,Hyderabad,Petrol,Manual,Second,5,2.75,7,alto,20.92,67.1,54
4,Jaipur,Petrol,Manual,Second,5,1.85,11,wagon,14.0,64.0,83


### Feature Set Selection

In [4]:
x_features = ['Model', 'Age' , 'KM_Driven', 'Transmission', 'Fuel_Type']

In [5]:
encoded_cars_df = pd.get_dummies(cars_df[x_features],
                                columns = ['Model','Transmission', 'Fuel_Type'] )

In [6]:
encoded_cars_df = encoded_cars_df.replace({True : 1, False :0})

In [7]:
encoded_cars_df[:10]

Unnamed: 0,Age,KM_Driven,Model_a-star,Model_alto,Model_baleno,Model_celerio,Model_ciaz,Model_dzire,Model_eeco,Model_ertiga,Model_omni,Model_ritz,Model_swift,Model_vitara,Model_wagon,Model_zen,Transmission_Automatic,Transmission_Manual,Fuel_Type_Diesel,Fuel_Type_Petrol
0,8,87,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0
1,5,64,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0
2,3,25,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0
3,7,54,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
4,11,83,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1
5,2,50,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0
6,12,90,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
7,6,52,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0
8,6,53,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0
9,7,65,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0


In [8]:
encoded_cars_df.shape

(1009, 20)

### Setting X and y variables

In [9]:
X = encoded_cars_df
y = cars_df.Price

### Data Splitting

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size= 0.8, random_state= 100)

In [12]:
from sklearn.linear_model import LinearRegression

In [13]:
lreg= LinearRegression()
lreg.fit(X_train, y_train)

### Understanding model parameters

In [14]:
lreg.intercept_

6.990330413116105

In [15]:
lreg.coef_

array([-0.28349586, -0.00666433, -0.99754617, -1.61638854,  1.42373449,
       -1.11063953,  2.15506599,  0.89438078, -1.09763268,  2.04465393,
       -1.98006084, -0.54862859,  0.15862544,  2.82349234, -1.0535976 ,
       -1.09545902,  0.34644582, -0.34644582,  0.28077605, -0.28077605])

In [16]:
dict(zip(X_train.columns,
         np.round_(lreg.coef_,3)))

{'Age': -0.283,
 'KM_Driven': -0.007,
 'Model_a-star': -0.998,
 'Model_alto': -1.616,
 'Model_baleno': 1.424,
 'Model_celerio': -1.111,
 'Model_ciaz': 2.155,
 'Model_dzire': 0.894,
 'Model_eeco': -1.098,
 'Model_ertiga': 2.045,
 'Model_omni': -1.98,
 'Model_ritz': -0.549,
 'Model_swift': 0.159,
 'Model_vitara': 2.823,
 'Model_wagon': -1.054,
 'Model_zen': -1.095,
 'Transmission_Automatic': 0.346,
 'Transmission_Manual': -0.346,
 'Fuel_Type_Diesel': 0.281,
 'Fuel_Type_Petrol': -0.281}

### Predict on test set

In [17]:
y_pred = lreg.predict(X_test)

In [18]:
y_df = pd.DataFrame({"actual": y_test,
                     "predicted": y_pred,
                     "Residual": y_test-y_pred})

In [19]:
y_df.sample(10)

Unnamed: 0,actual,predicted,Residual
134,7.25,6.841823,0.408177
961,7.39,6.999929,0.390071
482,1.55,1.658517,-0.108517
44,2.1,1.728754,0.371246
964,4.11,4.072895,0.037105
249,3.25,3.893582,-0.643582
404,8.3,8.062631,0.237369
517,5.7,5.33259,0.36741
609,5.99,6.908466,-0.918466
308,2.9,2.918516,-0.018516


### Measuring Accuracy

In [20]:
from sklearn.metrics import mean_squared_error

In [21]:
mse = mean_squared_error(y_df.actual, y_df.predicted)
mse

0.546102192151361

In [22]:
rmse = np.sqrt(mse)
rmse

0.7389872746883812

In [23]:
from sklearn.metrics import r2_score

In [24]:
r2_score(y_df.actual,y_df.predicted)

0.8911369344899931

## Building model with all the variables

### Feature Set Selection

In [25]:
x_features_1 = list(cars_df.columns)
x_features_1

['Location',
 'Fuel_Type',
 'Transmission',
 'Owner_Type',
 'Seats',
 'Price',
 'Age',
 'Model',
 'Mileage',
 'Power',
 'KM_Driven']

In [26]:
x_features_1.remove('Price')

In [27]:
x_features_1

['Location',
 'Fuel_Type',
 'Transmission',
 'Owner_Type',
 'Seats',
 'Age',
 'Model',
 'Mileage',
 'Power',
 'KM_Driven']

### Encoding Categorical Variables


In [28]:
encoded_cars_df = pd.get_dummies(cars_df[x_features_1],
                                 columns=['Location','Fuel_Type','Transmission','Owner_Type','Model'])

In [29]:
encoded_cars_df = encoded_cars_df.replace({True : 1, False :0})
encoded_cars_df.shape

(1009, 37)

In [30]:
encoded_cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1009 entries, 0 to 1008
Data columns (total 37 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Seats                   1009 non-null   int64  
 1   Age                     1009 non-null   int64  
 2   Mileage                 1009 non-null   float64
 3   Power                   1009 non-null   float64
 4   KM_Driven               1009 non-null   int64  
 5   Location_Ahmedabad      1009 non-null   int64  
 6   Location_Bangalore      1009 non-null   int64  
 7   Location_Chennai        1009 non-null   int64  
 8   Location_Coimbatore     1009 non-null   int64  
 9   Location_Delhi          1009 non-null   int64  
 10  Location_Hyderabad      1009 non-null   int64  
 11  Location_Jaipur         1009 non-null   int64  
 12  Location_Kochi          1009 non-null   int64  
 13  Location_Kolkata        1009 non-null   int64  
 14  Location_Mumbai         1009 non-null   

### Setting X and y variables

In [31]:
X = encoded_cars_df
y = cars_df.Price 

### Data Splitting

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 100)

### Build Model

In [34]:
from sklearn.linear_model import LinearRegression

In [35]:
lreg_1 = LinearRegression()
lreg_1.fit(X_train,y_train)

### Understanding model parameters

In [36]:
lreg_1.intercept_

5.250023669916943

In [37]:
lreg_1.coef_

array([ 6.57354813e-02, -2.29752836e-01,  4.37288021e-02,  3.98769188e-04,
       -9.90649304e-03, -1.57791021e-01,  4.74048235e-01,  2.63889115e-02,
        9.62948207e-01, -5.03420176e-01,  5.46813163e-01, -4.85758481e-02,
        2.40092028e-01, -9.84300239e-01, -3.57630897e-01, -1.98572364e-01,
        2.02326151e-01, -2.02326151e-01,  2.74047246e-01, -2.74047246e-01,
        1.80533012e-01, -4.91829021e-05, -1.80483829e-01, -1.09967274e+00,
       -1.77654807e+00,  1.49179279e+00, -1.04281614e+00,  2.17431215e+00,
        8.47605720e-01, -1.32604780e+00,  2.20871152e+00, -2.10859113e+00,
       -6.44260249e-01,  3.25561581e-01,  2.91092145e+00, -8.25527014e-01,
       -1.13544207e+00])

In [38]:
dict(zip(X_train.columns,
         np.round(lreg_1.coef_,3)))

{'Seats': 0.066,
 'Age': -0.23,
 'Mileage': 0.044,
 'Power': 0.0,
 'KM_Driven': -0.01,
 'Location_Ahmedabad': -0.158,
 'Location_Bangalore': 0.474,
 'Location_Chennai': 0.026,
 'Location_Coimbatore': 0.963,
 'Location_Delhi': -0.503,
 'Location_Hyderabad': 0.547,
 'Location_Jaipur': -0.049,
 'Location_Kochi': 0.24,
 'Location_Kolkata': -0.984,
 'Location_Mumbai': -0.358,
 'Location_Pune': -0.199,
 'Fuel_Type_Diesel': 0.202,
 'Fuel_Type_Petrol': -0.202,
 'Transmission_Automatic': 0.274,
 'Transmission_Manual': -0.274,
 'Owner_Type_First': 0.181,
 'Owner_Type_Second': -0.0,
 'Owner_Type_Third': -0.18,
 'Model_a-star': -1.1,
 'Model_alto': -1.777,
 'Model_baleno': 1.492,
 'Model_celerio': -1.043,
 'Model_ciaz': 2.174,
 'Model_dzire': 0.848,
 'Model_eeco': -1.326,
 'Model_ertiga': 2.209,
 'Model_omni': -2.109,
 'Model_ritz': -0.644,
 'Model_swift': 0.326,
 'Model_vitara': 2.911,
 'Model_wagon': -0.826,
 'Model_zen': -1.135}

##### Best Fit line for this :(y = intercept + coefficient (x))
examples y = price
- for age:  y = 5.25 - 0.325(age)
- for Location_Ahmedabad :  y = 5.25 -0.158(Location_Ahmedabad)
- for Model_a-star:  y = 5.25 - 1.023(Model_a-star)

### Predict on test set

In [39]:
y_pred = lreg_1.predict(X_test)

In [40]:
y_df = pd.DataFrame({"actual": y_test,
                     "predicted": y_pred,
                     "Residual": y_test-y_pred})

In [41]:
y_df.sample(10)

Unnamed: 0,actual,predicted,Residual
565,4.5,4.60823,-0.10823
892,6.3,6.068328,0.231672
709,5.2,4.798027,0.401973
116,4.9,4.868256,0.031744
778,9.75,9.035333,0.714667
583,6.14,5.307792,0.832208
965,1.25,0.056247,1.193753
666,7.85,7.611962,0.238038
376,3.4,3.676424,-0.276424
24,9.9,7.774419,2.125581


### Measuring Accuracy: RMSE and R2

In [42]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [43]:
mse = mean_squared_error(y_df.actual, y_df.predicted)
mse

0.36602417759200023

In [44]:
rmse = np.sqrt(mse)
rmse

0.6049993203235854

In [45]:
r2_score(y_df.actual, y_df.predicted)

0.9270346931469555

# Storing the data

In [46]:
class Cars_Predicting_Model():

    def __init__(self, model, features, rmse):
        self.model = model
        self.features = features
        self.rmse = rmse

In [47]:
my_model = Cars_Predicting_Model(lreg_1, list(X_train.columns), rmse)

In [48]:
my_model.model

In [49]:
my_model.rmse

0.6049993203235854

In [50]:
from joblib import dump

In [51]:
dump(my_model, "cars_pred.pkl")

['cars_pred.pkl']