# Regression Example: Used Car Price Prediction
### Multiple Linear regression

This notebook introduces the steps to build a regression model to predict the resale price of an used car.

### Dataset

**Filename**: final_cars_maruti.csv

It is a comma separated file and there are 11 columns in the dataset.

1. Model - Model of the car
2. Location - The location in which the car was sold.
3. Age - Age of the car when the car was sold from the year of purchase.
4. KM_Driven - The total kilometers are driven in the car by the previous owner(s) in '000 kms.
5. Fuel_Type - The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
6. Transmission - The type of transmission used by the car. (Automatic / Manual)
7. Owner_Type - First, Second, Third, or Fourth & Above
8. Mileage - The standard mileage offered by the car company in kmpl or km/kg
9. Power - The maximum power of the engine in bhp.
10. Seats - The number of seats in the car.
11. Price - The resale price of the car (target).


## 1. Loading the Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [2]:
cars_df = pd.read_csv("E:\ML_course\practice\S7_ML_Multiple_Linear_Regression/final_cars_maruti.csv")

In [3]:
cars_df[:5]

Unnamed: 0,Location,Fuel_Type,Transmission,Owner_Type,Seats,Price,Age,Model,Mileage,Power,KM_Driven
0,Chennai,Diesel,Manual,First,7,6.0,8,ertiga,20.77,88.76,87
1,Jaipur,Diesel,Manual,First,5,5.6,5,swift,25.2,74.0,64
2,Jaipur,Diesel,Manual,First,5,5.99,3,swift,28.4,74.0,25
3,Hyderabad,Petrol,Manual,Second,5,2.75,7,alto,20.92,67.1,54
4,Jaipur,Petrol,Manual,Second,5,1.85,11,wagon,14.0,64.0,83


## Building a model with more variables

Based on most important questions that customers ask

- Which model is it? (categorical feature)
- How old the vehicle is? (Numerical feature)
- How many kilometers it is driven? (Numerical feature)
    

### Feature Set Selection

In [4]:
x_features = ['Model', 'Age' , 'KM_Driven']

- As Model is a categorical feature and age and kmdriven is a numerical feature , so we need to convert model into numerical feature (Numericalize) . We will use OHE : One Hot Encoding tecnique which encodes categorical feature into numeriacl feature

### How to encode categorical variables?

OHE: One Hot Encoding

https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

In [5]:
encoded_cars_df = pd.get_dummies(cars_df[x_features],
                                columns = ['Model'] )

In [6]:
encoded_cars_df = encoded_cars_df.replace({True : 1, False :0})

In [7]:
encoded_cars_df[:10]

Unnamed: 0,Age,KM_Driven,Model_a-star,Model_alto,Model_baleno,Model_celerio,Model_ciaz,Model_dzire,Model_eeco,Model_ertiga,Model_omni,Model_ritz,Model_swift,Model_vitara,Model_wagon,Model_zen
0,8,87,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,5,64,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,3,25,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,7,54,0,1,0,0,0,0,0,0,0,0,0,0,0,0
4,11,83,0,0,0,0,0,0,0,0,0,0,0,0,1,0
5,2,50,0,0,0,0,0,0,0,0,0,0,0,1,0,0
6,12,90,0,1,0,0,0,0,0,0,0,0,0,0,0,0
7,6,52,0,0,0,0,0,0,0,0,0,0,1,0,0,0
8,6,53,0,0,0,0,0,0,0,0,0,0,1,0,0,0
9,7,65,0,0,0,0,0,0,0,0,0,0,1,0,0,0


### Setting X and y variables

In [8]:
X = encoded_cars_df
y = cars_df.Price

### Data Splitting

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size= 0.8, random_state= 40)

## Multiple Linear Regression Model


Simple linear regression is given by,

$\hat{Y} = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + ... + \beta_{n}X_{n}$
									
- $\beta_{0}$, $\beta_{1}$...$\beta_{n}$  are the regression coefficients

In [11]:
from sklearn.linear_model import LinearRegression

In [12]:
lreg_1= LinearRegression()
lreg_1.fit(X_train, y_train)

### Understanding model parameters

In [13]:
lreg_1.intercept_

6.797674428948882

In [14]:
lreg_1.coef_

array([-0.32510246, -0.00458647, -1.02336178, -1.86138553,  1.35119448,
       -0.98470574,  2.17690626,  1.24631711, -1.36101243,  2.18856792,
       -2.32440576, -0.43699509,  0.25826952,  3.0745239 , -1.15622224,
       -1.14769062])

In [15]:
dict(zip(X_train.columns,
         np.round_(lreg_1.coef_,3)))

{'Age': -0.325,
 'KM_Driven': -0.005,
 'Model_a-star': -1.023,
 'Model_alto': -1.861,
 'Model_baleno': 1.351,
 'Model_celerio': -0.985,
 'Model_ciaz': 2.177,
 'Model_dzire': 1.246,
 'Model_eeco': -1.361,
 'Model_ertiga': 2.189,
 'Model_omni': -2.324,
 'Model_ritz': -0.437,
 'Model_swift': 0.258,
 'Model_vitara': 3.075,
 'Model_wagon': -1.156,
 'Model_zen': -1.148}

##### Best Fit line for this :(y = intercept + coefficient (x))
examples y = price
- for age:  y = 6.697 - 0.325(age)
- for Model_a-star:  y = 6.697 - 1.023(Model_a-star)

### Predict on test set

In [16]:
y_pred = lreg_1.predict(X_test)

In [17]:
y_df = pd.DataFrame({"actual": y_test,
                     "predicted": y_pred,
                     "Residual": y_test-y_pred})

In [18]:
y_df.sample(10)

Unnamed: 0,actual,predicted,Residual
805,3.9,4.889765,-0.989765
283,3.22,1.790216,1.429784
620,3.9,4.157003,-0.257003
962,3.65,3.731804,-0.081804
689,3.95,3.779797,0.170203
599,4.25,4.088206,0.161794
303,4.35,4.111139,0.238861
608,8.9,6.981541,1.918459
85,2.2,3.304994,-1.104994
358,4.85,6.398645,-1.548645


### Measuring Accuracy

In [19]:
from sklearn.metrics import mean_squared_error

In [20]:
mse = mean_squared_error(y_df.actual, y_df.predicted)
mse

0.6673136631799984

In [21]:
rmse = np.sqrt(mse)
rmse

0.8168926876769056

In [22]:
from sklearn.metrics import r2_score

In [23]:
r2_score(y_df.actual,y_df.predicted)

0.8553137746014273

### Participants Exercise: 

Build a model by adding the following five parameters and measure accuracy in terms of RMSE and R2.

- Age
- KM_Driven
- Model
- Transmission Type
- Fuel Type

### Participant Exercise: 2

Take different training set, build model and measure the model accuracy. But, how to sample differenent training and test sets?
- Change the random_state to different numbers while training and test splits and then measure the r2 values.
- Repeat the above process for 5 different random_states and make a note of the r2 values.