# Regression Example: Used Car Price Prediction

This notebook introduces the steps to build a regression model to predict the resale price of an used car.

### Dataset

**Filename**: final_cars_maruti.csv

It is a comma separated file and there are 11 columns in the dataset.

1. Model - Model of the car
2. Location - The location in which the car was sold.
3. Age - Age of the car when the car was sold from the year of purchase.
4. KM_Driven - The total kilometers are driven in the car by the previous owner(s) in '000 kms.
5. Fuel_Type - The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
6. Transmission - The type of transmission used by the car. (Automatic / Manual)
7. Owner_Type - First, Second, Third, or Fourth & Above
8. Mileage - The standard mileage offered by the car company in kmpl or km/kg
9. Power - The maximum power of the engine in bhp.
10. Seats - The number of seats in the car.
11. Price - The resale price of the car (target).


## 1. Loading the Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
cars_df = pd.read_csv('final_cars_maruti.csv')

In [None]:
cars_df.head(5)

In [None]:
cars_df.shape

### Relationship between Age and Price

In [None]:
sn.scatterplot(data = cars_df,
               y = 'Price',
               x = 'Age');

## 2. Simple Linear Regression

Assumes linear relationship between features and outcome variable.

Simple linear regression is given by,

$\hat{Y} = \beta_{0} + \beta_{1}X$
									
- $\beta_{0}$ and $\beta_{1}$ are the regression coefficients
- $\hat{Y}$ is the predicted value of ${Y}$.


So, the error (Mean Squared Error) is:

${mse}$ =  $ \frac{1}{N}  \sum_{i=1}^{n}{(Y_{i} - (\hat{Y}))}^2$

or 

${mse}$ =  $ \frac{1}{N}  \sum_{i=1}^{n}{(Y_{i} - (\beta_{0} + \beta_{1} X_{i}))}^2$




Regression Explained: https://mlu-explain.github.io/linear-regression/

### Setting X and Y Variables

In [None]:
X = cars_df[['Age']]
y = cars_df.Price

### Splitting the dataset

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 100)

In [None]:
X_train.shape

In [None]:
X_test.shape

### Building the Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lreg_v1 = LinearRegression()

In [None]:
lreg_v1.fit(X_train, y_train)

### Finding the model parameters

In [None]:
lreg_v1.intercept_

In [None]:
lreg_v1.coef_

In [None]:
sn.lmplot(data = cars_df,
          y = 'Price',
          x = 'Age',
          fit_reg = True);

### Predicting on Test Set 

In [None]:
y_pred = lreg_v1.predict(X_test)

In [None]:
y_df = pd.DataFrame( { "actual" : y_test,
                       "predicted" : y_pred,
                       "residual": y_test - y_pred }) 

In [None]:
y_df.sample(10)

### Error or Accuracy Analysis: RMSE

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
mse = mean_squared_error(y_df.actual, y_df.predicted)

In [None]:
mse

In [None]:
rmse = np.sqrt(mse)

In [None]:
rmse

### What is R-squared?

R-squared is a statistical measure that indicates how much of the variation of a dependent variable is explained by an independent variable in a regression model.

https://www.investopedia.com/terms/r/r-squared.asp


Total Variance in Y = $\sum_{i=1}^{N}(Y_{i} - \bar{Y})^2$ 

where, 

- $\bar{Y}$ is the mean of Y.

Unexplained Variance = $\sum_{i=1}^{N}(Y_{i} - (\beta_{0} + \beta_{1} X_{i}))^2$  


$R^{2}$ is given by:

$R^{2}$ = $\frac{Explained\ Variance}{Total\ Variance}$


Notes:

- R-squared values range from 0 to 1 and are commonly stated as percentages from 0% to 100%. 
- What is a “good” R-squared value depends on the domain or context. In the field of social sciences, even a relatively low R-squared, such as 0.5, could be considered relatively strong. In other fields, the standards for a good R-squared reading can be much higher, such as 0.9 or above. In finance, an R-squared above 0.7 would generally be seen as showing a high level of correlation. [Source](https://www.investopedia.com/terms/r/r-squared.asp)

In [None]:
from sklearn.metrics import r2_score

In [None]:
r2_score(y_df.actual, y_df.predicted)

### Participants Exercise: 1

Build a model by adding the following two parameters and measure accuracy in terms of RMSE and R2.

- Age
- KM_Driven

## 3. Building a model with more variables

Based on most important questions that customers ask

- Which model is it? (categorical feature)
- How old the vehicle is?
- How many kilometers it is driven?
    

### Feature Set Selection

In [None]:
x_features = ['Model', 'Age', 'KM_Driven']

### How to encode categorical variables?

OHE: One Hot Encoding

https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

In [None]:
encoded_cars_df = pd.get_dummies(cars_df[x_features],
                                 columns = ['Model'])

In [None]:
encoded_cars_df.head(5)

In [None]:
encoded_cars_df.shape

In [None]:
len(cars_df.Model.unique())

### Setting X and y variables

In [None]:
X = encoded_cars_df
y = cars_df.Price

### Data Splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 40)

In [None]:
X.shape

In [None]:
X_train.shape

In [None]:
X_train[0:10]

In [None]:
X_test.shape

## 4. Multiple Linear Regression Model


Simple linear regression is given by,

$\hat{Y} = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + ... + \beta_{n}X_{n}$
									
- $\beta_{0}$, $\beta_{1}$...$\beta_{n}$  are the regression coefficients

In [None]:
lreg_v2 = LinearRegression()
lreg_v2.fit(X_train, y_train)

### Understanding model parameters

In [None]:
lreg_v2.intercept_

In [None]:
lreg_v2.coef_

In [None]:
dict(zip(X_train.columns,
         np.round(lreg_v2.coef_, 3)))

### Predict on test set

In [None]:
y_pred_2 = lreg_v2.predict(X_test)

In [None]:
y_df_2 = pd.DataFrame( {"actual": y_test,
                        "predicted": y_pred_2,
                        "residual": y_test - y_pred_2} )

In [None]:
y_df_2.info()

In [None]:
y_df_2.sample(10)

### Measuring Accuracy

In [None]:
r2_score(y_df_2.actual, y_df_2.predicted)

In [None]:
mse_2 = mean_squared_error(y_df_2.actual, y_df_2.predicted)

In [None]:
rmse_2 = np.sqrt(mse_2)
rmse_2

### Participants Exercise: 2

Build a model by adding the following five parameters and measure accuracy in terms of RMSE and R2.

- Age
- KM_Driven
- Model
- Transmission Type
- Fuel Type

## 4. Building model with all the variables

### Feature Set Selection

In [None]:
x_features = list(cars_df.columns)

In [None]:
x_features

In [None]:
x_features.remove('Price')

In [None]:
x_features

### Encoding Categorical Variables


In [None]:
cat_features = ['Location',
                'Fuel_Type',
                'Transmission',
                'Owner_Type',
                'Model']

In [None]:
num_features = list(set(x_features) - set(cat_features))

In [None]:
num_features

In [None]:
encoded_cars_df = pd.get_dummies(cars_df[x_features],
                                 columns = cat_features,
                                 drop_first = True)

In [None]:
encoded_cars_df.head(5)

In [None]:
encoded_cars_df.columns

### Setting X and y variables

In [None]:
X = encoded_cars_df
y = cars_df.Price

### Data Splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.2,
                                                    random_state = 40)

In [None]:
X_train.shape

In [None]:
X_test.shape

### Build Model

In [None]:
lreg_v3 = LinearRegression()
lreg_v3.fit(X_train, y_train)

### Understanding model parameters

In [None]:
lreg_v3.intercept_

In [None]:
lreg_v3.coef_

In [None]:
dict(zip(X_train.columns,
         np.round(lreg_v3.coef_, 3)))

### Predict on test set

In [None]:
y_pred_3 = lreg_v3.predict(X_test)

In [None]:
y_df_3 = pd.DataFrame({"actual": y_test,
                       "predicted": y_pred_3,
                       "residual": y_test - y_pred_3})

In [None]:
y_df_3.sample(10)

### Measuring Accuracy: RMSE and R2

In [None]:
r2_score(y_df_3.actual, y_df_3.predicted)

In [None]:
mse_3 = mean_squared_error(y_df_3.actual, y_df_3.predicted)
rmse_3 = np.sqrt(mse_3)
rmse_3

### What are the reasons for the remaining error?

1. More factors 
2. More samples 
3. Complex Models : Try other models
4. Noise (randomness)
   

### Participant Exercise: 3

Take different training set, build model and measure the model accuracy. But, how to sample differenent training and test sets?
- Change the random_state to different numbers while training and test splits and then measure the r2 values.
- Repeat the above process for 5 different random_states and make a note of the r2 values.

## 6. Storing the model

In [None]:
class CarPredictionModel():
    
    def __init__(self, model, features, rmse):
        self.model = model
        self.features = features
        self.rmse = rmse

In [None]:
my_model = CarPredictionModel(lreg_v3, list(X_train.columns), rmse_3)

In [None]:
my_model.model

In [None]:
my_model.rmse

In [None]:
from joblib import dump

In [None]:
dump(my_model, 'cars.pkl')

In [None]:
!ls -al