# Regression Example: Used Car Price Prediction

Regression analysis is a set of ML algorithms for estimating the relationships between a dependent (continuous) variable (also called the 'outcome' or 'response' variable) and one or more independent variables (often called 'predictors', or 'features').

Source: https://en.wikipedia.org/wiki/Regression_analysis

Other References:

https://hbr.org/2015/11/a-refresher-on-regression-analysis

### Loading the Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
cars_df = pd.read_csv( "new_used_car.csv" )

In [None]:
cars_df.sample(5)

In [None]:
cars_df.info()

## Building a simple linear regression model

Assumes linear relationship between features and outcome variable.

### Setting X and Y Variables

In [None]:
X = pd.DataFrame(cars_df['KM_Driven'])
y = cars_df['Price']

### Splitting the dataset

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 80)

In [None]:
X_train.shape

In [None]:
X_test.shape

### Observing the relationship

In [None]:
sn.lmplot( data = cars_df.sample(100),
           x = 'KM_Driven',
           y = 'Price',
           fit_reg = False);

### Building the model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lreg_v1 = LinearRegression()
lreg_v1.fit(X_train, y_train)

#### Finding the model parameters

In [None]:
lreg_v1.intercept_

In [None]:
lreg_v1.coef_

### Predicting on test set and evaluation model performance

In [None]:
y_pred = lreg_v1.predict(X_test)

In [None]:
y_df = pd.DataFrame({"actual": y_test,
                     "predicted": y_pred,
                     "residual": y_pred - y_test})

In [None]:
y_df.sample(10, random_state = 100)

#### What is R-quared?
https://www.investopedia.com/terms/r/r-squared.asp

In [None]:
from sklearn.metrics import r2_score

In [None]:
r2_score(y_test, y_pred)

### Participants Exercise: 1

Build a model by adding the following four parameters and measure accuracy

- mileage_new 
- engine_new 
- power_new
- KM_Driven

## Building model with all required variables (Multiple Linear Regression)

### Feature Set Selection

In [None]:
list(cars_df.columns)

In [None]:
#x_features = ['KM_Driven', 'Fuel_Type', 'age',
#              'Transmission', 'Owner_Type', 'Seats', 
#              'make', 'mileage_new', 'engine_new', 
#              'power_new', 'Location', 'model']

x_features = ['KM_Driven', 'Fuel_Type', 'age',
              'Transmission', 'Owner_Type', 'Seats', 
              'make', 'mileage_new', 'engine_new', 
              'power_new', 'Location']

In [None]:
cat_features = ['Fuel_Type', 
                'Transmission', 'Owner_Type',
                'make', 'Location']

#cat_features = ['Fuel_Type', 
#                'Transmission', 'Owner_Type',
#                'make', 'Location', 'model']

In [None]:
num_features = list(set(x_features) - set(cat_features))

In [None]:
num_features

In [None]:
cars_df[x_features].info()

In [None]:
cars_df.isnull().sum()

### Dropping Null Values

In [None]:
cars_df = cars_df[x_features + ['Price']].dropna()

In [None]:
cars_df.shape

In [None]:
cars_df.sample(10)

### Encoding Categorical Variables

OHE: One Hot Encoding

https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

In [None]:
encoded_cars_df = pd.get_dummies(cars_df[x_features], 
                                 columns=cat_features)

In [None]:
encoded_cars_df.sample(5)

In [None]:
encoded_cars_df.columns

In [None]:
encoded_cars_df.shape

### Setting X and y variables

In [None]:
X = encoded_cars_df
y = cars_df['Price']

### Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 80)

In [None]:
X_train.shape

In [None]:
X_test.shape

### Multiple Linear Regression Models

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lreg_v1 = LinearRegression()

In [None]:
lreg_v1.fit(X_train, y_train)

### Understanding model parameters

In [None]:
lreg_v1.intercept_

In [None]:
lreg_v1.coef_

In [None]:
dict(zip(X_train.columns, 
         np.round(lreg_v1.coef_, 3)))

### Predict on test set

In [None]:
y_pred = lreg_v1.predict(X_test)

In [None]:
y_df = pd.DataFrame({"actual": y_test,
                     "predicted": y_pred,
                     "residual": y_pred - y_test})

In [None]:
y_df.sample(10, random_state = 100)

In [None]:
r2_score(y_test, y_pred)

### Measuring Accuracy

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
mse_v1 = mean_squared_error(y_test, y_pred)

In [None]:
mse_v1

In [None]:
rmse_v1 = np.sqrt(mse_v1)

In [None]:
rmse_v1

### Participant Exercise: 2

Take different training set, build model and measure the model accuracy. But, how to sample differenent training and test sets?
- Change the random_state to different numbers while training and test splits and then measure the r2 values.
- Repeat the above process for 5 different random_states and make a note of the r2 values.

### K-FOLD Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(LinearRegression(),
                         X_train,
                         y_train,
                         cv = 10,
                         scoring = 'r2')
scores.mean()

In [None]:
scores

In [None]:
scores.std()

In [None]:
r2_score(y_test, y_pred)

### What are the reasons for the remaining error

1. More factors 
2. More samples 
3. Complex Models : Try other models
4. Feature Engineering - Derive new features (factors) from existing features (factors)
5. Noise (randomness)
   

### Saving the model

In [None]:
class CarPredictionModel():
    
    def __init__(self, model, features, rmse):
        self.model = model
        self.features = features
        self.rmse = rmse

In [None]:
my_model = CarPredictionModel(lreg_v1, list(X_train.columns), rmse_v1)

In [None]:
my_model.rmse

In [None]:
# Uncomment this code for older version of sklearn
#from sklearn.externals import joblib
#joblib.dump(my_model, './cars.pkl')

In [None]:
from joblib import dump

In [None]:
dump(my_model, './cars.pkl')

### Participant Exercise: 3

1. Removing all cars prior to 2010
2. Add the car model (cateorical variable) to the list of x features.
3. Build a new linear regression model
4. Predict on test set and measure the accuracy (RMSE and R Squared values)
5. Do the cross Validation and find the mean and std of the r2 values

## Building KNN Model

In [None]:
sn.lmplot( data = cars_df.sample(50),
           x = "mileage_new",
           y = 'KM_Driven',
           fit_reg = False);

In [None]:
cars_df.sample(10)

### Scaling the data

- Min Max Scaler
- Standard Scaler

In [None]:
X.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 80)

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler.fit(X_train)

In [None]:
x_train_scaled = scaler.transform(X_train)
x_test_scaled = scaler.transform(X_test)

In [None]:
x_train_scaled.shape

In [None]:
X_train[0:10]

### Build the model

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
knn_v1 = KNeighborsRegressor(n_neighbors=10,
                             weights='distance')

In [None]:
knn_v1.fit(x_train_scaled, y_train)

In [None]:
x_train_scaled.shape

### Predicting on test data and calculating accuracy

In [None]:
y_knn_pred = knn_v1.predict(x_test_scaled)

In [None]:
mse_knn = mean_squared_error(y_test, y_knn_pred)

In [None]:
np.sqrt(mse_knn)

In [None]:
r2_score(y_test, y_knn_pred)

### Participant Exercise: 4

Finding best params

- Iterate through a list of possible K values. For example: 3 through 15
- Build model for each k value, predict on test set and measure it's accuracy
- Print the k value for which r2 is maximum

### Grid Search