# Regression Using KNN

**Author:** Manaranjan Pradhan</br>
**Email ID:** manaranjan@gmail.com</br>
**LinkedIn:** https://www.linkedin.com/in/manaranjanpradhan/

## Loading the Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
cars_df = pd.read_csv( "final_cars_maruti.csv" )

In [None]:
cars_df.sample(5)

In [None]:
cars_df.info()

## What is a Neighbour?

Let's say we only know two attributes of cars.

- Age
- KM_Driven

In [None]:
cars_subset_df = cars_df[['Age', 
                          'KM_Driven', 
                          'Price']].sample(10, random_state = 70)

In [None]:
cars_subset_df.reset_index(inplace=True, 
                           drop=True) 
cars_subset_df

In [None]:
sn.scatterplot(cars_subset_df,
               x = 'Age',
               y = 'KM_Driven');

- **Cars with very similar attributes i.e. similar age and kilometer drivern are called neighbors.**
- Similar cars will have shorter distance i.e. they will be nearer to each other on the euclidean space of age and kilometer driven compared to dissimilar cars.
- The distane between the car **x** and car **y** is given by:

$ dist_{xy}  = \sqrt {\left( age_{x}-age_{y}\right)^2 + \left( km_{x}-km_{y}\right)^2 } $

https://en.wikipedia.org/wiki/Euclidean_distance

### What is the distance between two cars

Let's find which cars from this sample set are nearer to a car with 4 years old and has been driven around 30 kilometers. 

In [None]:
def cardist(age_x, km_x, age_y, km_y):
    return  np.round(np.sqrt((age_x - age_y)**2 
                           + (km_x - km_y)**2), 2)

In [None]:
cardist(10, 50, #car 1
        4, 30)  #car 2

In [None]:
cars_subset_df['dist'] = (cars_subset_df
                          .apply(lambda rec:                                               
                                 cardist(rec['Age'], rec['KM_Driven'],                                                       
                                         4, 30), 
                                 axis = 1))

In [None]:
cars_subset_df.sort_values('dist', 
                           ascending=True)

## Is the distance calculation correct?

In [None]:
custs_df = pd.DataFrame({"Name": ["A", "B", "C", "D"],
                         "Age" : [20, 21, 70, 50], 
                         "Income": [10000, 11000, 10500, 90000]})

In [None]:
def custdist(age_x, income_x, age_y, income_y):
    return  np.round(np.sqrt((age_x - age_y)**2 
                           + (income_x - income_y)**2), 2)

In [None]:
custs_df

In [None]:
## Distance between A and B
custdist(20, 10000, #cust A
         21, 11000) #cust B

In [None]:
## Distance between A and C
custdist(20, 10000, #cust A
         70, 10500) #cust C

#### Conclusion:

- Distance wise A and B are very different, whereas A and c are similar.
In reality, A and B are are very similar whereas A and C are very different as they have huge difference in terms of age.

- This is because of difference in scale in which age and income are represented.

## Scale the features

For distance calculation, we need to bring all features into same scale.

####  Min Max Scaler


In this technique, the minimum value of the feature is scaled to 0 and the maximum value is scaled to 1. All other values are scaled to a value between 0 and 1 based on their relative position to the minimum and maximum values.

$X_{norm} = \frac{X_{i} - X_{min}}{X_{max} - X_{min}}$

[Sklearn Source](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)


#### Standard Scaler

Standard scaling, also known as standardization, is a data preprocessing technique used in machine learning and data science to transform the features of a dataset so that they have a mean of 0 and a standard deviation of 1.

$X_{norm} = \frac{X_{i} - \mu}{\sigma}$

[Sklearn Source](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(custs_df[['Age', 
                                             'Income']])
scaled_data

In [None]:
custs_df['Age_norm'] = scaled_data[:, 0:1]
custs_df['Income_norm'] = scaled_data[:, 1:2]

In [None]:
custs_df

In [None]:
## Distance between A and B
custdist(0, 0, 0.02, 0.0125)

In [None]:
## Distance between A and C
custdist(0, 0, 12, 0.00625)

The distance between the customers seem to make sense now.

Feature scaling an important step before running KNN algorithm.

### Scaling Cars Data

In [None]:
scaler = MinMaxScaler()
cars_scaled_data = scaler.fit_transform(cars_subset_df[['Age', 
                                                        'KM_Driven']])
cars_scaled_df = pd.DataFrame(cars_scaled_data)
cars_scaled_df.columns = ['Age_Norm', 
                          'KM_Driven_Norm']

In [None]:
cars_scaled_df

In [None]:
cars_scaled_df['Price'] = cars_subset_df['Price']
cars_scaled_df['Age'] = cars_subset_df['Age']
cars_scaled_df['KM_Driven'] = cars_subset_df['KM_Driven']
cars_scaled_df

In [None]:
cars_scaled_df['dist'] = (cars_scaled_df
                          .apply(lambda rec: cardist(rec['Age_Norm'],                                                                   
                                                     rec['KM_Driven_Norm'],                                                                   
                                                     0.250,                                                                   
                                                     0.206349),                                               
                                 axis = 1)) 

# Sorting the cars by their distance
cars_scaled_df = cars_scaled_df.sort_values('dist', 
                                            ascending=True)
cars_scaled_df

## Predicting From Neighbours

- What would be the resalce price of a car which is 4 years old and has been driven for 30k kilometers?

#### Using Simple Average of neighbors

$ Price_{pred} = \frac{Price_{n1}+Price_{n2}}{2}$

In [None]:
prices_list = list(cars_scaled_df[1:3]['Price'])
prices_list

In [None]:
pred_sale_value = (prices_list[0] + prices_list[1])/ 2

In [None]:
pred_sale_value

#### Using Weighted Average normalized by distance

$ weight_{n} = \frac{1}{distance_{n}}$

$ Price_{pred} = \frac{Price_{n1} * Weight_{n1} + Price_{n2} * Weight_{n2}}{Weight_{n1} + Weight_{n2}}$

In [None]:
cars_scaled_df['weights'] = 1 / (cars_scaled_df['dist'] + 0.001)
cars_scaled_df

In [None]:
price_list = list(cars_scaled_df[1:3]['Price'])
price_list

In [None]:
weights_list = list(cars_scaled_df[1:3]['weights'])
weights_list

In [None]:
pred_sale_value_wa = (((price_list[0] * weights_list[0]) 
                      + (price_list[1] * weights_list[1]))
                      / (weights_list[0] + weights_list[1]))

In [None]:
pred_sale_value_wa

## Using more features

### Feature Set Selection

In [None]:
x_features = ['Fuel_Type', 
              'Transmission', 
              'Owner_Type', 
              'Age', 
              'Model', 
              'KM_Driven']

In [None]:
x_features

In [None]:
cat_vars = ['Fuel_Type',
            'Transmission',
            'Owner_Type',
            'Model']

In [None]:
num_vars = list(set(x_features) - set(cat_vars))

In [None]:
num_vars

### Need for Data Transformation

1. Categorical columns
    - OHE Encoding
2. Numerical Columns
    - No Transformation Required

### Setting X and y variables

In [None]:
X = cars_df[x_features]
y = cars_df['Price']

### Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 80)

In [None]:
X_train.shape

In [None]:
X_test.shape

## Creating Pipelines for KNN

In [None]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

#### Pipeline for OHE for categorical columns

In [None]:
ohe_encoder = OneHotEncoder(handle_unknown='ignore')
cat_transformer = Pipeline(steps=[('oheencoder', ohe_encoder)])

#### Pipeline for OHE for numerical columns

In [None]:
minmax_scaler = MinMaxScaler()
num_transformer = Pipeline(steps=[('scaler', minmax_scaler)])

#### Defining the processing pipeline

In [None]:
preprocessor = ColumnTransformer(
        transformers = [('numerical', num_transformer, num_vars),
                        ('categorical', cat_transformer, cat_vars)])

## Building KNN Model

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
knn = KNeighborsRegressor(n_neighbors=5, 
                          weights='uniform')

In [None]:
knn_pipeline = Pipeline (steps = [('preprocessor', preprocessor),
                                   ('regression', knn)])

In [None]:
knn_pipeline

In [None]:
knn_pipeline.fit(X_train, y_train)

## Predicting on Test Set and Measuring Accuracy

In [None]:
y_pred = knn_pipeline.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
np.round(r2_score(y_test, y_pred), 2)

## Predicting on New Data

In [None]:
data = {'Fuel_Type': 'Diesel',
        'Transmission': 'Manual',
        'Owner_Type': 'First',
        'Age': 8,
        'Model': 'ertiga',
        'KM_Driven': 87}

In [None]:
data_df = pd.DataFrame(data, index=[0])

In [None]:
data_df

In [None]:
knn_pipeline.predict(data_df)

## Participant Exercise : 1: How many neighbors?

- Find out how many number of neighbors gives the best accuracy on test set?
- Loop through many possible number of neighbors e.g 5, 10, 20 etc..
    - Build models with different number of neighbors
    - Measure accuracy i.e. r2
    - Print the number of neigbors and accuray scores



## Participant Exercise : 2

- Build the model with all the variables from the dataset
- Use the above approach to find the most optimal values of number of neighbors, which gives best accuray on the test set.