# Finding Neighbors in KNN

How to find neighboring examples from a dataset?

**Author:** Manaranjan Pradhan</br>
**Email ID:** manaranjan@gmail.com</br>
**LinkedIn:** https://www.linkedin.com/in/manaranjanpradhan/

## Load Dataset

Loading the used car resale price dataset.

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [45]:
cars_df = pd.read_csv( "final_cars_maruti.csv" )

In [46]:
cars_df.sample(5)

Unnamed: 0,Location,Fuel_Type,Transmission,Owner_Type,Seats,Price,Age,Model,Mileage,Power,KM_Driven
808,Chennai,Diesel,Manual,First,5,6.5,5,ciaz,26.21,88.5,90
48,Mumbai,Petrol,Manual,First,7,8.25,2,ertiga,16.02,93.7,18
768,Chennai,Petrol,Manual,First,7,3.75,4,eeco,15.1,73.0,55
286,Mumbai,Petrol,Manual,First,5,3.95,4,ritz,18.5,85.8,11
571,Coimbatore,Petrol,Automatic,First,5,7.29,2,baleno,21.4,83.1,34


In [47]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Location      1010 non-null   object 
 1   Fuel_Type     1010 non-null   object 
 2   Transmission  1010 non-null   object 
 3   Owner_Type    1010 non-null   object 
 4   Seats         1010 non-null   int64  
 5   Price         1010 non-null   float64
 6   Age           1010 non-null   int64  
 7   Model         1010 non-null   object 
 8   Mileage       1010 non-null   float64
 9   Power         1010 non-null   float64
 10  KM_Driven     1010 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 86.9+ KB


Selecting the features that will be used for modeling.

In [48]:
x_features = ['Fuel_Type', 
              'Transmission', 
              'Owner_Type', 
              'Age', 
              'Model', 
              'KM_Driven']

In [49]:
x_features

['Fuel_Type', 'Transmission', 'Owner_Type', 'Age', 'Model', 'KM_Driven']

In [50]:
cat_vars = ['Fuel_Type',
            'Transmission',
            'Owner_Type',
            'Model']

In [51]:
num_vars = list(set(x_features) - set(cat_vars))

In [52]:
num_vars

['KM_Driven', 'Age']

### Setting X and y variables

In [53]:
X = cars_df[x_features]
y = cars_df['Price']

## Creating Pipelines for Feature Transformation

1. Categorical columns
    - OHE Encoding
2. Numerical Columns
    - No Transformation Required

In [54]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

#### Pipeline for OHE for categorical columns

In [55]:
ohe_encoder = OneHotEncoder(handle_unknown='ignore')
cat_transformer = Pipeline(steps=[('oheencoder', ohe_encoder)])

#### Pipeline for OHE for numerical columns

In [56]:
minmax_scaler = MinMaxScaler()
num_transformer = Pipeline(steps=[('scaler', minmax_scaler)])

#### Defining the processing pipeline

In [57]:
preprocessor = ColumnTransformer(
        transformers = [('numerical', num_transformer, num_vars),
                        ('categorical', cat_transformer, cat_vars)])

## Finding Nearest Neighbors

https://scikit-learn.org/stable/modules/neighbors.html

Nearest neighbor methods finds a predefined number of training samples closest in distance to the new sample or data point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning). 

In [58]:
from sklearn.neighbors import NearestNeighbors

Find 5 neighbors based on the distance.

In [83]:
nn = NearestNeighbors(n_neighbors = 5)

The features need to preprocessed before the samples can be used for finding neighbors.

In [None]:
preprocess_pipeline = Pipeline (steps = [('preprocessor', 
                                          preprocessor)])

In [60]:
preprocess_pipeline.fit(X)

In [61]:
X_data = preprocess_pipeline.transform(X)

In [62]:
nn.fit(X_data)

### Example 1

Let's find 5 nearest neighbors (i.e cars) that have very similar attributes from the complete dataset, which has about 1000 examples.

In [64]:
data = {'Fuel_Type': 'Diesel',
        'Transmission': 'Manual',
        'Owner_Type': 'First',
        'Age': 8,
        'Model': 'ertiga',
        'KM_Driven': 87}

In [65]:
data_df = pd.DataFrame(data, index=[0])

In [66]:
neighbors = nn.kneighbors(preprocess_pipeline.transform(data_df), 
                          n_neighbors=5,
                          return_distance = True)

In [67]:
neighbors

(array([[0.        , 0.00595238, 0.0297619 , 0.10497238, 0.1271089 ]]),
 array([[  0, 298, 338, 886, 234]]))

*neighbors* provides two arrays:

- The first array provides the distances to the nearest neigbors.
- The index of the neibors in the dataset.

In [75]:
neighbors_index = list(neighbors[1][0])
neighbors_index

[0, 298, 338, 886, 234]

In [70]:
data_df

Unnamed: 0,Fuel_Type,Transmission,Owner_Type,Age,Model,KM_Driven
0,Diesel,Manual,First,8,ertiga,87


Looking at the neigbors from the dataset and intutively finding out how similar they are to the test sample above.

In [77]:
X[X.index.isin(neighbors_index)]

Unnamed: 0,Fuel_Type,Transmission,Owner_Type,Age,Model,KM_Driven
0,Diesel,Manual,First,8,ertiga,87
234,Diesel,Manual,First,7,ertiga,70
298,Diesel,Manual,First,8,ertiga,88
338,Diesel,Manual,First,8,ertiga,82
886,Diesel,Manual,First,7,ertiga,75


### Estimating the Price

The price is estimated from the price of the neighbors.

In [80]:
y_train[y_train.index.isin(neighbors_index)]

298    5.85
886    6.55
0      6.00
234    5.45
338    6.50
Name: Price, dtype: float64

Average of the resale price of the above cars.

In [82]:
np.round(y_train[y_train.index.isin(neighbors_index)].mean(), 4)

6.07

## Participant Exercise: 1

- Find the estimated prices of the above test sample using weighted average
- Use the distances returned by the *nn.kneighbors()*