# Regression Using KNN

## Participant Exercise : 1: How many neighbors?

- Find out how many number of neighbors gives the best accuracy on test set?
- Loop through many possible number of neighbors e.g 5, 10, 20 etc..
    - Build models with different number of neighbors
    - Measure accuracy i.e. r2
    - Print the number of neigbors and accuray scores

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [2]:
cars_df = pd.read_csv("E:\ML_course\practice\S9_Regression _Using_KNN/final_cars_maruti.csv")

In [3]:
cars_df.sample(5)

Unnamed: 0,Location,Fuel_Type,Transmission,Owner_Type,Seats,Price,Age,Model,Mileage,Power,KM_Driven
702,Chennai,Petrol,Manual,First,5,6.25,3,swift,20.85,83.14,28
976,Mumbai,Petrol,Automatic,First,5,2.5,8,a-star,16.98,66.1,32
305,Pune,Diesel,Automatic,First,5,9.0,3,dzire,28.4,73.75,41
255,Hyderabad,Diesel,Automatic,First,5,9.15,3,dzire,28.4,73.75,28
629,Coimbatore,Petrol,Automatic,First,5,5.25,3,celerio,23.1,67.04,35


In [4]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Location      1010 non-null   object 
 1   Fuel_Type     1010 non-null   object 
 2   Transmission  1010 non-null   object 
 3   Owner_Type    1010 non-null   object 
 4   Seats         1010 non-null   int64  
 5   Price         1010 non-null   float64
 6   Age           1010 non-null   int64  
 7   Model         1010 non-null   object 
 8   Mileage       1010 non-null   float64
 9   Power         1010 non-null   float64
 10  KM_Driven     1010 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 86.9+ KB


### Feature Set Selection

In [5]:
x_features = ['Fuel_Type', 
              'Transmission', 
              'Owner_Type', 
              'Age', 
              'Model', 
              'KM_Driven']

In [6]:
x_features

['Fuel_Type', 'Transmission', 'Owner_Type', 'Age', 'Model', 'KM_Driven']

In [7]:
cat_vars = ['Fuel_Type',
            'Transmission',
            'Owner_Type',
            'Model']

In [8]:
num_vars = list(set(x_features) - set(cat_vars))
num_vars

['Age', 'KM_Driven']

### Need for Data Transformation

1. Categorical columns
    - OHE Encoding
2. Numerical Columns
    - No Transformation Required

- Setting X and y variables

In [9]:
X = cars_df[x_features]
y = cars_df['Price']

- Data Spliting

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size= 0.8, 
                                                    random_state = 80) 

## Creating Pipelines for KNN

In [12]:
from sklearn.preprocessing import OneHotEncoder , MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

#### Pipeline for OHE for categorical columns

In [13]:
ohe_encoder = OneHotEncoder(handle_unknown = 'ignore')
cat_transform = Pipeline(steps=[('oheencoder', ohe_encoder)])

#### Pipeline for Min Max Scaler for numerical columns

In [14]:
minmax_scaler = MinMaxScaler()
num_transform = Pipeline(steps=[('Minmax', minmax_scaler)])

#### Defining the Pipeline Process

In [15]:
preprocessor = ColumnTransformer(
    transformers = [('numerical', num_transform, num_vars),
                    ('categorical', cat_transform, cat_vars)]
)

## Building KNN Model

In [16]:
from sklearn.neighbors import KNeighborsRegressor

In [17]:
knn = knn = KNeighborsRegressor(n_neighbors=5, 
                                weights='uniform')

In [18]:
knn_pipeline = Pipeline (steps = [('preprocessor', preprocessor),
                                   ('regression', knn)])

In [19]:
knn_pipeline

In [20]:
knn_pipeline.fit(X_train, y_train)

### Predicting on Test Set and Measuring Accuracy

In [21]:
y_pred = knn_pipeline.predict(X_test)

In [22]:
from sklearn.metrics import r2_score

In [23]:
r2_score(y_test, y_pred)

0.8553105669077542

In [24]:
np.round(r2_score(y_test, y_pred), 2)

0.86

### Loop through many possible number of neighbors e.g 5, 10, 20 etc..
After Altering the number of neighbors, the Value of R2 Score reduces
- 5  -> r2score = 0.86
- 10 -> r2score = 0.83
- 20 -> r2score = 0.83
- 30 -> r2score = 0.82
- 40 -> r2score = 0.81
- 50 -> r2score = 0.77


#### Predicting on New Data

In [25]:
data = {'Fuel_Type': 'Diesel',
        'Transmission': 'Manual',
        'Owner_Type': 'First',
        'Age': 8,
        'Model': 'ertiga',
        'KM_Driven': 87}

In [26]:
data_df = pd.DataFrame(data, index=[0])

In [27]:
data_df

Unnamed: 0,Fuel_Type,Transmission,Owner_Type,Age,Model,KM_Driven
0,Diesel,Manual,First,8,ertiga,87


In [28]:
knn_pipeline.predict(data_df)

array([6.07])