# K-Nearest Neighbors (KNN) Imputer

#### Why Use KNN Imputer?

In datasets, sometimes values are missing, and this can cause problems when trying to analyze or build models with the data. The **KNN imputer** is a method used to fill in (or "impute") these missing values with reasonable estimates. It’s especially useful when the missing data is numerical or categorical and there's a relationship between the data points.

#### How Does the KNN Imputer Work?

KNN imputer fills in the missing values by finding the **k-nearest neighbors** (similar data points) based on the other values that aren’t missing. Here’s how it works in simple steps:

1. **Identify missing values**: Find where the missing values are in the dataset.
2. **Find similar rows**: For each row with missing data, KNN looks for **k nearest rows** (neighbors) that are similar, but without missing values.
3. **Calculate the missing value**: For numerical data, the missing value is usually filled with the **average** (mean) of the neighbors' values. For categorical data, it could be filled with the **most common category** among the neighbors.

For example, if you have a dataset of people's heights and weights, and someone's height is missing, KNN might find the 3 closest people (k=3) based on their weight, and use the average of their heights to fill in the missing value.

#### Formula Used

To find the nearest neighbors, KNN uses a **distance formula**, such as **Euclidean distance**. The formula for Euclidean distance between two points (rows) is:

\[
\text{Distance} = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + \dots}
\]

Where \(x_1, y_1\) are values in one row, and \(x_2, y_2\) are values in another row. The rows with the smallest distances are considered the nearest neighbors.

#### Advantages of KNN Imputer

1. **Simple and intuitive**: It uses similar data to fill in missing values.
2. **Works for both numerical and categorical data**: KNN imputation can handle a wide range of data types.
3. **Preserves relationships in the data**: Since it uses neighboring data, it keeps the structure of the data intact.

#### Disadvantages of KNN Imputer

1. **Computationally expensive**: KNN can be slow on large datasets because it has to calculate the distance for each missing value.
2. **Sensitive to outliers**: If there are unusual data points, they can affect the imputation result.
3. **Assumes data similarity**: KNN assumes that similar rows are truly neighbors, which might not always be the case.
4. **Parameter tuning**: Choosing the right value for **k** (the number of neighbors) can be tricky and might require trial and error.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('train.csv')[['Age', 'Pclass', 'Fare', 'Survived']]

In [3]:
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [4]:
df.isnull().mean() * 100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [5]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [7]:
X_train

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7000
873,47.0,3,9.0000
182,9.0,3,31.3875
876,20.0,3,9.8458
...,...,...,...
534,30.0,3,8.6625
584,,3,8.7125
493,71.0,1,49.5042
527,,1,221.7792


In [8]:
knn = KNNImputer(n_neighbors=3, weights='distance')

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [9]:
pd.DataFrame(X_train_trf, columns=X_train.columns)

Unnamed: 0,Age,Pclass,Fare
0,40.000000,1.0,27.7208
1,4.000000,3.0,16.7000
2,47.000000,3.0,9.0000
3,9.000000,3.0,31.3875
4,20.000000,3.0,9.8458
...,...,...,...
707,30.000000,3.0,8.6625
708,26.151292,3.0,8.7125
709,71.000000,1.0,49.5042
710,32.666667,1.0,221.7792


In [10]:
lr = LogisticRegression()

lr.fit(X_train_trf, y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_pred, y_test)

0.7150837988826816

#### Comparing KNN-Imputer to Simple Imputer mean strategy

In [11]:
si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [13]:
lr = LogisticRegression()

lr.fit(X_train_trf2, y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test, y_pred2) # hence we observe that accuracy is little low than the KNN imputer

0.6927374301675978