## KNN Imputer
The K-Nearest Neighbors (KNN) imputer is a method for imputing missing values in a dataset by considering the values of the nearest neighbors. It is implemented in scikit-learn as the `KNNImputer` class. Unlike traditional imputation methods, KNN imputation takes into account the similarity between data points to estimate missing values.

Here are key points about the KNN imputer:

### KNN Imputer:

1. **Nearest Neighbor Approach:**
   - KNN imputer estimates missing values based on the values of the k-nearest neighbors of a data point.

2. **Distance Metric:**
   - The imputer typically uses a distance metric (e.g., Euclidean distance) to measure the similarity between data points.

3. **Choice of k:**
   - The parameter 'k' specifies the number of neighbors to consider. It is an important hyperparameter that can impact imputation accuracy.

4. **Weighted Imputation:**
   - In some implementations, the imputation can be weighted, where closer neighbors have a higher influence on the imputed value.

### Nan Euclidean Distance:

In the context of the KNN imputer, "NaN Euclidean distance" refers to the handling of missing values when calculating distances between data points. In standard Euclidean distance, the presence of missing values in any of the dimensions can pose challenges. Some implementations, like the one in scikit-learn, address this issue by adjusting the distance calculation when missing values are present.

For example, if two data points have missing values in different dimensions, the contribution of those dimensions to the Euclidean distance is adjusted, considering only the dimensions where both data points have non-missing values.

#### Formula for NaN Euclidean Distance

$ \text{dist}(x, y) = \sqrt{\text{weight} \cdot \sum_{i=1}^{n} \delta_i \cdot (x_i - y_i)^2} $

where:
- $ \text{dist}(x, y) $ is the distance between points $x$ and $y$,
- $ n $ is the total number of coordinates,
- $ \delta_i $ is an indicator function equal to 1 if both $x_i$ and $y_i$ are present, and 0 otherwise,
- $ x_i $ and $ y_i $ are the values of the $i$-th coordinate of points $x$ and $y$, respectively, and
- $ \text{weight} = \frac{\text{Total no. of coordinates}}{\text{No. of present coordinates}} $.

### Example in Python:

```python
import numpy as np
from sklearn.impute import KNNImputer

# Assume 'X' is your feature matrix with missing values

# Create KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)  # You can adjust 'n_neighbors' based on your needs

# Perform imputation
X_imputed = knn_imputer.fit_transform(X)
```

In this example, the `KNNImputer` is used with a default setting of 5 neighbors. You can adjust the `n_neighbors` parameter based on the characteristics of your dataset. The imputation is performed by considering the values of the k-nearest neighbors for each missing entry.

### Considerations:

- **Computational Complexity:**
  - KNN imputation can be computationally expensive, especially for large datasets, as it involves calculating distances between data points.

- **Choice of Distance Metric:**
  - The choice of the distance metric can impact imputation results. Euclidean distance is common, but other metrics like Manhattan distance or Minkowski distance can be used.

- **Handling Categorical Variables:**
  - KNN imputation is primarily designed for numerical data. For categorical variables, additional preprocessing may be required.

- **Optimizing Hyperparameters:**
  - Consider cross-validation to optimize hyperparameters, such as the number of neighbors ('n_neighbors').

KNN imputation can be a powerful method when the missingness in the dataset has a spatial structure, and the imputed values depend on the values of neighboring data points. However, its performance may vary based on the characteristics of the data and the choice of hyperparameters.

In [32]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [33]:
df = pd.read_csv('train.csv')[['Age','Pclass','Fare','Survived']]

In [34]:
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [35]:
df.isnull().mean() * 100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [36]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [37]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [38]:
X_train.head()

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7
873,47.0,3,9.0
182,9.0,3,31.3875
876,20.0,3,9.8458


In [53]:
knn = KNNImputer(n_neighbors=3,weights='distance')

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [54]:
lr = LogisticRegression()

lr.fit(X_train_trf,y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.7150837988826816

In [55]:
# Comparision with Simple Imputer --> mean

si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [56]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978