--------------------------
#### KNNImputer:
- A machine learning-based imputation technique.
- It stands for K-Nearest Neighbors Imputer.
- Utilizes information from the k-nearest neighbors of each data point to impute missing values.
- Particularly useful for imputing missing values in datasets where the structure and relationships between variables are important.
- Can be effective when missing data points have patterns or dependencies in the feature space.
- Available in scikit-learn's impute module.
- Allows specifying the number of neighbors (n_neighbors) to consider during imputation.

In [57]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [58]:
# Load PIMA Indians Diabetes Database from OpenML
diabetes_data = fetch_openml(name='diabetes', version=1)

In [59]:
df = pd.DataFrame(data   =np.c_[diabetes_data['data'], diabetes_data['target']],
                  columns=diabetes_data['feature_names'] + ['target'])

In [60]:
df

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,target
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0,tested_positive
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0,tested_negative
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0,tested_positive
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,tested_negative
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,tested_positive
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.0,180.0,32.9,0.171,63.0,tested_negative
764,2.0,122.0,70.0,27.0,0.0,36.8,0.34,27.0,tested_negative
765,5.0,121.0,72.0,23.0,112.0,26.2,0.245,30.0,tested_negative
766,1.0,126.0,60.0,0.0,0.0,30.1,0.349,47.0,tested_positive


In [63]:
df.isnull().sum()

preg      0
plas      0
pres      0
skin      0
insu      0
mass      0
pedi      0
age       0
target    0
dtype: int64

In [62]:
# Split the dataset into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

In [65]:
# Introduce missing values artificially
np.random.seed(42)
missing_mask   = np.random.choice([True, False], size=X.shape, p=[0.1, 0.9])
missing_mask

array([[False, False, False, ..., False,  True, False],
       [False, False,  True, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [ True, False,  True, ..., False, False, False],
       [False, False, False, ...,  True,  True, False],
       [False, False, False, ..., False, False, False]])

In [66]:
X_with_missing = X.mask(missing_mask)

In [67]:
X_with_missing

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age
0,6.0,148.0,72.0,35.0,0.0,33.6,,50.0
1,1.0,85.0,,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,,0.167,21.0
4,,137.0,40.0,35.0,168.0,,2.288,33.0
...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,,180.0,32.9,0.171,63.0
764,2.0,122.0,70.0,27.0,0.0,36.8,0.34,27.0
765,,121.0,,23.0,112.0,26.2,0.245,30.0
766,1.0,126.0,60.0,0.0,0.0,,,47.0


Example of mask...

Replace values where the condition is True.

In [78]:
s = pd.Series(range(5))
s.where(s > 0)

0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

In [79]:
s.mask(s > 1, 10)

0     0
1     1
2    10
3    10
4    10
dtype: int64

In [80]:
s.mask(s > 1, -10)

0     0
1     1
2   -10
3   -10
4   -10
dtype: int64

In [81]:
s.mask(s > 1)

0    0.0
1    1.0
2    NaN
3    NaN
4    NaN
dtype: float64

In [73]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_with_missing, y, test_size=0.2, random_state=42)

In [74]:
# Perform KNN imputation on the training set
imputer = KNNImputer(n_neighbors=5)
X_train_imputed = imputer.fit_transform(X_train)

In [75]:
# Train a RandomForestClassifier on the imputed data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_imputed, y_train)

RandomForestClassifier(random_state=42)

In [76]:
# Impute missing values in the test set
X_test_imputed = imputer

In [77]:
# Make predictions on the imputed test set
y_pred = clf.predict(X_test_imputed)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the imputed test set: {accuracy:.2%}")

Accuracy on the imputed test set: 77.27%
