## KNN imputation



The missing values are estimated as the average value from the closest K neighbours.

[KNNImputer from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer)

- Same K will be used to impute all variables
- Can't really optimise K to better predict the missing values
- Could optimise K to better predict the target

**Note**

If what we want is to predict, as accurately as possible the values of the missing data, then, we would not use the KNN imputer, we would build individual KNN algorithms to predict 1 variable from the remaining ones. This is a common regression problem.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split

# multivariate imputation
from sklearn.impute import KNNImputer

## Load data

In [2]:
# list with numerical varables

cols_to_use = [
    'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
    'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
    'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
    '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
    'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
    'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
    'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea',
    'WoodDeckSF',  'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
    'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold',
    'SalePrice'
]

In [3]:
# let's load the dataset with a selected variables

data = pd.read_csv('../Datasets/houseprice.csv', usecols=cols_to_use)

# find variables with missing data
for var in data.columns:
    if data[var].isnull().sum() > 1:
        print(var, data[var].isnull().sum())

# these LotFrontage, MasVnrArea and GarageYrBlt have missing data

LotFrontage 259
MasVnrArea 8
GarageYrBlt 81


In [4]:
# let's separate into training and testing set

# first drop the target from the feature list
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 36), (438, 36))

In [5]:
# reset index, so we can compare values later on
# in the demo

X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)

## KNN imputation

In [6]:
imputer = KNNImputer(
    n_neighbors=5, # the number of neighbours K
    weights='distance', # the weighting factor
    metric='nan_euclidean', # the metric to find the neighbours
    add_indicator=False, # whether to add a missing indicator
)

"""

1. n_neighbors=5

This is the K in KNN — how many neighbors to consider when filling a missing value.

For example, if a row has a missing age, the imputer finds the 5 most similar rows (neighbors) that


2. weights='distance'

This tells the imputer how to combine the neighbors’ values:

'uniform': all neighbors contribute equally.

'distance': closer neighbors get more weight (inversely proportional to distance).


3. nan_euclidean' is a special metric that can handle missing values safely.

It calculates Euclidean distance only on the features that are present (not NaN).

Then it adjusts the distance to account for the number of missing features.


4. add_indicator=False

If True, the imputer would add extra binary columns indicating where the original data had missing values (0 = no missing, 1 = missing).

Here it’s False, so you don’t add those indicator columns — you just fill the missing values.

"""

"\n\n1. n_neighbors=5\n\nThis is the K in KNN — how many neighbors to consider when filling a missing value.\n\nFor example, if a row has a missing age, the imputer finds the 5 most similar rows (neighbors) that\n\n\n2. weights='distance'\n\nThis tells the imputer how to combine the neighbors’ values:\n\n'uniform': all neighbors contribute equally.\n\n'distance': closer neighbors get more weight (inversely proportional to distance).\n\n\n3. nan_euclidean' is a special metric that can handle missing values safely.\n\nIt calculates Euclidean distance only on the features that are present (not NaN).\n\nThen it adjusts the distance to account for the number of missing features.\n\n\n4. add_indicator=False\n\nIf True, the imputer would add extra binary columns indicating where the original data had missing values (0 = no missing, 1 = missing).\n\nHere it’s False, so you don’t add those indicator columns — you just fill the missing values.\n\n"

In [7]:
imputer.fit(X_train,y_train)

0,1,2
,missing_values,
,n_neighbors,5
,weights,'distance'
,metric,'nan_euclidean'
,copy,True
,add_indicator,False
,keep_empty_features,False


In [8]:
train_t = imputer.transform(X_train) # Applies the imputer to fill missing values in X_train
test_t = imputer.transform(X_test) # Applies the imputer to fill missing values in X_test


In [9]:
# sklearn returns a Numpy array
# lets make a dataframe
train_t = pd.DataFrame(train_t, columns=X_train.columns)
test_t = pd.DataFrame(test_t, columns=X_test.columns)

train_t.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,60.0,70.115142,9375.0,7.0,5.0,1997.0,1998.0,573.0,739.0,0.0,...,645.0,576.0,36.0,0.0,0.0,0.0,0.0,0.0,2.0,2009.0
1,120.0,42.533053,2887.0,6.0,5.0,1996.0,1997.0,0.0,1003.0,0.0,...,431.0,307.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,2008.0
2,20.0,50.0,7207.0,5.0,7.0,1958.0,2008.0,0.0,696.0,0.0,...,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2010.0
3,50.0,60.0,9060.0,6.0,5.0,1939.0,1950.0,0.0,204.0,0.0,...,280.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,2009.0
4,30.0,60.0,8400.0,2.0,5.0,1920.0,1950.0,0.0,290.0,0.0,...,246.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2009.0


In [10]:
# variables without NA after the imputation

train_t[['LotFrontage', 'MasVnrArea', 'GarageYrBlt']].isnull().sum() # no missing values now

LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64

In [11]:
# the obseravtions with NA in the original train set

X_train[X_train['MasVnrArea'].isnull()]['MasVnrArea']

420   NaN
490   NaN
642   NaN
824   NaN
921   NaN
Name: MasVnrArea, dtype: float64

In [12]:
# the replacement values in the transformed dataset

train_t[X_train['MasVnrArea'].isnull()]['MasVnrArea']

# these are the imputted values now which is very different that it's mean value 103.55

420     99.765717
490     34.106592
642      0.000000
824    375.749332
921     85.817715
Name: MasVnrArea, dtype: float64

In [13]:
# the mean value of the variable (i.e., for mean imputation)

X_train['MasVnrArea'].mean()


np.float64(103.55358898721731)

In some cases, the imputation values are very different from the mean value we would have used in MeanMedianImputation.

## Automatically find best imputation parameters using grid search

We can optimise the parameters of the KNN imputation to better predict our outcome.

In [14]:
# import extra classes for modelling
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso #using Lasso regressions
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [15]:
# separate intro train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],  # just the features
    data['SalePrice'],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0)  # for reproducibility

X_train.shape, X_test.shape

((1022, 36), (438, 36))

In [16]:
#the pipeline

pipe = Pipeline(steps=[
    ('imputer', KNNImputer(
        n_neighbors=5, # the number of neighbours K
        weights='distance', # the weighting factor
        add_indicator=False)), # whether to add a missing indicator
    
    ('scaler', StandardScaler()),
    ('regressor', Lasso(max_iter=2000)),
])

In [17]:
# now we create the grid with all the parameters that we would like to test

param_grid = {
    'imputer__n_neighbors': [3,5,10], #test more number of neighbors like 3,5,10
    'imputer__weights': ['uniform', 'distance'], # try 2 different weighting methods.
    'imputer__add_indicator': [True, False],
    'regressor__alpha': [10, 100, 200], #try different regularization strengths
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='r2')

# cv=3 is the cross-validation
# no_jobs =-1 indicates to use all available cpus
# scoring='r2' indicates to evaluate using the r squared

# for more details in the grid parameters visit:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [18]:
# and now we train over all the possible combinations 
# of the parameters above
grid_search.fit(X_train, y_train)

# and we print the best score over the train set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_train, y_train)))

best linear regression from grid search: 0.845


In [19]:
# let's check the performance over the test set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

best linear regression from grid search: 0.730


In [20]:
# and find the best parameters

grid_search.best_params_ # the best parameters found by the grid search

{'imputer__add_indicator': True,
 'imputer__n_neighbors': 10,
 'imputer__weights': 'distance',
 'regressor__alpha': 200}

## Compare with univariate imputation (just use the same variable to impute itself)

In [21]:
from sklearn.impute import SimpleImputer

In [22]:
# separate intro train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],  # just the features
    data['SalePrice'],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0)  # for reproducibility

X_train.shape, X_test.shape

((1022, 36), (438, 36))

In [23]:
pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean', fill_value=-1)),
    ('scaler', StandardScaler()),
    ('regressor', Lasso(max_iter=2000)),
])

param_grid = {
    'imputer__strategy': ['mean', 'median', 'constant'],
    'imputer__add_indicator': [True, False],
    'regressor__alpha': [10, 100, 200],
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='r2')

# and now we train over all the possible combinations of the parameters above
grid_search.fit(X_train, y_train)

# and we print the best score over the train set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_train, y_train)))

best linear regression from grid search: 0.845


In [24]:
# and finally let's check the performance over the test set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

best linear regression from grid search: 0.729


In [25]:
# and find the best fit parameters like this
grid_search.best_params_

{'imputer__add_indicator': False,
 'imputer__strategy': 'constant',
 'regressor__alpha': 200}

We see that imputing the values with an arbitrary value of -1, returns approximately the same performance as doing KNN imputation, so we might not want to add the additional complexity of training models to impute NA, to then go ahead and predict the real target we are interested in.