#### KNN (K-Nearest Neighbours) Imputation | Multivariate 

KNN imputer - determines the missing data point value, as the weighted average of the values of its K nearest neigbours. 

**Intuition**

If an observation looks very similar to other observations in the data set, most likely, the missing value would be similar to the values shown in those similar observation.

**Process:**

$$\begin{bmatrix} X_1 & Y_2 & Z_1 \\ X_2 & NA & Z_3 \\ X_3 & Y_3 & Z_3 \end{bmatrix}$$

For each variable with missing value i.e $Y$

- **Step 1**: Train Knn using other variables. $X, Z$
- **Step 2**: Find the K closest neighbours. 
- **Step 3**: Determine the weighted average of the K neighbours 

$
    NA  replacement: \frac{ W_1 \times Y_1 + W_2 \times Y_3}{k}
$

**Finding Weight:**

*uniform:*
All neighbours count equally $(W_i = 1)$

*Distance:*

$
    W_i = \frac{1}{Euclidean distance (distance - observation)}
$


**Key Challenges:**
- Finding the best number of $K$



**Notes:**

- Author claims: makes precise imputation up to 20% of missing data.
- Insensitive to the values of K, for K between 10 and 20.

*These claims are made for gene sequencing data, amy not be extended to other problems.*



- Same K will be used to impute all variables
- Can't really optimise K to better predict the missing values
- Could optimise K to better predict the 

If what we want is to predict, as accurately as possible the values of the missing data, then, we would not use the KNN imputer, we would build individual KNN algorithms to predict 1 variable from the remaining ones. This is a common regression problem.

Dataset: House price



In [19]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split

# multivariate imputation
from sklearn.impute import KNNImputer

# import extra classes for modelling
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [2]:
# list with numerical varables

use_columns = [
    'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
    'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
    'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
    '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
    'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
    'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
    'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea',
    'WoodDeckSF',  'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
    'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold',
    'SalePrice'
]

In [5]:
# load the dataset
data = pd.read_csv('../datasets/houseprice.csv', usecols=use_columns)

In [6]:
# lets count the missing values 
for var in data.columns:
    if data[var].isnull().sum() > 1:
        print(var, data[var].isnull().sum())

LotFrontage 259
MasVnrArea 8
GarageYrBlt 81


In [7]:
# let's separate into training and testing set

# first drop the target from the feature list
use_columns.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(
    data[use_columns],
    data['SalePrice'],
    test_size=0.3,
    random_state=0
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")


Train: (1022, 36), Test: (438, 36)


In [8]:
# reset index, so we can compare values later on
# in the demo

X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)

##### KNN Imputation 

In [9]:
imputer = KNNImputer(
    n_neighbors=5, # the number of neighbours K
    weights='distance', # the weighting factor
    metric='nan_euclidean', # the metric to find the neighbours
    add_indicator=False, # whether to add a missing indicator
)

In [10]:
imputer.fit(X_train)

KNNImputer(weights='distance')

In [11]:
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

# sklearn returns a Numpy array
# lets make a dataframe
train_t = pd.DataFrame(train_t, columns=X_train.columns)
test_t = pd.DataFrame(test_t, columns=X_test.columns)

train_t.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,60.0,70.115142,9375.0,7.0,5.0,1997.0,1998.0,573.0,739.0,0.0,...,645.0,576.0,36.0,0.0,0.0,0.0,0.0,0.0,2.0,2009.0
1,120.0,42.533053,2887.0,6.0,5.0,1996.0,1997.0,0.0,1003.0,0.0,...,431.0,307.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,2008.0
2,20.0,50.0,7207.0,5.0,7.0,1958.0,2008.0,0.0,696.0,0.0,...,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2010.0
3,50.0,60.0,9060.0,6.0,5.0,1939.0,1950.0,0.0,204.0,0.0,...,280.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,2009.0
4,30.0,60.0,8400.0,2.0,5.0,1920.0,1950.0,0.0,290.0,0.0,...,246.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2009.0


In [12]:
# variables without NA after the imputation

train_t[['LotFrontage', 'MasVnrArea', 'GarageYrBlt']].isnull().sum()

LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64

In [15]:
# the obseravtions with NA in the original train set

X_train[X_train['MasVnrArea'].isnull()]['MasVnrArea']

420   NaN
490   NaN
642   NaN
824   NaN
921   NaN
Name: MasVnrArea, dtype: float64

In [17]:
# the replacement values in the transformed dataset

train_t[X_train['MasVnrArea'].isnull()]['MasVnrArea']

420     99.765717
490     34.106592
642      0.000000
824    375.749332
921     85.817715
Name: MasVnrArea, dtype: float64

In [18]:
# the mean value of the variable (i.e., for mean imputation)
print(f"Original mean :{X_train['MasVnrArea'].mean()}")
print(f"KNN imputated mean :{train_t['MasVnrArea'].mean()}")


Original mean :103.55358898721731
KNN imputated mean :103.62958841087315


##### Automatically find the best imputation parameters 

We can optimise the parameters of the KNN imputation to better predict our outcome.

In [20]:
# separate intro train and test set

X_train, X_test, y_train, y_test = train_test_split(
        data[use_columns],  # just the features
        data['SalePrice'],  # the target
        test_size=0.3,  # the percentage of obs in the test set
        random_state=0
    )

print(f"Train {X_train.shape}, Test: {X_test.shape}")

Train (1022, 36), Test: (438, 36)


In [21]:
pipe = Pipeline(steps=[
    ('imputer', KNNImputer(
        n_neighbors=5,
        weights='distance',
        add_indicator=False)
    ),
    ('scaler', StandardScaler()),
    ('regressor', Lasso(max_iter=2000)),
])

In [22]:
# now we create the grid with all the parameters that we would like to test

param_grid = {
    'imputer__n_neighbors': [3,5,10],
    'imputer__weights': ['uniform', 'distance'],
    'imputer__add_indicator': [True, False],
    'regressor__alpha': [10, 100, 200],
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='r2')

# cv=3 is the cross-validation
# no_jobs =-1 indicates to use all available cpus
# scoring='r2' indicates to evaluate using the r squared

In [23]:
# and now we train over all the possible combinations 
# of the parameters above
grid_search.fit(X_train, y_train)

# and we print the best score over the train set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_train, y_train)))

best linear regression from grid search: 0.845


In [24]:
# let's check the performance over the test set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

best linear regression from grid search: 0.730


In [25]:
# and find the best parameters

grid_search.best_params_

{'imputer__add_indicator': True,
 'imputer__n_neighbors': 10,
 'imputer__weights': 'distance',
 'regressor__alpha': 200}

##### Lets Compare with univariate imputation

In [26]:
from sklearn.impute import SimpleImputer

In [27]:
# separate intro train and test set

X_train, X_test, y_train, y_test = train_test_split(
        data[use_columns],  # just the features
        data['SalePrice'],  # the target
        test_size=0.3,  # the percentage of obs in the test set
        random_state=0
    )

In [28]:
pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean', fill_value=-1)),
    ('scaler', StandardScaler()),
    ('regressor', Lasso(max_iter=2000)),
])

param_grid = {
    'imputer__strategy': ['mean', 'median', 'constant'],
    'imputer__add_indicator': [True, False],
    'regressor__alpha': [10, 100, 200],
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='r2')

# and now we train over all the possible combinations of the parameters above
grid_search.fit(X_train, y_train)

# and we print the best score over the train set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_train, y_train)))

best linear regression from grid search: 0.845


In [29]:
# and finally let's check the performance over the test set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

best linear regression from grid search: 0.729


In [30]:
# and find the best fit parameters like this
grid_search.best_params_

{'imputer__add_indicator': False,
 'imputer__strategy': 'constant',
 'regressor__alpha': 200}

We see that imputing the values with an arbitrary value of -1, returns approximately the same performance as doing KNN imputation, so we might not want to add the additional complexity of training models to impute NA.

**Its your choice**