# Machine Learning Imputation - KNN Imputation

What is KNN Imputation?

KNN imputation is a method used to estimate and fill in missing values in a dataset. It operates on the principle that data points with similar characteristics tend to have similar values.  Instead of relying on overall averages or medians, KNN imputation focuses on the local neighborhood of the data point with the missing value.

How it Works:

Define Similarity (Distance):

* The algorithm calculates the similarity or distance between the data point with the missing value and all other data points in the dataset.
* The choice of distance metric is crucial.  For numerical data, Euclidean distance is commonly used.  For categorical data, Hamming distance (after encoding) is more appropriate.
* Similarity is determined based on the available features (variables).

Identify Nearest Neighbors:

* The algorithm selects the "K" nearest neighbors to the data point with the missing value.
* "K" is a user-defined parameter, representing the number of neighbors to consider.  A smaller K makes the imputation more sensitive to local variations, while a larger K smooths out the estimates.

Impute the Missing Value:

* Numerical Data: If the missing value is in a numerical feature, the imputed value is typically the average or median of the corresponding values of the K nearest neighbors. The average is more sensitive to outliers, while the median is more robust.
* Categorical Data: If the missing value is in a categorical feature, the imputed value is usually the mode (most frequent value) among the K nearest neighbors.

Example:

* Imagine you're trying to estimate the missing price of a product (e.g., shorts).  KNN imputation would:
* Calculate how similar that product is to other products in terms of its attributes (e.g., material, brand, size, style).
* Identify the K most similar products.
* Use the prices of those K similar products to estimate the missing price by calculating the average or median price.

# Notebook Structure

1. Import necessary dependencies
2. Create the dataset
3. Define the Utility function for KNN Imputation imputation
4. Execution of the utility function


# 1. Import necessary dependencies

In [216]:
# libraries & dataset

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

# 2. Create the dataset

In [217]:
# Sample DataFrame with missing values

data = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Purchase_Amount': [500, 1200, np.nan, 600, 1500, 2000, 100, 700, 1300, np.nan],
    'Discount_Percentage': [0.1, 0.05, 0.15, np.nan, 0.03, 0.0, 0.2, 0.12, 0.07, 0.5],
    'Delivery_Time_Days': [2, 3, 4, 2, 5, 6, 1, np.nan, 4, 3]
})

In [218]:
print("Original Data:\n")
data

Original Data:



Unnamed: 0,CustomerID,Purchase_Amount,Discount_Percentage,Delivery_Time_Days
0,1,500.0,0.1,2.0
1,2,1200.0,0.05,3.0
2,3,,0.15,4.0
3,4,600.0,,2.0
4,5,1500.0,0.03,5.0
5,6,2000.0,0.0,6.0
6,7,100.0,0.2,1.0
7,8,700.0,0.12,
8,9,1300.0,0.07,4.0
9,10,,0.5,3.0


# 3. Define the KNN Imputer utility function

The  knn_imputation  function imputes missing values in a DataFrame's numerical columns.  It identifies the 'K' nearest data points to those with missing values, and replaces the missing values with the average or median of the corresponding values from those neighbors.

def knn_imputation(df, numerical_cols, n_neighbors=5) Function:

* Takes the DataFrame, a list of numerical columns to impute, and the number of neighbors (n_neighbors) as input.
* Creates a copy of the DataFrame to avoid modifying the original.
* Imputes missing values in the specified numerical columns using KNN Imputer.
* For each missing value, it identifies the 'n_neighbors' closest data points based on their values in the other numerical columns.
* The missing value is then replaced by the average (or median) of the corresponding values from these neighbors.
* Returns the DataFrame with imputed values.

In [219]:
def knn_imputation(df, numerical_cols, n_neighbors=5):
    """Imputes missing values using KNN imputation for numerical columns.

    Args:
        df (pd.DataFrame): The input DataFrame.
        numerical_cols (list): List of numerical columns to impute.
        n_neighbors (int, optional): Number of neighbors for KNN. Defaults to 5.

    Returns:
        pd.DataFrame: A new DataFrame with the missing values imputed.
    """
    df_imputed = df.copy()

    # 1. Impute numerical features
    imputer = KNNImputer(n_neighbors=n_neighbors)
    df_imputed[numerical_cols] = imputer.fit_transform(df_imputed[numerical_cols])
    return df_imputed

# 4. Execution of the utility function

In [220]:
# Separate numerical and categorical columns
numerical_cols = ['Purchase_Amount', 'Discount_Percentage', 'Delivery_Time_Days']


# Perform KNN imputation
data_imputed = knn_imputation(data.copy(), numerical_cols, n_neighbors=3)

In [221]:
print("\nData after KNN Imputation:\n")
data_imputed


Data after KNN Imputation:



Unnamed: 0,CustomerID,Purchase_Amount,Discount_Percentage,Delivery_Time_Days
0,1,500.0,0.1,2.0
1,2,1200.0,0.05,3.0
2,3,1066.666667,0.15,4.0
3,4,600.0,0.25,2.0
4,5,1500.0,0.03,5.0
5,6,2000.0,0.0,6.0
6,7,100.0,0.2,1.0
7,8,700.0,0.12,3.0
8,9,1300.0,0.07,4.0
9,10,800.0,0.5,3.0


All the NaN values in Numerical columns are imputed using KNN imputer