# Simple Imputation - Median Imputation

Median Imputation (The "Middle Value Guess")

When to Use:

* When your data is skewed (not normally distributed) or has outliers.
* When you want to use a measure of central tendency that is less sensitive to extreme values.
* When you want to maintain the "middle" value of the data.

How it Works:

* Calculates the median (middle value) of the existing values in a column.
* Replaces missing values with that calculated median.

Limitations:

* Like mean imputation, it reduces variance in the data.
* Not suitable for categorical data.
* May not be representative if the data is highly multimodal (has multiple peaks).


# Notebook Structure

1. Import necessary dependencies
2. Create the dataset
3. Define the Utility function for Median imputation
4. Execution of the utility function


# 1. Import necessary dependencies

In [55]:
# libraries & dataset

import pandas as pd
import numpy as np

# 2. Create the dataset

In [56]:
# Create Sample DataFrame with missing values

data = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Product_Type': ['T-shirt', 'Shorts', 'Track Pants', 'T-shirt', None,
                     'Joggers', 'Cap', None, 'Shorts', 'T-shirt'],
    'Purchase_Amount': [500, 1200, np.nan, 600, 1500, 2000, 100, 700, 1300, np.nan],
    'Discount_Percentage': [0.1, 0.05, 0.15, np.nan, 0.03, 0.0, 0.2, 0.12, 0.07, 0.5],
    'Delivery_Time_Days': [2, 3, 4, 2, 5, 6, 1, np.nan, 4, 3]
})

In [57]:
print("Original Data:\n")
data

Original Data:



Unnamed: 0,CustomerID,Product_Type,Purchase_Amount,Discount_Percentage,Delivery_Time_Days
0,1,T-shirt,500.0,0.1,2.0
1,2,Shorts,1200.0,0.05,3.0
2,3,Track Pants,,0.15,4.0
3,4,T-shirt,600.0,,2.0
4,5,,1500.0,0.03,5.0
5,6,Joggers,2000.0,0.0,6.0
6,7,Cap,100.0,0.2,1.0
7,8,,700.0,0.12,
8,9,Shorts,1300.0,0.07,4.0
9,10,T-shirt,,0.5,3.0


# 3. Define the Utility function for Median imputation

The median_imputation function takes a Pandas DataFrame and a column name. It computes the median of the specified column's non-missing values. It then creates a copy of the DataFrame and replaces all missing values (NaN) in the given column with this calculated median, returning the modified copy.

median_imputation(df, column) Function:

* Calculates the median of the specified column.
* Uses fillna() to replace missing values with the calculated median.
* Returns the imputed DataFrame.

In [58]:
# --- Median Imputation ---

def median_imputation(df, column):
    """Imputes missing values with the median of the column."""
    median_val = df[column].median()
    df_imputed = df.copy()
    df_imputed[column].fillna(median_val, inplace=True)
    return df_imputed

# 4. Execution of the utility function

### A. Impute the missing values in 'Purchase_Amount' column

In [59]:
data['Purchase_Amount'].median()

950.0

In [60]:
data_median_imputed = median_imputation(data.copy(), 'Purchase_Amount')
print("\nData after Median Imputation (Purchase_Amount):\n")
data_median_imputed


Data after Median Imputation (Purchase_Amount):



Unnamed: 0,CustomerID,Product_Type,Purchase_Amount,Discount_Percentage,Delivery_Time_Days
0,1,T-shirt,500.0,0.1,2.0
1,2,Shorts,1200.0,0.05,3.0
2,3,Track Pants,950.0,0.15,4.0
3,4,T-shirt,600.0,,2.0
4,5,,1500.0,0.03,5.0
5,6,Joggers,2000.0,0.0,6.0
6,7,Cap,100.0,0.2,1.0
7,8,,700.0,0.12,
8,9,Shorts,1300.0,0.07,4.0
9,10,T-shirt,950.0,0.5,3.0


As you can see the Row with NaN value is replaced with 950 ( median of the column )

### B. Impute the missing values in 'Discount_Percentage' column

In [53]:
data['Discount_Percentage'].median()

0.1

In [61]:
data_median_imputed = median_imputation(data_median_imputed.copy(), 'Discount_Percentage')
print("\nData after Median Imputation (Discount_Percentage):\n")
data_median_imputed


Data after Median Imputation (Discount_Percentage):



Unnamed: 0,CustomerID,Product_Type,Purchase_Amount,Discount_Percentage,Delivery_Time_Days
0,1,T-shirt,500.0,0.1,2.0
1,2,Shorts,1200.0,0.05,3.0
2,3,Track Pants,950.0,0.15,4.0
3,4,T-shirt,600.0,0.1,2.0
4,5,,1500.0,0.03,5.0
5,6,Joggers,2000.0,0.0,6.0
6,7,Cap,100.0,0.2,1.0
7,8,,700.0,0.12,
8,9,Shorts,1300.0,0.07,4.0
9,10,T-shirt,950.0,0.5,3.0


As you can see the Row with NaN value is replaced with .1( median of the column )