# Simple Imputation - Mean Imputation

Mean Imputation (The "Average Guess")

When to Use:

* When your data is approximately normally distributed (bell-shaped curve).
* When you want to maintain the overall average of the data.
* When you have a large dataset and the missing values are relatively few.

How it Works:

* Calculates the average (mean) of the existing values in a column.
* Replaces missing values with that calculated mean.

Limitations:

* Highly sensitive to outliers. If your data has extreme values, the mean will be skewed, and the imputed values will be misleading.
* Reduces variance in the data, potentially underestimating variability.
* Not suitable for categorical data.


# Notebook Structure

1. Import necessary dependencies
2. Create the dataset
3. Define the Utility function for Mean imputation
4. Execution of the utility function


# 1. Import necessary dependencies

In [2]:
# libraries & dataset

import pandas as pd
import numpy as np

# 2. Create the dataset

In [3]:
# Create Sample DataFrame with missing values

data = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Product_Type': ['T-shirt', 'Shorts', 'Track Pants', 'T-shirt', None,
                     'Joggers', 'Cap', None, 'Shorts', 'T-shirt'],
    'Purchase_Amount': [500, 1200, np.nan, 600, 1500, 2000, 100, 700, 1300, np.nan],
    'Discount_Percentage': [0.1, 0.05, 0.15, np.nan, 0.03, 0.0, 0.2, 0.12, 0.07, 0.5],
    'Delivery_Time_Days': [2, 3, 4, 2, 5, 6, 1, np.nan, 4, 3]
})


In [4]:
print("Original Data:\n")
data

Original Data:



Unnamed: 0,CustomerID,Product_Type,Purchase_Amount,Discount_Percentage,Delivery_Time_Days
0,1,T-shirt,500.0,0.1,2.0
1,2,Shorts,1200.0,0.05,3.0
2,3,Track Pants,,0.15,4.0
3,4,T-shirt,600.0,,2.0
4,5,,1500.0,0.03,5.0
5,6,Joggers,2000.0,0.0,6.0
6,7,Cap,100.0,0.2,1.0
7,8,,700.0,0.12,
8,9,Shorts,1300.0,0.07,4.0
9,10,T-shirt,,0.5,3.0


# 3. Define the Utility function for Mean imputation

The mean_imputation function takes a Pandas DataFrame and a column name as input. It calculates the mean of the specified column and then fills any missing values (NaN) in that column with this calculated mean. It returns a new DataFrame with the imputed values, leaving the original DataFrame unchanged.

mean_imputation(df, column) Function:

* Calculates the mean of the specified column.
* Uses fillna() to replace missing values with the calculated mean.
* Returns the imputed DataFrame.

In [5]:
# --- Mean Imputation ---

def mean_imputation(df, column):
    """Imputes missing values with the mean of the column."""
    mean_val = df[column].mean()
    df_imputed = df.copy()
    df_imputed[column].fillna(mean_val, inplace=True)
    return df_imputed

# 4. Execution of the utility function

### A. Impute the missing values in 'Purchase_Amount' column

In [11]:
data['Purchase_Amount'].mean()

np.float64(987.5)

In [8]:
data_mean_imputed = mean_imputation(data.copy(), 'Purchase_Amount')
print("\nData after Mean Imputation (Purchase_Amount):\n")
data_mean_imputed


Data after Mean Imputation (Purchase_Amount):



Unnamed: 0,CustomerID,Product_Type,Purchase_Amount,Discount_Percentage,Delivery_Time_Days
0,1,T-shirt,500.0,0.1,2.0
1,2,Shorts,1200.0,0.05,3.0
2,3,Track Pants,987.5,0.15,4.0
3,4,T-shirt,600.0,,2.0
4,5,,1500.0,0.03,5.0
5,6,Joggers,2000.0,0.0,6.0
6,7,Cap,100.0,0.2,1.0
7,8,,700.0,0.12,
8,9,Shorts,1300.0,0.07,4.0
9,10,T-shirt,987.5,0.5,3.0


As you can see the Row with NaN value is replaced with 987.5 ( mean of the column )

### B. Impute the missing values in 'Discount_Percentage' column

In [13]:
data['Discount_Percentage'].mean()

np.float64(0.13555555555555554)

In [12]:
data_mean_imputed = mean_imputation(data_mean_imputed.copy(), 'Discount_Percentage')
print("\nData after Mean Imputation (Discount_Percentage):\n")
data_mean_imputed


Data after Mean Imputation (Discount_Percentage):



Unnamed: 0,CustomerID,Product_Type,Purchase_Amount,Discount_Percentage,Delivery_Time_Days
0,1,T-shirt,500.0,0.1,2.0
1,2,Shorts,1200.0,0.05,3.0
2,3,Track Pants,987.5,0.15,4.0
3,4,T-shirt,600.0,0.135556,2.0
4,5,,1500.0,0.03,5.0
5,6,Joggers,2000.0,0.0,6.0
6,7,Cap,100.0,0.2,1.0
7,8,,700.0,0.12,
8,9,Shorts,1300.0,0.07,4.0
9,10,T-shirt,987.5,0.5,3.0


As you can see the Row with NaN value is replaced with .135556( mean of the column )