# Simple Imputation - Mode Imputation

Mode Imputation (The "Most Common Guess")

When to Use:

* When dealing with categorical data.
* When you want to use the most frequent value as the imputation.
* When you have discrete numerical data (e.g., ratings, counts) and you want to use the most common value.

How it Works:

* Calculates the mode (most frequent value) of the existing values in a column
Replaces missing values with that calculated mode.

Limitations:

* Only suitable for categorical or discrete data.
* May introduce bias if the mode is not representative of the missing values.
* If there are multiple modes, the choice of which mode to use can be arbitrary.


# Notebook Structure

1. Import necessary dependencies
2. Create the dataset
3. Define the Utility function for Mode imputation
4. Execution of the utility function


# 1. Import necessary dependencies

In [37]:
# libraries & dataset

import pandas as pd
import numpy as np

# 2. Create the dataset

In [38]:
# Create Sample DataFrame with missing values

data = pd.DataFrame({
    'UserID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'City': ['Kolkata', 'Mumbai', None, 'Delhi', 'Kolkata',
             'Chennai', 'Mumbai', 'Bangalore', None, 'Delhi'],
    'Gender': ['Male', 'Female', 'Female', None, 'Male',
               'Female', 'Male', 'Male', 'Female', None],
    'Education': ['Graduate', 'Post-Graduate', 'Under-Graduate', 'Graduate', None,
                  'Post-Graduate', None, 'Graduate', 'Under-Graduate', 'Graduate'],
    'Employment_Status': ['Employed', 'Self-Employed', 'Unemployed', 'Employed', 'Employed',
                          None, 'Self-Employed', 'Employed', 'Unemployed', None],
    'Subscription_Type': ['Basic', None, 'Premium', 'Basic', 'Standard',
                          'Premium', 'Basic', None, 'Standard', 'Premium'],
    'Usage_Hours': [10, 15, np.nan, 12, 8, 20, 14, 9, np.nan, 11],
    'Satisfaction_Score': [4, 5, 3, np.nan, 4, 5, 2, 4, 3, np.nan]
})


In [39]:
print("Original Data:\n")
data

Original Data:



Unnamed: 0,UserID,City,Gender,Education,Employment_Status,Subscription_Type,Usage_Hours,Satisfaction_Score
0,101,Kolkata,Male,Graduate,Employed,Basic,10.0,4.0
1,102,Mumbai,Female,Post-Graduate,Self-Employed,,15.0,5.0
2,103,,Female,Under-Graduate,Unemployed,Premium,,3.0
3,104,Delhi,,Graduate,Employed,Basic,12.0,
4,105,Kolkata,Male,,Employed,Standard,8.0,4.0
5,106,Chennai,Female,Post-Graduate,,Premium,20.0,5.0
6,107,Mumbai,Male,,Self-Employed,Basic,14.0,2.0
7,108,Bangalore,Male,Graduate,Employed,,9.0,4.0
8,109,,Female,Under-Graduate,Unemployed,Standard,,3.0
9,110,Delhi,,Graduate,,Premium,11.0,


# 3. Define the Utility function for Mode imputation

The mode_imputation function takes a Pandas DataFrame and a column name. It calculates the mode (most frequent value) of the specified column. It then creates a copy of the DataFrame and fills all missing values (NaN) in the given column with this mode, returning the modified copy. The [0] is used to select the first mode in case of ties.

mode_imputation(df, column) Function:

* Calculates the mode of the specified column.
* Uses fillna() to replace missing values with the calculated mode.
* Returns the imputed DataFrame.
* Because the .mode() function returns a series, we use [0] to get the first mode.

In [40]:
# --- Mode Imputation ---

def mode_imputation(df, column):
    """Imputes missing values with the mode of the column."""
    mode_val = df[column].mode()[0]  # mode() returns a Series, so we take the first element
    df_imputed = df.copy()
    df_imputed[column].fillna(mode_val, inplace=True)
    return df_imputed

# 4. Execution of the utility function

### A. Impute the missing values in 'Education' column

In [41]:
data['Education'].mode()

Unnamed: 0,Education
0,Graduate


In [42]:
data_mode_imputed = mode_imputation(data.copy(), 'Education')
print("\nData after Mode Imputation (Education):\n")
data_mode_imputed


Data after Mode Imputation (Education):



Unnamed: 0,UserID,City,Gender,Education,Employment_Status,Subscription_Type,Usage_Hours,Satisfaction_Score
0,101,Kolkata,Male,Graduate,Employed,Basic,10.0,4.0
1,102,Mumbai,Female,Post-Graduate,Self-Employed,,15.0,5.0
2,103,,Female,Under-Graduate,Unemployed,Premium,,3.0
3,104,Delhi,,Graduate,Employed,Basic,12.0,
4,105,Kolkata,Male,Graduate,Employed,Standard,8.0,4.0
5,106,Chennai,Female,Post-Graduate,,Premium,20.0,5.0
6,107,Mumbai,Male,Graduate,Self-Employed,Basic,14.0,2.0
7,108,Bangalore,Male,Graduate,Employed,,9.0,4.0
8,109,,Female,Under-Graduate,Unemployed,Standard,,3.0
9,110,Delhi,,Graduate,,Premium,11.0,


As you can see the Rows with None value is replaced with Graduate ( Mode of the column )

### B. Impute the missing values in 'Employment_Status' column

In [43]:
data['Employment_Status'].mode()

Unnamed: 0,Employment_Status
0,Employed


In [44]:
data_mode_imputed = mode_imputation(data_mode_imputed.copy(), 'Employment_Status')
print("\nData after Mode Imputation (Employment_Status):\n")
data_mode_imputed


Data after Mode Imputation (Employment_Status):



Unnamed: 0,UserID,City,Gender,Education,Employment_Status,Subscription_Type,Usage_Hours,Satisfaction_Score
0,101,Kolkata,Male,Graduate,Employed,Basic,10.0,4.0
1,102,Mumbai,Female,Post-Graduate,Self-Employed,,15.0,5.0
2,103,,Female,Under-Graduate,Unemployed,Premium,,3.0
3,104,Delhi,,Graduate,Employed,Basic,12.0,
4,105,Kolkata,Male,Graduate,Employed,Standard,8.0,4.0
5,106,Chennai,Female,Post-Graduate,Employed,Premium,20.0,5.0
6,107,Mumbai,Male,Graduate,Self-Employed,Basic,14.0,2.0
7,108,Bangalore,Male,Graduate,Employed,,9.0,4.0
8,109,,Female,Under-Graduate,Unemployed,Standard,,3.0
9,110,Delhi,,Graduate,Employed,Premium,11.0,


As you can see the Rows with None value is replaced with Employed( mode of the column )

### C. Impute the missing values in 'Subscription_Type' column

In [45]:
data['Subscription_Type'].mode()

Unnamed: 0,Subscription_Type
0,Basic
1,Premium


In [46]:
data_mode_imputed = mode_imputation(data_mode_imputed.copy(), 'Subscription_Type')
print("\nData after Mode Imputation (Subscription_Type):\n")
data_mode_imputed


Data after Mode Imputation (Subscription_Type):



Unnamed: 0,UserID,City,Gender,Education,Employment_Status,Subscription_Type,Usage_Hours,Satisfaction_Score
0,101,Kolkata,Male,Graduate,Employed,Basic,10.0,4.0
1,102,Mumbai,Female,Post-Graduate,Self-Employed,Basic,15.0,5.0
2,103,,Female,Under-Graduate,Unemployed,Premium,,3.0
3,104,Delhi,,Graduate,Employed,Basic,12.0,
4,105,Kolkata,Male,Graduate,Employed,Standard,8.0,4.0
5,106,Chennai,Female,Post-Graduate,Employed,Premium,20.0,5.0
6,107,Mumbai,Male,Graduate,Self-Employed,Basic,14.0,2.0
7,108,Bangalore,Male,Graduate,Employed,Basic,9.0,4.0
8,109,,Female,Under-Graduate,Unemployed,Standard,,3.0
9,110,Delhi,,Graduate,Employed,Premium,11.0,


As you can see the Rows with None value is replaced with Basic( mode of the column )