# Create a New Category Imputation

What is "Create a New Category" Imputation?

"Create a New Category" imputation is a straightforward technique for handling missing values in categorical data.  Instead of predicting the missing category, it treats the absence of information as a new, valid category.  This method is particularly useful when the fact that a value is missing is itself potentially informative.

How it Works:

* Identify Missing Values: The algorithm identifies where categorical values are missing in a particular column.
* Define New Category: A new category label is created to represent the missing data (e.g., "Missing", "Unknown", "Not Specified").
* Replace Missing Values: All missing values in the selected categorical column are replaced with this new category label.

Example:

Consider a dataset of customer reviews with a "Product Color" column.  If some reviews lack color information, this method would:

* Identify the missing "Product Color" values.
* Assign all these missing entries to a new category, such as "Unknown".
* This results in a modified dataset where the absence of a product color is explicitly represented, rather than left as a gap.

# Notebook Structure

1. Import necessary dependencies
2. Create the dataset
3. Define the Utility function for  "Create a New Category" Imputation
4. Execution of the utility function


# 1. Import necessary dependencies

In [18]:
# libraries & dataset

import pandas as pd
import numpy as np

# 2. Create the dataset

In [19]:
# Example usage with the sample data provided:

data = pd.DataFrame({
    'Feedback ID': [1, 2, 3, 4, 5, 6],
    'Product Name': ['Widget A', 'Gadget B', 'Widget A', 'Gizmo C', 'Gadget B', 'Widget A'],
    'Product Color': ['Blue', 'Red', np.nan, 'Green', np.nan, 'Blue'],
    'Rating': [4, 5, 3, 4, 2, 5]
})

In [20]:
print("Original Data:\n")
data

Original Data:



Unnamed: 0,Feedback ID,Product Name,Product Color,Rating
0,1,Widget A,Blue,4
1,2,Gadget B,Red,5
2,3,Widget A,,3
3,4,Gizmo C,Green,4
4,5,Gadget B,,2
5,6,Widget A,Blue,5


# 3. Define the create_new_category Imputer utility function

The function create_new_category_imputation imputes missing values in a specified categorical column of a DataFrame. It replaces these missing values with a new category label, defaulting to "Unknown", and returns the modified DataFrame.

def create_new_category_imputation(df, categorical_column, new_category_label="Unknown") Function:
* Takes the input DataFrame, the name of the categorical column to impute, and the new category label as arguments.
* Creates a copy of the DataFrame to avoid modifying the original data.
* Raises a ValueError if the specified categorical column is not found in the DataFrame.
* Replaces missing values (NaN) in the specified categorical column with the provided new_category_label (defaults to "Unknown").
* Returns the DataFrame with the missing values imputed.

In [21]:
def create_new_category_imputation(df, categorical_column, new_category_label="Unknown"):
    """
    Imputes missing values in a categorical column by replacing them with a new category label.

    Args:
        df (pd.DataFrame): The input DataFrame.
        categorical_column (str): The name of the categorical column to impute.
        new_category_label (str, optional): The label to use for the new category. Defaults to "Unknown".

    Returns:
        pd.DataFrame: A new DataFrame with the missing values imputed.
    """
    df_imputed = df.copy()  # Create a copy to avoid modifying the original DataFrame

    # Check if the categorical column exists
    if categorical_column not in df_imputed.columns:
        raise ValueError(f"Categorical column '{categorical_column}' not found in DataFrame.")

    # Replace missing values in the specified column with the new category label
    df_imputed[categorical_column] = df_imputed[categorical_column].fillna(new_category_label)

    return df_imputed

# 4. Execution of the utility function

In [22]:
# Apply the imputation to the 'Product Color' column

data_imputed = create_new_category_imputation(data, 'Product Color', new_category_label="Unknown")

In [23]:
print("\nData After 'Create a New Category' Imputation:\n")

data_imputed


Data After 'Create a New Category' Imputation:



Unnamed: 0,Feedback ID,Product Name,Product Color,Rating
0,1,Widget A,Blue,4
1,2,Gadget B,Red,5
2,3,Widget A,Unknown,3
3,4,Gizmo C,Green,4
4,5,Gadget B,Unknown,2
5,6,Widget A,Blue,5


All the NaN values in Numerical columns are imputed using 'Unknown' as a new category