<a href="https://colab.research.google.com/github/mmfara/Adversarial-Debiasing-Enhanced/blob/main/Conditional_Normal_Imputation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Assumes that the feature you're imputing is normally distributed within each group.

In [None]:
import numpy as np
import pandas as pd

def conditional_normal_impute(df, target_col, group_cols):
    """
    Impute missing values in a numerical column using group-wise normal distribution sampling.

    Parameters:
    - df (pd.DataFrame): Input data.
    - target_col (str): Column with missing numerical values to impute.
    - group_cols (str or list of str): Column(s) to group by (e.g., protected attributes).

    Returns:
    - df (pd.DataFrame): DataFrame with imputed values in target_col.
    """
    df = df.copy()
    df[target_col] = pd.to_numeric(df[target_col], errors='coerce')

    # Ensure group_cols is a list
    if isinstance(group_cols, str):
        group_cols = [group_cols]

    # Group by intersectional group(s)
    grouped = df.groupby(group_cols)

    for group_keys, group_df in grouped:
        group_mask = np.all(
            [df[col] == val for col, val in zip(group_cols, [group_keys] if len(group_cols) == 1 else group_keys)],
            axis=0
        )
        group_values = df.loc[group_mask, target_col]

        mean = group_values.mean(skipna=True)
        std = group_values.std(skipna=True)
        na_mask = group_mask & df[target_col].isna()
        n_missing = na_mask.sum()

        if n_missing > 0 and not np.isnan(mean) and not np.isnan(std) and std > 0:
            sampled = np.random.normal(loc=mean, scale=std, size=n_missing)
            df.loc[na_mask, target_col] = sampled

    return df


###One group column (e.g., gender):

In [None]:
df_imputed = conditional_normal_impute(df, target_col='income', group_cols='gender')

###Multiple group columns (e.g., gender + race):

In [None]:
df_imputed = conditional_normal_impute(df, target_col='income', group_cols=['gender', 'race'])


If the data is skewed, using a normal distribution for imputation (as in Conditional Normal Imputation) may produce unrealistic or biased values. In such cases, it's better to use imputation methods that respect the true distribution of the data.

1. Conditional Sampling from Empirical Distribution
Instead of assuming normality, sample from the actual observed values within each group.

In [None]:
def conditional_empirical_impute(df, target_col, group_cols):
    """
    Impute missing numerical values by sampling from the empirical (observed) distribution
    within group(s).
    """
    df = df.copy()
    df[target_col] = pd.to_numeric(df[target_col], errors='coerce')
    if isinstance(group_cols, str):
        group_cols = [group_cols]

    grouped = df.groupby(group_cols)

    for group_keys, group_df in grouped:
        group_mask = np.all(
            [df[col] == val for col, val in zip(group_cols, [group_keys] if len(group_cols) == 1 else group_keys)],
            axis=0
        )
        group_values = df.loc[group_mask, target_col].dropna()
        na_mask = group_mask & df[target_col].isna()

        if not group_values.empty and na_mask.sum() > 0:
            sampled = np.random.choice(group_values, size=na_mask.sum(), replace=True)
            df.loc[na_mask, target_col] = sampled

    return df


2. Quantile-based Imputation
Use group-wise quantiles (like median or IQR-based sampling) to impute values. This avoids the distortion caused by outliers in skewed data.



3. Model-Based Imputation (Advanced)
Train a regression model per group to predict missing values from other features. This works well with complex, skewed, or nonlinear data — but requires more setup.