# Statistical Matching Imputation

This notebook demonstrates how to use MicroImpute's Matching imputer to impute values using the statistical matching approach. Statistical matching (also known as data fusion or synthetic matching) is a technique used to integrate information from different data sources.

## How Matching Imputation Works

Statistical matching imputation works by finding similar records in a donor dataset and transferring values to a recipient dataset. The key concept is to match records based on common variables present in both datasets.

The Matching imputer in MicroImpute:
- Uses nearest neighbor distance hot deck matching via R's StatMatch package
- Identifies donors with similar characteristics to recipients
- Transfers values from donors to recipients based on the similarity of matching variables
- Can handle different types of variables including categorical variables
- Doesn't make distributional assumptions unlike parametric methods

## Setup and Data Preparation

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Import MicroImpute tools
from microimpute.comparisons.data import preprocess_data
from microimpute.models import Matching
from microimpute.config import QUANTILES

In [None]:
# Load the diabetes dataset
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Display the first few rows of the dataset
df.head()

In [None]:
# Define variables for the model
predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = ["s1"]  # We'll impute 's1' (total serum cholesterol)

# Create a subset with only needed columns
diabetes_df = df[predictors + imputed_variables]

# Display summary statistics
diabetes_df.describe()

In [None]:
# Split data into training and testing sets
X_train, X_test = train_test_split(diabetes_df, test_size=0.3, random_state=42)

# Let's see how many records we have in each set
print(f"Training set size: {X_train.shape[0]} records")
print(f"Testing set size: {X_test.shape[0]} records")

## Simulating Missing Data

For this example, we'll simulate missing data in our test set by removing the values we want to impute.

In [None]:
# Create a copy of the test set with missing values
X_test_missing = X_test.copy()

# Store the actual values for later comparison
actual_values = X_test_missing[imputed_variables].copy()

# Remove the values to be imputed
X_test_missing[imputed_variables] = np.nan

X_test_missing.head()

## Training and Using the Matching Imputer

Now we'll train the Matching imputer and use it to impute the missing values in our test set.

In [None]:
# Initialize the Matching imputer
matching_imputer = Matching()

# Fit the model with our training data
# This stores the donor records for later matching
matching_imputer.fit(X_train, predictors, imputed_variables)

In [None]:
# Impute values in the test set
# This will identify similar donors for each recipient and transfer values
imputed_values = matching_imputer.predict(X_test_missing, QUANTILES)

# Display the first few imputed values at the median (0.5 quantile)
imputed_values[0.5].head()

## Evaluating the Imputation Results

Now let's compare the imputed values with the actual values to evaluate the performance of our imputer.

In [None]:
# Extract median predictions for evaluation
median_predictions = imputed_values[0.5]

# Calculate Mean Absolute Error (MAE) for the median predictions
mae = np.abs(median_predictions - actual_values).mean()
print(f"Mean Absolute Error: {mae:.4f}")

In [None]:
# Create a scatter plot comparing actual vs. imputed values
plt.figure(figsize=(8, 6))
plt.scatter(actual_values, median_predictions, alpha=0.5)
plt.plot([actual_values.min().min(), actual_values.max().max()], 
         [actual_values.min().min(), actual_values.max().max()], 
         'r--')
plt.xlabel('Actual Values')
plt.ylabel('Imputed Values')
plt.title('Comparison of Actual vs. Imputed Values using Matching')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

## Examining Quantile Predictions

The Matching imputer can also provide predictions at different quantiles, which can be useful for understanding the uncertainty in the imputation.

In [None]:
# Compare predictions at different quantiles for the first 5 records
quantiles_to_show = [0.1, 0.25, 0.5, 0.75, 0.9]
comparison_df = pd.DataFrame(index=range(5))

# Add actual values
comparison_df['Actual'] = actual_values.iloc[:5, 0].values

# Add quantile predictions
for q in quantiles_to_show:
    comparison_df[f'Q{int(q*100)}'] = imputed_values[q].iloc[:5, 0].values

comparison_df

## Advantages and Limitations of Matching Imputation

### Advantages:
- Preserves relationships between variables as they exist in the donor dataset
- Makes no parametric assumptions about the distribution of the data
- Can handle both continuous and categorical variables
- Relatively simple to understand conceptually
- Generally preserves the marginal distributions of the variables

### Limitations:
- Requires a rich donor dataset with sufficient coverage
- May not perform well if donor and recipient populations differ significantly
- Matching quality depends on the selection of matching variables
- Limited ability to extrapolate beyond the range of the donor dataset
- May struggle with high-dimensional data due to the curse of dimensionality