# Quantile Regression Imputation

This notebook demonstrates how to use MicroImpute's QuantReg imputer to impute values using quantile regression. Quantile regression is a technique that extends linear regression to estimate the conditional quantiles of a response variable, providing a more complete view of the relationship between variables.

## How Quantile Regression Imputation Works

Quantile regression imputation works by fitting separate regression models for different quantiles of the distribution. The QuantReg imputer in MicroImpute:

- Uses statsmodels' QuantReg to fit specialized regression models for specific quantiles
- Fits a separate model for each requested quantile (unlike OLS which fits one model)
- Directly optimizes the quantile loss function for each specific quantile
- Allows for different relationships at different parts of the distribution
- Can capture heteroskedasticity (where variance is not constant) and skewness in the data

## Setup and Data Preparation

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Import MicroImpute tools
from microimpute.comparisons.data import preprocess_data
from microimpute.models import QuantReg
from microimpute.config import QUANTILES

In [None]:
# Load the diabetes dataset
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Display the first few rows of the dataset
df.head()

In [None]:
# Define variables for the model
predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = ["s1"]  # We'll impute 's1' (total serum cholesterol)

# Create a subset with only needed columns
diabetes_df = df[predictors + imputed_variables]

# Display summary statistics
diabetes_df.describe()

In [None]:
# Split data into training and testing sets
X_train, X_test = train_test_split(diabetes_df, test_size=0.3, random_state=42)

# Let's see how many records we have in each set
print(f"Training set size: {X_train.shape[0]} records")
print(f"Testing set size: {X_test.shape[0]} records")

## Simulating Missing Data

For this example, we'll simulate missing data in our test set by removing the values we want to impute.

In [None]:
# Create a copy of the test set with missing values
X_test_missing = X_test.copy()

# Store the actual values for later comparison
actual_values = X_test_missing[imputed_variables].copy()

# Remove the values to be imputed
X_test_missing[imputed_variables] = np.nan

X_test_missing.head()

## Training and Using the QuantReg Imputer

Now we'll train the QuantReg imputer and use it to impute the missing values in our test set. For quantile regression, we need to explicitly specify which quantiles to model during fitting.

In [None]:
# Define quantiles we want to model
# We'll use the default quantiles from the config module
print(f"Modeling these quantiles: {QUANTILES}")

In [None]:
# Initialize the QuantReg imputer
quantreg_imputer = QuantReg()

# Fit the model with our training data
# This trains a separate regression model for each quantile
quantreg_imputer.fit(X_train, predictors, imputed_variables, quantiles=QUANTILES)

In [None]:
# Impute values in the test set
# This uses the trained quantile regression models to predict missing values
imputed_values = quantreg_imputer.predict(X_test_missing, QUANTILES)

# Display the first few imputed values at the median (0.5 quantile)
imputed_values[0.5].head()

## Evaluating the Imputation Results

Now let's compare the imputed values with the actual values to evaluate the performance of our imputer.

In [None]:
# Extract median predictions for evaluation
median_predictions = imputed_values[0.5]

# Calculate Mean Absolute Error (MAE) for the median predictions
mae = np.abs(median_predictions - actual_values).mean()
print(f"Mean Absolute Error: {mae:.4f}")

In [None]:
# Create a scatter plot comparing actual vs. imputed values
plt.figure(figsize=(8, 6))
plt.scatter(actual_values, median_predictions, alpha=0.5)
plt.plot([actual_values.min().min(), actual_values.max().max()], 
         [actual_values.min().min(), actual_values.max().max()], 
         'r--')
plt.xlabel('Actual Values')
plt.ylabel('Imputed Values')
plt.title('Comparison of Actual vs. Imputed Values using QuantReg')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

## Examining Quantile Predictions

Quantile regression provides predictions at different quantiles, which helps us understand the entire conditional distribution of the missing values.

In [None]:
# Compare predictions at different quantiles for the first 5 records
quantiles_to_show = [0.1, 0.25, 0.5, 0.75, 0.9]
comparison_df = pd.DataFrame(index=range(5))

# Add actual values
comparison_df['Actual'] = actual_values.iloc[:5, 0].values

# Add quantile predictions
for q in quantiles_to_show:
    comparison_df[f'Q{int(q*100)}'] = imputed_values[q].iloc[:5, 0].values

comparison_df

## Visualizing Prediction Intervals

One of the key advantages of quantile regression is its ability to provide prediction intervals that can adapt to different parts of the distribution.

In [None]:
# Create a prediction interval plot for the first 10 records
plt.figure(figsize=(12, 6))

# Number of records to plot
n_records = 10

# X-axis positions
x = np.arange(n_records)

# Plot actual values
plt.scatter(x, actual_values.iloc[:n_records, 0], color='black', label='Actual', zorder=3)

# Plot median predictions
plt.scatter(x, imputed_values[0.5].iloc[:n_records, 0], color='red', label='Median (Q50)', zorder=3)

# Plot 50% prediction interval (Q25 to Q75)
plt.fill_between(x, 
                 imputed_values[0.25].iloc[:n_records, 0],
                 imputed_values[0.75].iloc[:n_records, 0],
                 alpha=0.3, color='blue', label='50% PI (Q25-Q75)')

# Plot 80% prediction interval (Q10 to Q90)
plt.fill_between(x, 
                 imputed_values[0.1].iloc[:n_records, 0],
                 imputed_values[0.9].iloc[:n_records, 0],
                 alpha=0.15, color='blue', label='80% PI (Q10-Q90)')

plt.xlabel('Record Index')
plt.ylabel('Value')
plt.title('QuantReg Imputation Prediction Intervals')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

## Examining Coefficient Differences Across Quantiles

A unique feature of quantile regression is that the coefficients can vary across different quantiles, allowing us to see how the relationship between variables changes across the distribution.

In [None]:
# Function to extract coefficients from different quantile models
def extract_coefficients(imputer, variable, quantiles):
    """Extract coefficients from quantile regression models."""
    # Initialize a DataFrame to store coefficients
    coef_df = pd.DataFrame(index=imputer.models[variable][list(quantiles.keys())[0]].params.index)
    
    # Extract coefficients for each quantile
    for q, model in imputer.models[variable].items():
        if q in quantiles:
            coef_df[f'Q{int(q*100)}'] = model.params
    
    return coef_df

# Extract coefficients for selected quantiles
selected_quantiles = {0.1: None, 0.5: None, 0.9: None}
coef_df = extract_coefficients(quantreg_imputer, imputed_variables[0], selected_quantiles)

# Display the coefficients
coef_df

In [None]:
# Visualize how coefficients change across quantiles
plt.figure(figsize=(10, 6))

# Plot each predictor's coefficient across quantiles
for predictor in coef_df.index[1:]:  # Skip the intercept
    plt.plot([10, 50, 90], coef_df.loc[predictor].values, marker='o', label=predictor)

plt.xlabel('Quantile')
plt.ylabel('Coefficient Value')
plt.title('Quantile Regression Coefficients Across Quantiles')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

## Advantages and Limitations of Quantile Regression Imputation

### Advantages:
- Models different parts of the distribution without normal assumption
- Can capture heteroskedasticity and skewness in the data
- Provides a more complete picture of the relationship between variables
- More robust to outliers than OLS
- Maintains the interpretability of linear models

### Limitations:
- Requires more computation than OLS (fitting multiple models)
- May face convergence issues with small sample sizes
- Less flexible than non-parametric methods like QRF for capturing non-linear relationships
- Still assumes a linear relationship at each quantile
- Requires careful selection of quantiles to model