# Quantile Random Forest (QRF) Imputation

This notebook demonstrates how to use MicroImpute's QRF imputer to impute values using Quantile Random Forests. QRF is a powerful machine learning technique that extends traditional random forests to predict the entire conditional distribution of a target variable.

## How QRF Imputation Works

Quantile Random Forest imputation builds a non-parametric machine learning model that can predict any quantile of the distribution of missing values. The QRF imputer in MicroImpute:

- Uses a random forest model to predict quantiles directly
- Captures complex, non-linear relationships between variables
- Handles categorical features through one-hot encoding
- Models heteroskedasticity (where variance is not constant across the distribution)
- Can represent multimodal and skewed distributions
- Provides more flexible/accurate quantile predictions than parametric methods

## Setup and Data Preparation

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Import MicroImpute tools
from microimpute.comparisons.data import preprocess_data
from microimpute.models import QRF
from microimpute.config import QUANTILES

In [None]:
# Load the diabetes dataset
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Display the first few rows of the dataset
df.head()

In [None]:
# Define variables for the model
predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = ["s1"]  # We'll impute 's1' (total serum cholesterol)

# Create a subset with only needed columns
diabetes_df = df[predictors + imputed_variables]

# Display summary statistics
diabetes_df.describe()

In [None]:
# Split data into training and testing sets
X_train, X_test = train_test_split(diabetes_df, test_size=0.3, random_state=42)

# Let's see how many records we have in each set
print(f"Training set size: {X_train.shape[0]} records")
print(f"Testing set size: {X_test.shape[0]} records")

## Simulating Missing Data

For this example, we'll simulate missing data in our test set by removing the values we want to impute.

In [None]:
# Create a copy of the test set with missing values
X_test_missing = X_test.copy()

# Store the actual values for later comparison
actual_values = X_test_missing[imputed_variables].copy()

# Remove the values to be imputed
X_test_missing[imputed_variables] = np.nan

X_test_missing.head()

## Training and Using the QRF Imputer

Now we'll train the QRF imputer and use it to impute the missing values in our test set.

In [None]:
# Initialize the QRF imputer with some custom parameters
# You can customize the random forest by passing additional parameters
qrf_imputer = QRF()

# Fit the model with our training data
# This trains a quantile random forest model
qrf_imputer.fit(X_train, predictors, imputed_variables, n_estimators=100, min_samples_leaf=5)

In [None]:
# Impute values in the test set
# This uses the trained QRF model to predict missing values at specified quantiles
imputed_values = qrf_imputer.predict(X_test_missing, QUANTILES)

# Display the first few imputed values at the median (0.5 quantile)
imputed_values[0.5].head()

## Evaluating the Imputation Results

Now let's compare the imputed values with the actual values to evaluate the performance of our imputer.

In [None]:
# Extract median predictions for evaluation
median_predictions = imputed_values[0.5]

# Calculate Mean Absolute Error (MAE) for the median predictions
mae = np.abs(median_predictions - actual_values).mean()
print(f"Mean Absolute Error: {mae:.4f}")

In [None]:
# Create a scatter plot comparing actual vs. imputed values
plt.figure(figsize=(8, 6))
plt.scatter(actual_values, median_predictions, alpha=0.5)
plt.plot([actual_values.min().min(), actual_values.max().max()], 
         [actual_values.min().min(), actual_values.max().max()], 
         'r--')
plt.xlabel('Actual Values')
plt.ylabel('Imputed Values')
plt.title('Comparison of Actual vs. Imputed Values using QRF')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

## Examining Quantile Predictions

QRF provides predictions at different quantiles, allowing us to capture the entire conditional distribution of the missing values.

In [None]:
# Compare predictions at different quantiles for the first 5 records
quantiles_to_show = [0.1, 0.25, 0.5, 0.75, 0.9]
comparison_df = pd.DataFrame(index=range(5))

# Add actual values
comparison_df['Actual'] = actual_values.iloc[:5, 0].values

# Add quantile predictions
for q in quantiles_to_show:
    comparison_df[f'Q{int(q*100)}'] = imputed_values[q].iloc[:5, 0].values

comparison_df

## Visualizing Prediction Intervals

One of the advantages of QRF is that it can provide prediction intervals, which can help us understand the uncertainty in our imputed values.

In [None]:
# Create a prediction interval plot for the first 10 records
plt.figure(figsize=(12, 6))

# Number of records to plot
n_records = 10

# X-axis positions
x = np.arange(n_records)

# Plot actual values
plt.scatter(x, actual_values.iloc[:n_records, 0], color='black', label='Actual', zorder=3)

# Plot median predictions
plt.scatter(x, imputed_values[0.5].iloc[:n_records, 0], color='red', label='Median (Q50)', zorder=3)

# Plot 50% prediction interval (Q25 to Q75)
plt.fill_between(x, 
                 imputed_values[0.25].iloc[:n_records, 0],
                 imputed_values[0.75].iloc[:n_records, 0],
                 alpha=0.3, color='blue', label='50% PI (Q25-Q75)')

# Plot 80% prediction interval (Q10 to Q90)
plt.fill_between(x, 
                 imputed_values[0.1].iloc[:n_records, 0],
                 imputed_values[0.9].iloc[:n_records, 0],
                 alpha=0.15, color='blue', label='80% PI (Q10-Q90)')

plt.xlabel('Record Index')
plt.ylabel('Value')
plt.title('QRF Imputation Prediction Intervals')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

## Advantages and Limitations of QRF Imputation

### Advantages:
- Captures non-linear relationships and interactions between variables
- Models heteroskedasticity (variance that changes across the distribution)
- Handles categorical variables effectively
- Robust to outliers and noisy data
- Can represent complex distributions including multimodality and skewness
- No distributional assumptions about residuals

### Limitations:
- More computationally intensive than parametric methods
- Requires more data to train effectively
- Less interpretable than linear models
- May overfit with small sample sizes if not properly tuned
- Prediction time increases with the number of trees

## Tuning the QRF Model

The QRF imputer supports various parameters that can be adjusted to improve performance. Here are some of the key parameters you might want to tune:

In [None]:
# Example of creating a QRF imputer with custom parameters
tuned_qrf_imputer = QRF()

# Fit with custom parameters
tuned_qrf_imputer.fit(
    X_train, 
    predictors, 
    imputed_variables,
    n_estimators=200,            # Number of trees in the forest
    min_samples_leaf=3,          # Minimum samples required at a leaf node
    max_features='sqrt',         # Number of features to consider for best split
    bootstrap=True,              # Whether to use bootstrap samples
    random_state=42,             # Random seed for reproducibility
    n_jobs=-1                    # Use all available cores
)