# Ordinary Least Squares (OLS) Imputation

This notebook demonstrates how to use MicroImpute's OLS imputer to impute values using linear regression. OLS imputation is a parametric approach that assumes a linear relationship between the predictor variables and the variable being imputed.

## How OLS Imputation Works

OLS imputation works by fitting a linear regression model to predict missing values based on other observed variables. The OLS imputer in MicroImpute:

- Uses statsmodels OLS to fit a linear regression model
- Assumes normally distributed residuals to generate quantile predictions
- Predicts different quantiles by adding scaled normal quantiles to the mean prediction
- Provides symmetric predictions around the median due to the normal assumption
- Is efficient and works well when relationships between variables are approximately linear

## Setup and Data Preparation

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Import MicroImpute tools
from microimpute.comparisons.data import preprocess_data
from microimpute.models import OLS
from microimpute.config import QUANTILES

In [None]:
# Load the diabetes dataset
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Display the first few rows of the dataset
df.head()

In [None]:
# Define variables for the model
predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = ["s1"]  # We'll impute 's1' (total serum cholesterol)

# Create a subset with only needed columns
diabetes_df = df[predictors + imputed_variables]

# Display summary statistics
diabetes_df.describe()

In [None]:
# Split data into training and testing sets
X_train, X_test = train_test_split(diabetes_df, test_size=0.3, random_state=42)

# Let's see how many records we have in each set
print(f"Training set size: {X_train.shape[0]} records")
print(f"Testing set size: {X_test.shape[0]} records")

## Simulating Missing Data

For this example, we'll simulate missing data in our test set by removing the values we want to impute.

In [None]:
# Create a copy of the test set with missing values
X_test_missing = X_test.copy()

# Store the actual values for later comparison
actual_values = X_test_missing[imputed_variables].copy()

# Remove the values to be imputed
X_test_missing[imputed_variables] = np.nan

X_test_missing.head()

## Training and Using the OLS Imputer

Now we'll train the OLS imputer and use it to impute the missing values in our test set.

In [None]:
# Initialize the OLS imputer
ols_imputer = OLS()

# Fit the model with our training data
# This trains a linear regression model
ols_imputer.fit(X_train, predictors, imputed_variables)

In [None]:
# Impute values in the test set
# This uses the trained OLS model to predict missing values
imputed_values = ols_imputer.predict(X_test_missing, QUANTILES)

# Display the first few imputed values at the median (0.5 quantile)
imputed_values[0.5].head()

## Evaluating the Imputation Results

Now let's compare the imputed values with the actual values to evaluate the performance of our imputer.

In [None]:
# Extract median predictions for evaluation
median_predictions = imputed_values[0.5]

# Calculate Mean Absolute Error (MAE) for the median predictions
mae = np.abs(median_predictions - actual_values).mean()
print(f"Mean Absolute Error: {mae:.4f}")

In [None]:
# Create a scatter plot comparing actual vs. imputed values
plt.figure(figsize=(8, 6))
plt.scatter(actual_values, median_predictions, alpha=0.5)
plt.plot([actual_values.min().min(), actual_values.max().max()], 
         [actual_values.min().min(), actual_values.max().max()], 
         'r--')
plt.xlabel('Actual Values')
plt.ylabel('Imputed Values')
plt.title('Comparison of Actual vs. Imputed Values using OLS')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

## Examining Quantile Predictions

The OLS imputer generates quantile predictions based on the normal distribution assumption, which can help understand prediction uncertainty.

In [None]:
# Compare predictions at different quantiles for the first 5 records
quantiles_to_show = [0.1, 0.25, 0.5, 0.75, 0.9]
comparison_df = pd.DataFrame(index=range(5))

# Add actual values
comparison_df['Actual'] = actual_values.iloc[:5, 0].values

# Add quantile predictions
for q in quantiles_to_show:
    comparison_df[f'Q{int(q*100)}'] = imputed_values[q].iloc[:5, 0].values

comparison_df

## Understanding the OLS Model

We can examine the OLS model coefficients to understand the relationship between predictor variables and the imputed variable.

In [None]:
# Access the underlying model for s1
ols_model = ols_imputer.models[imputed_variables[0]]

# Print model summary
print(ols_model.summary())

## Advantages and Limitations of OLS Imputation

### Advantages:
- Simple and interpretable model
- Computationally efficient
- Works well when relationships are approximately linear
- Provides a confidence interval for predictions
- Coefficients are directly interpretable

### Limitations:
- Assumes linear relationships between variables
- Assumes normally distributed residuals
- May not capture complex interactions
- Predictions are symmetric around the median
- Sensitive to outliers