# Exploring Iterative Imputer for Multivariate Imputation

Author: Mohamed Oussama NAJI

Date: Jan 29, 2024

## Introduction

In this notebook, we will explore the usage of Iterative Imputer for multivariate imputation of missing values in a dataset. Iterative Imputer is a technique that substitutes missing values as a function of other features. We will use the heart disease dataset, which contains various demographic, behavioral, and medical features, to predict the risk of coronary heart disease (CHD) in patients.


## Table of Contents
1. [Data Loading and Exploration](#data-loading-exploration)
   - [Loading the Dataset](#loading-dataset)
   - [Dataset Information](#dataset-information)
   - [Missing Value Analysis](#missing-value-analysis)
   - [Visualizing Missing Values](#visualizing-missing-values)
2. [Iterative Imputer](#iterative-imputer)
   - [Importing Iterative Imputer](#importing-iterative-imputer)
   - [Creating Iterative Imputer Object](#creating-iterative-imputer-object)
   - [Fitting and Transforming the Dataset](#fitting-transforming-dataset)
   - [Visualizing Imputed Data](#visualizing-imputed-data)
3. [Logistic Regression Models](#logistic-regression-models)
   - [Model Without Imputation](#model-without-imputation)
   - [Model With Iterative Imputer](#model-iterative-imputer)
   - [Comparison of Accuracies](#comparison-accuracies)
4. [Iterative Imputer with Random Forest](#iterative-imputer-random-forest)
5. [Experiments with Different Imputation Methods and Algorithms](#experiments-imputation-algorithms)
   - [Imputation Methods](#imputation-methods)
   - [Algorithms](#algorithms)
   - [Experimental Results](#experimental-results)
6. [Best Strategy for Random Forest](#best-strategy-random-forest)
7. [Best Algorithm for Iterative Imputer](#best-algorithm-iterative-imputer)
8. [Best Combination of Algorithm and Imputation Strategy](#best-combination-algorithm-strategy)
9. [Conclusion](#conclusion)

## Data Loading and Exploration <a id="data-loading-exploration"></a>


### Loading the Dataset <a id="loading-dataset"></a>


In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

df = pd.read_csv("heart_disease.csv")
df

### Dataset Information <a id="dataset-information"></a>


In [None]:
df.info()

### Missing Value Analysis <a id="missing-value-analysis"></a>


In [None]:
for i in range(len(df.columns)):
    missing_data = df[df.columns[i]].isna().sum()
    perc = missing_data / len(df) * 100
    print(f'Feature {i+1} >> Missing entries: {missing_data}  |  Percentage: {round(perc, 2)}')

### Visualizing Missing Values <a id="visualizing-missing-values"></a>


In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.isna(), cbar=False, cmap='viridis', yticklabels=False)

## Iterative Imputer <a id="iterative-imputer"></a>


### Importing Iterative Imputer <a id="importing-iterative-imputer"></a>

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

### Creating Iterative Imputer Object <a id="creating-iterative-imputer-object"></a>


In [None]:
imputer = IterativeImputer(max_iter=10, random_state=0)

### Fitting and Transforming the Dataset <a id="fitting-transforming-dataset"></a>


In [None]:
data = df.values
X = data[:, :-1]
y = data[:, -1]

imputer.fit(X)
X_transform = imputer.transform(X)

### Visualizing Imputed Data <a id="visualizing-imputed-data"></a>


In [None]:
print(f"Missing cells: {sum(np.isnan(X).flatten())}")
print(f"Missing cells: {sum(np.isnan(X_transform).flatten())}")

plt.figure(figsize=(10,6))
sns.heatmap(df.isna(), cbar=False, cmap='viridis', yticklabels=False)

plt.figure(figsize=(10,6))
sns.heatmap(X_transform.isna(), cbar=False, cmap='viridis', yticklabels=False)

df_transform = pd.DataFrame(data=X_transform)
df_transform

plt.figure(figsize=(10,6))
sns.heatmap(df_transform.isna(), cbar=False, cmap='viridis', yticklabels=False)

## Logistic Regression Models <a id="logistic-regression-models"></a>


### Model Without Imputation <a id="model-without-imputation"></a>


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

df = pd.read_csv("heart_disease.csv")
X = df[df.columns[:-1]]
y = df[df.columns[-1]]

model = LogisticRegression()
model.fit(X, y)

### Model With Iterative Imputer <a id="model-iterative-imputer"></a>


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

df = pd.read_csv("heart_disease.csv")
X = df[df.columns[:-1]]
y = df[df.columns[-1]]

imputer = IterativeImputer(max_iter=10, random_state=0)
model = LogisticRegression()

pipeline = Pipeline([('impute', imputer), ('model', model)])

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

scores2 = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

print(f"Mean Accuracy: {round(np.mean(scores2), 3)}  | Std: {round(np.std(scores2), 3)}")

### Comparison of Accuracies <a id="comparison-accuracies"></a>

In [None]:
mean_accuracy1 = np.mean(scores)  # Accuracy for dropping missing values method
mean_accuracy2 = np.mean(scores2)  # Accuracy for iterative imputer with mean strategy

if np.isclose(mean_accuracy1, mean_accuracy2, atol=1e-3):
    print("Both methods have the same accuracy.")
elif mean_accuracy1 > mean_accuracy2:
    print(f"Dropping missing values method is better, with Mean Accuracy: {mean_accuracy1:.3f}")
else:
    print(f"Iterative Imputer with Mean Strategy method is better, with Mean Accuracy: {mean_accuracy2:.3f}")

## Iterative Imputer with Random Forest <a id="iterative-imputer-random-forest"></a>


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

imputer = IterativeImputer(max_iter=10, random_state=0)
model = RandomForestClassifier()

pipeline = Pipeline([('impute', imputer), ('model', model)])

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

print(f"Mean Accuracy: {round(np.mean(scores), 3)}  | Std: {round(np.std(scores), 3)}")

## Experiments with Different Imputation Methods and Algorithms <a id="experiments-imputation-algorithms"></a>


### Imputation Methods <a id="imputation-methods"></a>
- Mean
- Median
- Most_frequent
- Constant
- IterativeImputer

### Algorithms <a id="algorithms"></a>
- Logistic Regression
- KNN
- Random Forest
- SVM
- Any other algorithm of your choice

### Experimental Results <a id="experimental-results"></a>


In [None]:
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
import numpy as np

strategies = ['mean', 'median', 'most_frequent', 'constant', 'iterative']
algorithms = [LogisticRegression(), KNeighborsClassifier(), RandomForestClassifier(), SVC(), GradientBoostingClassifier()]

results = []

for strategy in strategies:
    for model in algorithms:
        if strategy == 'iterative':
            imputer = IterativeImputer(max_iter=10, random_state=0)
        elif strategy == 'constant':
            imputer = SimpleImputer(strategy=strategy, fill_value=0)
        else:
            imputer = SimpleImputer(strategy=strategy)

        pipeline = Pipeline([('impute', imputer), ('model', model)])
        cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
        scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

        results.append((strategy, model.__class__.__name__, scores))

for result in results:
    strategy, model_name, scores = result
    mean_accuracy = np.mean(scores)
    max_accuracy = np.max(scores)
    print(f"Strategy: {strategy}, Model: {model_name} >> Accuracy: {mean_accuracy:.3f} | Max accuracy: {max_accuracy:.3f}")


## Best Strategy for Random Forest <a id="best-strategy-random-forest"></a>


In [None]:
rf_results = [result for result in results if result[1] == 'RandomForestClassifier']

best_strategy_rf = max(rf_results, key=lambda x: np.mean(x[2]))

print(f"The best imputation strategy for the Random Forest algorithm on this dataset is '{best_strategy_rf[0]}' with a mean accuracy of {np.mean(best_strategy_rf[2]):.3f}")

## Best Algorithm for Iterative Imputer <a id="best-algorithm-iterative-imputer"></a>


In [None]:
iterative_imputer_results = [result for result in results if result[0] == 'iterative']

algorithms_to_compare = ['LogisticRegression', 'RandomForestClassifier', 'KNeighborsClassifier', 'GradientBoostingClassifier']

iterative_imputer_results_filtered = [result for result in iterative_imputer_results if result[1] in algorithms_to_compare]

best_algorithm_iterative = max(iterative_imputer_results_filtered, key=lambda x: np.mean(x[2]))

print(f"The best algorithm for the 'IterativeImputer' strategy on this dataset is '{best_algorithm_iterative[1]}' with a mean accuracy of {np.mean(best_algorithm_iterative[2]):.3f}")

## Best Combination of Algorithm and Imputation Strategy <a id="best-combination-algorithm-strategy"></a>


In [None]:
best_overall_combination = max(results, key=lambda x: np.mean(x[2]))

best_strategy = best_overall_combination[0]
best_algorithm = best_overall_combination[1]
best_mean_accuracy = np.mean(best_overall_combination[2])

print(f"The best combination overall is using '{best_strategy}' strategy with '{best_algorithm}' algorithm, achieving a mean accuracy of {best_mean_accuracy:.3f}")


## Conclusion <a id="conclusion"></a>

In this notebook, we explored the usage of Iterative Imputer for multivariate imputation of missing values in the heart disease dataset. We performed data exploration, analyzed missing values, and applied Iterative Imputer to impute the missing values based on other features.

We built logistic regression models with and without imputation and compared their accuracies. We also experimented with different imputation methods and algorithms to identify the best strategy for the Random Forest algorithm, the best algorithm for the Iterative Imputer strategy, and the overall best combination of algorithm and imputation strategy.

The experimental results showed that Iterative Imputer can be an effective technique for handling missing values in this dataset, often leading to improved model performance compared to dropping missing values.

However, it is important to note that the choice of imputation method and algorithm may vary depending on the specific dataset and problem at hand. It is recommended to evaluate different imputation strategies and algorithms for each specific use case to determine the most suitable approach.

Overall, Iterative Imputer demonstrated its potential as a valuable tool for multivariate imputation, enabling the development of more robust predictive models in the presence of missing data.