# Exploring SimpleImputer for Missing Value Imputation

Author: Mohamed Oussama NAJI

Date: Jan 29, 2024

## Introduction

In this notebook, we will explore the SimpleImputer class from the sklearn.impute module for handling missing values in a dataset. We will use the heart disease dataset, which contains various demographic, behavioral, and medical features, to predict the risk of coronary heart disease (CHD) in patients.


## Table of Contents
1. [Data Loading and Exploration](#data-loading-exploration)
   - [Loading the Dataset](#loading-dataset)
   - [Dataset Information](#dataset-information)
   - [Missing Value Analysis](#missing-value-analysis)
2. [SimpleImputer](#simpleimputer)
   - [Importing SimpleImputer](#importing-simpleimputer)
   - [Creating SimpleImputer Object](#creating-simpleimputer-object)
   - [Fitting and Transforming the Dataset](#fitting-transforming-dataset)
   - [Visualizing Missing Values](#visualizing-missing-values)
3. [Logistic Regression Models](#logistic-regression-models)
   - [Model Without Imputation](#model-without-imputation)
   - [Model With SimpleImputer Mean Strategy](#model-simpleimputer-mean)
   - [Comparison of Accuracies](#comparison-accuracies)
4. [Benchmarking with Random Forest](#benchmarking-random-forest)
5. [Experiments with Different Strategies and Algorithms](#experiments-strategies-algorithms)
   - [Strategies](#strategies)
   - [Algorithms](#algorithms)
   - [Experimental Results](#experimental-results)
6. [Best Strategy for Random Forest](#best-strategy-random-forest)
7. [Best Algorithm for Mean Strategy](#best-algorithm-mean-strategy)
8. [Best Combination of Algorithm and Imputation Strategy](#best-combination-algorithm-strategy)
9. [Conclusion](#conclusion)

## Data Loading and Exploration <a id="data-loading-exploration"></a>


### Loading the Dataset <a id="loading-dataset"></a>


In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

df = pd.read_csv("heart_disease.csv")
df

### Dataset Information <a id="dataset-information"></a>


In [None]:
df.info()

### Missing Value Analysis <a id="missing-value-analysis"></a>


In [None]:
for i in range(len(df.columns)):
    missing_data = df[df.columns[i]].isna().sum()
    perc = missing_data / len(df) * 100
    print(f'Feature {i+1} >> Missing entries: {missing_data}  |  Percentage: {round(perc, 2)}')


In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.isna(), cbar=False, cmap='viridis', yticklabels=False)

## SimpleImputer <a id="simpleimputer"></a>

### Importing SimpleImputer <a id="importing-simpleimputer"></a>


In [None]:
from sklearn.impute import SimpleImputer

### Creating SimpleImputer Object <a id="creating-simpleimputer-object"></a>


In [None]:
imputer = SimpleImputer(strategy='mean')

### Fitting and Transforming the Dataset <a id="fitting-transforming-dataset"></a>


In [None]:
data = df.values
X = data[:, :-1]
y = data[:, -1]

imputer.fit(X)

X_transform = imputer.transform(X)

### Visualizing Missing Values <a id="visualizing-missing-values"></a>


In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.isna(), cbar=False, cmap='viridis', yticklabels=False)

plt.figure(figsize=(10,6))
sns.heatmap(X_transform.isna(), cbar=False, cmap='viridis', yticklabels=False)

In [None]:
df_transform = pd.DataFrame(data=X_transform)
df_transform

plt.figure(figsize=(10,6))
sns.heatmap(df_transform.isna(), cbar=False, cmap='viridis', yticklabels=False)

## Logistic Regression Models <a id="logistic-regression-models"></a>


### Model Without Imputation <a id="model-without-imputation"></a>


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()

model.fit(X,y)

### Model With SimpleImputer Mean Strategy <a id="model-simpleimputer-mean"></a>


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

df = pd.read_csv("heart_disease.csv")
df = df.dropna()

X = df[df.columns[:-1]]
y = df[df.columns[-1]]

imputer = SimpleImputer(strategy='mean')
logistic_model = LogisticRegression()

pipeline = Pipeline(steps=[('imputer', imputer), ('model', logistic_model)])

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

scores2 = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv)

print(f"Mean Accuracy: {round(np.mean(scores2), 3)}  | Std: {round(np.std(scores2), 3)}")

### Comparison of Accuracies <a id="comparison-accuracies"></a>


In [None]:
mean_accuracy1 = np.mean(scores)
mean_accuracy2 = np.mean(scores2)

if np.isclose(mean_accuracy1, mean_accuracy2, atol=1e-3):
    print("Both methods have the same accuracy.")
else:
    print(f"Dropping missing values Mean Accuracy: {mean_accuracy1:.3f}")
    print(f"SimpleImputer with Mean Strategy Mean Accuracy: {mean_accuracy2:.3f}")

## Benchmarking with Random Forest <a id="benchmarking-random-forest"></a>


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

imputer = SimpleImputer(strategy='mean')
random_forest_model = RandomForestClassifier()

pipeline = Pipeline(steps=[('imputer', imputer), ('model', random_forest_model)])

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv)

print(f"Mean Accuracy: {round(np.mean(scores2), 3)}  | Std: {round(np.std(scores2), 3)}")


## Experiments with Different Strategies and Algorithms <a id="experiments-strategies-algorithms"></a>


### Strategies <a id="strategies"></a>
- Mean
- Median
- Most_frequent
- Constant

### Algorithms <a id="algorithms"></a>
- Logistic Regression
- KNN
- Random Forest
- SVM
- Any other algorithm of your choice

### Experimental Results <a id="experimental-results"></a>


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
import numpy as np

df = pd.read_csv("heart_disease.csv")
X = df[df.columns[:-1]]
y = df[df.columns[-1]]

strategies = ['mean', 'median', 'most_frequent', 'constant']
algorithms = [LogisticRegression(), KNeighborsClassifier(), RandomForestClassifier(), SVC(), DecisionTreeClassifier()]

results = []

for strategy in strategies:
    for algorithm in algorithms:
        if strategy == 'constant':
            imputer = SimpleImputer(strategy=strategy, fill_value=0)
        else:
            imputer = SimpleImputer(strategy=strategy)

        pipeline = Pipeline([
            ('imputer', imputer),
            ('model', algorithm)
        ])

        cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

        scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

        result = {
            'strategy': strategy,
            'algorithm': algorithm.__class__.__name__,
            'mean_accuracy': np.mean(scores),
            'std_accuracy': np.std(scores)
        }
        results.append(result)

for result in results:
    print(result)

## Best Strategy for Random Forest <a id="best-strategy-random-forest"></a>


In [None]:
rf_results = [result for result in results if result['algorithm'] == 'RandomForestClassifier']

best_strategy = max(rf_results, key=lambda x: x['mean_accuracy'])

print(f"The best imputation strategy for the Random Forest algorithm on this dataset is '{best_strategy['strategy']}' with a mean accuracy of {best_strategy['mean_accuracy']:.4f}")

## Best Algorithm for Mean Strategy <a id="best-algorithm-mean-strategy"></a>


In [None]:
mean_results = [result for result in results if result['strategy'] == 'mean']

algorithms_to_compare = ['LogisticRegression', 'RandomForestClassifier', 'KNeighborsClassifier', 'DecisionTreeClassifier']

mean_results_filtered = [result for result in mean_results if result['algorithm'] in algorithms_to_compare]

best_algorithm = max(mean_results_filtered, key=lambda x: x['mean_accuracy'])

print(f"The best algorithm for the 'mean' imputation strategy on this dataset is '{best_algorithm['algorithm']}' with a mean accuracy of {best_algorithm['mean_accuracy']:.4f}")

## Best Combination of Algorithm and Imputation Strategy <a id="best-combination-algorithm-strategy"></a>


In [None]:
algorithms_to_compare = ['LogisticRegression', 'RandomForestClassifier', 'KNeighborsClassifier']
strategies_to_compare = ['mean', 'median', 'most_frequent', 'constant']

filtered_results = [result for result in results if result['algorithm'] in algorithms_to_compare and result['strategy'] in strategies_to_compare]

best_combination = max(filtered_results, key=lambda x: x['mean_accuracy'])

print(f"The best combination overall is '{best_combination['algorithm']}' algorithm with '{best_combination['strategy']}' imputation strategy, achieving a mean accuracy of {best_combination['mean_accuracy']:.4f}")

## Conclusion <a id="conclusion"></a>

In this notebook, we explored the SimpleImputer class from sklearn.impute for handling missing values in the heart disease dataset. We performed data exploration, analyzed missing values, and applied different imputation strategies using SimpleImputer.

We built logistic regression models with and without imputation and compared their accuracies. We also benchmarked the performance using a random forest model with mean imputation strategy.

Furthermore, we conducted experiments with different imputation strategies and algorithms to identify the best strategy for the random forest algorithm, the best algorithm for the mean imputation strategy, and the overall best combination of algorithm and imputation strategy.

The experimental results provided insights into the effectiveness of different imputation strategies and algorithms for this specific dataset. It is important to note that the best strategy and algorithm may vary depending on the dataset and problem at hand.

Overall, SimpleImputer proved to be a useful tool for handling missing values in the heart disease dataset, enabling us to build predictive models with improved accuracy.