# Missing Value Imputation

Missing value imputation is the process of replacing missing values in a dataset with substituted values. In this notebook, we'll explore different methods of missing value imputation using the Iris dataset.

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error


## Loading the Iris Dataset

Let's start by loading the Iris dataset and manually deleting a value to create missingness.

In [13]:
iris = load_iris()
iris

In [14]:
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Convert to DataFrame
iris_df = pd.DataFrame(data=X, columns=iris.feature_names)
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [15]:
original_value = iris_df.loc[0, 'sepal length (cm)']
# Manually delete a value to create missingness
iris_df.loc[0, 'sepal length (cm)'] = np.nan
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [16]:
original_value

5.1

## Filling Missing Values

We'll fill the missing values using different methods such as mean, median, and most frequent imputation.

In [17]:
# Define different imputation methods
methods = ['mean', 'median', 'most_frequent']

# Initialize empty dictionary to store results
results = {}
# Loop through each method
for method in methods:
    # Initialize imputer with the chosen method
    imputer = SimpleImputer(strategy=method)
    
    # Fill missing values
    X_filled = imputer.fit_transform(iris_df)
    
    # Store the filled DataFrame
    results[method] = pd.DataFrame(data=X_filled, columns=iris.feature_names)

# Display the first few rows of each imputed DataFrame
for method, df in results.items():
    print(f'Imputed DataFrame using {method} method:')
    print(df.head())
    print()

Imputed DataFrame using mean method:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0           5.848322               3.5                1.4               0.2
1           4.900000               3.0                1.4               0.2
2           4.700000               3.2                1.3               0.2
3           4.600000               3.1                1.5               0.2
4           5.000000               3.6                1.4               0.2

Imputed DataFrame using median method:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.8               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

Imputed Da

In [5]:
results

{'mean':      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
 0             5.848322               3.5                1.4               0.2
 1             4.900000               3.0                1.4               0.2
 2             4.700000               3.2                1.3               0.2
 3             4.600000               3.1                1.5               0.2
 4             5.000000               3.6                1.4               0.2
 ..                 ...               ...                ...               ...
 145           6.700000               3.0                5.2               2.3
 146           6.300000               2.5                5.0               1.9
 147           6.500000               3.0                5.2               2.0
 148           6.200000               3.4                5.4               2.3
 149           5.900000               3.0                5.1               1.8
 
 [150 rows x 4 columns],
 'median':      s

## Comparing with Original Values

Let's compare the imputed values with the original values and calculate the mean absolute error.