# **Five Important Ways for Imputing Missing Values**

You can impute missing values using **machine learning models**. This process is known as **data imputation** and is commonly used in **data preprocessing** to handle missing or incomplete data. There are several methods and models you can use, depending on the nature of your data and the missing values.

---

## **1. Simple Imputation Techniques**

### **Mean/Median Imputation**
- Replace missing values with the **mean** or **median** of the column.  
- **Suitable for**: Numerical data.

### **Mode Imputation**
- Replace missing values with the **mode** (most frequent value) of the column.  
- **Useful for**: Categorical data.

### **K-Nearest Neighbors (KNN)**
- Uses the **similarity of rows** to impute missing values.  
- Finds the **K-nearest** rows with complete values and imputes based on them.

---

## **2. Regression Imputation**
- Uses a **regression model** to predict missing values based on **other variables** in the dataset.  
- Example: If `Age` is missing, predict it using `Salary`, `Experience`, etc.

---

## **3. Decision Trees & Random Forests**
- These models can **handle missing values inherently**.  
- Can also be used to **predict missing values** based on learned patterns in the data.

---

## **4. Advanced Techniques**
### **Multiple Imputation by Chained Equations (MICE)**
- A sophisticated method that **models each variable with missing values** as a function of other variables.  
- It works in a **round-robin** fashion to improve imputation quality.

### **Deep Learning Methods**
- **Neural networks**, especially **autoencoders**, can impute missing values in complex datasets.  
- Best for **large datasets with non-linear relationships**.

### **Time-Series Specific Methods**
- For **time-series data**, common imputation techniques include:
  - **Interpolation** (estimating missing values using trends).
  - **Forward-fill** (using the last known value).
  - **Backward-fill** (using the next known value).

---

## **5. Choosing the Right Imputation Method**
It’s important to **select the right method** based on:  
- **Type of data** (numerical, categorical, time-series).  
- **Pattern of missingness**:
  - **Missing Completely at Random (MCAR)** → No pattern in missing values.  
  - **Missing at Random (MAR)** → Related to other observed data.  
  - **Not Missing at Random (NMAR)** → Missingness depends on **unobserved** data.  

⚠ **Imputation can introduce bias or alter data distribution**, so always analyze the impact before applying it!  

---


## 1. Simple Imputation Techniques
#### 1.1. Mean/Median Imputation

Mean/median imputation replaces missing values with the mean or median of the column. This is a simple and effective method, but it has some limitations. For example, it reduces variance in the dataset, and it can lead to biased estimates if the missing values are not missing at random.

Let's see how to implement mean/median imputation in Python using the Titanic dataset.



#### 1.1.1. Mean Imputation


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# load the Titanic dataset
df = sns.load_dataset('titanic')
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [6]:
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
deck,688
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0


We can see that the age column has 177 missing values. Let's replace these missing values with the mean of the column:

In [5]:
df['age']=df['age'].fillna(df['age'].mean())

In [8]:
# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)


Unnamed: 0,0
deck,688
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0



#### 1.1.2. Median Imputation
Let's load the dataset and replace the missing values in the age column with the median of the column:

In [9]:
df = sns.load_dataset('titanic')
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
deck,688
age,177
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
sibsp,0
parch,0
fare,0


In [11]:
df['age']=df['age'].fillna(df['age'].median())
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
deck,688
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0


#### 1.2. Mode Imputation
Mode imputation replaces missing values with the mode (most frequent value) of the column. This is useful for imputing categorical columns, such as Embarked and embark_town in the Titanic dataset.

Let's see how to implement mode imputation in Python using the Titanic dataset.

In [15]:
df = sns.load_dataset('titanic')
df.isnull().sum().sort_values(ascending=False)



Unnamed: 0,0
deck,688
age,177
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
sibsp,0
parch,0
fare,0


In [13]:

df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
deck,688
age,177
survived,0
pclass,0
sex,0
sibsp,0
parch,0
fare,0
embarked,0
class,0


We can see that the missing values in the embark_town column and embarked column have been replaced with the mode of the column.

## 2. K-Nearest Neighbors (KNN)
KNN is a machine learning algorithm that can be used for imputing missing values. It works by finding the most similar data points to the one with the missing value based on other available features. The missing value is then imputed with the mean or median of the most similar data points.

Let's see how to implement KNN imputation in Python using the Titanic dataset.

In [16]:
from sklearn.impute import KNNImputer
df = sns.load_dataset('titanic')
df.isnull().sum().sort_values(ascending=False)


Unnamed: 0,0
deck,688
age,177
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
sibsp,0
parch,0
fare,0


In [18]:
imputer=KNNImputer(n_neighbors=5)
df['age'] = imputer.fit_transform(df[['age']])


In [20]:
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
deck,688
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0


## 3. Regression Imputation
Regression imputation uses a regression model to predict the missing values based on other variables in the dataset. It works well for both categorical and numerical data.

Let's see how to implement regression imputation in Python using the Titanic dataset.

In [21]:
# load the dataset
df = sns.load_dataset('titanic')

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
deck,688
age,177
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
sibsp,0
parch,0
fare,0


In [22]:
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [23]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer=IterativeImputer()
df['age'] = imputer.fit_transform(df[['age']])
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
deck,688
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0


## 4. Random Forests for Imputing Missing Values
Random forests can handle missing values inherently. They can also be used to predict missing values based on the patterns learned from the other data.

Let's see how to implement random forests in Python using the Titanic dataset.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.impute import SimpleImputer

# 1. load the dataset
df = sns.load_dataset('titanic')

# check missing values in each column
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
deck,688
age,177
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
sibsp,0
parch,0
fare,0


We will remove the deck column from the dataset because it has too many missing values:



In [3]:
# remove deck column
df.drop('deck', axis=1, inplace=True)

# check missing values in each column
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
age,177
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
sibsp,0
parch,0
fare,0
class,0


We will encode the data at this stage:



In [4]:
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [5]:
# encode the data using label encoding
from sklearn.preprocessing import LabelEncoder
# Columns to encode
columns_to_encode = ['sex', 'embarked', 'who', 'class', 'embark_town', 'alive']

# Dictionary to store LabelEncoders for each column
label_encoders = {}

# Loop to apply LabelEncoder to each column
for col in columns_to_encode:
    # Create a new LabelEncoder for the column
    le = LabelEncoder()

    # Fit and transform the data, then inverse transform it
    df[col] = le.fit_transform(df[col])

    # Store the encoder in the dictionary
    label_encoders[col] = le

# Check the first few rows of the DataFrame
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


We have to first impute the missing values in the age column before we can use it to predict the missing values in the embarked and emark_town columns.

In [6]:
# Split the dataset into two parts: one with missing values, one without
df_with_missing = df[df['age'].isna()]
# dropna removes all rows with missing values
df_without_missing = df.dropna()

Let's see the shape of the datasets with and without the missing values:



In [7]:
print("The shape of the original dataset is: ", df.shape)
print("The shape of the dataset with missing values removed is: ", df_without_missing.shape)
print("The shape of the dataset with missing values is: ", df_with_missing.shape)

The shape of the original dataset is:  (891, 14)
The shape of the dataset with missing values removed is:  (714, 14)
The shape of the dataset with missing values is:  (177, 14)


let's see the first five rows of the dataset with the missing values:



In [8]:
df_with_missing.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,,0,0,7.8792,1,2,2,False,1,1,True


let's see the first five rows of the dataset without the missing values:



In [9]:
df_without_missing.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


Let's see the names of all the columns in the dataset:



In [10]:
# check the names of the columns
print(df.columns)

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive',
       'alone'],
      dtype='object')


In [11]:
# Regression Imputation

# split the data into X and y and we will only take the columns with no missing values
X = df_without_missing.drop(['age'], axis=1)
y = df_without_missing['age']

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Random Forest Imputation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# evaluate the model
y_pred = rf_model.predict(X_test)
print("RMSE for Random Forest Imputation: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score for Random Forest Imputation: ", r2_score(y_test, y_pred))
print("MAE for Random Forest Imputation: ", mean_absolute_error(y_test, y_pred))
print("MAPE for Random Forest Imputation: ", mean_absolute_percentage_error(y_test, y_pred))

RMSE for Random Forest Imputation:  11.081260589808045
R2 Score for Random Forest Imputation:  0.33769388288226154
MAE for Random Forest Imputation:  8.666661815622195
MAPE for Random Forest Imputation:  0.40839466096086574


In [12]:
# check the number of missing values in each column
df_with_missing.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
age,177
survived,0
pclass,0
sex,0
sibsp,0
parch,0
fare,0
embarked,0
class,0
who,0


In [13]:
# Predict missing values
y_pred = rf_model.predict(df_with_missing.drop(['age'], axis=1))

In [14]:
# remove warning
import warnings
warnings.filterwarnings('ignore')

# replace the missing values with the predicted values
df_with_missing['age'] = y_pred

# check the missing values
df_with_missing.isnull().sum().sort_values(ascending=False)


Unnamed: 0,0
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,0
class,0
who,0


In [15]:
# concatenate the two dataframes
df_complete = pd.concat([df_with_missing, df_without_missing], axis=0)
# print the shape of the complete dataframe
print("The shape of the complete dataframe is: ", df_complete.shape)

#check the first 5 rows of the complete dataframe


The shape of the complete dataframe is:  (891, 14)


In [16]:
for col in columns_to_encode:
    # Retrieve the corresponding LabelEncoder for the column
    le = label_encoders[col]

    # Inverse transform the data
    df_complete[col] = le.inverse_transform(df[col])

# check the first 5 rows of the complete dataframe
df_complete.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,male,32.976583,0,0,8.4583,S,Third,man,True,Southampton,no,True
17,1,2,female,35.642218,0,0,13.0,C,First,woman,True,Cherbourg,yes,True
19,1,3,female,18.347,0,0,7.225,S,Third,woman,False,Southampton,yes,True
26,0,3,female,35.571486,0,0,7.225,S,First,woman,True,Southampton,yes,True
28,1,3,male,20.651429,0,0,7.8792,S,Third,man,False,Southampton,no,True


In [17]:
print("The shape of the complete dataframe is: ", df_complete.shape)


The shape of the complete dataframe is:  (891, 14)


In [18]:
df_complete.isnull().sum().sort_values(ascending=False)


Unnamed: 0,0
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
class,0


## 5. Multiple Imputation by Chained Equations (MICE)
MICE (Multiple Imputation by Chained Equations) is a smart way to fill in missing values by predicting them based on other available data. Instead of using a simple method like the **mean or median**, MICE looks at patterns in the dataset and **estimates each missing value step by step**. It works well for both **categorical and numerical data**.  

## **Using MICE in Python**  
In Python, we can use the **`IterativeImputer`** from the **`sklearn.impute`** module to apply MICE. This method **predicts missing values for each column one by one**, using the other columns as inputs.  

It repeats this process **multiple times**, refining the values each time, until the missing data is **filled in accurately**.  


In [19]:
# imoprt libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# laod the dataset
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [20]:
# check the missing values
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
deck,688
age,177
embarked,2
embark_town,2
survived,0
pclass,0
sex,0
sibsp,0
parch,0
fare,0


In [21]:
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [22]:
from sklearn.preprocessing import LabelEncoder

# create a LabelEncoder object using LabelEncoder() in for loop for categorical columns
# Columns to encode
columns_to_encode = ['sex', 'embarked', 'who', 'deck', 'class', 'embark_town', 'alive']

# Dictionary to store LabelEncoders for each column
label_encoders = {}

# Loop to apply LabelEncoder to each column for encoding
for col in columns_to_encode:
    # Create a new LabelEncoder for the column
    le = LabelEncoder()
    # Fit and transform the data
    df[col] = le.fit_transform(df[col])
    # Store the encoder in the dictionary
    label_encoders[col] = le
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,7,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,2,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,7,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,7,2,0,True


In [23]:
# impute the missing values with IterativeImputer
# call the IterativeImputer class with max_iter = 10
imputer = IterativeImputer(max_iter=10)

#impute missing values using IterativeImputer in a for loop for age, embark_town,embarked columns and deck

# Columns to impute
columns_to_impute = ['age', 'embark_town', 'embarked', 'deck']

# Loop to impute each column
for col in columns_to_impute:
    df[col] = imputer.fit_transform(df[[col]])
# check the missing values
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,0
class,0
who,0


In [24]:
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2.0,2,1,True,7.0,2.0,0,False
1,1,1,0,38.0,1,0,71.2833,0.0,0,2,False,2.0,0.0,1,False
2,1,3,0,26.0,0,0,7.925,2.0,2,2,False,7.0,2.0,1,True
3,1,1,0,35.0,1,0,53.1,2.0,0,2,False,2.0,2.0,1,False
4,0,3,1,35.0,0,0,8.05,2.0,2,1,True,7.0,2.0,0,True


In [25]:
# Inverse transform for encoded columns
for col in columns_to_encode:
    # Retrieve the corresponding LabelEncoder for the column
    le = label_encoders[col]
    # Inverse transform the data and convert to integer type
    df[col] = le.inverse_transform(df[col].astype(int))

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
