# Imputing Missing values in Python
### What are missing values?
- `missing values` are values that are not present in the dataset.
- They are represented by `NaN`, `NA`, or `None`.
- Missing vlaues can be caused by various reasons such as `data corruption`, `data entery errors` or `missing data`.
- Missing values can be handeled by:
    - `removing` the rows or columns with missing values
    - `imputing` the missing values or
    - `using` algorithms that can handle missing values

In this notebook, I will show how to handle missing values using the `pandas` library. 

But before you start, you need to know how to detect missing values in your dataset and this blog post will help you with that:


https://codanics.com/missing-values-k-rolay/

## Six import ways for Imputing Missing Values

You can impute missing values using machine learning models. This process is known as data imputation and is commonly used in data preprocessing to handle missing or incomplete data. There are several methods and models you can use, depending on the nature of your data and the missing values:

1. **`Simple Imputation Techniques`:**

    - **Mean/Median Imputation:** Replace missing values with the mean or median of the column. Suitable for numerical data.
    - **Mode Imputation:** Replace missing values with the mode (most frequent value) of the column. Useful for categorical data.

2. **`K-Nearest Neighbors (KNN)`:** This algorithm can be used to impute missing values based on the similarity of rows.

3. **`Regression Imputation`:** Use a regression model to predict the missing values based on other variables in your dataset.

4. **`Decision Trees and Random Forests`:** These can handle missing values inherently. They can also be used to predict missing values based on the patterns learned from the other data.

5. **`Advanced Techniques`:**

    - **Multiple Imputation by Chained Equations (MICE):** This is a more sophisticated technique that models each variable with missing values as a function of other variables in a round-robin fashion.

    - **Deep Learning Methods:** Neural networks, especially autoencoders, can be effective in imputing missing values in complex datasets.

6. **`Time Series Specific Methods`:** For time-series data, you might use techniques like interpolation, forward-fill, or backward-fill.

It's important to choose the right method based on the type of data, the pattern of missingness (e.g., at random, completely at random, or not at random), and the amount of missing data. Additionally, it's crucial to understand that imputation can introduce bias or affect the distribution of your data, so it should be done with caution and an understanding of the potential implications.

## 1. Simple Imputation Techniques
### 1.1. Mean/Median Imputation

Mean/median imputation replaces missing values with the mean or median of the column. This is a simple and effective method, but it has some limitations. For example, it reduces variance in the dataset, and it can lead to biased estimates if the missing values are not missing at random.

Let's see how to implement mean/median imputation in Python using the Titanic dataset.

#### 1.1.1. Mean Imputation
First, let's import the necessary libraries and load the dataset:

In [642]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

In [643]:
#load the Titanic dataset
data = sns.load_dataset('titanic')
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [644]:
#check the number of missing values in each column
data.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

We can see that the `age` column has 177 missing values. Let's replace these missing values with the mean of the column:

In [645]:
# Remove/drop the deck column and update the dataset
data = data.drop('deck', axis=1)

# data.drop("deck", axis=1, inplace=True) THIS METHOD IS ALSO OK BECAUSE WE HAVE USED THE 'INPLACE=TRUE'
# In pandas DataFrame's drop() method: axis=0 means drop rows, axis=1 means drop columns

In [646]:
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,Cherbourg,yes,True


In [647]:
# impute missing values with mean
data['age'] = data['age'].fillna(data['age'].mean())
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.000000,1,0,7.2500,S,Third,man,True,Southampton,no,False
1,1,1,female,38.000000,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.000000,0,0,7.9250,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.000000,1,0,53.1000,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.000000,0,0,8.0500,S,Third,man,True,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,13.0000,S,Second,man,True,Southampton,no,True
887,1,1,female,19.000000,0,0,30.0000,S,First,woman,False,Southampton,yes,True
888,0,3,female,29.699118,1,2,23.4500,S,Third,woman,False,Southampton,no,False
889,1,1,male,26.000000,0,0,30.0000,C,First,man,True,Cherbourg,yes,True


In [648]:
# # check again the number of missing values in each column
data.isnull().sum().sort_values(ascending=False)

embarked       2
embark_town    2
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
class          0
who            0
adult_male     0
alive          0
alone          0
dtype: int64

The column of `deck` is gone. 

Also, the column of `age` doesn't have any empty column now because the missing values in the `age` column have been replaced with the mean of the column.

#### 1.1.2. Median Imputation
Let's load the dataset and replace the missing values in the `age` column with the median of the column:

In [649]:
# load the Titanic dataset again for Median Imputation
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [650]:
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [651]:
# We can drop/remove the `deck` column 
df.drop('deck', axis=1, inplace=True)

In [652]:
# impute missing values with median
df['age'] = df['age'].fillna(df['age'].median())

# check the numnber of missing values in eacn column
df.isnull().sum().sort_values(ascending=False)

embarked       2
embark_town    2
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
class          0
who            0
adult_male     0
alive          0
alone          0
dtype: int64

### 1.2. Mode Imputation
Mode imputation replaces missing values with the mode (most frequent value) of the column. This is useful for imputing categorical columns, such as `embarked` and `embark_town` in the Titanic dataset.

Let's see how to implement mode imputation in Python using the Titanic dataset.

In [653]:
# load the dataset again
df = sns.load_dataset('titanic')

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [654]:
# impute missing values with mode
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])


In [655]:
# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

In [656]:
df.drop('deck', axis=1, inplace=True)
df.drop('age', axis=1, inplace=True)

In [657]:
# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)


# NOW ALL THE COLUMNS ARE FILLED (NO COLUMN IS EMPTY)

survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

We can see that the missing values in the `embark_town` column and `embarked` column have been replaced with the mode of the column.

## 2. K-Nearest Neighbors (KNN)
`KNN` is a machine learning algorithm that can be used for imputing missing values. It works by finding the `most similar data points to the one with the missing value` based on other available features. The missing value is then imputed with the mean or median of the most similar data points.

Let's see how to implement KNN imputation in Python using the Titanic dataset.

In [658]:
# load the dataset again for KNN imputer
df = sns.load_dataset('titanic')

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [659]:
# impute missing values with KNN imputer
from sklearn.impute import KNNImputer

# call the KNN class with number of neighbors = 4
imputer = KNNImputer(n_neighbors=4)

In [660]:
#impute missing values with KNN imputer
# df['age'] = imputer.fit(df[['age']]).transform(df[['age']])
# THE ABOVE LINE IS EQUIVALENT TO THE LINE BELOW
df['age'] = imputer.fit_transform(df[['age']])

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

## 3. Regression Imputation

Regression imputation uses a `regression model` to predict the missing values based on other variables in the dataset. It works well for both `categorical` and `numerical data`.

Let's see how to implement regression imputation in Python using the Titanic dataset.

In [661]:
# load the dataset again for Regression Imputation
df = sns.load_dataset('titanic')

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [662]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [663]:
# import the important libraries for regression imputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer


# call the IterativeImputer class with max_iter = 10
imputer = IterativeImputer(max_iter=10)

In [664]:
#impute missing values with regression imputer
df['age'] = imputer.fit_transform(df[['age']])

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

## 4.Random Forests for Imputing Missing Values

Random forests can handle missing values `inherently`. They can also be used to predict missing values based on the patterns learned from the other data.

Let's see how to implement random forests in Python using the Titanic dataset.


In [665]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.impute import SimpleImputer

# 1. load the dataset
df = sns.load_dataset('titanic')

# check missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

We will remove the deck column from the dataset because it has too many missing values:

In [666]:
# remove deck column
df.drop('deck', axis=1, inplace=True)

# check missing values in each column
df.isnull().sum().sort_values(ascending=False)

age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [667]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [668]:
# encode the data using label encoding
from sklearn.preprocessing import LabelEncoder
# Columns to encode
columns_to_encode = ['sex', 'embarked', 'who', 'class', 'embark_town', 'alive']


In [669]:
# encode the data using label encoding
from sklearn.preprocessing import LabelEncoder
columns_to_encode = ['sex', 'embarked', 'who', 'class', 'embark_town', 'alive']

encoder = LabelEncoder()
df['sex'] = encoder.fit_transform(df['sex'])
encoder = LabelEncoder()
df['embarked'] = encoder.fit_transform(df['embarked'])
encoder = LabelEncoder()
df['who'] = encoder.fit_transform(df['who'])
encoder = LabelEncoder()
df['class'] = encoder.fit_transform(df['class'])
encoder = LabelEncoder()
df['embark_town'] = encoder.fit_transform(df['embark_town'])
encoder = LabelEncoder()
df['alive'] = encoder.fit_transform(df['alive'])


# ALL THIS CAN BE DONE WITH A FOR LOOP AS BELOW 

In [670]:
# Loop to apply LabelEncoder to each column

# Dictionary to store LabelEncoders for each column
label_encoders = {} 


for col in columns_to_encode:
    # Create a new LabelEncoder for the column
    le = LabelEncoder()

    # Fit and transform the data, then inverse transform it
    df[col] = le.fit_transform(df[col])

    # Store the encoder in the dictionary
    label_encoders[col] = le

In [671]:
# Check the first few rows of the DataFrame
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


We have to first impute the missing values in the `age` column before we can use it to predict the missing values in the `embarked` and `embark_town` columns.

In [672]:
# Split the dataset into two parts: one with missing values, one without
df_with_missing = df[df['age'].isna()]

# dropna removes all rows with missing values
df_without_missing = df.dropna()

In [673]:
df_with_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,,0,0,7.8792,1,2,2,False,1,1,True


Let's see the shape of the datasets with and without the missing values:

In [674]:
print("The shape of the original dataset is: ", df.shape)
print("The shape of the dataset with missing values removed is: ", df_without_missing.shape)
print("The shape of the dataset with missing values is: ", df_with_missing.shape)

The shape of the original dataset is:  (891, 14)
The shape of the dataset with missing values removed is:  (714, 14)
The shape of the dataset with missing values is:  (177, 14)


let's see the first five rows of the dataset with the missing values: `df_without_missing`

In [675]:
df_with_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,,0,0,7.8792,1,2,2,False,1,1,True


Let's see the first five rows of the dataset without the missing values:

In [676]:
df_without_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


Let's see the names of all the columns in the dataset:

In [677]:
# check the names of the columns
print(df.columns)

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive',
       'alone'],
      dtype='object')


In [678]:
# We will train the model with the data where values of 'age' are present
# Regression Imputation

# split the data into X and y and we will only take the columns with no missing values
X = df_without_missing.drop(['age'], axis=1)
y = df_without_missing['age']

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

In [679]:
# Random Forest Imputation
rf_model = RandomForestRegressor(n_estimators=100, random_state=4)
rf_model.fit(X_train, y_train)

In [680]:
# evaluate the model
y_pred = rf_model.predict(X_test)
print("RMSE for Random Forest Imputation: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score for Random Forest Imputation: ", r2_score(y_test, y_pred))
print("MAE for Random Forest Imputation: ", mean_absolute_error(y_test, y_pred))
print("MAPE for Random Forest Imputation: ", mean_absolute_percentage_error(y_test, y_pred))

RMSE for Random Forest Imputation:  11.786290881896859
R2 Score for Random Forest Imputation:  0.3324613636132774
MAE for Random Forest Imputation:  9.01689505505794
MAPE for Random Forest Imputation:  0.5430325223866342


In [681]:
# check the number of missing values in each column
df_with_missing.isnull().sum().sort_values(ascending=False)

age            177
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

In [682]:
# Predict missing values of the age in the missing dataset
# Axis=1 means column

y_pred = rf_model.predict(df_with_missing.drop(['age'], axis=1))

In [683]:
y_pred  # predicted missing values
# This will show the values of age predicted for the missing dataset

array([32.2928315 , 35.21006586, 19.49666667, 31.74140873, 25.44666667,
       26.09048999, 33.77      , 19.232     , 22.07969643, 33.30404839,
       31.6299024 , 34.74166667, 19.232     , 23.42      , 34.23      ,
       39.061     , 30.7       , 26.09048999, 31.6299024 , 22.00333333,
       31.6299024 , 31.6299024 , 26.09048999, 26.86810995, 30.80708333,
       31.6299024 , 49.0782123 , 30.32      , 35.285     , 31.14160209,
       24.43944225, 20.7575    , 27.55533333, 57.67626984, 24.34242857,
       18.8525    , 28.99      , 50.3       , 30.47      , 49.0782123 ,
       19.232     , 20.7575    , 42.89875397, 26.09048999, 21.91      ,
       31.572     , 27.795     , 30.47      , 31.14160209, 30.842     ,
       49.0782123 , 26.91094048, 52.54      , 19.232     , 38.11984863,
       59.25960317, 39.061     , 55.90075   , 19.232     , 24.195     ,
       33.96666667, 31.6299024 , 28.53      , 20.7575    , 22.75      ,
       37.12      , 26.09048999, 29.225     , 46.09      , 31.74

In [684]:
# replace the missing values with the predicted values
df_with_missing['age'] = y_pred

# check the missing values
df_with_missing.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [685]:
# we can remove the warning above also with the help of 
# remove warning
import warnings
warnings.filterwarnings('ignore')

# replace the missing values with the predicted values
df_with_missing['age'] = y_pred

# check the missing values
df_with_missing.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [686]:
# concatenate the two dataframes for axis=0 (means rows)
df_complete = pd.concat([df_with_missing, df_without_missing], axis=0)
# print the shape of the complete dataframe
print("The shape of the complete dataframe is: ", df_complete.shape)

#check the first 5 rows of the complete dataframe
df_complete.head()

The shape of the complete dataframe is:  (891, 14)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,32.292832,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,35.210066,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,19.496667,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,31.741409,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,25.446667,0,0,7.8792,1,2,2,False,1,1,True


In [687]:
# check the number of missing values in each column
df_complete.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [688]:
# please save the data into csv

df_complete.to_csv('Titanic_Full.csv', index=False)

## 5. Advanced Techniques
### 5.1. Multiple Imputation by Chained Equations (MICE)

Multiple Imputation by Chained Equations (MICE) is a more sophisticated technique that models each variable with missing values as a function of other variables in a `round-robin fashion`. It works well for both `categorical` and `numerical data`.

To demonstrate Multiple Imputation by Chained Equations (MICE) in Python, we can use the `IterativeImputer` class from the `sklearn.impute` module. MICE is a sophisticated method of imputation that models each feature with missing values as a function of other features, and it uses that estimate for imputation. It does this in a round-robin fashion: each feature is modeled in turn. The MICE algorithm is implemented in the IterativeImputer class.

Let's see how to implement MICE in Python using the Titanic dataset.


In [689]:
# imoprt libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# laod the dataset
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [690]:
# check the missing values
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [691]:
# encode the data using label encoding
from sklearn.preprocessing import LabelEncoder
# Columns to encode
columns_to_encode = ['sex', 'embarked', 'who', 'deck', 'class', 
                     'embark_town', 'alive']

encoder = LabelEncoder()
df['sex'] = encoder.fit_transform(df['sex'])
df['embarked'] = encoder.fit_transform(df['embarked'])
df['who'] = encoder.fit_transform(df['who'])
df['deck'] = encoder.fit_transform(df['deck'])
df['class'] = encoder.fit_transform(df['class'])
df['embark_town'] = encoder.fit_transform(df['embark_town'])
df['alive'] = encoder.fit_transform(df['alive'])

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,7,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,2,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,7,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,7,2,0,True


In [692]:
# We can do the same step as above by using the for loop
from sklearn.preprocessing import LabelEncoder

# create a LabelEncoder object using LabelEncoder() in for loop for 
# categorical 
# columns
# Columns to encode
columns_to_encode = ['sex', 'embarked', 'who', 'deck', 'class', 
                     'embark_town', 'alive']

# Dictionary to store LabelEncoders for each column
label_encoders = {}

# Loop to apply LabelEncoder to each column for encoding
for col in columns_to_encode:
    # Create a new LabelEncoder for the column
    le = LabelEncoder()
    # Fit and transform the data
    df[col] = le.fit_transform(df[col])
    # Store the encoder in the dictionary
    label_encoders[col] = le
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,7,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,2,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,7,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,7,2,0,True


In [693]:
# impute the missing values with IterativeImputer
# call the IterativeImputer class with max_iter = 10
imputer = IterativeImputer(max_iter=10)

#impute missing values using IterativeImputer in a for loop for age, 
# embark_town,embarked columns and deck

# Columns to impute
columns_to_impute = ['age', 'embark_town', 'embarked', 'deck']

# Loop to impute each column
for col in columns_to_impute:
    df[col] = imputer.fit_transform(df[[col]])    
# check the missing values
df.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [694]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2.0,2,1,True,7.0,2.0,0,False
1,1,1,0,38.0,1,0,71.2833,0.0,0,2,False,2.0,0.0,1,False
2,1,3,0,26.0,0,0,7.925,2.0,2,2,False,7.0,2.0,1,True
3,1,1,0,35.0,1,0,53.1,2.0,0,2,False,2.0,2.0,1,False
4,0,3,1,35.0,0,0,8.05,2.0,2,1,True,7.0,2.0,0,True


In [695]:
# Inverse transform for encoded columns
for col in columns_to_encode:
    # Retrieve the corresponding LabelEncoder for the column
    le = label_encoders[col]
    # Inverse transform the data and convert to integer type
    df[col] = le.inverse_transform(df[col].astype(int))
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,7,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,2,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,7,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,7,2,0,True
