# Five importnat ways for Imputing Missing Values

You can impute missing values using machine learning models. This process is known as data imputation and is commonly used in data preprocessing to handle missing or incomplete data. There are several methods and models you can use, depending on the nature of your data and the missing values:

1. `Simple Imputation Techniques:`

    * Mean/Median Imputation: Replace missing values with the mean or median of the column. Suitable for numerical data.
    * Mode Imputation: Replace missing values with the mode (most frequent value) of the column. Useful for categorical data.
2. `K-Nearest Neighbors (KNN):` This algorithm can be used to impute missing values based on the similarity of rows. 
3. `Regression Imputation:` Use a regression model to predict the missing values based on other variables in your dataset.
4. `Decision Trees and Random Forests:` These can handle missing values inherently. They can also be used to predict missing values based on the patterns learned from the other data.
5. `Advanced Techniques:`
   
    * **Multiple Imputation by Chained Equations (MICE):** This is a more sophisticated technique that models each variable with missing values as a function of other variables in a round-robin fashion.
    * **Deep Learning Methods:** Neural networks, especially autoencoders, can be effective in imputing missing values in complex datasets.
6. `Time Series Specific Methods:` For time-series data, you might use techniques like interpolation, forward-fill, or backward-fill.

It's important to choose the right method based on the type of data, the pattern of missingness (e.g., at random, completely at random, or not at random), and the amount of missing data. Additionally, it's crucial to understand that imputation can introduce bias or affect the distribution of your data, so it should be done with caution and an understanding of the potential implications.

# 1. Simple Imputation Techniques

## 1.1. Mean/Median Imputation
Mean/Median imputation replaces missing values with the mean or median of the column. This is a simple and effective method, but it has some limitations. For example, it reduces variance in the dataset, and it can lead to biased estimates if the missing values are not missing at random.

Let's see how to implement mean/median imputation in Python using the Titanic dataset.

### 1.1.1. Mean Imputation

First, let's import the necessary libraries and load the dataset:

In [2]:
import pandas as pd
import numpy as np 
import seaborn as sns
# import matplotlib.pyplot as plt
# %matplotlib inline

data = sns.load_dataset("titanic")
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
# check the number of missing values in each column and sort them in descending order
data.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

There are only 4 columns which have missing values 

We seen that there are 177 missing values of age and now we will replace them with mean of this column

In [4]:
# this line of code will replace the missing values in the "age" column with the mean of the "age" column
data["age"].fillna(data["age"].mean(), inplace= True)
data.isnull().sum().sort_values(ascending=False)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["age"].fillna(data["age"].mean(), inplace= True)


deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

now all the missing values of age column have replaced with the mean of that column

### 1.1.2. Median Imputation
Let's load the dataset and replace the missing values in the `age` column with the median of the column:

In [5]:
df = sns.load_dataset("titanic")
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [6]:
df["age"].fillna(df["age"].mean(), inplace= True)
df.isnull().sum().sort_values(ascending=False)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["age"].fillna(df["age"].mean(), inplace= True)


deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

### 1.2. Mode Imputation

Mode imputation replaces missing values with the mode (most frequent value) of the column. This is useful for imputing categorical columns, such as `Embarked` and `embark_town` in the Titanic dataset.

Let's see how to implement mode imputation in Python using the Titanic dataset.


In [7]:
df = sns.load_dataset("titanic")
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [8]:
df["embarked"] = df["embarked"].fillna(df["embarked"].mode()[0])
df["embark_town"] = df["embark_town"].fillna(df["embark_town"].mode()[0])
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

We can see that the missing values in the embark_town column and embarked column have been replaced with the mode of the column.


# 2. K-Nearest Neighbors (KNN)

KNN is a machine learning algorithm that can be used for imputing missing values. It works by finding the most similar data points to the one with the missing value based on other available features. The missing value is then imputed with the mean or median of the most similar data points.

Let's see how to implement KNN imputation in Python using the Titanic dataset.


In [9]:
# load the dataset
df = sns.load_dataset('titanic')

# check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [10]:
from sklearn.impute import KNNImputer

# call the imputer object
imputer = KNNImputer(n_neighbors=5)

# impute missing values in the "age" column using KNN imputer
df["age"] = imputer.fit_transform(df[["age"]])

df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

# 3. Regression Imputation

Regression imputation uses a regression model to predict the missing values based on other variables in the dataset. It works well for both categorical and numerical data.

Let's see how to implement regression imputation in Python using the Titanic dataset.


In [11]:
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer

# call the imputer object 
imputer = IterativeImputer(max_iter=10)

# impute missing values in the "age" column using Iterative imputer
df["age"] = imputer.fit_transform(df[["age"]])
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

# 4. Random Forests for Imputing Missing Values

Random forests can handle missing values inherently. They can also be used to predict missing values based on the patterns learned from the other data.

Let's see how to implement random forests in Python using the Titanic dataset.

In [12]:
import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, root_mean_squared_error, mean_absolute_percentage_error
from sklearn.impute import SimpleImputer

# load the dataset
df = sns.load_dataset('titanic')

# check the number of missing values in each column as sorted in descending order
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [13]:
df.drop(columns=["deck"], inplace=True)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  embark_town  889 non-null    object  
 12  alive        891 non-null    object  
 13  alone        891 non-null    bool    
dtypes: bool(2), category(1), float64(2), int64(4), object(5)
memory usage: 79.4+ KB


In [15]:
# creating a class to store all the encoders 
class LabelEncoderStore:
    def __init__(self):
        self.encoders = {}

    def add_encoder(self, name, encoder):
        if name in self.encoders:
            raise ValueError(f"Label Encoder for '{name}' already exists.")
        self.encoders[name] = encoder 

    def get_encoder(self, name):
        if name not in self.encoders:
            raise ValueError(f"Label Encoder for '{name}' does not exist.")
        return self.encoders[name]     

# encode the categorical data using label encoder
from sklearn.preprocessing import LabelEncoder

columns_to_encode = ['sex', 'embarked', 'class', 'who', 'embark_town', 'alive']

# label_encoders = {}
label_encoder_store = LabelEncoderStore()


for col in columns_to_encode:
    le = LabelEncoder()

    # fit the label encoder to the columns and tranform the data
    df[col] = le.fit_transform(df[col])
    # store the label encoder of each column in the dictionary named label_encoders
    # label_encoders[col] = le
    label_encoder_store.add_encoder(col,le)

df.head()
label_encoder_store.get_encoder('sex')

In [16]:
df_with_missing_values = df[df['age'].isna()]
df_without_missing_values = df.dropna()

let's see the shape of dataset with and without the missing values

In [17]:
print("The shape of the original dataset ", df.shape)
print("The shape of the dataset with missing values ", df_with_missing_values.shape)
print("The shape of the dataset without missing values ", df_without_missing_values.shape)

The shape of the original dataset  (891, 14)
The shape of the dataset with missing values  (177, 14)
The shape of the dataset without missing values  (714, 14)


In [18]:
# now we are using random forest regressor to predict the missing values in the "age" column
# spliting the dataset into features and target variable or labels and we take only the columns that are not null
X  = df_without_missing_values.drop("age", axis=1)
y = df_without_missing_values["age"]

# spliting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2 , random_state = 42)

# creating a random forest regressor object
rf_model = RandomForestRegressor(n_estimators = 100, random_state = 42)

# fitting the model to the training data
rf_model.fit(X_train, y_train)

# predicting the target variable on the test data
y_pred = rf_model.predict(X_test)

# Evaluating the model performance
print("RMSE for Random Forest Regressor:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score for Random Forest Regressor:", r2_score(y_test, y_pred))
print("MAE for Random Forest Regressor:", mean_absolute_error(y_test, y_pred))
print("MAPE for Random Forest Regressor:", mean_absolute_percentage_error(y_test, y_pred))


RMSE for Random Forest Regressor: 11.081260589808045
R2 Score for Random Forest Regressor: 0.33769388288226154
MAE for Random Forest Regressor: 8.666661815622195
MAPE for Random Forest Regressor: 0.40839466096086574


In [19]:
# check the number of missing values in each column
df_with_missing_values.isnull().sum().sort_values(ascending=False)

age            177
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

In [20]:
y_pred = rf_model.predict(df_with_missing_values.drop("age", axis=1))

In [21]:
# remove warning 
import warnings
warnings.filterwarnings("ignore")

# replacing the missing values in the "age" column with the predicte values from the model 
df_with_missing_values["age"] = y_pred

# check the missing values 
df_with_missing_values.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [22]:
df_complete = pd.concat([df_without_missing_values, df_with_missing_values], axis=0)

print("The shape of the complete dataframe is ", df_complete.shape)

# check the first 5 rows of the complete dataframe
df_complete.head()

The shape of the complete dataframe is  (891, 14)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


In [23]:
for col in columns_to_encode:
    le = label_encoder_store.encoders[col]
    df_complete[col] = le.inverse_transform(df_complete[col])
df_complete.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [24]:
# print the shape of the complete dataframe
print("The shape of the complete dataframe is ", df_complete.shape)

The shape of the complete dataframe is  (891, 14)


In [25]:
# check the number of missing values in each column
df_complete.isnull().sum().sort_values(ascending=False)

embarked       2
embark_town    2
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
class          0
who            0
adult_male     0
alive          0
alone          0
dtype: int64

# 5. Advanced Techniques
### 5.1. Multiple Imputation by Chained Equations(MICE)
Multiple Imputation by Chained Equations (MICE) is a more sophisticated technique that models each variable with missing values as a function of other variables in a round-robin fashion. It works well for both categorical and numerical data.

To demonstrate Multiple Imputation by Chained Equations (MICE) in Python, we can use the IterativeImputer class from the sklearn.impute module. MICE is a sophisticated method of imputation that models each feature with missing values as a function of other features, and it uses that estimate for imputation. It does this in a round-robin fashion: each feature is modeled in turn. The MICE algorithm is implemented in the IterativeImputer class.

Let's see how to implement MICE in Python using the Titanic dataset.

In [26]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# load the dataset
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [27]:
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [28]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
from sklearn.preprocessing import LabelEncoder

# columns to encode
columns_to_encode = ['sex', 'embarked', 'class', 'who', 'embark_town', 'alive']

# create a dictionary to store the label encoders for all the columns
label_encoders = {}

# loop to apply label encoder 
for col in columns_to_encode:
    
    # create a label encoder object
    le = LabelEncoder()

    # fit the label encoder to the columns and tranform the data
    df[col] = le.fit_transform(df[col])

    # store the label encoder of each column in the dictionary named label_encoders
    label_encoders[col] = le

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,C,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,C,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,,2,0,True


In [42]:
from sklearn.preprocessing import LabelEncoder

# impute the missing values using Itrative imputer in a for loop for 'age' embark_town , embarked and deck columns
columns_to_encode = ['sex', 'embark_town', 'embarked', 'deck', 'who', 'class', 'alive']

label_encoders = {}
for col in columns_to_encode:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,7,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,2,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,7,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,7,2,0,True


In [43]:
from sklearn.impute import IterativeImputer
# create an instance of the IterativeImputer
imputer = IterativeImputer(max_iter=10)
columns_to_impute = ['age', 'embark_town', 'embarked','deck']
for col in columns_to_impute:
    df[col] = imputer.fit_transform(df[[col]])

df.isnull().sum().sort_values(ascending=False)      

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [44]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2.0,2,1,True,7.0,2.0,0,False
1,1,1,0,38.0,1,0,71.2833,0.0,0,2,False,2.0,0.0,1,False
2,1,3,0,26.0,0,0,7.925,2.0,2,2,False,7.0,2.0,1,True
3,1,1,0,35.0,1,0,53.1,2.0,0,2,False,2.0,2.0,1,False
4,0,3,1,35.0,0,0,8.05,2.0,2,1,True,7.0,2.0,0,True


# 5.2. Deep Learning Methods

Neural networks, especially autoencoders, can be effective in imputing missing values in complex datasets. Deep learning methods, particularly neural networks like autoencoders, offer a powerful approach for imputing missing values in complex datasets. These methods are especially useful when the data has intricate, non-linear relationships that traditional statistical methods might not capture effectively.


## Understanding Autoencoders for Imputation:

1. **What is an Autoencoder?**
* An autoencoder is a type of neural network that is trained to copy its input to its output.
* It has a hidden layer that describes a code used to represent the input.
* The network may be viewed as consisting of two parts: an encoder function, which compresses the input into a latent-space representation, and a decoder function, which reconstructs the input from the latent space.

2. **How Autoencoders Work for Imputation:**
* The key idea is to train the autoencoder to ignore the noise (missing values) in the input data.
* During training, inputs with missing values are presented, and the network learns to predict the missing values in a way that minimizes reconstruction error for known parts of the data.
* This results in the network learning a robust representation of the data, enabling it to make reasonable guesses about missing values.

3. **Advantages of Using Autoencoders:**
* Handling Complex Patterns: They can capture non-linear relationships in the data, which is particularly useful for complex datasets.
* Scalability: They can handle large-scale datasets efficiently.
* Flexibility: They can be adapted to different types of data (e.g., images, text, time-series).

4. **Implementation Considerations:**
* Data Preprocessing: Data should be normalized or standardizedbefore feeding it into an autoencoder.Network Architecture: The choice of architecture (number oflayers, type of layers, etc.) depends on the complexity of thedata.
* Training Process: It might involve techniques like dropout ornoise addition to improve the model's ability to handlemissing data.

5. **Example Use-Cases:**
* Image Data: Filling in missing pixels or reconstructing corrupted images.
* Time-Series Data: Imputing missing values in sequences like stock prices or weather data.
* Tabular Data: Handling missing entries in datasets used for machine learning.

### Implementation Example:

Here's a simplified example of how you might set up an autoencoder for imputation in Python using TensorFlow and Keras: (Check the next notebook)