# **Imputing Missing Values**

Imputation is the process of filling in missing data values with estimated or actual values. This technique is often used in statistical analysis and data cleaning to make data more complete and apparent. Real-world datasets often contain missing values, which can be encoded as NaNs or blanks. Training a model with a dataset containing many missing values can significantly impact the model's quality.

**Here are some important ways to impute missing values**


`1.` Simple Imputation Techniques
- **Mean/Median Imputation:** Replace missing values with the mean or median of the column. Suitable for numerical data.
- **Mode Imputation:** Replace missing values with the mode (most frequent value) of the column. Useful for categorical data.

`2.` **K-Nearest Neighbors (KNN):** Fill in missing values by looking at nearby rows that are similar.

`3.` **Regression Imputation:** Predict missing values using a regression model and other known variables.

`4.` **Decision Trees and Random Forests:** Handle missing values naturally and predict them by learning patterns from existing data.

`5.` **Advanced Method**

- **Multiple Imputation by Chained Equations (MICE):** MICE is an advanced method where missing values in each variable are predicted based on other variables in a cyclical manner.

- **Deep Learning Methods:** Neural networks, particularly autoencoders, are powerful tools for filling in missing values, especially in intricate datasets.

`6.` **Time Series Specific Methods:** For time-series data, you might use techniques like interpolation, forward-fill, or backward-fill.

----

## ` | ` Simple Imputation Techniques

- **Mean/Median Imputation:** Fill in missing values by replacing them with the mean or median of the available data in the respective variable.
Applicable when a small percentage of values are missing or if the missing data is randomly distributed. 


**Two methods exist:**

- **Single Imputation:** Replace missing data with the mean (or median) of observed data.

- **Multiple Imputation:** Generate multiple imputed datasets, calculating means (or medians) for each, and then combine the results for more robust imputation.

**Advantages:**
> - Easy to implement.

> - Fast way of obtaining complete datasets.

> - Can be integrated into production (during model deployment).

In [67]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [68]:
Titanic=sns.load_dataset("titanic")
Titanic.head(2)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False


In [69]:
Titanic.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [70]:

Titanic.drop("deck",axis=1,inplace=True)

In [71]:
Titanic["age"].fillna(Titanic["age"].mean(),inplace=True)
Titanic.isnull().sum().sort_values(ascending=False)

embarked       2
embark_town    2
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
class          0
who            0
adult_male     0
alive          0
alone          0
dtype: int64

## ` | `  Mode Imputation
Mode imputation replaces missing values with the mode (most frequent value) of the column. This is useful for imputing categorical columns, such as **Embarked** and **embark_town** in the Titanic dataset.

In [72]:
Titanic["embark_town"].fillna(Titanic["embark_town"].mode()[0],inplace=True)
Titanic["embarked"].fillna(Titanic["embarked"].mode()[0],inplace=True)


Titanic.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

## `|` K-Nearest Neighbors (KNN)

KNN imputation is a method used to fill in missing values in a dataset by predicting them using the attributes of nearby data points that are similar to the one with the missing value. This technique estimates missing values by considering the characteristics of neighboring data points.

The KNN imputation algorithm fills in missing values by taking the average or weighted average of the values from the k-nearest neighbors.

> let's explore the process of implementing KNN imputation in Python with the Titanic dataset

In [73]:
Titanic=sns.load_dataset("titanic")
Titanic.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [74]:
from sklearn.impute import KNNImputer
Titanic["age"]=KNNImputer(n_neighbors=5).fit_transform(Titanic[["age"]])
Titanic.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

## `|` **Regression imputation**

Regression imputation is a technique used to impute missing values in a dataset by predicting them based on the relationships observed in the remaining data. Regression imputation uses a regression model to predict the missing values based on other variables in the dataset. It works well for both categorical and numerical data.

In [75]:
Titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [76]:
Titanic=sns.load_dataset("titanic")
Titanic.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [77]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
Titanic["age"]=IterativeImputer(max_iter=10).fit_transform(Titanic[["age"]])
Titanic.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

## `|` **Random Forests for Imputing Missing Values**

*Random Forests can be a powerful tool for imputing missing values in a dataset. Imputing missing values is essential in data preprocessing for machine learning algorithms, as they cannot handle missing data directly, including Random Forests.*

**Below is a detailed guide on utilizing Random Forests to impute missing values**

`1` **Understand the Data:** 
> - Identify features with missing values.

`2` **Data Preprocessing** 
> - Separate the dataset into features (X) and target variable (y).
> - Encode categorical variables if necessary.

`3` **Identify Variables for Imputation**
> - Identify the variables with missing values that you want to impute.

`4` **Build the Random Forest Model**
> - Use the dataset with complete data to train a Random Forest model.
> - The target variable should be the variable you want to impute, and the other variables (with no missing values) will be the features.

`5` **Impute Missing Values**
> - Use the trained Random Forest model to predict the missing values in the dataset with missing values.


`6` **Evaluate Imputation Quality**
> - The text discusses the importance of assessing the quality of imputation by comparing imputed values with true values or using other metrics to evaluate accuracy.


In [78]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.impute import SimpleImputer

# 1. load the dataset
Titanic = sns.load_dataset('titanic')

# check missing values in each column
Titanic.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [79]:
Titanic.drop('deck', axis=1, inplace=True)
Titanic.isnull().sum().sort_values(ascending=False)
Titanic.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [65]:
from sklearn.preprocessing import LabelEncoder
print(Titanic.isnull().sum().sort_values(ascending=False))
Encode_columns=["sex","embarked","class","who","embark_town","alive"]
LabelEncoder1={}
for col in Encode_columns:
    LabelEncoder1[col]=LabelEncoder()
    Titanic[col]=LabelEncoder1[col].fit_transform(Titanic[col])


age            177
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64


In [28]:
# Split the dataset into two parts: one with missing values, one without
Titanic_with_missing = Titanic[Titanic['age'].isna()]
Titanic_with_missing.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

In [29]:
Titanic_without_missing = Titanic.dropna()
Titanic_without_missing.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [80]:
print("The shape of the original dataset is: ", Titanic.shape)
print("The shape of the dataset without missing values: ", Titanic_without_missing.shape)
print("The shape of the dataset with missing values is: ", Titanic_with_missing.shape)

The shape of the original dataset is:  (891, 14)
The shape of the dataset without missing values:  (714, 14)
The shape of the dataset with missing values is:  (177, 14)


In [81]:
Titanic_with_missing.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
5,0,3,1,,0,0,8.4583,1,2,1,True,1,0,True
17,1,2,1,,0,0,13.0,2,1,1,True,2,1,True
19,1,3,0,,0,0,7.225,0,2,2,False,0,1,True
26,0,3,1,,0,0,7.225,0,2,1,True,0,0,True
28,1,3,0,,0,0,7.8792,1,2,2,False,1,1,True


In [82]:
Titanic_without_missing.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


In [84]:
X=Titanic_without_missing.drop('age',axis=1)
y=Titanic_without_missing["age"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Random Forest Imputation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)



In [85]:
y_pred = rf_model.predict(X_test)

In [86]:
# evaluate the model
print("RMSE for Random Forest Imputation: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score for Random Forest Imputation: ", r2_score(y_test, y_pred))
print("MAE for Random Forest Imputation: ", mean_absolute_error(y_test, y_pred))
print("MAPE for Random Forest Imputation: ", mean_absolute_percentage_error(y_test, y_pred))

RMSE for Random Forest Imputation:  11.081260589808045
R2 Score for Random Forest Imputation:  0.33769388288226154
MAE for Random Forest Imputation:  8.666661815622195
MAPE for Random Forest Imputation:  0.40839466096086574


## `|` Advanced Method

**Multiple Imputation by Chained Equations (MICE)**

Multiple Imputation by Chained Equations (MICE) is an advanced method for filling in missing data in a dataset. It does this by looking at each variable with missing values, one at a time.MICE is suitable for datasets with mixed types of variables, including both categorical and numerical features.

In [7]:
# imoprt libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# laod the dataset
Titanic = sns.load_dataset('titanic')
Titanic.isnull().sum().sort_values(ascending=False)



deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [8]:
Titanic.dtypes

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

In [9]:
LabelEncoder1={}
for col in Titanic:
    if Titanic[col].dtype=="object" or Titanic[col].dtype=="category":
        LabelEncoder1[col]=LabelEncoder()
        Titanic[col]=LabelEncoder1[col].fit_transform(Titanic[col])

Titanic.dtypes



NameError: name 'LabelEncoder' is not defined

In [None]:
Null_columns=Titanic.columns[Titanic.isnull().any()]
print("```````````````````````````````````")
print("Total Null Columns in This data set\n",Null_columns)
print("\n```````````````````````````````````")

for col in Null_columns:
    Titanic[col]=IterativeImputer().fit_transform(Titanic[[col]])
    
Titanic.isnull().sum()


```````````````````````````````````
Total Null Columns in This data set
 Index([], dtype='object')

```````````````````````````````````


survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

### **What is an Autoencoder?**

*An autoencoder is a type of neural network designed to replicate its input in its output. It consists of an encoder that compresses input data into a latent-space representation and a decoder that reconstructs the input from this representation. When applied to imputation, autoencoders are trained to ignore noise or missing values in the input data. During training, the network learns to predict missing values by minimizing the reconstruction error for known parts of the data.*


**How Autoencoders Work for Imputation:**

*The key idea is to train the autoencoder to ignore the noise (missing values) in the input data.
During training, inputs with missing values are presented, and the network learns to predict the missing values in a way that minimizes reconstruction error for known parts of the data.
This results in the network learning a robust representation of the data, enabling it to make reasonable guesses about missing values.*

**Advantages of Using Autoencoders:**

- **Handling Complex Patterns:** *They can capture non-linear relationships in the data, which is particularly useful for complex datasets.*
- **Scalability:** *They can handle large-scale datasets efficiently.*
- **Flexibility:** *Autoencoders can be adapted to different types of data, such as images, text, or time-series.*

**Implementation Considerations:**

- **Data Preprocessing:** Data should be normalized or standardized before feeding it into an autoencoder.
- **Network Architecture:** The choice of architecture (number of layers, type of layers, etc.) depends on the complexity of the data.
- **Training Process:** It might involve techniques like dropout or noise addition to improve the model's ability to handle missing data.\
**Example Use-Cases:**

- **Image Data:** Autoencoders can be used to fill in missing pixels or reconstruct corrupted images.
- **Time-Series Data:** They are effective in imputing missing values in sequences, such as stock prices or weather data.
- **Tabular Data:** Handling missing entries in datasets used for machine learning.

---

**`Why is data automatically imputed when using label encoder? eg label encoder.`**


---


In [None]:
# imoprt libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# laod the dataset
df = sns.load_dataset('titanic')
df.isnull().sum().sort_values(ascending=False)


deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [None]:
from sklearn.preprocessing import LabelEncoder

# create a LabelEncoder object using LabelEncoder() in for loop for categorical columns
# Columns to encode
columns_to_encode = ['sex', 'embarked', 'who', 'deck', 'class', 'embark_town', 'alive']

# Dictionary to store LabelEncoders for each column
label_encoders = {}

# Loop to apply LabelEncoder to each column for encoding
for col in columns_to_encode:
    # Create a new LabelEncoder for the column
    le = LabelEncoder()
    # Fit and transform the data
    df[col] = le.fit_transform(df[col])
    # Store the encoder in the dictionary
    label_encoders[col] = le
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,7,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,2,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,7,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,7,2,0,True


In [None]:
df.isnull().sum().sort_values(ascending=False)


age            177
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
deck             0
embark_town      0
alive            0
alone            0
dtype: int64