## Recalling what we talked about last class...missing data!
Very often we have to deal with datasets that aren't ready for data analysis in the real world. One of the reasons is missing data, it consists in no data/ no value stored in certain observations within a variable.

We saw that missig data has **different sources**: it can be missing completely at random (MCAR),  almost-randomly (MAR) or it can involves a systematic loss of data (MNAR). Moreover, we saw two different ways to deal with this problem: **Complete Case Analysis** or CCA (which consists in discarding observations) and **mean / median imputation**. 

Now, let's see more robust ways to deal with missing data, such as **KNN Imputation** and **Iterative Imputation**.

Let's start by identifying the missing values on a Predictive Maintenance dataset.

In [0]:
pip install -U missingno

In [0]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [0]:
path = '/dbfs/FileStore/CDS2023/predictive_maintenance.csv'
data = pd.read_csv(path)
data.head()

In [0]:
data.columns

In [0]:
data=data.drop(columns=['Product ID','Type'])

In [0]:
data

Let's plot a matrix that shows how nulls are scattered accross the dataset.

In [0]:
import missingno as msno
msno.matrix(data);

We can also see if there's correlation between the missingness of the variables using a heatmap.

In [0]:
msno.heatmap(data)

As we can see from the heatmap, the missingness of the 3 variables aren't correlated.

## Imputing with KNN Imputer

It's a method for imputing numeric values based on an algorithm from Sklearn: *KNN Imputer*. KNN stands for **K-Nearest Neighbors**. That is to say, it tries to predict the value of numeric nullity by averaging the distances between its k nearest neighbors. 

But what is the idea behind this concept?
![alt text](https://urldefense.com/v3/__https://www.ibm.com/content/dam/connectedassets-adobe-cms/worldwide-content/cdp/cf/ul/g/ef/3a/KNN.component.l.ts=1653407890466.png/content/adobe-cms/us/en/topics/knn/jcr:content/root/table_of_contents/intro/complex_narrative/items/content_group/image__;!!D5dmWrNzYg!L9U_tgz4muzYIdWWOUOBkeemrTncdyadBUgXLUZIgd-bYBCLAWdzkNcB1z0AWWVgNanz43waWrkluqCoWA7oUg$ )

In [0]:
from sklearn.impute import KNNImputer

# Copy the data
prediction_knn_imputed = data.copy(deep=True)

# Init the transformer
knn_imp = KNNImputer(n_neighbors=3)

# Fit/transform
prediction_knn_imputed.loc[:, :] = knn_imp.fit_transform(prediction_knn_imputed)

prediction_knn_imputed.isnull().sum()

With this imputer, we have to choose the value for k. So, let's use a visual approach to do so. The ideia is to plot the original distribution (of the variable 'Tool wear [min]', in this case) and then impute different values of k and plot the distributions on top of the original.

In [0]:
n_neighbors = [2, 3, 5, 7]

fig, ax = plt.subplots(figsize=(16, 8))
# Plot the original distribution
sns.kdeplot(data['Tool wear [min]'], label="Original Distribution")
for k in n_neighbors:
    knn_imp = KNNImputer(n_neighbors=k)
    prediction_knn_imputed.loc[:, :] = knn_imp.fit_transform(data)
    sns.kdeplot(prediction_knn_imputed['Tool wear [min]'], label=f"Imputed Dist with k={k}")

plt.legend();

The closer the imputed distribution comes to the original, the better was the imputation. In this case, k=2 is the best choice.

## Imputing with Iterative Imputer

The method takes an arbitrary Sklearn estimator and tries to impute missing values by modeling other features as a function of features with missing values.

Here is a step-by-step explanation:

1- A regressor is passed to the transformer.

2- The first feature (feature_1) with missing values is chosen.

3- The data is split into train/test sets where the train set contains all the known values for feature_1, and the test set contains the missing samples.

4- The regressor is fit on all the other variables as inputs and with feature_1 as an output.

5- The regressor predicts the missing values.

6- The transformer continues this process until all features are imputed.

7- Steps 1–6 are called a single iteration round, and these steps are carried out multiple times as specified by the max_iter transformer parameter.

In [0]:
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Copy the data
predictive_ii_imputed = data.copy(deep=True)

# Init
ii_imp = IterativeImputer(
    estimator=ExtraTreesRegressor(), max_iter=10, random_state=1121218
)

# Tranform
predictive_ii_imputed.loc[:, :] = ii_imp.fit_transform(predictive_ii_imputed)

In [0]:
predictive_ii_imputed.isnull().sum()

When all iterations are done, (ii) returns only the last result of the predictions because, through each iteration, the predictions improve.

**Authors:** Julianada Coelho, Camila Mizokami

**References:**

https://urldefense.com/v3/__https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html*5Cn__;JQ!!D5dmWrNzYg!L9U_tgz4muzYIdWWOUOBkeemrTncdyadBUgXLUZIgd-bYBCLAWdzkNcB1z0AWWVgNanz43waWrkluqBZGO4bcg$ 
https://urldefense.com/v3/__https://towardsdatascience.com/advanced-missing-data-imputation-methods-with-sklearn-d9875cbcc6eb__;!!D5dmWrNzYg!L9U_tgz4muzYIdWWOUOBkeemrTncdyadBUgXLUZIgd-bYBCLAWdzkNcB1z0AWWVgNanz43waWrkluqAKejsAzw$ 