# <center>Deep Learning Imputation (datawig)</center>

Datawig is a recent tool that utilises MXnet’s Deep Neural Networks to predict missing values. It can run on a CPU as well as a GPU, and supports numerical values, categorical values and more generic data types such as unstructured text.

The imputation model follows the approach of MICE, also referred as fully conditional specification: for each to-be-imputed column (referred to as output column), the user can specify the columns which might contain useful information for imputation (referred to as input columns).

<u>Datawig showed favourable [results](https://jmlr.csail.mit.edu/papers/volume20/18-753/18-753.pdf) against:</u>
- fancyimpute package: mean, KNN, matrix factorization
- iterativeimputer: estimators included RandomForestregressor and linear regression which are similar to the missforest approach and the MICE with a linear model

<u>Drawbacks:</u> 
- Difficult to set up and run 
- With custom imputation models, only one column can be imputed at one time 
- It can be slow

**When to use datawig:** datawig is a very recent algorithm and it hasn't been tested as thoroughly as MICE. However, there are [suggestions](https://pubmed.ncbi.nlm.nih.gov/35455196/) about using it when there is less than 40% of missingness in the column.

### Installing datawig

Datawig comes with MXnet which causes many dependency clashes with recent Python versions. To avoid this, you will need to [create a virtual environment](https://www.geeksforgeeks.org/set-up-virtual-environment-for-python-using-anaconda/) with an earlier version of Python such as 3.7. Then follow the steps below:

1. Activate your new environment: `conda activate envname` <br>
- Install datawig with pip: `pip install datawig` <br>
- You may also need to downgrade numpy by installing an earlier version: `conda install -c conda-forge numpy=x.x.x`<br>
- Switch to your new environment through the anaconda navigator, choose your favourite IDE and let's go!

### Running datawig 

Datawig offers two methods for imputation
- **SimpleImputer**: uses default encoders and featurizers that usually yield good results 
- **Imputer**: allows us to specify which encoder and featurizer to use for each column

## Introduction to SimpleImputer (method 1)

An easy way to use the SimpleImputer is with the complete() function. This takes a data frame as an argument and automatically imputes all missing values with all other columns as inputs.

```
# Basic parameters
datawig.SimpleImputer.complete(df, precision_threshold = 0.8, inplace=True)
```

```
# High accuracy parameters
datawig.SimpleImputer.complete(
    df, precision_threshold = 0.8,
    inplace = True, hpo = True,   # Setting hpo to True can be slow!
    iterations = 20     # Higher iterations, higher computational time
)
```

The precision_threshold parameter is only valid for categorical variables (otherwise it is ignored). If the model cannot reach this threshold for a given value, that value will not be imputed; thus depending on the precision_threshold (default = 0.0), the returned data frame may still contain some missing values. The hpo parameter is used to optimize hyperparameters and takes a boolean value. The iterations parameter takes the number of iterations that we want to use for imputation. Research suggests that a value of 20 should be generally sufficient, but higher values can be used.

A few things to consider before using this method:
- It will not work properly if data types are not correctly specified (e.g. numeric columns passed as string) 
- It will return a  ValueError if columns with type category exist in the dataset
- It will not impute categorical columns unless missingness is very low

In [1]:
# Basic libraries
import pandas as pd
import numpy as np
import pickle
import warnings
warnings.filterwarnings('ignore')

# datawig
import datawig

In [2]:
# Import data
df = pickle.load(open("titanic_df.p","rb"))
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Survived  891 non-null    category
 1   Pclass    891 non-null    category
 2   Name      891 non-null    object  
 3   Sex       891 non-null    category
 4   Age       714 non-null    float64 
 5   SibSp     891 non-null    int64   
 6   Parch     891 non-null    int64   
 7   Ticket    891 non-null    object  
 8   Fare      891 non-null    float64 
 9   Cabin     204 non-null    object  
 10  Embarked  889 non-null    category
dtypes: category(4), float64(2), int64(2), object(3)
memory usage: 59.6+ KB


In [4]:
# Our missing data
df.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [5]:
# Convert category types to prevent ValueError
df[["Survived","Pclass"]] = df[["Survived","Pclass"]].astype("int")
df[["Sex","Embarked"]] = df[["Sex","Embarked"]].astype("object")

In [6]:
df_imputed = datawig.SimpleImputer.complete(df)
df_imputed.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      0
dtype: int64

All columns were imputed except Cabin, a categorical column with high missingness. The authors of datawig argue that it will not impute when predictive accuracy is not high enough. This may explain the remaining NaNs, but let's try to overcome this by building a custom imputation model.

## Introduction to SimpleImputer (method 2)

The SimpleImputer also allows for **custom imputation models**. One thing to note with this approach is that the procedure must be repeated for every to-be-imputed column. 

```
# Custom imputation model 
imputer = datawig.SimpleImputer(
    input_columns=['input1', 'input2'], 
    output_column= 'output', 
    output_path = 'imputer_model'  # Stores model metrics
    )
```

After building an imputation model, we will need some data to train it with and different data to test it on. If the dataset is not already split, datawig provides its own split function.

```
# Split data for SimpleImputer 
df_train, df_test = datawig.utils.random_split(df)
```

```
# Fit model 
imputer_Cabin.fit(train_df = df_train, num_epochs = 50)
```

We can also fit with hyperparameter tuning: `imputer.fit_hpo(train_df=df_train)`. The num_epochs parameter defines how many times to loop through the network. A rule of thumb is to start with a value that is three times the number of columns in your data. 

```
# Impute missing values and return original dataframe with predictions
predictions = imputer.predict(df_test)

# Calculate metrics 
metrics = imputer_Cabin.load_metrics()
weighted_f1 = metrics['weighted_f1']
avg_precision = metrics['avg_precision']
print("weighted_f1 :", weighted_f1, "\n", "avg_precision :", avg_precision)
```

In [7]:
# Split dataset
df_train, df_test = datawig.utils.random_split(df)

In [8]:
# Custom imputation model for Cabin 
imputer_Cabin = datawig.SimpleImputer(
    input_columns=['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked'], 
    output_column= 'Cabin', 
    output_path = 'imputer_model'
    )

In [9]:
# Fit model 
imputer_Cabin.fit(train_df = df_train, num_epochs = 50)

<datawig.simple_imputer.SimpleImputer at 0x7f2a3c5e8210>

In [10]:
# Impute missing values
predictions = imputer_Cabin.predict(df_test)

In [11]:
# Calculate metrics - datawig provides its own metrics for categorical variables
metrics = imputer_Cabin.load_metrics()
weighted_f1 = metrics['weighted_f1']
avg_precision = metrics['avg_precision']
print("weighted_f1 :", weighted_f1, "\n", "avg_precision :", avg_precision)

weighted_f1 : 0.6025641025641025 
 avg_precision : 0.6214285714285714


The prediction score is low but this is expected with such high missingness in the column. One way to improve the predictions is by providing the model with variables that are more 'additive'. This is where exploring missing data comes in handy as we get to **understand the relationships between variables**. Another way may be to improve the **balancing of classes** within the column. If a certain class dominates the column, the predictions will be biased towards that class. 


**Final notes:**

If you look at my notebook where I test the behaviour of datawig on a couple of datasets (titanic included), you will notice that here we got a higher accuracy score when imputing the Cabin column. This is because we carried out the imputations on Cabin after Age was imputed which means that the model had more data to learn from.

## Imputer

The Imputer enables more flexibility with specifying model parameters, such as using particular encoders and featurizers.

**ColumnEncoder:** Transforms the raw data of a column into an encoded numerical representation.
- **<span style="color:purple">SequentialEncoder</span>:** for sequences of string symbols (e.g. characters or words)
- **<span style="color:purple">BowEncoder</span>**: bag-of-word representation for strings, as sparse vectors
- **<span style="color:purple">CategoricalEncoder</span>**: for categorical variables (one-hot encoding)
- **<span style="color:purple">NumericalEncoder</span>**: for numerical values

**Featurizer:** converts encoded data into features that will be used in the imputation model for training and prediction. There are a few options for Featurizers depending on which ColumnEncoder was used for a particular column.
- **<span style="color:purple">LSTMFeaturizer</span>** – used with SequentialEncoder
- **<span style="color:purple">BowFeaturizer</span>** - used with BowEncoder 
- **<span style="color:purple">EmbeddingFeaturizer</span>** - used with CategoricalEncoder
- **<span style="color:purple">NumericalFeaturizer</span>** - used with NumericalEncoder

```
# Specifying Encoders and Featurizers
data_encoder_cols = [BowEncoder('input1'),
                     BowEncoder('input2')]
label_encoder_cols = [CategoricalEncoder('output')]
data_featurizer_cols = [BowFeaturizer('input1'),
                        BowFeaturizer('input2')] 
```

For the input columns that contain data useful for imputation, the Imputer expects you to specify the particular encoders and featurizers. For the label column that your are trying to impute, only specifying the type of encoder is necessary.

```
imputer = Imputer(
    data_featurizers=data_featurizer_cols,
    label_encoders=label_encoder_cols,
    data_encoders=data_encoder_cols,
    output_path='imputer_model'
)

imputer.fit(train_df=df_train, num_epochs=50)
predictions = imputer.predict(df_test)
```
We will carry out the imputation only on the categorical variable Cabin.

In [12]:
# Libraries

from datawig import Imputer
from datawig.column_encoders import *
from datawig.mxnet_input_symbols import *
# from datawig.utils import random_split

In [13]:
# Specifying Encoders and Featurizers
data_encoder_cols = [BowEncoder('Sex'),
                     BowEncoder('Embarked')]
label_encoder_cols = [CategoricalEncoder('Cabin')]
data_featurizer_cols = [BowFeaturizer('Sex'),
                        BowFeaturizer('Embarked')]

imputer = Imputer(
    data_featurizers=data_featurizer_cols,
    label_encoders=label_encoder_cols,
    data_encoders=data_encoder_cols,
    output_path='imputer_model'
)

imputer.fit(train_df=df_train, num_epochs=5)

<datawig.imputer.Imputer at 0x7f2a3c5cc710>

In [14]:
imputed = imputer.predict(df_test)
imputed.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_imputed,Cabin_imputed_proba
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
857,1,1,"Wick, Mrs. George Dennick (Mary Hitchcock)",female,45.0,1,1,36928,164.8667,,S,C,0.400003
214,0,2,"Givard, Mr. Hans Kristensen",male,30.0,0,0,250646,13.0,,S,C,0.40072
466,0,3,"Goncalves, Mr. Manuel Estanslas",male,38.0,0,0,SOTON/O.Q. 3101306,7.05,,S,C,0.40072
207,0,3,"Backstrom, Mr. Karl Alfred",male,32.0,1,0,3101278,15.85,,S,C,0.40072
666,0,2,"Hickman, Mr. Lewis",male,32.0,2,0,S.O.C. 14879,73.5,,S,C,0.40072
