# MIDASpy demonstration

This notebook provides a brief demonstration of **MIDASpy**'s core functionalities. We show how to use the package to impute missing values in the [Adult census dataset](https://github.com/MIDASverse/MIDASpy/blob/master/Examples/adult_data.csv) (which is commonly used for benchmarking machine learning tasks).

Users of **MIDASpy** must have **TensorFlow** installed as a **pip** package in their Python environment. **MIDASpy** is compatible with both **TensorFlow** 1.X and **TensorFlow** >= 2.2 versions.


Once these packages are installed, users can import the dependencies and load the data:

In [1]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
import tensorflow as tf
import MIDASpy as md

data_0 = pd.read_csv('adult_data.csv')
data_0.columns.str.strip()

Index(['Unnamed: 0', 'age', 'workclass', 'fnlwgt', 'education',
       'education_num', 'marital_status', 'occupation', 'relationship', 'race',
       'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
       'native_country', 'class_labels'],
      dtype='object')

As the Adult dataset has very little missingness, we randomly set 5,000 observed values as missing in each column:

In [3]:
np.random.seed(441)

def spike_in_generation(data):
    spike_in = pd.DataFrame(np.zeros_like(data), columns= data.columns)
    for column in data.columns:
        subset = np.random.choice(data[column].index[data[column].notnull()], 5000, replace= False)
        spike_in.loc[subset, column] = 1
    return spike_in

spike_in = spike_in_generation(data_0)
original_value = data_0.loc[4, 'hours_per_week']
data_0[spike_in == 1] = np.nan

Next, we list categorical variables in a vector and one-hot encode them using **MIDASpy**'s inbuilt preprocessing function `cat_conv`, which returns both the encoded data and a nested list of categorical column names we can pass to the imputation algorithm. To construct the final, pre-processed data we append the one-hot encoded categorical data to the non-cateogrical data, and replace null values with `np.nan` values:

In [4]:
categorical = ['workclass','marital_status','relationship','race','class_labels','sex','education','occupation','native_country']
data_cat, cat_cols_list = md.cat_conv(data_0[categorical])

data_0.drop(categorical, axis = 1, inplace = True)
constructor_list = [data_0]
constructor_list.append(data_cat)
data_in = pd.concat(constructor_list, axis=1)

na_loc = data_in.isnull()
data_in[na_loc] = np.nan

To visualize the results:


In [5]:
print(data_in.head())

   Unnamed: 0   age    fnlwgt  education_num  capital_gain  capital_loss  \
0         0.0  39.0   77516.0           13.0        2174.0           0.0   
1         1.0  50.0   83311.0           13.0           0.0           0.0   
2         2.0  38.0  215646.0            9.0           0.0           NaN   
3         3.0  53.0  234721.0            NaN           NaN           0.0   
4         NaN  28.0       NaN           13.0           0.0           NaN   

   hours_per_week  workclass_Federal-gov  workclass_Local-gov  \
0            40.0                    0.0                  0.0   
1            13.0                    0.0                  0.0   
2            40.0                    NaN                  NaN   
3            40.0                    0.0                  0.0   
4             NaN                    0.0                  0.0   

   workclass_Never-worked  ...  native_country_Portugal  \
0                     0.0  ...                      0.0   
1                     0.0  ...    

The data are now ready to be fed into the imputation algorithm, which involves three steps. First, we specify the dimensions, input corruption proportion, and other hyperparameters of the MIDAS neural network. Second, we build a MIDAS model based on the data. The vector of one-hot-encoded column names should be passed to the softmax_columns argument, as MIDAS employs a softmax final-layer activation function for categorical variables. Third, we train the model on the data, setting the number of training epochs as 20 in this example:

In [6]:
imputer = md.Midas(layer_structure = [256,256], vae_layer = False, seed = 89, input_drop = 0.75)
imputer.build_model(data_in, softmax_columns = cat_cols_list)
imputer.train_model(training_epochs = 20)

Size index: [7, 8, 7, 6, 5, 2, 2, 16, 14, 41]

Computation graph constructed

Model initialised

Epoch: 0 , loss: 117256.53483883518
Epoch: 1 , loss: 86191.57900557012
Epoch: 2 , loss: 82235.14160604215
Epoch: 3 , loss: 80032.84553225857
Epoch: 4 , loss: 77202.56478397874
Epoch: 5 , loss: 72421.6698374017
Epoch: 6 , loss: 68203.67437592153
Epoch: 7 , loss: 67226.95568095717
Epoch: 8 , loss: 66277.40566796619
Epoch: 9 , loss: 65670.3376109672
Epoch: 10 , loss: 65726.79206258191
Epoch: 11 , loss: 66133.58962712719
Epoch: 12 , loss: 65309.62404327592
Epoch: 13 , loss: 65309.563779258475
Epoch: 14 , loss: 64963.398904342954
Epoch: 15 , loss: 65626.56983061825
Epoch: 16 , loss: 64806.52387842501
Epoch: 17 , loss: 65535.00585041571
Epoch: 18 , loss: 65120.4375428766
Epoch: 19 , loss: 65258.038586377785
Training complete. Saving file...
Model saved in file: tmp/MIDAS


<MIDASpy.midas_base.Midas at 0x7fdc29165700>

Once training is complete, we can generate any number of imputed datasets (M) using the `generate_samples` function (here we set M as 10). Users can then either write these imputations to separate .CSV files or work with them directly in Python:

In [7]:
imputations = imputer.generate_samples(m=10).output_list 

# for i in imputations:
#    file_out = ``midas_imp_" + str(n) + ``.csv"
#    i.to_csv(file_out, index=False)
#    n += 1

INFO:tensorflow:Restoring parameters from tmp/MIDAS
Model restored.


Finally, using the list of generated imputations, we can estimate M separate regression models and combine the parameter and variance estimates (see Rubin 1987) using **MIDASpy's** `combine` function:

In [36]:
model = md.combine(y_var = "capital_gain", 
                   X_vars = ["education_num","age"],
                   df_list = imputations)

model

Unnamed: 0,term,estimate,std.error,statistic,df,p.value
0,const,-98.10391,100.060298,-0.980448,162.300742,0.328324
1,education_num,18.190052,3.836454,4.741371,403.771392,3e-06
2,age,20.059125,2.099022,9.556413,151.30342,0.0
