# VIME Tutorial

### VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain

- Paper: Jinsung Yoon, Yao Zhang, James Jordon, Mihaela van der Schaar, 
  "VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain," 
  Neural Information Processing Systems (NeurIPS), 2020.

- Paper link: TBD

- Last updated Date: October 11th 2020

- Code author: Jinsung Yoon (jsyoon0823@gmail.com)

This notebook describes the user-guide of self- and semi-supervised learning for tabular domain using MNIST database.

### Prerequisite
Clone https://github.com/jsyoon0823/VIME.git to the current directory.

### Necessary packages and functions call

- data_loader: MNIST dataset loading and preprocessing
- supervised_models: supervised learning models (Logistic regression, XGBoost, and Multi-layer Perceptron)

- vime_self: Self-supervised learning part of VIME framework
- vime_semi: Semi-supervised learning part of VIME framework
- vime_utils: Some utility functions for VIME framework

In [1]:
import numpy as np
import os
import warnings
warnings.filterwarnings("ignore")
  
from data_loader import load_mnist_data
from supervised_models import logit, xgb_model, mlp

from vime_self import vime_self
from vime_semi import vime_semi
from vime_utils import perf_metric

### Set the parameters and define output

-   label_no: Number of labeled data to be used
-   model_sets: supervised model set (mlp, logit, or xgboost)
-   p_m: corruption probability for self-supervised learning
-   alpha: hyper-parameter to control the weights of feature and mask losses
-   K: number of augmented samples
-   beta: hyperparameter to control supervised and unsupervised loss
-   label_data_rate: ratio of labeled data
-   metric: prediction performance metric (either acc or auc)

In [2]:
# Experimental parameters
label_no = 1000  
model_sets = ['logit','xgboost','mlp']
  
# Hyper-parameters
p_m = 0.3
alpha = 2.0
K = 3
beta = 1.0
label_data_rate = 0.1

# Metric
metric = 'acc'
  
# Define output
results = np.zeros([len(model_sets)+2])  

### Load data

Load original MNIST dataset and preprocess the loaded data.
- Only select the subset of data as the labeled data

In [3]:
# Load data
x_train, y_train, x_unlab, x_test, y_test = load_mnist_data(label_data_rate)
    
# Use subset of labeled data
x_train = x_train[:label_no, :]
y_train = y_train[:label_no, :]  

### Train supervised models

- Train 3 supervised learning models (Logistic regression, XGBoost, MLP)
- Save the performances of each supervised model.

In [4]:
# Logistic regression
y_test_hat = logit(x_train, y_train, x_test)
results[0] = perf_metric(metric, y_test, y_test_hat) 

# XGBoost
y_test_hat = xgb_model(x_train, y_train, x_test)    
results[1] = perf_metric(metric, y_test, y_test_hat)   

# MLP
mlp_parameters = dict()
mlp_parameters['hidden_dim'] = 100
mlp_parameters['epochs'] = 100
mlp_parameters['activation'] = 'relu'
mlp_parameters['batch_size'] = 100
      
y_test_hat = mlp(x_train, y_train, x_test, mlp_parameters)
results[2] = perf_metric(metric, y_test, y_test_hat)

# Report performance
for m_it in range(len(model_sets)):  
    
  model_name = model_sets[m_it]  
    
  print('Supervised Performance, Model Name: ' + model_name + 
        ', Performance: ' + str(results[m_it]))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


2022-01-25 18:12:18.724161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-01-25 18:12:18.724309: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-25 18:12:18.724701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: NVIDIA GeForce RTX 2070 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.83
pciBusID: 0000:01:00.0
2022-01-25 18:12:18.724744: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-25 18:12:18.725271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: NVIDIA GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.746

Restoring model weights from the end of the best epoch.
Epoch 00072: early stopping
Supervised Performance, Model Name: logit, Performance: 0.8738
Supervised Performance, Model Name: xgboost, Performance: 0.8826
Supervised Performance, Model Name: mlp, Performance: 0.8994


### Train & Test VIME-Self
Train self-supervised part of VIME framework only
- Check the performance of self-supervised part of VIME framework.

In [5]:
# Train VIME-Self
vime_self_parameters = dict()
vime_self_parameters['batch_size'] = 128
vime_self_parameters['epochs'] = 50
vime_self_encoder = vime_self(x_unlab, p_m, alpha, vime_self_parameters)
  
# Save encoder
if not os.path.exists('save_model'):
  os.makedirs('save_model')

file_name = './save_model/encoder_model.h5'
  
vime_self_encoder.save(file_name)  
        
# Test VIME-Self
x_train_hat = vime_self_encoder.predict(x_train)
x_test_hat = vime_self_encoder.predict(x_test)
      
y_test_hat = mlp(x_train_hat, y_train, x_test_hat, mlp_parameters)
results[3] = perf_metric(metric, y_test, y_test_hat)
    
print('VIME-Self Performance: ' + str(results[3]))

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 54000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


2022-01-25 18:12:58.907692: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-01-25 18:12:58.907714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      


Restoring model weights from the end of the best epoch.
Epoch 00063: early stopping
VIME-Self Performance: 0.9097


### Train & Test VIME

Train semi-supervised part of VIME framework on top of trained self-supervised encoder
- Check the performance of entire part of VIME framework.

In [6]:
# Train VIME-Semi
vime_semi_parameters = dict()
vime_semi_parameters['hidden_dim'] = 100
vime_semi_parameters['batch_size'] = 128
vime_semi_parameters['iterations'] = 1000
y_test_hat = vime_semi(x_train, y_train, x_unlab, x_test, 
                       vime_semi_parameters, p_m, K, beta, file_name)

# Test VIME
results[4] = perf_metric(metric, y_test, y_test_hat)
  
print('VIME Performance: '+ str(results[4]))




Instructions for updating:
Please use `layer.__call__` method instead.



Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor



Start training
Iteration: 0/1000, Current loss: 2.2177


2022-01-25 18:13:00.139642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-01-25 18:13:00.139665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      


Iteration: 100/1000, Current loss: 0.3376
Iteration: 200/1000, Current loss: 0.2806
Iteration: 300/1000, Current loss: 0.2717

INFO:tensorflow:Restoring parameters from ./save_model/class_model.ckpt
VIME Performance: 0.9212


2022-01-25 18:13:14.392443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-01-25 18:13:14.392464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      


### Report Prediction Performances

- 3 Supervised learning models
- VIME with self-supervised part only
- Entire VIME framework

In [7]:
for m_it in range(len(model_sets)):  
    
  model_name = model_sets[m_it]  
    
  print('Supervised Performance, Model Name: ' + model_name + 
        ', Performance: ' + str(results[m_it]))
    
print('VIME-Self Performance: ' + str(results[m_it+1]))
  
print('VIME Performance: '+ str(results[m_it+2]))

Supervised Performance, Model Name: logit, Performance: 0.8738
Supervised Performance, Model Name: xgboost, Performance: 0.8826
Supervised Performance, Model Name: mlp, Performance: 0.8994
VIME-Self Performance: 0.9097
VIME Performance: 0.9212
