# Deep Bayesian Networks for Active Learning

In this tutorial we will implement active learning through Deep Bayesian networks. While the MNIST dataset is used, this implementation uses a multilayer perceptron architecture and may be generalized to any non-imaging neural network task.

Paper: https://arxiv.org/pdf/1703.02910.pdf

In [None]:
import numpy as np
from scipy import stats
from tensorflow.keras import Input, Model, layers, losses, optimizers, datasets
import tensorflow as tf

### Data

The following code block will prepare the MNIST dataset as archived by the Tensorflow / Keras library. For purposes of demonstration, both the train and valid cohorts are combined into a single array. Additionally, the original data to type `uint8` on range `[0, 255]` is scaled to range of `[0, 1]`.

In an active learning task, one goal is to ensure that the rare (out-of-distribution) examples are prioritized for training. To simulate this, we will arbitrarily limit the number of training examples for a designated minority class.

In [None]:
def load_mnist(minority_class=9, minority_class_count=1000):

    (x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()

    # --- Combine train and valid for demo purposes
    x = np.concatenate((x_train, x_test))
    y = np.concatenate((y_train, y_test))

    # --- Flatten and scale
    x = x.reshape((x.shape[0], -1))
    x = x / 255.

    # --- Artificially reduce examples of minority class
    i = np.nonzero(y == minority_class)[0][minority_class_count:]
    x = np.delete(x, i, axis=0)
    y = np.delete(y, i, axis=0)

    return x, y

# Deep Bayesian Network

Compared to standard neural network, a deep Bayesian model learns each network parameter as a distribution of values. To evaluate a deep Bayesian model on any given single input, an integration must be performed over all possible model parameters (e.g., an ensemble of infinite neural networks). Using this approach, the **uncertainty** of any given prediction may be estimated (e.g., how much variation exists from prediction to prediction).

As an approximate solution to this problem, a deep Bayesian model may be estimated using a Monte Carlo simulation using random **dropout** applied during inference. 

Let us start by building a simple 2 hidden layer MLP. Note that in this implementation, the `training` flag is set to `None` by default to indicate the standard dropout behavior e.g., dropout is applied during training but removed during inference.

In [None]:
def create_model(shape=(784,), training=None):

    x = Input(shape=shape)

    # --- Layer 1
    l0 = layers.Dense(256, activation='relu')(x)
    d0 = layers.Dropout(0.5)(l0, training=training)
    
    # --- Layer 2
    l1 = layers.Dense(256, activation='relu')(d0)
    d1 = layers.Dropout(0.5)(l1, training=training)
    
    # --- Logits
    l2 = layers.Dense(10)(d1)

    # --- Create model
    model = Model(inputs=x, outputs=l2)

    # --- Compile model
    model.compile(
        optimizer=optimizers.Adam(learning_rate=1e-3),
        loss=losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=['sparse_categorical_accuracy'])

    return model

By default, dropout is not applied during inference. To create a new "shadow" model identical in network architecture but with dropout manually activated (e.g., `training=True`), use the following code block:

In [None]:
def create_model_dropout(model):
    """
    Method to re-create model with dropout manually activated 

    """
    model_dropout = create_model(training=True) 
    model_dropout.set_weights(model.get_weights())

    return model_dropout

During model training in this demonstration, we will alternative between training and evaluating the standard dropout behavior while simultaneously performing active learning using the dropout activated model.

# Active Learning

Provided a large pool of unlabeled data, active learning strategies help to identify the out-of-distribution examples that may yield the greatest incremental gain in model performance. Such strategies may be valuable in supervised deep learning for healthcare as resources for annotating medical data are often scarce.

The deep Bayesian network implementation of active learning identifies out-of-distribution examples using various estimates of prediction **uncertainty**. Given a total of `n` repeated predictions (e.g., forward passes with dropout enabled), uncertainty can be estimated by measuring the degree of variation between predictions for the same training example. Formal metrics for measuring prediction variation are known as *acquisition functions* and include:

* entropy
* variation ratio
* BALD metric (Bayesian active learning by disagreement)

### Ranking strategies

To implement these various ranking strategies (acquisition functions), we will use the following Python decorator:

In [None]:
def sample_data(func):
    """
    Decorator for sampling function

      (1) Create subcohort to feed into acquisition function (for speed)
      (2) Run repeated forward passes of model (with dropout enabled)
      (3) Find top n_samples of most informative data using acquisition function
      (4) Remove top n_samples from training pool

    """
    def wrapper(model, x, y, n_samples=10, repeats=20, subcohort=2000, *args, **kwargs):

        # --- Create subcohort for speed 
        i_sub = np.random.permutation(x.shape[0])[:subcohort]
        x_sub = x[i_sub]
        y_sub = y[i_sub]

        # --- Run repeated forward passes
        if hasattr(model, 'predict'):
            m = create_model_dropout(model)
            preds = [m.predict(x_sub) for _ in range(repeats)]
            preds = np.array([tf.nn.softmax(p, axis=-1) for p in preds])

        else:
            preds = None

        indices = func(preds=preds, x=x_sub, y=y_sub, n_samples=n_samples, *args, **kwargs)
        top_n = i_sub[indices[:n_samples]]

        x_new = x[top_n]
        y_new = y[top_n]

        x = np.delete(x, top_n, axis=0)
        y = np.delete(y, top_n, axis=0)

        return x_new, y_new, x, y

    return wrapper

Using this wrapper, we will define a number of acquisition functions (passed as the generic `func` object above). All functions will be defined using the following API:

```python
@sample_data
def func(preds, **kwargs):
    
    # --- Method to create indices ranking various predictions
    indices = ...
    
    return indices
```

The decorator function will be responsible for the remaining implementation steps, including sampling a total number of `repeats` predictions from the `model` object as well as recreating the training cohort based on the top ranked examples.

### Entropy

Many of the acquisition functions rely on a calculation of the entropy associated with a probability distribution.

![Entropy](https://i0.wp.com/jeanvitor.com/wp-content/uploads/2017/11/entropyequation.png)

The following function will be used to calculate entropy generically for various ranking strategies:

In [51]:
def calculate_entropy(probs, epsilon=1e-10):

    return -np.sum(probs * np.log(probs + epsilon), axis=-1)

### Initial sampling

To initialize training for all experiments, a baseline distribution of data will be sampled in a balanced manner across all label classes. 

In [None]:
@sample_data
def rank_initial(y, n_samples, **kwargs):
    """
    Method to create initial balanced uniform sampling from each class 

    """
    indices = []

    uniques = np.unique(y)
    n = int(n_samples / uniques.size)

    for value in uniques:
        i = np.nonzero(y == value)[0]
        indices.append(i[np.random.permutation(i.size)[:n]])

    return np.array(indices).ravel()

### Random sampling

As a baseline strategy for learning, data will be added incrementally to the training cohort through random sampling.

In [None]:
@sample_data
def rank_random(x, **kwargs):
    """
    Method to implement random sampling strategy

    """
    return np.random.permutation(x.shape[0])

### Entropy-based sampling

The following code block implements a sampling strategy based on the overall entropy of the model predictions after first aggregating all predictions by taking the mean across all repeated samples. Note that compared to the *BALD* metric strategy below, this strategy evaluates the *aggregate* entropy of all repeated samples and does not take into consideration the degree of potential variation that may exist between individual samples.

In [None]:
@sample_data
def rank_entropy(preds, **kwargs):
    """
    Method to implement acquisition ranking via entropy

    :params

      (np.ndarray) preds : prediction array of shape (repeats, batch, classes)

    """
    # --- Find mean prob distribution across all repeated samples
    preds = np.mean(preds, axis=0)

    # --- Calculate entropy across mean prob distribution
    entropy = calculate_entropy(preds)

    return np.argsort(-entropy)

### Variation ratio sampling

In this strategy, the `mode` (e.g., most common) *count* of all repeated predictions for a given example is calculated and used as an approximate estimate to the degree of prediction variance. For a total of `repeats` number of predictions, if all predictions are identical, then the *count* of the prediction `mode` will be identical to `repeats` and the degree of variation will be `0`. If all predictions are completely different, then the *count* of the prediction `mode` will be `1` and the degree of variation will be maximal.

In [None]:
@sample_data
def rank_var_ratio(preds, **kwargs):
    """
    Method to implement acquisition ranking via variation ratio

    :params

      (np.ndarray) preds : prediction array of shape (repeats, batch, classes)

    """
    # --- Convert probs to label classes
    preds = np.argmax(preds, axis=-1)

    # --- Determine how often predictions agree
    mode, count = stats.mode(preds, axis=0)

    var = 1 - count.ravel() / preds.shape[0]

    return np.argsort(-var)

### BALD sampling

The Bayesian active learning by disagreement (BALD) strategy is similar to a ranking based on entropy (above). However instead of evaluating the entropy of the mean aggregate of all `repeats` number of predictions, the BALD strategy calculates the **marginal** entropy of all *individual* `repeats` number of predictions relative to the baseline mean aggregation. In this way, the degree of variation between each repeated prediction is assessed instead of the baseline degree of entropy itself.

In [None]:
@sample_data
def rank_bald(preds, **kwargs):
    """
    Method to implement acquisition ranking via BALD 

    :params

      (np.ndarray) preds : prediction array of shape (repeats, batch, classes)

    """
    # --- Calculate entropy across each repeated prob distribution
    entropy_all = np.mean(calculate_entropy(preds), axis=0)

    # --- Calculate entropy across the mean of repeated prob distributions
    preds = np.mean(preds, axis=0)
    entropy_rpt = calculate_entropy(preds)

    bald = entropy_rpt - entropy_all

    return np.argsort(-bald)

# Training

The following code blocks are used to generically run an experiment using a specified acquisition function.

### Model evaluation

To evaluate model performance, we will use the following function to evaluate accuracy on the remaining data yet to be used for algorithm training.

In [None]:
def test_model(model, x, y):
    """
    Method to test model on remaining data

    """
    y_ = model.predict(x)
    y_ = np.argmax(y_, axis=-1)

    return (y_ == y).sum() / y.size

The following code block will evaluate training performance on both the entire and minority class individually.

In [None]:
def print_performance(model, x, y, size, minority_class=9, **kwargs):

    acc_all = test_model(model, x, y)
    acc_min = test_model(model, x[y == minority_class], y[y == minority_class])

    print('Model performance ({} samples): {:0.5f} (all) | {:0.5f} (minority)'.format(
        str(size).rjust(4), acc_all, acc_min))

### Training

To train each model, we will specify:

* `acquisition_func`: the specific ranking strategy above
* `rounds`: total number of training rounds
* `n_samples`: total number of training examples to add each round

In [None]:
def train(model, acquisition_func, rounds=10, n_samples=100, batch_size=128, epochs=50, **kwargs):

    # --- Create initial uniform sampling of 20 training examples
    x, y = load_mnist()
    x_train, y_train, x, y = rank_initial(model=model, x=x, y=y, n_samples=20, **kwargs)

    # --- Create initial fit
    model.fit(x=x_train, y=y_train, batch_size=batch_size, epochs=epochs, verbose=False)
    print_performance(model, x, y, y_train.size, **kwargs)

    for n in range(rounds):

        # --- Get additional data
        x_new, y_new, x, y = acquisition_func(model=model, x=x, y=y, n_samples=n_samples, **kwargs)
        x_train = np.concatenate((x_train, x_new))
        y_train = np.concatenate((y_train, y_new))

        # --- Train 
        model.fit(x=x_train, y=y_train, batch_size=batch_size, epochs=epochs, verbose=False)
        print_performance(model, x, y, y_train.size, **kwargs)

    return model

### Experiments

Use the following cells to run each individual experiment:

In [None]:
# --- Train using random sampling (baseline)
model = create_model()
train(model=model, acquisition_func=rank_random)

In [None]:
# --- Train using entropy
model = create_model()
train(model=model, acquisition_func=rank_entropy)

In [None]:
# --- Train using variation ratio
model = create_model()
train(model=model, acquisition_func=rank_var_ratio)

In [None]:
# --- Train using BALD
model = create_model()
train(model=model, acquisition_func=rank_bald)