### Stream based scenario - _MNIST_ example

This notebook contains an example on how to implement the stream based scenario of Active Learning. For this we are going to use a subset of [_MNIST_](https://archive.ics.uci.edu/dataset/683/mnist+database+of+handwritten+digits) dataset, loaded from _sklearn_.
Stream based AL can work in two settings:
- in batch setting, we store the points as they arrive from a stream until a batch is complete, and then we treat this batch as the pool of the pool based scenario, by applying the standard query strategies.
- in stream setting, we evaluate one point at a time and use tailored strategies to decide whether to annotate this point or discard it.
Both settings can be implemented with our framework.

In [None]:
import numpy as np
import torch
from modAL.uncertainty import uncertainty_sampling
from sklearn.ensemble import RandomForestClassifier
from torch.utils.data import DataLoader, Subset
from torchvision import datasets, models, transforms

from activelearning.AL_cycle import plot_results, strategy_comparison
from activelearning.queries.informative.margin_query_stream import stream_query_margin
from activelearning.queries.representative.coreset_query import query_coreset
from activelearning.queries.representative.diversity_query_stream import stream_query_diversity
from activelearning.queries.representative.random_query import query_random
from activelearning.queries.representative.random_query_stream import stream_query_random

torch.manual_seed(123)
np.random.seed(123)

In [None]:
# convert images to tensors of acceptable sizes for ResNet and normalize
preprocess = transforms.Compose(
    [
        transforms.Grayscale(num_output_channels=3),
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ]
)

# import data
train_full = datasets.MNIST(root="./data", train=True, download=True, transform=preprocess)
test_full = datasets.MNIST(root="./data", train=False, download=True, transform=preprocess)

# we are going to use a subset of 1500 images for this example
y_train = train_full.targets[:1000,]
y_test = test_full.targets[:500,]

train = Subset(train_full, indices=range(1000))
test = Subset(test_full, indices=range(500))

# to use batches instead of one at a time
loaded_train = DataLoader(train, batch_size=64, shuffle=False)
loaded_test = DataLoader(test, batch_size=64, shuffle=False)

_MNIST_ consists of images of handwritten digits from 0 to 9. When working with image data, we need to first extract features using for example a pretrained model. In this case, we will use [ResNet18](https://pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html) for feature extraction.

In [None]:
# import ResNet model
resnet = models.resnet18(pretrained=True)

# remove classification layer
resnet = torch.nn.Sequential(*list(resnet.children())[:-1])
# set to evaluation mode
resnet.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
resnet.to(torch.device(device))

train_features_list = []
with torch.no_grad():
    for images, _ in loaded_train:
        features_by_resnet = resnet(images.to(device))
        train_features_list.append(features_by_resnet)

test_features_list = []
with torch.no_grad():
    for images, _ in loaded_test:
        features_by_resnet = resnet(images.to(device))
        test_features_list.append(features_by_resnet)


train_features = torch.cat(train_features_list, dim=0).squeeze()

train_features = train_features.cpu().numpy()

test_features = torch.cat(test_features_list, dim=0).squeeze()
test_features = test_features.cpu().numpy()

Evaluating performance in the stream based scenario is trickier than in pool based scenario, as the random order in which points are sampled (and batches are formed) will heavily influence the result. We can still use the complete training set accuracy as reference metric, but keeping in mind that some important points might be discarded (for example if a batch is formed of mostly very relevant points) and therefore the goal accuracy might never be reached.

In [None]:
RF_mod = RandomForestClassifier()
RF_mod.fit(train_features, y_train)

goal_acc = RF_mod.score(test_features, y_test)
print(goal_acc)

To compare the query strategy in the batch setting, we simply need to specify the *batch_size* parameter in the *strategy_comparison* function. In this case, the choice of *n_instances* is very important, as it represents how much of the batch will be kept and how much will be discarded. Choosing to keep a high percentage of instances will result in an accuracy level closer to the goal, but the potential saving in labeling costs is reduced, while choosing a low percentage will guarantee less labeling, but it might fail to reach the same levels of accuracy.

In [None]:
n_instances = [16, 32]

scores_batch = strategy_comparison(
    X_train=None,
    y_train=None,
    X_pool=train_features,
    y_pool=y_train,
    X_test=test_features,
    y_test=y_test,
    classifier="randomforest",
    query_strategies=[uncertainty_sampling, query_coreset, query_random],
    batch_size=64,
    n_instances=n_instances,
    goal_acc=goal_acc,
)

In [None]:
plot_results(
    scores_batch,  # output data frame from strategy_comparison
    n_instances=n_instances,
    tot_samples=train_features.shape[0],  # size of the original training set, for scale
    goal_acc=goal_acc,
    figsize=(10, 6),
)

To simulate the stream setting instead, we simply need to specify *n_instances* equal to 1. Given the random element of the stream, it might be a good idea to replicate the experiment more than once, to get a sense of the variability. We usually want to keep all the first *n* points coming from the stream, to create a solid base of labeled points before employing the query strategies; this can be done either by provinding a starting set (*X_train*, *y_train* parameters) or by specifying the *start_len* parameter. The choice of *quantile* indicates how high the threshold for the decision of accepting or discarding a point will be, according to the query strategy. More detail on the theshold can be found in each query's documentation.

In [None]:
n_instances = [1, 1]

scores_stream = strategy_comparison(
    X_train=None,
    y_train=None,
    X_pool=train_features,
    y_pool=y_train,
    X_test=test_features,
    y_test=y_test,
    classifier="randomforest",
    start_len=100,
    quantile=0.5,
    query_strategies=[stream_query_margin, stream_query_diversity, stream_query_random],
    n_instances=n_instances,
    goal_acc=goal_acc,
)

In [None]:
plot_results(
    scores_stream,  # output data frame from strategy_comparison
    n_instances=n_instances,
    tot_samples=train_features.shape[0],  # size of the original training set, for scale
    goal_acc=goal_acc,
    figsize=(10, 6),
)