### _MNIST_ Active Learning example

This notebook contains a simple example of how to implement the Active Learning framework using modAL and the query strategies of this repository. For this we are going to use a subset of [_MNIST_](https://archive.ics.uci.edu/dataset/683/mnist+database+of+handwritten+digits) dataset, loaded from _sklearn_.
First we import the query strategies that we are going to compare and the data.

In [None]:
import numpy as np
import torch
from modAL.disagreement import consensus_entropy_sampling, max_disagreement_sampling, vote_entropy_sampling
from modAL.uncertainty import entropy_sampling, margin_sampling, uncertainty_sampling
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Subset
from torchvision import datasets, models, transforms

from activelearning.AL_cycle import plot_results, strategy_comparison
from activelearning.queries.representative.coreset_query import query_coreset
from activelearning.queries.representative.probcover_query import query_probcover
from activelearning.queries.representative.random_query import query_random

torch.manual_seed(123)
np.random.seed(123)

In [None]:
# convert images to tensors of acceptable sizes for ResNet and normalize
preprocess = transforms.Compose(
    [
        transforms.Grayscale(num_output_channels=3),
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ]
)

# import data
train_full = datasets.MNIST(root="./data", train=True, download=True, transform=preprocess)
test_full = datasets.MNIST(root="./data", train=False, download=True, transform=preprocess)

# we are going to use a subset of 1500 images for this example
y_train = train_full.targets[:1000,]
y_test = test_full.targets[:500,]

train = Subset(train_full, indices=range(1000))
test = Subset(test_full, indices=range(500))

# to use batches instead of one at a time
loaded_train = DataLoader(train, batch_size=64, shuffle=False)
loaded_test = DataLoader(test, batch_size=64, shuffle=False)

_MNIST_ consists of images of handwritten digits from 0 to 9. When working with image data, we need to first extract features using for example a pretrained model. In this case, we will use [ResNet18](https://pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html) for feature extraction.

In [None]:
# import ResNet model
resnet = models.resnet18(pretrained=True)

# remove classification layer
resnet = torch.nn.Sequential(*list(resnet.children())[:-1])
# set to evaluation mode
resnet.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
resnet.to(torch.device(device))

train_features_list = []
with torch.no_grad():
    for images, _ in loaded_train:
        features_by_resnet = resnet(images.to(device))
        train_features_list.append(features_by_resnet)

test_features_list = []
with torch.no_grad():
    for images, _ in loaded_test:
        features_by_resnet = resnet(images.to(device))
        test_features_list.append(features_by_resnet)


train_features = torch.cat(train_features_list, dim=0).squeeze()

train_features = train_features.cpu().numpy()

test_features = torch.cat(test_features_list, dim=0).squeeze()
test_features = test_features.cpu().numpy()

We evaluate the performance of a randomforest classifier on the complete training set. This will serve as reference metric for the active learning query strategies, as we want to reach the same accuracy but with less labeled data.

In [None]:
RF_mod = RandomForestClassifier()
RF_mod.fit(train_features, y_train)

goal_acc = RF_mod.score(test_features, y_test)
print(goal_acc)

To compare how different query strategies perform, we can use the *strategy_comparison* function and pass the strategies to be used. We can also pass more than one number of instances, to check whether a different batch size influences performance. *plot_results* can be used to immediatly plot the output from *strategy_comparison*, or a custom graph can be created from the scores data frame.

In [None]:
n_instances = [16, 32]

scores = strategy_comparison(
    X_train=None,
    y_train=None,
    X_pool=train_features,
    y_pool=y_train,
    X_test=test_features,
    y_test=y_test,
    classifier="randomforest",
    query_strategies=[
        uncertainty_sampling,
        margin_sampling,
        entropy_sampling,
        query_random,
        query_coreset,
        query_probcover,
    ],
    n_instances=n_instances,
    goal_acc=goal_acc,
)

In [None]:
plot_results(
    scores,  # output data frame from strategy_comparison
    n_instances=n_instances,
    tot_samples=train_features.shape[0],  # size of the original training set, for scale
    goal_acc=goal_acc,
    figsize=(10, 6),
)

To use a random starting set for the initial training of the model, instead of immediatly starting with the query strategies, we can simply pass the relative argument to the *strategy_comparison* function.

In [None]:
train_start, train_pool, y_start, y_pool = train_test_split(train_features, y_train, test_size=0.75)

n_instances = [16, 32]

scores_v2 = strategy_comparison(
    X_train=train_start,
    y_train=y_start,
    X_pool=train_pool,
    y_pool=y_pool,
    X_test=test_features,
    y_test=y_test,
    classifier="randomforest",
    query_strategies=[
        uncertainty_sampling,
        margin_sampling,
        entropy_sampling,
        query_random,
        query_coreset,
        query_probcover,
    ],
    n_instances=n_instances,
    goal_acc=goal_acc,
)

In [None]:
plot_results(
    scores_v2,  # output data frame from strategy_comparison
    n_instances=n_instances,
    tot_samples=train_features.shape[0],  # size of the original training set, for scale
    goal_acc=goal_acc,
    figsize=(10, 6),
)

We can also use a committee of models for the selection of query instances. To do so, we can use the *committee_classifiers* to specify which models we want to form a committee. Query strategies in this case should be appropriate to the committee framework. In this example we use a committee of two simple neural networks.

In [None]:
scores_qbc = strategy_comparison(
    X_train=train_start,
    y_train=y_start,
    X_pool=train_pool,
    y_pool=y_pool,
    X_test=test_features,
    y_test=y_test,
    classifier="randomforest",
    query_strategies=[vote_entropy_sampling, consensus_entropy_sampling, max_disagreement_sampling, query_random],
    committee_classifiers=["nnet", "nnet"],
    n_instances=n_instances,
    goal_acc=goal_acc,
)

In [None]:
plot_results(
    scores_qbc,  # output data frame from strategy_comparison
    n_instances=n_instances,
    tot_samples=train_features.shape[0],  # size of the original training set, for scale
    goal_acc=goal_acc,
    figsize=(10, 6),
)

To use a different classifier instead of a random forest, we can simply change the *classifier* argument of *strategy_comparison*. For example, we can use a simple neural network classifier instead. If the default options aren't sophisticated enough, a custom model can be passed as input, as long as it comes from *sklearn* or one of its wrappers. For neural networks, you can use *skorch* and refer to the modAL documentation.

In [None]:
n_instances = [16, 32]

scores_nnet = strategy_comparison(
    X_train=train_start,
    y_train=y_start,
    X_pool=train_pool,
    y_pool=y_pool,
    X_test=test_features,
    y_test=y_test,
    classifier="nnet",
    query_strategies=[
        uncertainty_sampling,
        margin_sampling,
        entropy_sampling,
        query_random,
        query_coreset,
        query_probcover,
    ],
    n_instances=n_instances,
    goal_acc=goal_acc,
)

In [None]:
plot_results(
    scores_nnet,  # output data frame from strategy_comparison
    n_instances=n_instances,
    tot_samples=train_features.shape[0],  # size of the original training set, for scale
    goal_acc=goal_acc,
    figsize=(10, 6),
)