# Annotate Text Data Using Active Learning with Cleanlab

In this example, we will use active learning to improve a fune-tuned HuggingFace transformer for text classification, while keeping the total number of collected labels from human annotators low.

When resource constraints prevent us from acquiring labels for the entirety of our data, active learning aims to save both time and money by selecting which examples data annotators should spend their effort labeling.

## Active Learning

**Active learning** helps prioritize what data to label in order to maximize the performance of a supervised machine learning model trained on the labelled data.

The learning process ususally happens iteratively. At each round, active learning tells us which examples we should collect additional annotations for to improve our current model the most under a limited labeling budget.

[ActiveLab](https://arxiv.org/abs/2301.11856) is an active learning algorithm that is useful when the labels coming from human annotators are noisy and when we should collect one more annotation for a previously annotated example (whose label seems suspect) versus for a not-yet-annotated example. After collecting these new annotations for a batch of data to increase our training dataset, we re-train our model and evaluate its test accuracy.

## Setups

In [None]:
!pip install -qU datasets transformers scikit-learn matplotlib cleanlab

In [None]:
import pandas as pd
import numpy as np
import random
import transformers
import datasets
import matplotlib.pyplot as plt
from transformers import (
    AutoTokenizer, AutoModel, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset, Dataset, DatasetDict, ClassLabel
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from scipy.special import softmax
from datetime import datetime

from cleanlab.multiannotator import(
    get_majority_vote_label,
    get_active_learning_scores,
    get_label_quality_multiannotator
)

pd.set_option('max_colwidth', None)

## Collect and organize data

We will use the [Standford Politeness corpus](https://huggingface.co/datasets/Cleanlab/stanford-politeness) as the dataset.

In [None]:
labeled_data_file = {'labeled': 'X_labeled_full.csv'}
unlabeled_data_file = {'unlabeled': 'X_unlabeled.csv'}
test_data_file = {'test': 'test.csv'}

X_labeled_full = load_dataset(
    'Cleanlab/stanford-politeness',
    split='labeled',
    data_files=labeled_data_file
)
X_unlabeded = load_dataset(
    'Cleanlab/stanford-politeness',
    split='unlabeled',
    data_files=unlabeled_data_file
)
test = load_dataset(
    'Cleanlab/stanford-politeness',
    split='test',
    data_files=test_data_file
)

test.csv:   0%|          | 0.00/256k [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
!wget -nc -O 'extra_annotations.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/extra_annotations.npy?download=true'

--2025-04-11 19:28:13--  https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/extra_annotations.npy?download=true
Resolving huggingface.co (huggingface.co)... 3.168.73.106, 3.168.73.129, 3.168.73.111, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.106|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/3a/2f/3a2fc34cf625e69cabda6493d475e2b30302e6ccd28e9f3d398c1055528f129d/d89f555ae53ec9849439fd00fe71803462cf153175b29986958e47c0b4f8fd51?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27extra_annotations.npy%3B+filename%3D%22extra_annotations.npy%22%3B&Expires=1744403294&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0NDQwMzI5NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzNhLzJmLzNhMmZjMzRjZjYyNWU2OWNhYmRhNjQ5M2Q0NzVlMmIzMDMwMmU2Y2NkMjhlOWYzZDM5OGMxMDU1NTI4ZjEyOWQvZDg5ZjU1NWFlNTNlYzk4NDk0MzlmZDAwZmU3MTgwMzQ2MmNmMTUzMTc1

In [None]:
extra_annotations = np.load('extra_annotations.npy', allow_pickle=True).item()

In [None]:
X_labeled_full = X_labeled_full.to_pandas()
X_labeled_full.set_index('id', inplace=True)

X_unlabeded = X_unlabeded.to_pandas()
X_unlabeded.set_index('id', inplace=True)

test = test.to_pandas()

### Classify the politeness of text

The dataset is structured as a binary text classification task, to classify whether each phrase is polite or impolite.

Human annotators are given a selected text phrase and they provide an (imperfect) annotation regarding its polintess: 0 for impolite and 1 for polite.

To train a transformer classifer on the annotated data, we measure model accuracy over a set of held-out test examples, where we feel confident about their ground truth labels because they are derived from a consensus amongst 5 annotators who labeled each of these examples.

As for the training data, we have
* `X_labeled_full`: our initial training set with just a small set of 100 text examples labeled with 2 annotations per example.
* `X_unlabeled`: large set of 1900 unlabeled text examples we can consider having annotators label.
* `extra_annotations`: pool of additional annotations we pull from when an annotation is requested for an example.

### Visualize data

In [None]:
# multi-annotated data
X_labeled_full.head()

Unnamed: 0_level_0,text,a6,a12,a16,a19,a20,a22,a39,a42,a52,...,a157,a158,a178,a180,a185,a193,a196,a197,a215,a216
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
450d326d,"<url>. Congrats, or should I say good luck?",,,1.0,,,0.0,,,,...,,,,,,,,,,
6a22f4ec,Can I get some time to finish what I am doing without everything being deleted??,,,,,,,,,,...,,,,,,,,,,
823f1104,"Ok. Thank you for clarifying. Could you be more specific as to what you are specifying as ""the claim"" so that I may find relevant information to refute?",,,,,,,,,,...,,,,,,,,,,
7677905a,"One wonders, of course, who ""Elliott of Macedon"" would have been. Probably something analogous to Brian of Nazareth but in a Macedonian phalanx?",,,,,,,0.0,,,...,,,,,,0.0,,,,
a1ce799b,"So, let me make sure I understand this. You think that, if we remove an image as it does not meet the NFCC, you would then be able to upload the same image, only this time, it would meet the NFCC?",,,,,,,,,,...,,,,,,,,0.0,,1.0


In [None]:
# unlabeled data
X_unlabeded.head()

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
486aff36,"The review has been up there for something like six weeks, I notice. Think you'll be able to take care of those last couple of things?"
201d7655,How many other states follow the same pattern? And do we really need it to?
c9125774,You added the name Ken Taylor to the <url> page but there is no such person listed on the DOD website as having received that award. Who were you refering to?
593ac8fb,I found <url> whilst looking for something else. Any use to you?
d1fdcdba,If it were me I'd want to try and find out more about how/why this happened first before I continued to use that software. Have you asked at the talk page I mentioned above?


In [None]:
# extra_annotaions contains the annotations that we will use
# when an additional annotation is requsted
extra_annotations

# check the format for a random sample
{k: extra_annotations[k] for k in random.sample(list(extra_annotations.keys()), 5)}

{'1fb3e07f': {'a16': 1.0, 'a22': 1.0, 'a61': 0.0, 'a160': 1.0, 'a175': 1.0},
 'caae3ae6': {'a22': 1.0, 'a98': 0.0, 'a117': 0.0, 'a121': 0.0, 'a150': 0.0},
 '23e93d59': {'a61': 0.0, 'a76': 0.0, 'a99': 0.0, 'a145': 0.0, 'a185': 0.0},
 '59b1ff76': {'a22': 0.0, 'a49': 0.0, 'a100': 0.0, 'a113': 0.0, 'a166': 1.0},
 'ff14ccc4': {'a16': 1.0, 'a61': 1.0, 'a73': 1.0, 'a174': 0.0, 'a205': 1.0}}

In [None]:
# samples from test set
num_to_label = {0: 'Impolite', 1: 'Polite'}

for i in range(2):
    print(f"{num_to_label[i]} examples:")
    subset = test[test.label == i][['text']].sample(n=3, random_state=111)
    print(subset)

Impolite examples:
                                                     text
559             Huh? Is that a web address or a wikilink?
67                               <url>. Or is it just me?
231  So nearly a year later you change it all again. Why?
Polite examples:
                                                                                                                                                                                                                                                                  text
553                                                                                                         I tried to create a similar map for Canada but couldn't get a usable output from the site you used for the Australia map. Could you create a Canadian one?
44   There seems to have been something wrong with the pages you made on the V8 Supercar Championship Series - a glitch with the "align" tags in the beginning infobox made the text grotesquely overflow the 

## Helper functions

`get_idx_to_label` is used in active learning scenarios when we deal with a mixture of labeled and unlabeled data. Its primary goal is to determine **which examples (from both labeled and unlabeled datasets) should be selected for additional annotations based on their active learning scores**.

In [None]:
# Get indices of examples with the lowest active learning score
# to collect more labels for
def get_idx_to_label(
        X_labeled_full,
        X_unlabeded,
        extra_annotations,
        batch_size_to_label,
        active_learning_scores,
        active_learning_scores_unlabeled=None
):
    if active_learning_scores_unlabeled is None:
        active_learning_scores_unlabeled = np.array([])

    to_label_idx = []
    to_label_idx_unlabeled = []

    num_labeled = len(active_learning_scores)
    active_learning_scores_combined = np.concatenate(
        (active_learning_scores, active_learning_scores_unlabeled)
    )
    to_label_idx_combined = np.argsort(active_learning_scores_combined)

    # We want tot collect the n=batch_size best examples to collect another annotation for
    i = 0
    while (len(to_label_idx) + len(to_label_idx_unlabeled)) < batch_size_to_label:
        idx = to_label_idx_combined[i]

        if idx < num_labeled:
            # We know this is an already annotated example.
            text_id = X_labeled_full.iloc[idx].name
            # make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx.append(idx)

        else:
            # We know this is an example that is currently not annotated
            # Subtract off offset to get back original index
            idx -= num_labeled
            text_id = X_unlabeded.iloc[idx].name
            # make sure we have an annotation left to collect
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx_unlabeled.append(idx)

        i += 1

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

`get_idx_to_label_random` is used for an active learning context where the selection of data points for additional annotation is done randomly rather than based on a model's uncertainty or learning scores. This approach may be used as a baseline to compare against more sophisticated active learning strategies or in scenarios where it is unclear how to score examples.

In [None]:
from shutil import which
# Get indices of random examples to collect more labels for
def get_idx_to_label_random(
        X_labeled_full,
        X_unlabeled,
        extra_annotations,
        batch_size_to_label
):
    to_label_idx = []
    to_label_idx_unlabeled = []

    # Generate list of indices for both sets of examples
    labeled_idx = [(x, 'labeled') for x in range(len(X_labeled_full))]
    unlabeled_idx = []
    if X_unlabeded is not None:
        unlabeled_idx = [(x, 'unlabeled') for x in range(len(X_unlabeled))]
    combined_idx = labeled_idx + unlabeled_idx

    # We want to collect the n=batch_size random examples to collect another annotation for
    while (len(to_label_idx) + len(to_label_idx_unlabeled)) < batch_size_to_label:
        # Random choice from indices
        # We time-seed to ensure randomness
        random.seed(datetime.now().timestamp())
        choice = random.choice(combined_idx)
        idx, which_subset = choice

        if which_subset == 'labeled':
            # We know this is an already annotated example
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx.append(idx)
            combined_idx.remove(choice)
        else:
            # We know this is an example that is currently not annotated
            text_id = X_unlabeded.iloc[idx].name
            # Make sure we have an annotation left to collect
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx_unlabeled.append(idx)
            combined_idx.remove(choice)

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

In [None]:
def compute_std_dev(accuracy):
    """Compute standard deviation across 2D array of accuracies."""
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev

def choose_existing(annotators, existing_annotators):
    """Select which annotator we should collect another annotation from"""
    for annotator in annotators:
        # if we find one that has already given an annotation, we return it
        if annotator in existing_annotators:
            return annotator

    # If we do not find an existing one, return a random one
    choice = random.choice(list(annotators.keys()))
    return choice

def compute_metrics(p):
    """compute_metrics function for Trainer"""
    logits, labels = p
    pred = np.argmax(logits, axis=1)
    pred_probs = softmax(logits, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    return {'logits': logits, 'pred_probs': pred_probs, 'accuracy': accuracy}

def tokenize_function(examples):
    """Tokenize text"""
    model_name = 'distilbert-base-uncased'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
    )

def tokenize_data(data):
    """Tokenize data"""
    dataset = Dataset.from_dict({'label': data['label'], 'text': data['text'].values})
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    tokenized_dataset = tokenized_dataset.cast_column('label', ClassLabel(names=['0', '1']))
    return tokenized_dataset

`get_trainer` is to set up a training envrionment for a text classification task using DistilBERT, a distilled version of the BERT model.

In [None]:
def get_trainer(train_set, test_set):
    """Get trainer for text classification"""
    model_name = 'distilbert-base-uncased'
    model_folder = 'model_training'
    max_training_steps = 300
    num_classes = 2

    # Set training arguments
    training_args = TrainingArguments(
        output_dir=model_folder
        max_steps=max_training_steps,
        seed=int(datetime.now().timestamp())
    )

    # Tokenize train/test set
    train_tokenized_dataset = tokenize_data(train_set)
    test_tokenized_dataset = tokenize_data(test_set)

    # Initiate a pretrained model
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_classes
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_tokenized_dataset,
        eval_dataset=test_tokenized_dataset
    )
    return trainer

`get_pred_probs` is used to perform out-of-sample prediction probability computation for a given dataset using cross-validation, with additional handling for unlabeled data.

In [None]:
def get_pred_probs(X, X_unlabeled):
    """Use cross-validation to get out-of-sample prediction probabilities"""
    # Generate cross-val splits
    n_splits = 3
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True)
    skf_splits = [
        [train_index, test_index]
        for train_index, test_index in skf.split(X=X['test'], y=X['label'])
    ]

    # Initiate empty array to store `pred_probs`
    num_examples, num_classes = len(X), len(X.label.value_counts())
    pred_probs = np.full((num_examples, num_classes), np.nan)
    pred_probs_unlabeled = None

    # If we use up all examples from the initial unlabeled pool,
    # X_unlabeled will be None
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.full((n_splits, len(X_unlabled), num_classes), np.nan)

    # Iterate through cross-validation folds
    for split_num, split in enumerate(skf_splits):
        train_index, test_index = split

        train_set = X.iloc[train_index]
        test_set = X.iloc[test_index]

        # Get trainer with train/test subsets
        trainer = get_trainer(train_set, test_set)
        trainer.train()
        eval_metrics = trainer.evaluate()

        # Get `pred_probs` and insert into dataframe
        pred_probs_fold = eval_metrics['eval_pred_probs']
        pred_probs[test_index] = pred_probs_fold

        # Since we do not have labels for the unlabeled pool,
        # we compute `pred_probs` at each round of cross-val, and then
        # average the results at the end
        if X_unlabeled Is not None:
            dataset_unlabeled = Dataset.from_dict({'text': X_unlabeled['text'].values})
            unlabeled_tokenized_dataset = dataset_unlabeled.map(tokenize_function, batched=True)
            logits = trainer.predict(unlabeled_tokenized_dataset).predictions
            curr_pred_probs_unlabeled = softmax(logits, axis=1)
            pred_probs_unlabeled[split_num] = curr_pred_probs_unlabeled

    # We average the `pred_probs` from each round of cross-val to get
    # `pred_probs` for the unlabeled pool
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.mean(np.array(pred_probs_unlabeled), axis=0)

    return pred_probs, pred_probs_unlabeled

`get_annotator` determines the most appropriate annotator to collect a new annotation from a specific example, based on a set of criteria while `get_annotation` focuses on collecting an actual annotation for a given example from a chosen annotator, and also deletes the collected annotation from the pool to prevent it from being selected again.

In [None]:
def get_annotator(example_id):
    """Determine which annotator to collect annotation from given example"""
    existing_annotators = set(X_labeled_full.drop('text', axis=1).columns)
    # Choose existing annotators first
    annotators = extra_annotations[example_id]
    chosen_annotator = choose_existing(annotators, existing_annotators)
    return chosen_annotator


def get_annotation(example_id, chosen_annotator):
    """Collect an annotation for a given test example"""
    new_annotation = extra_annotations[example_id][chosen_annotator]
    del extra_annotations[example_id][chosen_annotator]
    return new_annotation

## Methodology

For each **active learning** round, we
1. Compute ActiveLab consensus labels for each training example derived from all annotations collected thus far
2. Train our transformer classification model on the current training set using these consensus labels
3. Evaluate test accuracy on the test set (which has high-quality ground truth labels)
4. Run cross-validation to get out-of-sample predicted class probabilities from our model for the entire training set and unlabeled set
5. Get ActiveLab active learning scores for each example in the training set and unlabeled set. These scores estimate how informative it would be to collect another annotation for each example
6. Select a subset (*n = batch_size*) of examples with the lowest active learning scores
7. Collect one additional annotation for each of the *n* selected examples
8. Add the new annotations (and new previously non-annotated examples if selected) to our training set for the next iteration

Next, we compare models trained on data labeled via active leraning vs. data labeled via **random selection**. For each random selection round, we use majority vote consensus instead of ActiveLab consensus (in Step 1) and then randomly select the *n* examples to collect an additional label instead of using ActiveLab scores (in Step 6).

### Model training and evaluation

We will tokenize our train and test sets, and then initialize a pretrained DistilBert model. We will fine-tune DistilBert with 300 training steps. The model outputs predicted class probabilities which we convert to class predictions before evaluating their accuracy.

### Using active learning scores to decide what to label next

During each round of active learning, we fit our transformer model vias 3-fold cross-validation on the current training set, which allows us to get out-of-sample predicted class probabilities for each example in the training set and we can also use the trained transformer to get out-of-sample predicted class probabilities for each example in the unlabeled pool. This is all done internally in the `get_pred_probs()` function. The use of out-of-sample predictions helps us avoid bias due to potential overfitting.

Next, we pass the probabilistic predictions into the `get_active_learning_scores()` function from `cleanlab`. This method provides us with scores for all of our labeled and unlabeled data. Lower scores indicate data points for which collecting one additional label should be most informative for our current model (scores are directly comparable between labeled and unlabeled data).

Using the `get_idx_to_label()` function, we will form a batch of examples with the lowest scores.

### Adding new annotations

The `combined_example_ids` are the ids of the test examples we want to collect an annotation for. For each of these, we use the `get_annotation` to collect a new annotation from an annotator. We will prioritize selecting annotations from annotators who have already annotated another example. If none of the annotators for the given example exist in the training set, we randomly select one. In this case, we add a new column to our training set which represents the new annotator. Finally, we add the newly collected annotation to the training set. If the corresponding example was previously non-annotated, we also add it to the training set and remove it from the unlabeled collection.

### Implementation

In [None]:
# For Active Learning demo, we add 25 additional annotations to the training set
# Each iteration for 25 rounds
num_rounds = 25
batch_size_to_label = 25
model_accuracy_arr = np.full(num_rounds, np.nan)

# The `selection_method` determines if we use ActiveLab or random selection
# to choose the new annotation each round
selection_method = 'random'
# selection_method = 'active_learning'

In [None]:
# During each round, we
# - train our model
# - evaluate on unchanging test set
# - collect and add new annotations to training set
for i in range(num_rounds):
    # X_labeled_full is updated each iteration.
    # We drop the text column which leaves us with the annotations
    multiannotator_labels = X_labeled_full.drop('text', axis=1)

    # Use majority vote when using random selection to select the consensus label for each example
    if i == 0 or selection_method == 'random':
        consensus_labels = get_majority_vote_label(multiannotator_labels)

    # When using ActiveLab, use cleanlab's CrowdLab to select the consensus label for each example
    else:
        results = get_label_quality_multiannotator(
            multiannotator_labels,
            pred_probs_labeled,
            calibrate_probs=True
        )
        consensus_labels = results['label_quality']['consensus_label'].values

    # We only need the text and label columns
    train_set = X_labeled_full[['text']]
    train_set['label'] = consensus_labels
    test_set = test[['text', 'label']]

    # Train transformer model on the full set of labeled data
    # to evaluate model accuracy for the current round.
    # This is optional for demo, but
    # in practical applications, we may not have ground truth labels
    trainer = get_trainer(train_set, test_set)
    trainer.train()
    eval_metrics = trainer.evaluate()
    model_accuracy_arr[i] = eval_metrics['eval_accuracy']

    # For ActiveLab, we need to run cross-validation to get out-of-sample predicted probabilities
    if selection_method == 'active_learning':
        pred_probs, pred_probs_unlabeled = get_pred_probs(train_set, X_unlabeled)

        # Compute active learning scores
        active_learning_scores, active_learning_scores_unlabeled = get_active_learning_scores(
            multiannotator_labels,
            pred_probs,
            pred_probs_unlabeled
        )

        # Get the indices of examples to collect more labels for
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label(
            X_labeled_full,
            X_unlabeled,
            extra_annotations,
            batch_size_to_label,
            active_learning_scores,
            active_learning_scores_unlabeled
        )

    # We do not need to run cross-validation,
    # just get random examples to collect annotations for
    if selection_method == 'random':
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label_random(
            X_labeled_full,
            X_unlabeled,
            extra_annotations,
            batch_size_to_label
        )

    unlabeled_example_ids = np.array([])
    # Check to see if we still have unlabeled examples left
    if X_unlabeled is not None:
        # Get unlabeled text examples we want to collect annotations for
        new_text = X_unlabeled.iloc[chosen_examples_unlabeled]
        unlabeled_example_ids = new_text.index.values

        num_ex = len(new_text)
        num_annot = multiannotator_labels.shape[1]
        empty_annot = pd.DataFrame(
            data=np.full((num_ex, num_annot), np.nan),
            columns=multiannotator_labels.columns,
            index=unlabeled_example_ids
        )
        new_unlabeled_df = pd.concat([new_text, empty_annot], axis=1)

        # Combine unlabeled text examples with existing, labeled examples
        X_labeled_full = pd.concat([X_labeled_full, new_unlabeled_df], axis=0)

        # Remove examples from X_unlabeled and check if empty.
        # Once it is empty we set it to None to handle appropriately elsewhere.
        X_unlabeled = X_unlabeled.drop(new_text.index)
        if X_unlabeled.empty:
            X_unlabeled = None

    if selection_method == 'active_learning':
        # Update `pred_prob` arrays with newly added examples if necessary
        if pred_probs_unlabeled is not None and len(chosen_examples_unlabeled) != 0:
            pred_probs_new = pred_probs_unlabeled[chosen_examples_unlabeled, :]
            pred_probs_labeled = np.concatenate((pred_probs, pred_probs_new))
            pred_probs_unlabeled = np.delete(pred_probs_unlabeled, chosen_examples_unlabeled, axis=0)
        else:
            # otherwise we have nothing to modify
            pred_probs_labeled = pred_probs

    # Get combined list of text ID's to relabel
    labeled_example_ids = X_label_full.iloc[chosen_examples_labeled].index.values
    combined_example_ids = np.concatenate([labeled_example_ids, unlabeled_example_ids])

    # We collect annotations for the selected examples
    for example_id in combined_example_ids:
        # choose which annotator to collect annotation from
        chosen_annotator = get_annotator(example_id)
        # collect new annotation
        new_annotation = get_annotation(example_id, chosen_annotator)
        # new annotator has been selected
        if chosen_annotator not in multiannotator_labels.columns.values:
            empty_col = np.full((len(X_labeled_full),), np.nan)
            X_labeled_full[chosen_annotator] = empty_col

        # add selected annotation to the training set
        X_labeled_full.at[example_id, chosen_annotator] = new_annotation

## Results

In [None]:
# Get numpy array of results.
!wget -nc -O 'random_acc.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/activelearn_acc.npy'
!wget -nc -O 'activelearn_acc.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/random_acc.npy'

In [None]:
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev

In [None]:
al_acc = np.load("activelearn_acc.npy")
rand_acc = np.load("random_acc.npy")

rand_acc_std = compute_std_dev(rand_acc)
al_acc_std = compute_std_dev(al_acc)

plt.plot(range(1, al_acc.shape[1] + 1), np.mean(al_acc, axis=0), label="active learning", color="green")
plt.fill_between(range(1, al_acc.shape[1] + 1), al_acc_std[0], al_acc_std[1], alpha=0.3, color="green")

plt.plot(range(1, rand_acc.shape[1] + 1), np.mean(rand_acc, axis=0), label="random", color="red")
plt.fill_between(range(1, rand_acc.shape[1] + 1), rand_acc_std[0], rand_acc_std[1], alpha=0.1, color="red")

plt.hlines(y=0.9, xmin=1.0, xmax=25.0, color="black", linestyle="dotted")
plt.legend()
plt.xlabel("Round Number")
plt.ylabel("Test Accuracy")
plt.title("ActiveLab vs Random Annotation Selection --- 5 Runs")
plt.savefig("al-results.png")
plt.show()