# Annotating and Active Learning with Argilla

This notebook is set up for the Congruence Engine project. The goal is to build a simple  NLP framework for annotation and active learning, targeted specifically at named entity recognition.

This notebook sets up an annotation and active learning frame work with Argilla. The notebook can be run locally or on Google Colab.

This tutorial builds on the excellent [blog post and notebook](https://docs.argilla.io/en/latest/tutorials/notebooks/deploying-textclassification-colab-activelearning.html) by [Moritz Laurer](https://www.linkedin.com/in/moritz-laurer/).

Run this notebook on Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kasparvonbeelen/NER-AL/blob/1-setup/argilla_setup_ner_pipeline.ipynb)



In [None]:
# if using colab download requirements .txt with wget
# uncomment line below
# !wget -i https://raw.githubusercontent.com/kasparvonbeelen/NER-AL/1-setup/requirements.txt

In [1]:
# install packages
!pip install -r requirements.txt

Collecting argilla[listeners,server]==1.1.1
  Downloading argilla-1.1.1-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting transformers[sentencepiece]~=4.25.1
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hCollecting datasets~=2.7.1
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m451.7/451.7 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting small-text[transformers]~=1.1.1
  Downloading small_text-1.1.1-py3-none-any.whl (178 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.1/178.1 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colab-xterm~=0.1.2
  Downloading colab_xterm-

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.7/22.7 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting dill<0.3.7
  Using cached dill-0.3.6-py3-none-any.whl (110 kB)
Collecting scipy
  Using cached scipy-1.10.1-cp39-cp39-macosx_12_0_arm64.whl (28.9 MB)
Collecting torchtext>=0.7.0
  Downloading torchtext-0.15.2-cp39-cp39-macosx_11_0_arm64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting torch>=1.6.0
  Downloading torch-2.0.1-cp39-none-macosx_11_0_arm64.whl (55.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting brotli>=1.0.7
  Downloading Brotli-1.0.9-cp39-cp39-macosx_10_9_universal2.whl (786 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m786.7/786.7 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0ma [3

Collecting mpmath>=0.19
  Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Building wheels for collected packages: pyngrok, prodict, psutil, PyYAML, wrapt
  Building wheel for pyngrok (setup.py) ... [?25ldone
[?25h  Created wheel for pyngrok: filename=pyngrok-5.2.3-py3-none-any.whl size=19864 sha256=d0449736cdec3fe86727db7058c1380d2ae761dacdaa4193f947e68e55275693
  Stored in directory: /Users/kasparbeelen/Library/Caches/pip/wheels/7b/26/1d/fc3a749c956dff3c7abc8b07a8b6917a7ee926eeaf0bf09ad1
  Building wheel for prodict (setup.py) ... [?25ldone
[?25h  Created wheel for prodict: filename=prodict-0.8.18-py3-none-any.whl size=4203 sha256=bf67ccd3dac132d98d4b4be5d5a594b5bc8f2f49d8538adcc875ae8e538e47b1
  Stored in directory: /Users/kasparbeelen/Library/Caches/pip/wheels/da/0c/ae/1d7e040733ae90d0356f0db8d54e7b36e6736516ff909b407e
  Building wheel for psutil (setup.py) ... [?25ldone
[?25h  Created wheel for psutil: filename=psutil-5.8.0-cp39-cp39-macosx_11_0_arm64.whl size=234619 sha2

In [2]:
# info on the hardware you are using - either a CPU or GPU
!nvidia-smi

zsh:1: command not found: nvidia-smi


In [3]:
# info on available ram
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('\n\nYour runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))



Your runtime has 17.2 gigabytes of available RAM



Argilla requires elastic search installation.

## Local install

Download [Docker Desktop](https://docs.docker.com/desktop/install/mac-install/).

In [8]:
!docker run -d --name quickstart -p 6900:6900 argilla/argilla-quickstart:latest

zsh:1: command not found: docker


Go to [http://localhost:6900](http://localhost:6900) and log in with username admin and password 12345678. 

## On Colab

In [9]:
%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.10.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.2

In [None]:
%%bash --bg

sudo -u daemon -- elasticsearch-7.10.2/bin/elasticsearch

In [None]:
import time
time.sleep(30)  # sleeping to give ES time to set up. Otherwise downstream code will bug

In [None]:
# create a terminal to run Argilla with, in case you don't have Colab Pro.
# type "python -m argilla" into the terminal that appears below this code cell.
%load_ext colabxterm
%xterm

 Create a ngrok free account by following the instructions [here](https://ngrok.com/). 
 
 Create an [API key](https://dashboard.ngrok.com/api) and run the code below

In [10]:
import getpass
from pyngrok import ngrok, conf

print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth")
print("You need to create a free ngrok account to get an authtoken. The token looks something like this: ASDO1283YZaDu95vysXYIUXZXYRR_54YfASDIb8cpNfVoz349587")
conf.get_default().auth_token = getpass.getpass()
# if the above does not work, you can try:
#ngrok.set_auth_token("<INSER_YOUR_NGROK_AUTHTOKEN>")

Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth
You need to create a free ngrok account to get an authtoken. The token looks something like this: ASDO1283YZaDu95vysXYIUXZXYRR_54YfASDIb8cpNfVoz349587
········


In [None]:
# disconnect all existing tunnels to avoid issues when rerunning cells
[ngrok.disconnect(tunnel.public_url) for tunnel in ngrok.get_tunnels()]

# create the public link
# ! check whether this is actually the localhost port Argilla is running on via the terminal above
ngrok_tunnel = ngrok.connect(6900)  # insert the port number Argilla is running on. e.g. 6900 if the terminal displays something like "Uvicorn running on http://0.0.0.0:6900"
print("You can now access the Argilla localhost with the public link below. (It should look something like 'http://X03b-34-XXX-237-25.ngrok.io')\n")
print(f"Your ngrok public link: {ngrok_tunnel}\n")
print("After clicking on the link, there will be a warning, which you can ignore")
print("You can then login with the default agrilla username 'argilla' and password '1234'")

## Loading Data and Annotation

In [11]:
# load dataset
import datasets
dataset_name = "trec"
dataset_hf = datasets.load_dataset(dataset_name, version=datasets.Version("2.0.0"))
# we work with only a sixth of the texts of the dataset for faster testing
dataset_hf["train"] = dataset_hf["train"].shard(num_shards=6, index=0)


Downloading builder script:   0%|          | 0.00/5.09k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading and preparing dataset trec/default to /Users/kasparbeelen/.cache/huggingface/datasets/trec/default/2.0.0/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/336k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5452 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset trec downloaded and prepared to /Users/kasparbeelen/.cache/huggingface/datasets/trec/default/2.0.0/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
## choose the transformer and load tokenizer
import torch
from transformers import AutoTokenizer

# Choose transformer model: In non-gpu environments we use a tiny model to increase efficiency
if not torch.cuda.is_available():
    transformer_model = "prajjwal1/bert-tiny"
    print(f"No GPU is available, we therefore use the small model '{transformer_model}' for the active learning loop.\n")
else:
    transformer_model = "microsoft/deberta-v3-xsmall"  #"bert-base-uncased"
    print(f"A GPU is available, we can therefore use '{transformer_model}' for the active learning loop.\n")

# Init tokenizer
tokenizer = AutoTokenizer.from_pretrained(transformer_model)


No GPU is available, we therefore use the small model 'prajjwal1/bert-tiny' for the active learning loop.



Downloading (…)lve/main/config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [13]:
## create small_text transformersdataset object
import numpy as np
from small_text import TransformersDataset

num_classes = dataset_hf["train"].features["coarse_label"].num_classes
target_labels = np.arange(num_classes)

train_text = [row["text"] for row in dataset_hf["train"]]
train_labels = np.array([row["coarse_label"] for row in dataset_hf["train"]])

# Create the dataset for small-text
dataset_st = TransformersDataset.from_arrays(
    train_text, train_labels, tokenizer, target_labels=target_labels
)

# Create test dataset
test_text = [row["text"] for row in dataset_hf["test"]]
test_labels = np.array([row["coarse_label"] for row in dataset_hf["test"]])

dataset_test = TransformersDataset.from_arrays(
    test_text, test_labels, tokenizer, target_labels=np.arange(num_classes)
)





In [17]:
## setting up the active learner
from small_text import (
    BreakingTies,
    PoolBasedActiveLearner,
    TransformerBasedClassificationFactory,
    TransformerModelArguments,
)

# Define our classifier
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device: ", device)

num_epochs = 5  # higher values of around 40 will probably improve performance on small datasets, but the active learning loop will take longer
clf_factory = TransformerBasedClassificationFactory(
    TransformerModelArguments(transformer_model),
    num_classes=num_classes,
    kwargs={"device": device, "num_epochs": num_epochs, "lr": 2e-05, "mini_batch_size": 8,
            "early_stopping_no_improvement": 5}  # kwargs={"device": "cuda"}
)


# Define our query strategy
query_strategy = BreakingTies()

# Use the active learner with a pool containing all unlabeled data
active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, dataset_st)


Using device:  cpu


In [18]:
## draw an initial sample for the first annotation round
# https://small-text.readthedocs.io/en/v1.1.1/components/initialization.html
from small_text import random_initialization, random_initialization_stratified, random_initialization_balanced
import numpy as np

# Fix seed for reproducibility
np.random.seed(42)

# Number of samples in our queried batches
NUM_SAMPLES = 10

# Draw an initial subset from the data pool
#initial_indices = random_initialization(dataset_st, NUM_SAMPLES)
#initial_indices = random_initialization_balanced(train_labels, NUM_SAMPLES)
initial_indices = random_initialization_stratified(train_labels, NUM_SAMPLES)


In [20]:
### log the first data to Argilla
import argilla as rg

# Choose a name for the dataset
DATASET_NAME = f"{dataset_name}_with_active_learning"

# Define labeling schema
labels = dataset_hf["train"].features["coarse_label"].names
settings = rg.TextClassificationSettings(label_schema=labels)

# Create dataset with a label schema
rg.configure_dataset(name=DATASET_NAME, settings=settings)

UnauthorizedApiError: Argilla server returned an error with http status: 401
Error details: [{'code': 'argilla.api.errors::UnauthorizedError', 'params': {'detail': 'Could not validate credentials'}}]

In [19]:


# Create records from the initial batch
records = [
    rg.TextClassificationRecord(
        text=dataset_hf["train"]["text"][idx],
        metadata={"batch_id": 0},
        id=idx,
    )
    for idx in initial_indices
]

# Log initial records to Argilla
rg.log(records, DATASET_NAME)


UnauthorizedApiError: Argilla server returned an error with http status: 401
Error details: [{'code': 'argilla.api.errors::UnauthorizedError', 'params': {'detail': 'Could not validate credentials'}}]