# Annotating and Active Learning with Argilla

This notebook is set up for the Congruence Engine project. The goal is to build a simple  NLP framework for annotation and active learning, targeted specifically at named entity recognition.

This notebook sets up an annotation and active learning frame work with Argilla. The notebook can be run locally.

Notebook based on [this tutorial](https://docs.argilla.io/en/latest/tutorials/notebooks/labelling-tokenclassification-spacy-pretrained.html)

In [3]:
# info on available ram
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('\n\nYour runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))



Your runtime has 17.2 gigabytes of available RAM



Argilla requires elastic search installation.

## Local install

Download [Docker Desktop](https://docs.docker.com/desktop/install/mac-install/).

In [4]:
!docker run -d --name quickstart -p 6900:6900 argilla/argilla-quickstart:latest

docker: Error response from daemon: Conflict. The container name "/quickstart" is already in use by container "2bc3fbe0af9c4da7a09efd086a2adef350ab78468fe0741877e4c91e4d9e4c66". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.


Go to [http://localhost:6900](http://localhost:6900) and log in with username admin and password 12345678. 

In [17]:
import argilla as rg

In [6]:
# import os
# import argilla as rg

# rg.init(
#     api_url=os.environ.get("ARGILLA_API_URL"),
#     api_key=os.environ.get("admin.apikey"),
#     workspace="my_workspace",
#     extra_headers={"X-Argilla-Workspace": "my_connection_headers"}
# )

In [7]:
#!pwd

/Users/kasparbeelen/Documents/NER-AL


In [8]:
#!export ARGILLA_LOCAL_AUTH_USERS_DB_FILE=/Users/kasparbeelen/Documents/NER-AL/users.yaml

In [18]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="http://localhost:6900",
    api_key="admin.apikey")

In [None]:
!pip install datasets "spacy[transformers]~=3.0" protobuf -qqq

In [None]:
!python -m spacy download en_core_web_trf
!python -m spacy download en_core_web_sm

In [19]:
from datasets import load_dataset
import pandas as pd
import spacy
from tqdm.auto import tqdm

In [20]:
dataset = load_dataset("gutenberg_time", split="train", streaming=True)

# Let's have a look at the first 5 examples of the train set.
pd.DataFrame(dataset.take(5))

Unnamed: 0,guten_id,hour_reference,time_phrase,is_ambiguous,time_pos_start,time_pos_end,tok_context
0,4447,5,five o'clock,True,145,147,"I crossed the ground she had traversed , notin..."
1,4447,12,the fall of the winter noon,True,68,74,So profoundly penetrated with thoughtfulness w...
2,28999,12,midday,True,46,47,"And here is Hendon , and it is time for us to ..."
3,28999,12,midday,True,133,134,Sorrows and trials she had had in plenty in he...
4,28999,0,midnight,True,43,44,Jeannie joined her friend in the window-seat ....


In [21]:
nlp = spacy.load("en_core_web_trf")

# Creating an empty record list to save all the records
records = []

# Iterate over the first 50 examples of the Gutenberg dataset
for record in tqdm(list(dataset.take(50))):
    # We only need the text of each instance
    text = record["tok_context"]

    # spaCy Doc creation
    doc = nlp(text)

    # Entity annotations
    entities = [(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents]

    # Pre-tokenized input text
    tokens = [token.text for token in doc]

    # Argilla TokenClassificationRecord list
    records.append(
        rg.TokenClassificationRecord(
            text=text,
            tokens=tokens,
            prediction=entities,
            prediction_agent="en_core_web_trf",
        )
    )

  0%|          | 0/50 [00:00<?, ?it/s]

In [22]:
#rg.set_workspace("my_private_workspace")
rg.log(records=records, name="gutenberg_spacy_ner")

Output()

BulkResponse(dataset='gutenberg_spacy_ner', processed=50, failed=0)