
Datasets:
- with LOAD_DATASETS=full, argilla loads `gutenberg_spacy-ner-monitoring` for Token Classification with default spaCy predictions ; which is a fork of https://huggingface.co/datasets/gutenberg_time
- default NER dataset in papers is CoNLL-2003 https://huggingface.co/datasets/conll2003
- https://huggingface.co/datasets/DFKI-SLT/few-nerd
- https://huggingface.co/datasets/tner/ontonotes5
- Look for argilla compatible NER datasets with this search: https://huggingface.co/datasets?task_categories=task_categories:token-classification&sort=trending&search=argilla


- ✨ Provide suggested spans with a confidence score, so your team doesn't need to start from scratch.


In [140]:
from typing import List, Tuple, Union, Dict
import types

In [108]:
import argilla as rg

rg.init(api_url="http://localhost:6900", api_key="admin.apikey")



# Load CoNLL2003 research dataset into Argilla

In [144]:
def template_for_token_classification(
    labels: Dict[str, str] = {"PER": "Person", "ORG": "Organization", "LOC": "Location", "MISC": "Other"}
) -> rg.FeedbackDataset:
    """Create a dataset with a span question for NER or POS tagging or information retrieval tasks.
    
    There is no pre-defined template in argilla yet, so we define a custom dataset instead.
    The high-level API of this method is TBD.
    ref: https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html#define-questions + click on Span
    """
    dataset = rg.FeedbackDataset(
        fields=[rg.TextField(name="text")],
        questions=[
            rg.SpanQuestion(
                name="entities",
                title="Highlight the entities in the text:",
                labels=labels,
                field="text", # the field where you want to do the span annotation
                required=True,
                allow_overlapping=True
            )
        ]
    )
    return dataset

In [110]:
dataset = template_for_token_classification()



In [111]:
from datasets import load_dataset, Features, Sequence, ClassLabel, Value, DatasetDict

def load_conll():
    classmap = ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'])
    return (
        load_dataset("conll2003")
        .map(lambda sample: {"parsed_ner_tags": classmap.int2str(sample["ner_tags"])})
    )

In [112]:
conll2003 = load_conll()

In [135]:
from spacy.tokens import Doc
from spacy.vocab import Vocab
from spacy.training.iob_utils import iob_to_biluo, biluo_tags_to_offsets
from argilla.client.feedback.schemas import SpanValueSchema

def tags_to_entities(row: dict, tokens="tokens", parsed_ner_tags="parsed_ner_tags") -> List[SpanValueSchema]:
    doc = Doc(Vocab(), words=row[tokens])
    offsets = biluo_tags_to_offsets(doc, iob_to_biluo(row["parsed_ner_tags"]))

    return [
        SpanValueSchema(
            start=start, # position of the first character of the span
            end=stop, # position of the character right after the end of the span
            label=entity,
            score=1.0
        ) for start, stop, entity in offsets
    ]

In [136]:
from tqdm import tqdm

def dataset_to_records(dataset: DatasetDict, agent, tokens="tokens"):
    for row in tqdm(dataset):
        text = " ".join(row[tokens])  # we assume the tokens are clean, and we disregard more tokenizer details

        # Seems like we have "empty" rows
        if not text.strip():
            continue

        yield rg.FeedbackRecord(
            fields={"text": text},
            suggestions = [
                {
                    "question_name": "entities",
                    "value": tags_to_entities(row),
                    "agent": agent,
                }
            ]
        )

In [116]:
# add records to the dataset and push to Argilla
dataset.add_records(list(dataset_to_records(conll2003['validation'], "gold_labels")))

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3250/3250 [00:00<00:00, 6520.85it/s]


In [117]:
dataset.push_to_argilla(name="dev-ner-conll2003", workspace="admin")

RemoteFeedbackDataset(
   id=60d31698-9ad9-47db-9e5f-d1e4f7d8b71b
   name=my-first-dataset
   workspace=Workspace(id=84a8fb6f-3350-4e9b-97c0-043cfedef934, name=admin, inserted_at=2024-05-14 17:08:20.825501, updated_at=2024-05-14 17:08:20.825501)
   url=http://localhost:6900/dataset/60d31698-9ad9-47db-9e5f-d1e4f7d8b71b/annotation-mode
   fields=[RemoteTextField(id=UUID('6b46a2c5-72ec-42d8-8392-9173007a025e'), client=None, name='text', title='Text', required=True, type='text', use_markdown=False)]
   questions=[RemoteSpanQuestion(id=UUID('12bbb02d-c899-4d42-ba40-8dc95985cbaf'), client=None, name='entities', title='Highlight the entities in the text:', description=None, required=True, type='span', field='text', labels=[SpanLabelOption(value='PER', text='Person', description=None), SpanLabelOption(value='ORG', text='Organization', description=None), SpanLabelOption(value='LOC', text='Location', description=None), SpanLabelOption(value='MISC', text='Other', description=None)], visible_label

# Load OntoNotes research dataset into Argilla

In [128]:
import collections

def load_ontonotes():
    ontonotes5_labels_raw = {"O": 0, "B-CARDINAL": 1, "B-DATE": 2, "I-DATE": 3, "B-PERSON": 4, "I-PERSON": 5, "B-NORP": 6, "B-GPE": 7, "I-GPE": 8, "B-LAW": 9, "I-LAW": 10, "B-ORG": 11, "I-ORG": 12, "B-PERCENT": 13, "I-PERCENT": 14, "B-ORDINAL": 15, "B-MONEY": 16, "I-MONEY": 17, "B-WORK_OF_ART": 18, "I-WORK_OF_ART": 19, "B-FAC": 20, "B-TIME": 21, "I-CARDINAL": 22, "B-LOC": 23, "B-QUANTITY": 24, "I-QUANTITY": 25, "I-NORP": 26, "I-LOC": 27, "B-PRODUCT": 28, "I-TIME": 29, "B-EVENT": 30, "I-EVENT": 31, "I-FAC": 32, "B-LANGUAGE": 33, "I-PRODUCT": 34, "I-ORDINAL": 35, "I-LANGUAGE": 36}
    ontonotes5_labels = collections.OrderedDict(sorted(ontonotes5_labels_raw.items(), key=lambda x: x[1]))
    classmap = ClassLabel(names=list(ontonotes5_labels.keys()))
    return (
        load_dataset("tner/ontonotes5")
        .rename_column("tags", "ner_tags")
        .map(lambda sample: {"parsed_ner_tags": classmap.int2str(sample["ner_tags"])})
    )

In [129]:
ontonotes = load_ontonotes()

Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59924/59924 [00:01<00:00, 38017.49 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8528/8528 [00:00<00:00, 44000.45 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8262/8262 [00:00<00:00, 41023.03 examples/s]


In [146]:
dataset = template_for_token_classification(labels={
    "CARDINAL": "Numerals that do not fall under another type", 
    "DATE": "Absolute or relative dates or periods", 
    "PERSON": "People, including fictional", 
    "NORP": "Nationalities or religious or political groups", 
    "GPE": "Countries, cities, states",
    "LAW": "Named documents made into laws", 
    "ORG": "Companies, agencies, institutions, etc.", 
    "PERCENT": "Percentage (including “%”)",
    "ORDINAL": "“first”, “second”",
    "MONEY": "Monetary values, including unit",
    "WORK_OF_ART": "Titles of books, songs, etc.",
    "FAC": "Facilities like Buildings, airports, highways, bridges, etc.",
    "TIME": "Times smaller than a day",
    "LOC": "Non-GPE locations, mountain ranges, bodies of water",
    "QUANTITY": "Measurements, as of weight or distance",
    "NORP": "Nationalities or religious or political groups",
    "PRODUCT": "Vehicles, weapons, foods, etc. (Not services)",
    "EVENT": "Named hurricanes, battles, wars, sports events, etc.",
    "LANGUAGE": "Any named language"
})



In [147]:
dataset.add_records(list(dataset_to_records(ontonotes['validation'], "gold_labels")))

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8528/8528 [00:01<00:00, 8016.14it/s]


In [148]:
dataset.push_to_argilla(name="dev-ner-ontonotes", workspace="admin")

RemoteFeedbackDataset(
   id=78330c77-8c4b-42bd-b37d-9d5724f184b2
   name=dev-ner-ontonotes
   workspace=Workspace(id=84a8fb6f-3350-4e9b-97c0-043cfedef934, name=admin, inserted_at=2024-05-14 17:08:20.825501, updated_at=2024-05-14 17:08:20.825501)
   url=http://localhost:6900/dataset/78330c77-8c4b-42bd-b37d-9d5724f184b2/annotation-mode
   fields=[RemoteTextField(id=UUID('c620e389-d9a9-4d97-951e-39ad93eca329'), client=None, name='text', title='Text', required=True, type='text', use_markdown=False)]
   questions=[RemoteSpanQuestion(id=UUID('26972fce-f2ae-4223-83d8-cbf85b22c328'), client=None, name='entities', title='Highlight the entities in the text:', description=None, required=True, type='span', field='text', labels=[SpanLabelOption(value='CARDINAL', text='Numerals that do not fall under another type', description=None), SpanLabelOption(value='DATE', text='Absolute or relative dates or periods', description=None), SpanLabelOption(value='PERSON', text='People, including fictional', desc

## Push to Huggingface

In [155]:
dataset.push_to_huggingface(
    repo_id="louisguitton/dev-ner-ontonotes",split="validation"
)

Uploading the dataset shards:   0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 411.00ba/s][A
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.73it/s]
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.0k/10.0k [00:00<00:00, 8.44MB/s]
