
Datasets:
- with LOAD_DATASETS=full, argilla loads `gutenberg_spacy-ner-monitoring` for Token Classification with default spaCy predictions ; which is a fork of https://huggingface.co/datasets/gutenberg_time
- default NER dataset in papers is CoNLL-2003 https://huggingface.co/datasets/conll2003
- https://huggingface.co/datasets/DFKI-SLT/few-nerd
- https://huggingface.co/datasets/tner/ontonotes5
- Look for argilla compatible NER datasets with this search: https://huggingface.co/datasets?task_categories=task_categories:token-classification&sort=trending&search=argilla


- ✨ Provide suggested spans with a confidence score, so your team doesn't need to start from scratch.


In [1]:
from typing import List, Tuple, Union, Dict
import types

In [2]:
import argilla as rg

rg.init(api_url="http://localhost:6900", api_key="admin.apikey")



## Push to Huggingface

In [155]:
dataset.push_to_huggingface(
    repo_id="louisguitton/dev-ner-ontonotes",split="validation"
)

Uploading the dataset shards:   0%|                                                                                                                                    | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 411.00ba/s][A
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.73it/s]
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.0k/10.0k [00:00<00:00, 8.44MB/s]


## Add suggestions to a remote dataset

In [262]:
remote_dataset = rg.FeedbackDataset.from_argilla(
    name="dev-ner-ontonotes",
    workspace="admin",
    with_vectors="all"
)

In [263]:
from typing import Type
from argilla.client.feedback.dataset.remote.dataset import RemoteFeedbackDataset
from argilla.client.feedback.schemas.remote.records import RemoteFeedbackRecord, RemoteSuggestionSchema
from argilla.client.feedback.schemas.suggestions import SuggestionSchema

def labeller(nlp: Type[spacy.language.Language], text: str) -> List[SpanValueSchema]:
    """Generate NER preditions from a spaCy model in the Argilla format."""
    doc = nlp(text)
    return [
        SpanValueSchema(
            start=ent.start_char,
            end=ent.end_char,
            label=ent.label_,
            score=0
        )  for ent in doc.ents
    ]
    
def add_suggestions_to_remote_dataset(remote_dataset: RemoteFeedbackDataset, nlp: Type[spacy.language.Language]) -> None:
    """Add suggestions from a spaCy NER model to a remote instance of an existing Argilla dataset.
    
    ref: https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/end2end_examples/add-suggestions-and-responses-005.html#For-the-RemoteFeedbackDataset"""
    modified_records: List[RemoteFeedbackRecord] = [record for record in remote_dataset.records]
    
    for record in modified_records:
        pred: List[SpanValueSchema] = labeller(nlp, record.fields["text"])
        # passing more than 1 suggestion fails with this error:
        # ValidationApiError: Argilla server returned an error with http status: 422. Error details: {'response': 'Record at 
        # position 0 is not valid because found duplicate suggestions question IDs', 'params': None}
        record.suggestions: Union[Tuple[Union[RemoteSuggestionSchema, SuggestionSchema]], List[Union[RemoteSuggestionSchema, SuggestionSchema]]] = [{
                "question_name": "entities",
                "value": pred,
                "agent": nlp.meta['name']
            }]
    
    remote_dataset.update_records(modified_records)

In [171]:
import spacy

nlp = spacy.load("en_core_web_sm")

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [264]:
add_suggestions_to_remote_dataset(remote_dataset, nlp)

In [270]:
def test_one_suggestion_and_no_response():
    r = remote_dataset.records[2]
    pred: List[SpanValueSchema] = labeller(nlp, r.fields["text"])
    r.responses = []
    r.suggestions = [{
                "question_name": "entities",
                "value": pred,
                "agent": nlp.meta['name']
            }]
    remote_dataset.update_records([r])

In [271]:
test_one_suggestion_and_no_response()

## Compute metrics

In [289]:
from argilla.client.feedback.metrics.utils import get_responses_and_suggestions_per_user

In [291]:
remote_dataset = rg.FeedbackDataset.from_argilla(
    name="ner-lvl2",
    workspace="admin",
    with_vectors="all"
)

# responses_and_suggestions_per_user = get_responses_and_suggestions_per_user(dataset=remote_dataset, question_name="entities")

Extracting responses and suggestions per user: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 861253.39it/s]


In [298]:
hf_dataset = remote_dataset.format_as("datasets")

In [300]:
hf_dataset[0]

{'text': 'A Russian diver has found the bodies of three of the 118 sailors who were killed when the nuclear submarine Kursk sank in the Barents Sea .',
 'entities': [{'user_id': 'fe7c1b6a-5d30-41d5-bb56-6675cfbad12f',
   'value': {'start': [2, 40, 53, 108, 122],
    'end': [9, 45, 56, 113, 137],
    'label': ['NORP', 'CARDINAL', 'CARDINAL', 'PRODUCT', 'LOC'],
    'text': ['Russian', 'three', '118', 'Kursk', 'the Barents Sea']},
   'status': 'submitted'}],
 'entities-suggestion': {'start': [2, 40, 53, 108, 122],
  'end': [9, 45, 56, 113, 137],
  'label': ['NORP', 'CARDINAL', 'CARDINAL', 'PRODUCT', 'LOC'],
  'text': ['Russian', 'three', '118', 'Kursk', 'the Barents Sea'],
  'score': [None, None, None, None, None]},
 'entities-suggestion-metadata': {'type': None, 'score': None, 'agent': None},
 'external_id': None,
 'metadata': '{}'}

In [None]:
from spacy.tokens import Doc
from spacy.training import Example

examples: List[Example] = []
for row in hf_dataset:
    text = row["text"]
    gold = row["entities"]
    pred = row["entities-suggestion"]

    # generate Doc with Doc.set_ents from a list of spans for the predicted suggestions
    # ref: https://spacy.io/api/doc#set_ents

    # generate Doc with Doc.set_ents from a list of spans for the gold responses
    
    example = Example(predicted, reference)
    examples.append(example)

In [296]:
from spacy.scorer import Scorer
scorer = Scorer()

In [None]:
scores = scorer.score(examples)

## Football Articles
### Create (and Delete) empty dataset in Argilla for our task

In [49]:
remote_dataset = rg.FeedbackDataset.from_argilla("football-news", workspace="admin")
remote_dataset.delete()

In [50]:
dataset = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="title"),
        rg.TextField(name="content"),
    ],
    questions=[
        rg.LabelQuestion(
            name="category",
            title="What is the category of the article?",
            labels=[
                "Coach Commentary", "Transfer News", "Match Report",
                "Player Profile", "League Updates", "Injury Updates",
                "Tactical Analysis", "Social Media Reaction",
                "Historical Milestone", "Match Incident"
            ],
            required=False,
            visible_labels=None
        ),
        rg.SpanQuestion(
            name="entities",
            title="Highlight the entities in the content:",
            labels=["Competition", "Team", "Player", "Match", "Transfer"],
            field="content",
            required=True,
            allow_overlapping=True
        )
    ],
    metadata_properties = [
        rg.TermsMetadataProperty(name="link"),
        rg.TermsMetadataProperty(name="source"),
    ],
    vectors_settings=[], # we will add sentence embeddings a posteriori
    guidelines="Please, read the question carefully and try to answer it as accurately as possible."
)



In [51]:
remote_dataset = dataset.push_to_argilla(name="football-news", workspace="admin")

### Add bare records to the remote Argilla dataset

In [52]:
from typing import Iterator
from tqdm import tqdm

def records_generator(filepath: str = '../../data/football-news-articles/final-articles.csv') -> Iterator[rg.FeedbackRecord]:
    dataset: pd.DataFrame = pd.read_csv(filepath).loc[lambda d: d.source.isin(["skysports", "all-football-app"])]

    for index, row in tqdm(dataset.iterrows()):
        record = rg.FeedbackRecord(
            fields={
                "title": row['title'],
                "content": row['content']
            },
            metadata={
                "link": row['link'],
                "source": row['source'],
            },
            vectors={},
            responses=[],
            suggestions=[],
            external_id=index,
        )

        yield record

In [53]:
remote_dataset = rg.FeedbackDataset.from_argilla("football-news", workspace="admin")
remote_dataset.add_records(list(records_generator()))

996it [00:00, 33681.58it/s]




### Add vectors to records to enable Similarity Search in Argilla

In [12]:
from argilla.client.feedback.integrations.sentencetransformers import SentenceTransformersExtractor

In [15]:
FAST_AND_SMALL = "sentence-transformers/all-MiniLM-L6-v2"

ste = SentenceTransformersExtractor(
    model=FAST_AND_SMALL,
    show_progress=True,
)



In [16]:
# Update the records
updated_records = ste.update_records(
    records=remote_dataset.records,
    fields=None, # Use all fields
    overwrite=True, # Overwrite existing fields
)

Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 238/238 [00:37<00:00,  6.37it/s]
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 238/238 [00:09<00:00, 24.78it/s]


In [17]:
# Update the dataset
ste.update_dataset(
    dataset=remote_dataset,
    fields=None, # None means using all fields
    update_records=True, # Also, update the records in the dataset
    overwrite=True, # Whether to overwrite existing vectors
)

Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 238/238 [00:05<00:00, 44.80it/s]
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 238/238 [00:31<00:00,  7.63it/s]


RemoteFeedbackDataset(
   id=39a5e7a6-e2b9-4275-b010-d443ceb571a3
   name=football-news
   workspace=Workspace(id=84a8fb6f-3350-4e9b-97c0-043cfedef934, name=admin, inserted_at=2024-05-14 17:08:20.825501, updated_at=2024-05-14 17:08:20.825501)
   url=http://localhost:6900/dataset/39a5e7a6-e2b9-4275-b010-d443ceb571a3/annotation-mode
   fields=[RemoteTextField(id=UUID('9faf43ab-7abb-4979-b497-7e0841d76780'), client=None, name='title', title='Title', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('90b93abd-b4ca-49f8-8a5b-869a835ff177'), client=None, name='content', title='Content', required=True, type='text', use_markdown=False)]
   questions=[RemoteLabelQuestion(id=UUID('941d4f41-4629-4b86-b5d4-c34be7b7c6d2'), client=None, name='category', title='What is the category of the article?', description=None, required=False, type='label_selection', labels=['Coach Commentary', 'Transfer News', 'Match Report', 'Player Profile', 'League Updates', 'Injury Updates', 'Tactica