# Fancy title with emoji

Insert motivation
Insert image

## Introduction

Definitions and goal of the tutorial

References

- https://twitter.com/explosion_ai/status/1696207181098705327
- https://github.com/explosion/prodigy-recipes/tree/master/tutorials/kb-guided-llm-ner
- https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/other_datasets/labelling-tokenclassification-using-spacy-llm.html
- https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/labelling-spacy-llm.html
- https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/other_datasets/weak_supervision_ner.html

In [1]:
import json
from typing import cast
from IPython.display import display, Markdown

import spacy
from spacy_llm.util import assemble, Config, assemble_from_config
from spacy_llm.pipeline import LLMWrapper
import argilla as rg

from datasets import load_dataset
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


## Setup

In [3]:
rg.init(
    api_url="http://localhost:6900",
    api_key="admin.apikey",
)



In [4]:
with open("../../data/wiki_guardians.json", "r") as fh:
    text: str = json.load(fh)["text"]
    paragraph = text.split("\n\n\n")[0]

In [5]:
dataset_hf = load_dataset("banking77", split="train")
dataset_hf.to_pandas().head()

Unnamed: 0,text,label
0,I am still waiting on my card?,11
1,What can I do if my card still hasn't arrived ...,11
2,I have been waiting over a week. Is the card s...,11
3,Can I track my card while it is in the process...,11
4,"How do I know if I will get my card, or if it ...",11


## `spacy-llm` + `Ollama` + `spacy-dbpedia-spotlight` pipeline

In [6]:
cfg_string = """
[nlp]
lang = "en"
pipeline = ["llm", "dbpedia-spotlight"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.NER.v2"
labels = ["Person", "Organisation"]

[components.llm.model]
@llm_models = "langchain.Ollama.v1"
name = "mistral"
context_length = 2048
config = {"temperature": 0.0}

[components.dbpedia-spotlight]
factory = "dbpedia_spotlight"
dbpedia_rest_endpoint = "http://localhost:2222/rest"
language_code = "en"
overwrite_ents = false
process = "annotate"
"""

config = Config().from_str(cfg_string)
nlp = assemble_from_config(config)

In [7]:
doc = nlp(paragraph)

In [8]:
spacy.displacy.render(
    doc,
    style = "ent",
    jupyter = True,
)

## Inference

In [9]:
def tokenizer(doc):
  return [token.text for token in doc]

In [10]:
records = [
    rg.TokenClassificationRecord(
        text=doc.text,
        tokens=tokenizer(doc),
        prediction=[(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents],
        prediction_agent="ollama-mistral"
    ) for doc in [nlp(item) for item in dataset_hf[:100]["text"]]
]

In [11]:
dataset = rg.DatasetForTokenClassification(records)

In [12]:
rg.log(dataset, "banking77_ner", workspace="admin")

BulkResponse(dataset='banking77_ner', processed=100, failed=0)

In [None]:
# add NER annotations
# validate predictions
# return dataset