# Prodigy Demo

https://demo.prodi.gy/?=null&view_id=ner_manual

# Run Prodigy Named Entity Annotation Session

Copy / paste the following to the terminal

```bash

prodigy ner.manual news-headlines-ner blank:en ./data/news-headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION

```

Annotate a couple of records and save.

# Let's Examine Input File and Saved Annotation
## Annotation Tasks (Input)

In [5]:
from srsly import read_jsonl
file_name = './data/news-headlines.jsonl'
input_dataset = list(read_jsonl(file_name)) # wrapping result into list because read_jsonl returns a generator
print(f"Loaded {len(input_dataset)} annotation tasks")

Loaded 200 annotation tasks


In [11]:
import json
task = input_dataset[0]
print(json.dumps(task, indent = 2))


{
  "text": "Uber\u2019s Lesson: Silicon Valley\u2019s Start-Up Machine Needs Fixing",
  "meta": {
    "source": "The New York Times"
  }
}


## Connect to Prodigy database

In [12]:
from prodigy.components.db import connect
db = connect()
print(f"Database location: {db.db.database}")


Database location: /home/vscode/.prodigy/prodigy.db


In [3]:
db.datasets

['news-headlines-ner']

In [14]:
dataset_name = 'news-headlines-ner'
dataset = db.get_dataset_examples(dataset_name)
print(f"Loaded {len(dataset)} annotated tasks")

Loaded 2 annotated tasks


In [20]:
task = dataset[1]
print(task.keys())


dict_keys(['text', 'meta', '_input_hash', '_task_hash', '_is_binary', 'tokens', '_view_id', 'spans', 'answer', '_timestamp', '_annotator_id', '_session_id'])


In [21]:
for key in task.keys() :
    if key not in ['tokens'] :
        print(f"{key}: {task[key]}")

text: Pearl Automation, Founded by Apple Veterans, Shuts Down
meta: {'source': 'The New York Times'}
_input_hash: 1487477437
_task_hash: -1298236362
_is_binary: False
_view_id: ner_manual
spans: [{'start': 0, 'end': 17, 'token_start': 0, 'token_end': 2, 'label': 'ORG'}, {'start': 29, 'end': 44, 'token_start': 5, 'token_end': 7, 'label': 'ORG'}]
answer: accept
_timestamp: 1762554543
_annotator_id: 2025-11-07_22-27-32
_session_id: 2025-11-07_22-27-32


In [23]:
print(f"Text: {task['text']}")

for span in task['spans'] :
    print(f"{span['label']}: {task['text'][span['start'] : span['end']]}")

Text: Pearl Automation, Founded by Apple Veterans, Shuts Down
ORG: Pearl Automation,
ORG: Apple Veterans,


## Visualize Annotations

### Initialize spaCy

In [32]:
import spacy
from spacy import displacy
model = spacy.blank("en")


### Create spaCy `Doc` and visualize it

In [33]:
doc = model(task['text'])
# A list of tuples (LABEL, TOKEN_START, TOKEN_END)
entities = [(span['label'], span['token_start'], span['token_end']) for span in task['spans']]
doc.ents = entities

displacy.render(doc, style="ent", jupyter = True)

In [34]:
entities

[('ORG', 0, 2), ('ORG', 5, 7)]

In [35]:
dir(displacy)


['Any',
 'Callable',
 'DependencyRenderer',
 'Dict',
 'Doc',
 'EntityRenderer',
 'Errors',
 'Iterable',
 'Optional',
 'RENDER_WRAPPER',
 'Span',
 'SpanRenderer',
 'Union',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_html',
 'app',
 'find_available_port',
 'get_doc_settings',
 'is_in_jupyter',
 'parse_deps',
 'parse_ents',
 'parse_spans',
 'render',
 'serve',
 'set_render_wrapper',
 'templates',

In [36]:
#print(json.dumps(task, indent = 2))

# Download Datasets

## Summarization Dataset

```python

from pandas import read_parquet

df = read_parquet("https://huggingface.co/datasets/r-three/fib/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet")
print(f"Loaded {len(df)} records")

file_name = "./data/hf-summarization-dataset.csv"
df.to_csv(file_name, index = False)
print(f"Saved {len(df)} records in {file_name}")

```


## News Headlines Dataset

News headlines dataset: 

https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl

# Read Datasets from local copy

In [1]:
from pandas import read_csv
file_name = "./data/hf-summarization-dataset.csv"
df = read_csv(file_name)
print(f"Loaded {len(df)} records from {file_name}")

Loaded 3579 records from ./data/hf-summarization-dataset.csv


In [3]:
df.head()

Unnamed: 0,id,input,correct_choice,list_choices,lbl,distractor_model,dataset
0,32168497,Vehicles and pedestrians will now embark and d...,Passengers using a chain ferry have been warne...,"["" A new service on the Isle of Wight's chain ...",1,bart-base,xsum
1,29610109,If you leave your mobile phone somewhere do yo...,"Do you ever feel lonely, stressed or jealous w...","[' You may be worried about your health, but w...",1,bart-base,xsum
2,38018439,"Speaking on TV, Maria Zakharova said Jews had ...",A spokeswoman on Russian TV has said Jewish pe...,[' The Russian foreign minister has said she h...,1,bart-base,xsum
3,32790804,"A report by the organisation suggests men, wom...",Egyptian security forces are using sexual viol...,[' Egyptian police are systematically abusing ...,1,bart-base,xsum
4,36437856,Police in Australia and Europe were aware of a...,One word and a freckle indirectly led to Huckl...,['One word and a freckle indirectly led to Huc...,0,bart-base,xsum


Saved 3579 records in ./data/hf-summarization-dataset.csv
