# Introduction to Prodigy

## Definitions

**Prodigy** is an annotation tool for creating and improving machine-learning datasets, especially for text, images, and audio.

It’s designed to be fast, scriptable, and developer-friendly — so you can build high-quality labeled data interactively, right next to your model.


### Key components:

* **Web app:**

    What annotators see — an interactive browser UI where you label examples (e.g., highlight entities, classify text, correct model outputs).

* **Annotation Interface:**

    An annotation interface is the front-end view shown to the annotator in Prodigy’s web app - it determines how you present a task (text, image, choice, etc.). 
    
    and how the user interacts with it (highlight spans, select choices, draw boxes). 

    It is also referred to as a `view_id` in code and json configurations.
    
    There are multiple pre-defined annotation interfaces (such as Named Entity Recognition, Text Categorization, etc) described here: https://prodi.gy/docs/api-interfaces
    
    Prodigy also allows to combine several existing interfaces into one page (by using `blocks`), and create your own interfaces via `HTMLTemplate` interface.

* **Database:**

    Every annotation session is saved in a database — you can export or query results for training or analysis. 
    
    By default, prodigy creates a sqlite database in ~/.prodigy/prodigy.db. 

    However, it can be configured to use an sqlite database located anywhere on the local drive, or to use PostgreSQL or SQL Server databases.

* **Recipes:**

    Ready-made or custom Python functions that define what to annotate and how.

    For example, ner.manual lets you label entities from scratch, and ner.correct lets you correct model predictions.


* **Prodigy CLI:**

    Lets you launch Peodigy, see annotation progress, export annotated data and more.

* **Controller (advanced):**
    
    The controller is the central part of Prodigy that manages the flow of data between your recipe, the annotation interface, and the database.

    It's what actually runs a Prodigy session - deciding:

    * which examples to send to the annotator,
        
    * how to display them in the browser, and
        
    * what to do with the answers when the user submits them.


<img src="images/prodigy-data-flow.png" alt="Prodigy Data Flow Diagram" width="1000"/>

# Prodigy Live Demo


Check out different out-of-the-box annotation interfaces on the Prodigy web site.

It will give you ideas about how Prodigy can be used for your needs.

### Named Entity

Allows to select and assign categories to non-overlapping text spans.

#### Manual data labeling: 

https://demo.prodi.gy/?=null&view_id=ner_manual

#### Pre-labeled data to accept/reject: 

https://demo.prodi.gy/?=null&view_id=ner

### Span categorization
Allows to select arbitrary (and potentially overlapping) text spans and assign categories

https://demo.prodi.gy/?=null&view_id=spans_manual

### Text Classification

Allows assigning categories to the whole text

https://demo.prodi.gy/?=null&view_id=textcat_multi

### Choices

Based on displayed information, select one or more options from the given ones.

#### Single Choice

https://demo.prodi.gy/?=null&view_id=textchoice

#### Multiple Choice (Image)

The demo is given on image data, but can be used with text

https://demo.prodi.gy/?=null&view_id=imgchoice

### Relations

Allows to define relations (directional or not) between text spans

https://demo.prodi.gy/?=null&view_id=rel_bio

### Other interfaces

Feel free to explore other annotation interfaces. There are examples with images, video and audio, not just text.

# Annotate Summarization Dataset

In [47]:
from pandas import read_csv
import json
file_name = "./data/hf-summarization-dataset.csv"
df = read_csv(file_name)
df.list_choices = df.list_choices.apply(lambda x : json.loads(x))
print(f"Loaded {len(df)} records from {file_name}")

Loaded 3579 records from ./data/hf-summarization-dataset.csv


In [3]:
df.head()

Unnamed: 0,id,input,correct_choice,list_choices,lbl,distractor_model,dataset
0,32168497,Vehicles and pedestrians will now embark and d...,Passengers using a chain ferry have been warne...,[ A new service on the Isle of Wight's chain f...,1,bart-base,xsum
1,29610109,If you leave your mobile phone somewhere do yo...,"Do you ever feel lonely, stressed or jealous w...","[ You may be worried about your health, but wh...",1,bart-base,xsum
2,38018439,"Speaking on TV, Maria Zakharova said Jews had ...",A spokeswoman on Russian TV has said Jewish pe...,[ The Russian foreign minister has said she ha...,1,bart-base,xsum
3,32790804,"A report by the organisation suggests men, wom...",Egyptian security forces are using sexual viol...,[ Egyptian police are systematically abusing d...,1,bart-base,xsum
4,36437856,Police in Australia and Europe were aware of a...,One word and a freckle indirectly led to Huckl...,[One word and a freckle indirectly led to Huck...,0,bart-base,xsum


## Construct Prodigy Annotation Tasks

In [4]:
tasks = []
for _, row in df.iterrows():
    task = {
        'id' : row['id'],
        'text' : row['input'],
        'options': [],
        'correct_answer' : str(row.lbl),
    }
    for index, choice in enumerate(row.list_choices) :
        task['options'].append({'id' : str(index), 'text' : choice})
    tasks.append(task)
print(f"Generated {len(tasks)} tasks.")

Generated 3579 tasks.


In [5]:
print(json.dumps(tasks[0], indent = 2))

{
  "id": "32168497",
  "text": "Vehicles and pedestrians will now embark and disembark the Cowes ferry separately following Maritime and Coastguard Agency (MCA) guidance.\nIsle of Wight Council said its new procedures were in response to a resident's complaint.\nCouncillor Shirley Smart said it would \"initially result in a slower service\".\nOriginally passengers and vehicles boarded or disembarked the so called \"floating bridge\" at the same time.\nMs Smart, who is the executive member for economy and tourism, said the council already had measures in place to control how passengers and vehicles left or embarked the chain ferry \"in a safe manner\".\nHowever, it was \"responding\" to the MCA's recommendations \"following this complaint\".\nShe added: \"This may initially result in a slower service while the measures are introduced and our customers get used to the changes.\"\nThe service has been in operation since 1859.",
  "options": [
    {
      "id": "0",
      "text": " A new 

In [6]:
import srsly
file_name = "./data/summarization-dataset-choices.jsonl"
srsly.write_jsonl(file_name, tasks)
print(f"Saved {len(tasks)} tasks in {file_name}")

Saved 3579 tasks in ./data/summarization-dataset-choices.jsonl


## Start Prodigy

Copy/paste the following commands to the Terminal:

```bash

cd annotation-projects/summarization-choices
./start.sh

```

The annotation session will be started at at http://localhost:8082?session=natalia

If you click on the link that Prodigy prints at start up, it will take you to the forwarded port link, but you will need to append `?session=natalia` to it and reload.

Annotate a few records and save (a button on top left)

## Examine annotated data

### Connect to Prodigy DB

In [7]:
from prodigy.components.db import connect
db = connect()
print(f"Database location: {db.db.database}")

Database location: /home/vscode/.prodigy/prodigy.db


In [8]:
db.datasets

['text-summarization',
 'text-summarization-binary',
 'summarization-dataset-reviewed',
 'summarization-dataset-annotated',
 'resume-ner']

In [9]:
dataset_name = 'text-summarization'
dataset = db.get_dataset_examples(dataset_name)
print(f"Loaded {len(dataset)} annotated tasks")

Loaded 3 annotated tasks


In [10]:
dataset[0]

{'id': '32168497',
 'text': 'Vehicles and pedestrians will now embark and disembark the Cowes ferry separately following Maritime and Coastguard Agency (MCA) guidance.\nIsle of Wight Council said its new procedures were in response to a resident\'s complaint.\nCouncillor Shirley Smart said it would "initially result in a slower service".\nOriginally passengers and vehicles boarded or disembarked the so called "floating bridge" at the same time.\nMs Smart, who is the executive member for economy and tourism, said the council already had measures in place to control how passengers and vehicles left or embarked the chain ferry "in a safe manner".\nHowever, it was "responding" to the MCA\'s recommendations "following this complaint".\nShe added: "This may initially result in a slower service while the measures are introduced and our customers get used to the changes."\nThe service has been in operation since 1859.',
 'options': [{'id': '0',
   'text': " A new service on the Isle of Wight's

### Inspect Task 1

```python
{
  # Copied from the input task
  'id': '32168497',
  'text': 'Vehicles and pedestrians will now embark and disembark the Cowes ferry separately following Maritime and Coastguard Agency (MCA) guidance.\nIsle of Wight Council said its new procedures were in response to a resident\'s complaint.\nCouncillor Shirley Smart said it would "initially result in a slower service".\nOriginally passengers and vehicles boarded or disembarked the so called "floating bridge" at the same time.\nMs Smart, who is the executive member for economy and tourism, said the council already had measures in place to control how passengers and vehicles left or embarked the chain ferry "in a safe manner".\nHowever, it was "responding" to the MCA\'s recommendations "following this complaint".\nShe added: "This may initially result in a slower service while the measures are introduced and our customers get used to the changes."\nThe service has been in operation since 1859.',
  'options': [
    {'id': '0', 'text': " A new service on the Isle of Wight's chain ferry has been launched following a complaint from a resident."},
    {'id': '1', 'text': 'Passengers using a chain ferry have been warned crossing times will be longer because of new safety measures.'}],
  'correct_answer': '1',

  # Spacy uses murmur hash to uniquely identify tasks for quicker processing
  '_input_hash': -920299312,
  '_task_hash': -1864632314,

  # This is passed from the recipe
  '_view_id': 'blocks', # Combined "choices" recipe with custom HTML line
  'config': {'choice_style': 'single'}, # Lets us know that it was a single choice annotation (as opposed to multiple choice)

  # This added by Prodigy
  'accept': ['0'], # "id" field of selected option from "options" above
  'answer': 'accept', # Accept button was pressed to process the annotation (as opposed to 'reject')
  '_timestamp': 1762560620, 
  '_annotator_id': 'text-summarization-natalia', # dataset_name + session id from the URL
  '_session_id': 'text-summarization-natalia' # dataset_name + session id from the URL. Currently, _session_id and annotator_id are interchangeable
}

```

In [49]:
dataset[2]


TypeError: 'NoneType' object is not subscriptable

### Inspect Task 2 (Flagged)

```python
{
    'id': '29610109',
    'text': 'If you leave your mobile phone somewhere do you worry you will not be able to check it?\nIf any of this sounds familiar, there is a chance you could be spending too much time on social networks.\nAn exclusive online Newsbeat poll suggests that a quarter of 15 to 18-year-olds in the UK feel happier online than they do in real life.\nDr Radha from The Surgery on Radio 1 has dealt with patients who have displayed "a lot of social anxiety" because they are using social networks too much.\n"Being online can provoke a sense of \'I\'m not good enough, everyone else is having an amazing life\'," she explained.\n"It doesn\'t give us a sense of reality and actually what you will find is most people are probably doing the same thing as you are."\nThe survey, carried out last month, also suggests a third of 15 to 18-year-olds have met someone in person they originally met through social media.\nDr Radha has said it is important people carefully consider what information they share with the online community.\n"What this survey showed is a lot of people go online alone," she said.\n"In terms of our personal details and how we respond to messages from other people, we need to make sure we are looking after all of that safely."\nDr Radha was concerned that some people feel safer dealing with people online, rather than in person.\n"The more time we spend online, the less we are able to develop our social skills," she explained.\n"When you are online you\'re not getting eye contact with people or perceiving how body language is changing, so as a result what people are saying can be misinterpreted.\n"Physical contact, like a hug and a kiss, is really important. You don\'t get that kind of emotional confidence from being online."\nIf your online activity is leaving you feeling anxious, Dr Radha has advised that you should "slowly try to wean yourself off it".\nShe said: "If you are worrying, \'what\'s going on? What am I missing?\' It\'s a sign that being online too much is quite bad for you.\n"Give yourself some rules by saying, \'I\'m only going to check things three times a day for this amount of time\'."\nBBC Radio 1\'s The Surgery with Aled and Dr Radha is on Wednesday\'s at 9pm.\nFollow @BBCNewsbeat on Twitter and Radio1Newsbeat on YouTube',
    'options': [
        {'id': '0', 'text': ' You may be worried about your health, but what if you are online?'},
        {'id': '1', 'text': 'Do you ever feel lonely, stressed or jealous when you are online?'}
    ],
    'correct_answer': '1',
    '_input_hash': 1247871379,
    '_task_hash': 884697833,
    '_view_id': 'blocks',
    'accept': ['1'],
    'config': {'choice_style': 'single'},
    'flagged': True, # ATTN: the message was flagged. Normally, we would also add "User Comment" field to the annotation interface so that annotators can leave a comment
    'answer': 'accept',
    '_timestamp': 1762560626,
    '_annotator_id': 'text-summarization-natalia',
    '_session_id': 'text-summarization-natalia'
}

```

### Inspect Task 3 (Rejected)

```python

{
    'id': '38018439',
    'text': 'Speaking on TV, Maria Zakharova said Jews had told her they donated both to Mr Trump and Hillary Clinton.\nShe joked that American Jews were the best guide to US politics.\nThe diplomat\'s remarks caused shock. Anti-US propagandists in the last century peddled an idea that rich New York Jews controlled US politics.\nMs Zakharova was speaking on a chat show on Russian state TV at the weekend but her comments drew more attention after being picked up by media outlets on Thursday.\nShe said she had visited New York with an official Russian delegation at the time of the last UN General Assembly, in September.\n"I have a lot of friends and acquaintances there, of course I was interested to find out: how are the elections going, what are the American people\'s expectations?" she said.\n"If you want to know what will happen in America, who do you need to talk to? You have to talk to the Jews, of course. It goes without saying."\nAt this, the TV studio audience applauded loudly.\n"I went here and there among them, to chat," she continued.\nImitating a Jewish accent, Mrs Zakharova said Jewish people had told her: "\'Marochka, understand this - we\'ll donate to Clinton, of course. But we\'ll give the Republicans twice that amount.\' Enough said! That settled it for me - the picture was clear.\n"If you want to know the future, don\'t read the mainstream newspapers - our people in Brighton [Beach] will tell you everything."\nShe was referring to a district of Brooklyn with a large diaspora of Jewish emigres from the former Soviet Union.\nRussian opposition activist Roman Dobrokhotov wrote on Twitter (in Russian) that the spokeswoman had "explained Trump\'s victory as a Jewish conspiracy".\nMichael McFaul, the former US ambassador to Moscow, commented on Facebook, "Wow. And this is the woman who criticizes me for not being diplomatic."\nDuring the election campaign, Mrs Clinton accused Mr Trump of posting a "blatantly anti-Semitic" tweet after he used an image resembling the Star of David and stacks of money.\nMr Trump, whose son-in-law Jared Kushner is Jewish, dismissed the accusation as "ridiculous".\nAn exit poll by US non-profit J Street suggests an overwhelming majority of US Jews voted for Hillary Clinton in the presidential election.',
    'options': [
      {'id': '0', 'text': ' The Russian foreign minister has said she has been "settled" by criticism from Jewish people for saying that the US election was a "Jewish conspiracy".'},
      {'id': '1', 'text': 'A spokeswoman on Russian TV has said Jewish people in New York told her they had mainly backed Trump in the US election.'}
    ],
    'correct_answer': '1',
    '_input_hash': -386851509,
    '_task_hash': 594860135,
    '_view_id': 'blocks',
    'accept': [], # Nothing was selected
    'config': {'choice_style': 'single'},
    'answer': 'reject', # ATTN: the task was rejected
    '_timestamp': 1762560628,
    '_annotator_id': 'text-summarization-natalia',
    '_session_id': 'text-summarization-natalia'
}
```

# Annotate Summarization (Binary)
## Create Dataset

In [12]:
import random
tasks = []
for _, row in df.iterrows():
    for index, choice in enumerate(row.list_choices) :
        task = {
            'id' : row['id'],
            'text' : row['input'],
            'option': {'id' : str(index), 'value' : choice},
            'is_correct_answer' : row.lbl == choice,
            'html' : ' ', # Important
        }
        tasks.append(task)
print(f"Generated {len(tasks)} tasks.")

random.seed(42)
# random.shuffle(tasks) # Commenting it out for demo purposes, but the dataset should be shuffled in real life


Generated 7158 tasks.


In [13]:
file_name = './data/summarization-dataset-binary.jsonl'

srsly.write_jsonl(file_name, tasks)
print(f"Saved {len(tasks)} annotation tasks to {file_name}")

Saved 7158 annotation tasks to ./data/summarization-dataset-binary.jsonl


### Start Prodigy Session

```bash
cd /workspaces/prodigy-demo/annotation-projects/summarization-binary
./start.sh

# This will start a session at http://localhost:8082?session=natalia
```

### Inspect Annotated Tasks

In [14]:
db.datasets

['text-summarization',
 'text-summarization-binary',
 'summarization-dataset-reviewed',
 'summarization-dataset-annotated',
 'resume-ner']

In [15]:
dataset_name = 'text-summarization-binary'
dataset = db.get_dataset_examples(dataset_name)
print(f"Loaded {len(dataset)} annotated tasks from {dataset_name}")

Loaded 8 annotated tasks from text-summarization-binary


#### Inspect accepted task

```python

{
    'id': '29610109',
    'text': 'If you leave your mobile phone somewhere do you worry you will not be able to check it?\nIf any of this sounds familiar, there is a chance you could be spending too much time on social networks.\nAn exclusive online Newsbeat poll suggests that a quarter of 15 to 18-year-olds in the UK feel happier online than they do in real life.\nDr Radha from The Surgery on Radio 1 has dealt with patients who have displayed "a lot of social anxiety" because they are using social networks too much.\n"Being online can provoke a sense of \'I\'m not good enough, everyone else is having an amazing life\'," she explained.\n"It doesn\'t give us a sense of reality and actually what you will find is most people are probably doing the same thing as you are."\nThe survey, carried out last month, also suggests a third of 15 to 18-year-olds have met someone in person they originally met through social media.\nDr Radha has said it is important people carefully consider what information they share with the online community.\n"What this survey showed is a lot of people go online alone," she said.\n"In terms of our personal details and how we respond to messages from other people, we need to make sure we are looking after all of that safely."\nDr Radha was concerned that some people feel safer dealing with people online, rather than in person.\n"The more time we spend online, the less we are able to develop our social skills," she explained.\n"When you are online you\'re not getting eye contact with people or perceiving how body language is changing, so as a result what people are saying can be misinterpreted.\n"Physical contact, like a hug and a kiss, is really important. You don\'t get that kind of emotional confidence from being online."\nIf your online activity is leaving you feeling anxious, Dr Radha has advised that you should "slowly try to wean yourself off it".\nShe said: "If you are worrying, \'what\'s going on? What am I missing?\' It\'s a sign that being online too much is quite bad for you.\n"Give yourself some rules by saying, \'I\'m only going to check things three times a day for this amount of time\'."\nBBC Radio 1\'s The Surgery with Aled and Dr Radha is on Wednesday\'s at 9pm.\nFollow @BBCNewsbeat on Twitter and Radio1Newsbeat on YouTube',
    'option': {'id': '0', 'value': ' You may be worried about your health, but what if you are online?'},
    'is_correct_answer': False,
    'html': ' ',
    '_input_hash': -1574724678,
    '_task_hash': -2145974390,
    '_view_id': 'html',
    'answer': 'accept',  ### ACCEPTED
    '_timestamp': 1762562740,
    '_annotator_id': 'text-summarization-binary-natalia',
    '_session_id': 'text-summarization-binary-natalia'
}

 ```

#### Inspect Rejected Task

```python
{
    'id': '38018439',
    'text': 'Speaking on TV, Maria Zakharova said Jews had told her they donated both to Mr Trump and Hillary Clinton.\nShe joked that American Jews were the best guide to US politics.\nThe diplomat\'s remarks caused shock. Anti-US propagandists in the last century peddled an idea that rich New York Jews controlled US politics.\nMs Zakharova was speaking on a chat show on Russian state TV at the weekend but her comments drew more attention after being picked up by media outlets on Thursday.\nShe said she had visited New York with an official Russian delegation at the time of the last UN General Assembly, in September.\n"I have a lot of friends and acquaintances there, of course I was interested to find out: how are the elections going, what are the American people\'s expectations?" she said.\n"If you want to know what will happen in America, who do you need to talk to? You have to talk to the Jews, of course. It goes without saying."\nAt this, the TV studio audience applauded loudly.\n"I went here and there among them, to chat," she continued.\nImitating a Jewish accent, Mrs Zakharova said Jewish people had told her: "\'Marochka, understand this - we\'ll donate to Clinton, of course. But we\'ll give the Republicans twice that amount.\' Enough said! That settled it for me - the picture was clear.\n"If you want to know the future, don\'t read the mainstream newspapers - our people in Brighton [Beach] will tell you everything."\nShe was referring to a district of Brooklyn with a large diaspora of Jewish emigres from the former Soviet Union.\nRussian opposition activist Roman Dobrokhotov wrote on Twitter (in Russian) that the spokeswoman had "explained Trump\'s victory as a Jewish conspiracy".\nMichael McFaul, the former US ambassador to Moscow, commented on Facebook, "Wow. And this is the woman who criticizes me for not being diplomatic."\nDuring the election campaign, Mrs Clinton accused Mr Trump of posting a "blatantly anti-Semitic" tweet after he used an image resembling the Star of David and stacks of money.\nMr Trump, whose son-in-law Jared Kushner is Jewish, dismissed the accusation as "ridiculous".\nAn exit poll by US non-profit J Street suggests an overwhelming majority of US Jews voted for Hillary Clinton in the presidential election.',
    'option': {'id': '0',
    'value': ' The Russian foreign minister has said she has been "settled" by criticism from Jewish people for saying that the US election was a "Jewish conspiracy".'},
    'is_correct_answer': False,
    'html': ' ',
    '_input_hash': 420195201,
    '_task_hash': 940450759,
    '_view_id': 'html',
    'answer': 'reject', # Choice was rejected
    '_timestamp': 1762562745,
    '_annotator_id': 'text-summarization-binary-natalia',
    '_session_id': 'text-summarization-binary-natalia'
}
```

# Explore Inter-annotator Agreement

### (Or more precisely, disagreement). 
### This pattern can also be used to review model results vs gold standard dataset


## Create Ground Truth Dataset

In [16]:
from copy import deepcopy
from prodigy.util import set_hashes

# Unannotated tasks (same code as above)
tasks = []
for _, row in df.iterrows():
    task = {
        'id' : row['id'],
        'text' : row['input'],
        'options': [],
        'correct_answer' : str(row.lbl),
    }
    for index, choice in enumerate(row.list_choices) :
        task['options'].append({'id' : str(index), 'text' : choice})
    tasks.append(task)
print(f"Generated {len(tasks)} tasks.")


correct_tasks = []

for task in deepcopy(tasks):
    task['_view_id'] = 'choice'
    task['answer'] = 'accept'
    task['accept'] = [task['correct_answer']]
    task['_session_id'] = 'summarization-dataset-gold'
    task['_annotator_id'] = 'summarization-dataset-gold'
    task = set_hashes(task)
    correct_tasks.append(task)


Generated 3579 tasks.


## Simulate annotations or model results

In [17]:
import random
random.seed(42)

random_tasks = []

for task in deepcopy(tasks):
    task['_view_id'] = 'choice'
    task['answer'] = 'accept'
    selected = random.choice(task['options'])
    task['accept'] = [selected['id']]
    task['_session_id'] = 'summarization-dataset-random'
    task['_annotator_id'] = 'summarization-dataset-random'
    task = set_hashes(task)
    random_tasks.append(task)

In [18]:
incorrect = 0
for task in random_tasks :
    if task['accept'][0] != task['correct_answer'] :
        incorrect += 1
incorrect

1789

## Add tasks to the database as if they were manually annotated

In [19]:
dataset_name = 'summarization-dataset-annotated'
if dataset_name in db.datasets :
    db.drop_dataset(dataset_name)
db.add_examples(correct_tasks, [dataset_name])
db.add_examples(random_tasks, [dataset_name])
print(f"Added {db.count_dataset(dataset_name)} tasks to dataset {dataset_name}")

Added 7158 tasks to dataset summarization-dataset-annotated


## Start review session in Prodigy

```bash

cd annotation-projects/summarization-choices
./start-review.sh

```

The annotation session will be started at at http://localhost:8082?session=natalia

# NER for IE

In [22]:
from pandas import read_csv
file_name = "./data/resume-dataset.csv"
df = read_csv(file_name)
print(f"Loaded {len(df)} resumes")

Loaded 10 resumes


In [None]:
df.head()

Unnamed: 0,resume_id,full_name,resume_text
0,1,Arjun Reddy,4\nOct\nArjun Reddy\n9185551234 | arjun.reddy@...
1,2,LUCAS M\nBERNARD-\nDUPONT,LUCAS M\nBERNARD-\nDUPONT\n\nLUCASBERNARD@PROT...
2,3,RYAN KALINOWSKI,"RYAN KALINOWSKI, M.S.\nData Scientist | Machin..."
3,4,"Morgan Ellis, Ph.D.","Arvada, CO\nLinkedIn\nMorgan Ellis, Ph.D.\nmor..."
4,5,Rohan Desai,"SUMMARY\nRohan Desai\nData Scientist\nMedford,..."


## Create Annotation Tasks

In [None]:
tasks = []
for index, row in df.iterrows() :
    task = {
        'resume_id' : row.resume_id,
        'text' : row.resume_text,
    }
    tasks.append(task)

file_name = './data/resume-ner-dataset.jsonl'
srsly.write_jsonl(file_name, tasks)
print(f"Saved {len(tasks)} tasks in {file_name}")


Saved 10 tasks in ./data/resume-ner-dataset.jsonl


## Start Prodigy Session

```bash
cd /workspaces/prodigy-demo/annotation-projects/resume-ner
./start.sh
```

This will start a session at http://localhost:8082?session=natalia

## Load Annotated Data

In [26]:
dataset_name = 'resume-ner'
dataset = db.get_dataset_examples(dataset_name)
print(f"Loaded {len(dataset)} tasks from {dataset_name}")

Loaded 10 tasks from resume-ner


## Build a Dataframe from annotations


This time, we are loading annotated tasks from a file, just to make sure all 10 tasks are have annotations. 

These tasks were annotated using prodigy, exported from the database and saved toa jsonl file.

In [28]:
file_name = './data/resume-ner-dataset-annotated.jsonl' # Loading from pre-annotated dataset so that we have 10 examples. The dataset was annotated in Prodigy.
dataset = list(srsly.read_jsonl(file_name))
len(dataset)

10

In [30]:
from pandas import DataFrame
rows = []
for task in dataset :
    text = task['text']
    row = {
        'resume_id' : task['resume_id'],
        'resume_text' : text,
    }
    for span in task['spans'] : # NER annotations
        span_text = text[span['start'] : span['end']]
        label = span['label'].lower()
        row[label] = span_text # If multiple entries for label exist, we'll end up keeping the last one, but it's ok for a demo
    rows.append(row)

DataFrame(rows)    

Unnamed: 0,resume_id,resume_text,first_name,last_name,phone,email,city,state,middle_name,1st_work_year,1st_study_year,linkedin
0,1,4\nOct\nArjun Reddy\n9185551234 | arjun.reddy@...,Arjun,Reddy,9185551234,arjun.reddy@themailpad.com,Irving,Texas,,,,
1,2,LUCAS M\nBERNARD-\nDUPONT\n\nLUCASBERNARD@PROT...,LUCAS,BERNARD-\nDUPONT,(949) 423-8672,LUCASBERNARD@PROTONMAIL.COM,,,M,2021,,
2,3,"RYAN KALINOWSKI, M.S.\nData Scientist | Machin...",RYAN,KALINOWSKI,(585) 305-7824,rkalinowski@cornell.edu,Seattle,WA,,8/2015,8/2011,
3,4,"Arvada, CO\nLinkedIn\nMorgan Ellis, Ph.D.\nmor...",Morgan,Ellis,(415) 716-4823,morgan.ellis@gmail.com,Arvada,CO,,2011,,
4,5,"SUMMARY\nRohan Desai\nData Scientist\nMedford,...",Rohan,Desai,| (716) 384-7623,rohan.desai@myjobsmails.com,Medford,MA,,2018-,,
5,6,Evan Marlow\nSenior Data Scientist\nevan.marlo...,Evan,Marlow,,4086004829,San Francisco,CA,,11/2013,09/2007,www.linkedin.com/in/evan-lol
6,7,"Houston, TX\nafarooq.amina@gmail.com\nafarooq-...",,,,afarooq.amina@gmail.com,Houston,TX,,2020,2007,
7,8,Data Scientist\nName : Ravi Sai Ajay Kumar Man...,Ravi,Dallas,+19452334678,ravi.ajaymandapati@gmail.com,,Texas,Sai Ajay Kumar,2018,2019,https://www.linkedin.com/in/ajay-mandapati/
8,9,Arvind Kumar Reddy\nData Scientist\n(272) 235-...,Arvind,Reddy,(272) 235-1189,arvind.k@jobtechsmail.com,,,Kumar,2015,,
9,10,Carlos Mendoza\nSenior AI/ML Engineer\nEmail: ...,Carlos,Mendoza,,carlos.mendoza8899@gmail.com,Jacksonville,FL,,01/2015,2014,www.linkedin.com/in/carlos-mendoza-zavala-9010...


# NER for text partitioning

Here, we demonstrate how to use NER interface for text partitioning.

The idea is to partition resumes / LinkedIn profiles into sections (such as PERSONAL_INFO, EXPERIENCE, EDUCATION, etc) for more efficient search, IE and RAG.

For example, when we are looking for an industry experience in certain area, we might choose to exclude paper publications and school projects.

## Start Prodigy Session

```bash
cd /workspaces/prodigy-demo/annotation-projects/resume-partitioning
./start.sh

# This will start a session at http://localhost:8082?session=natalia

```

## Alternative Interface: `spans` vs `ner`

```bash
cd /workspaces/prodigy-demo/annotation-projects/resume-partitioning
./start-spans.sh

# This will start a session at http://localhost:8082?session=natalia

```

## Get Partitioned Resumes

Again, here we are loading data from previous annotations to make sure we have 10 annotated tasks for the demo purpose.

In [None]:
# Read file with saved annotations
file_name = './data/resume-partitioning-dataset-annotated.jsonl'
dataset = list(srsly.read_jsonl(file_name))
len(dataset)

10

## Extract Partitions

### Code to extract partitions

In [None]:
# Extract resume partitions based on annotated partiion starts
# Assume the partition ends where the next one starts
#
def get_partitions(task, default_partition_name = 'OTHER', strict = False) :
    text = task['text']
    spans = task['spans']
    spans = sorted(spans, key = lambda x : x['start'])
    # If nothing was annotated, return the full resume
    if not spans :
        if strict :
            raise RuntimeError("No partitions found. Expected at least one.")
        
        partition = {
            'label' : default_partition_name,
            'start' : 0,
            'end' : len(text),
            'text' : text
        }
        return [partition]
    partitions = []
    end = 0
    if spans[0]['start'] > 0 :
        if strict :
            raise RuntimeError("First partition must start at offset 0")
        # Create default first partition
        first_partition = {
            'label' : default_partition_name,
            'start' : 0,
            'end' : spans[0]['start'],
            'text' : text[0 : spans[0]['start']],
        }
        partitions.append(first_partition)
        
        end = spans[0]['start']
    
    start = spans[0]['start']
    label = spans[0]['label']
    for span in spans[1:] :
        end = span['start']
        partition = {
            'label' : label,
            'start' : start,
            'end' : end,
            'text' : text[start : end],
        }
        partitions.append(partition)
        start = end
        label = span['label']
    if start != len(text) :
        end = len(text)
        partition = {
            'label' : label,
            'start' : start,
            'end' : end,
            'text' : text[start : end],
        }
        partitions.append(partition)
    
    partitions_texts = [x['text'] for x in partitions]
    partitions_texts = "".join(partitions_texts)
    assert text == partitions_texts

    return partitions
        


In [None]:
# Inspect output
get_partitions(dataset[1])

[{'label': 'PERSONAL_INFO',
  'start': 0,
  'end': 69,
  'text': 'LUCAS M\nBERNARD-\nDUPONT\n\nLUCASBERNARD@PROTONMAIL.COM\n\n(949) 423-8672\n'},
 {'label': 'SKILLS',
  'start': 69,
  'end': 162,
  'text': 'SKILLS\n\nWords per minute typing: 60+\nMedical terminology\nCommunication and leadership skills\n'},
 {'label': 'EXPERIENCE',
  'start': 162,
  'end': 446,
  'text': 'EXPERIENCE\nMEDICAL SCRIBE @ OROVILLE HOSPITAL\nSept 2021 - Present\nExtended the ability of the provider to care for patients by completing\nbilling, medical notes, creating medical orders and answering patient\nquestions. I am leaving this job because I am not being given enough\nhours.\n'},
 {'label': 'EXPERIENCE',
  'start': 446,
  'end': 716,
  'text': "INTERN AT UC DAVIS MEDICAL CENTER\nJan 1st 2020 to April 2020\nAssisted nursing staff in the accelerated access unit by dispatching them\nto patients' rooms. Maintained in-room supplies of consumables such as\ngloves and masks. The program was discontinued due to 

### Extract Partitions and create a DataFrame

In [42]:
partitions = []
for task in dataset :
    task_partitions = get_partitions(task)
    resume_id = task['resume_id']
    for index, partition in enumerate(task_partitions) :
        new_partition = {
            'resume_id' : resume_id,
            'partition_sequence' : index,
        }
        new_partition.update(partition)
        partitions.append(new_partition)

In [51]:
df_sections = DataFrame(partitions)
df_sections.head(10)


Unnamed: 0,resume_id,partition_sequence,label,start,end,text
0,1,0,OTHER,0,6,4\nOct\n
1,1,1,PERSONAL_INFO,6,74,Arjun Reddy\n9185551234 | arjun.reddy@themailp...
2,1,2,EDUCATION,74,153,Master's in Management Information Systems | T...
3,1,3,EDUCATION,153,261,Bachelor in Computer Science | VNR Vignana Jyo...
4,2,0,PERSONAL_INFO,0,69,LUCAS M\nBERNARD-\nDUPONT\n\nLUCASBERNARD@PROT...
5,2,1,SKILLS,69,162,SKILLS\n\nWords per minute typing: 60+\nMedica...
6,2,2,EXPERIENCE,162,446,EXPERIENCE\nMEDICAL SCRIBE @ OROVILLE HOSPITAL...
7,2,3,EXPERIENCE,446,716,INTERN AT UC DAVIS MEDICAL CENTER\nJan 1st 202...
8,2,4,EDUCATION,716,1108,EDUCATION\nGENETICS AND GENOMICS (NO DEGREE EA...
9,2,5,OTHER,1108,1445,VOLUNTEER EXPERIENCE OR LEADERSHIP\n-\nLed a s...


# Multi-step annotations


In the previous NER example, where we annotated first and last names, email, linkedIn, etc and experience / education years, we had too many labels and had to scroll through long resumes, which caused annotation errors.

One way to simplify it is to narrow down resume sections that we are annotating.

Now, when we have resume sections, we can only choose PERSONAL_INFO sections for names and contact info, which will make it easier to annotate.

## Create annotation dataset

In [53]:
df_personal_info = df_sections[df_sections.label == 'PERSONAL_INFO'].reset_index(drop = True)
print(f"Found {len(df_personal_info)} PERSONAL_INFO sections")

Found 9 PERSONAL_INFO sections


In [54]:
tasks = []
for index, row in df_personal_info.iterrows() :
    task = {
        "resume_id" : row.resume_id,
        "partition_sequence" : row.partition_sequence,
        "text" : row.text, 
        # Storing these just to avoid joining tables / dataframes
        "label" : row.label,
        "start" : row.start,
        "end" : row.end,
    }
    tasks.append(task)
len(tasks)

9

In [57]:
tasks[2]

{'resume_id': 3,
 'partition_sequence': 0,
 'text': 'RYAN KALINOWSKI, M.S.\nData Scientist | Machine Learning | Healthcare Analytics\n\n(585) 305-7824             rkalinowski@cornell.edu\nLinkedIn\nSeattle, WA\n',
 'label': 'PERSONAL_INFO',
 'start': 0,
 'end': 152}

## Save narrowed-down tasks to a jsonl file for annotations

In [60]:
file_name = './data/resume-ner-personal-info-dataset.jsonl'
srsly.write_jsonl(file_name, tasks)

## Start Prodigy

```bash
cd annotation-projects/resume-ner-personal-info/
./start.sh

```

This will start a session at http://localhost:8082?session=natalia

# Annotation Guidelines

# Prodigy NER Demo on the dataset from their website

Copy / paste the following to the terminal

```bash

prodigy ner.manual news-headlines-ner blank:en ./data/news-headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION

```

Annotate a couple of records and save.

# Let's Examine Input File and Saved Annotation
## Annotation Tasks (Input)

In [None]:
from srsly import read_jsonl
file_name = './data/news-headlines.jsonl'
input_dataset = list(read_jsonl(file_name)) # wrapping result into list because read_jsonl returns a generator
print(f"Loaded {len(input_dataset)} annotation tasks")

Loaded 200 annotation tasks


In [None]:
import json
task = input_dataset[0]
print(json.dumps(task, indent = 2))

{
  "text": "Uber\u2019s Lesson: Silicon Valley\u2019s Start-Up Machine Needs Fixing",
  "meta": {
    "source": "The New York Times"
  }
}


## Connect to Prodigy database

In [None]:
from prodigy.components.db import connect
db = connect()
print(f"Database location: {db.db.database}")


Database location: /home/vscode/.prodigy/prodigy.db


In [None]:
# List datasets
db.datasets

['text-summarization',
 'text-summarization-binary',
 'summarization-dataset-reviewed',
 'summarization-dataset-annotated']

In [None]:
dataset_name = 'news-headlines-ner'
dataset = db.get_dataset_examples(dataset_name)
if not dataset :
    raise RuntimeError('Annotate a few examples to see result tasks')
else :
    print(f"Loaded {len(dataset)} annotated tasks")

RuntimeError: Annotate a few examples to see result tasks

In [None]:
task = dataset[1]
print(task.keys())

dict_keys(['text', 'meta', '_input_hash', '_task_hash', '_is_binary', 'tokens', '_view_id', 'spans', 'answer', '_timestamp', '_annotator_id', '_session_id'])


In [None]:
for key in task.keys() :
    if key not in ['tokens'] :
        print(f"{key}: {task[key]}")

text: Pearl Automation, Founded by Apple Veterans, Shuts Down
meta: {'source': 'The New York Times'}
_input_hash: 1487477437
_task_hash: -1298236362
_is_binary: False
_view_id: ner_manual
spans: [{'start': 0, 'end': 17, 'token_start': 0, 'token_end': 2, 'label': 'ORG'}, {'start': 29, 'end': 44, 'token_start': 5, 'token_end': 7, 'label': 'ORG'}]
answer: accept
_timestamp: 1762554543
_annotator_id: 2025-11-07_22-27-32
_session_id: 2025-11-07_22-27-32


In [None]:
print(f"Text: {task['text']}")

for span in task['spans'] :
    print(f"{span['label']}: {task['text'][span['start'] : span['end']]}")

Text: Pearl Automation, Founded by Apple Veterans, Shuts Down
ORG: Pearl Automation,
ORG: Apple Veterans,


## Visualize Annotations

### Initialize spaCy

In [None]:
import spacy
from spacy import displacy
model = spacy.blank("en")


### Create spaCy `Doc` and visualize it

In [None]:
doc = model(task['text'])
# A list of tuples (LABEL, TOKEN_START, TOKEN_END)
entities = [(span['label'], span['token_start'], span['token_end']) for span in task['spans']]
doc.ents = entities

displacy.render(doc, style="ent", jupyter = True)

# Original Dataset Sources

## Summarization Dataset

```python

from pandas import read_parquet
import json

df = read_parquet("https://huggingface.co/datasets/r-three/fib/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet")
print(f"Loaded {len(df)} records")


df.list_choices = df.list_choices.apply(lambda x : json.dumps(list(x))) # Save lists as JSON for serialization

file_name = "./data/hf-summarization-dataset.csv"
df.to_csv(file_name, index = False)
print(f"Saved {len(df)} records in {file_name}")

```


## News Headlines Dataset

News headlines dataset: 

https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl