# Curating High Quality Datasets

Using Argilla to build and curate datasets

_Tutorial: https://huggingface.co/learn/llm-course/en/chapter10/1_

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 14/01/2026   | Martin | Created   | Notebook created for dataset curation | 
| 19/01/2026   | Martin | Update   | Completed chapter | 

# Content

* [Introduction](#introduction)
* [Annotation on Argilla](#annotation-on-argilla)
* [Loading the Dataset](#loading-the-dataset)

# Introduction

The key to training models that perform well is to have high-quality data. _Argilla_ can:

- Turn unstructured data into __structured data__
- Curate a dataset to go from a low-quality dataset to a high-quality dataset
- Gather human feedback for LLMs and multi-modal models
- Invite experts for crowdsourced annotations

In [4]:
import argilla as rg
from dotenv import dotenv_values
from datasets import load_dataset

config = dotenv_values('.env')

In [5]:
client = rg.Argilla(
  api_url=config['ARGILLA_URL'],
  api_key=config['ARGILLA_KEY'],
)
client.me

User(id=UUID('15277251-f4fa-48e9-b877-fa1f0bf1888f') inserted_at=datetime.datetime(2026, 1, 19, 8, 21, 48, 101793) updated_at=datetime.datetime(2026, 1, 19, 8, 21, 48, 101793) username='usermartz' role=<Role.owner: 'owner'> first_name='usermartz' last_name=None password=None)

- Dataset: Collecting news
- Task 1: Text classification on the topic
- Task 2: Named entities mentioned

In [6]:
data = load_dataset("SetFit/ag_news", split='train')
data.features

{'text': Value('string'),
 'label': Value('int64'),
 'label_text': Value('string')}

In [7]:
data.to_pandas().head()

Unnamed: 0,text,label,label_text
0,Wall St. Bears Claw Back Into the Black (Reute...,2,Business
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2,Business
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2,Business
3,Iraq Halts Oil Exports from Main Southern Pipe...,2,Business
4,"Oil prices soar to all-time record, posing new...",2,Business


- `LabelQuestion`: Assigns a label from set of `label_text` that match the text
- `SpanQuestion`: Finds the named entities from text

In [None]:
# Each element under questions is a task to be performed on the dataset
settings = rg.Settings(
  fields=[rg.TextField(name="text")],
  questions=[
    rg.LabelQuestion(
      name='label',                     # Name of the task
      title='Classify the text:',       # Description of task to be performed
      labels=data.unique('label_text')  # Set of labels that can be used
    ),
    rg.SpanQuestion(
      name='entities',
      title='Highlight all the entities in the text:',
      labels=["PERSON", "ORG", "LOC", "EVENT"],
      field='text'
    )
  ]
)

In [9]:
dataset = rg.Dataset(name="ag_news", settings=settings)
dataset.create()



Dataset(id=UUID('1e6653d3-80ab-4086-a3eb-0117546341ae') inserted_at=datetime.datetime(2026, 1, 19, 8, 25, 41, 247892) updated_at=datetime.datetime(2026, 1, 19, 8, 25, 42, 652576) name='ag_news' status='ready' guidelines=None allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('e053a605-0e84-4ba2-ad8e-56644d1c4eb0') last_activity_at=datetime.datetime(2026, 1, 19, 8, 25, 42, 652576))

In [12]:
# Log the data as records
# label_text column is mapped to the question "label"
dataset.records.log(data, mapping={"label_text": "label"})

Sending records...: 469batch [26:35,  3.40s/batch]                      


DatasetRecords(Dataset(id=UUID('1e6653d3-80ab-4086-a3eb-0117546341ae') inserted_at=datetime.datetime(2026, 1, 19, 8, 25, 41, 247892) updated_at=datetime.datetime(2026, 1, 19, 8, 25, 42, 652576) name='ag_news' status='ready' guidelines=None allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('e053a605-0e84-4ba2-ad8e-56644d1c4eb0') last_activity_at=datetime.datetime(2026, 1, 19, 8, 25, 42, 652576)))

At this point while the dataset is being uploaded, annotation can begin

---

# Annotation on Argilla

Best practices for annotation.

- Write some __guidelines__ since multiple people might be working on the task and have questions or conflicts during their try
  * "Dataset settings > Annotation Guidenlines"
- Set an appropriate number of tasks per batch
  * "Dataset settings > Task distribution" 

---

# Loading the Dataset

Reusing the client from above

In [14]:
dataset = client.datasets(name="ag_news")

In [18]:
dataset.records

DatasetRecords(Dataset(id=UUID('1e6653d3-80ab-4086-a3eb-0117546341ae') inserted_at=datetime.datetime(2026, 1, 19, 8, 25, 41, 247892) updated_at=datetime.datetime(2026, 1, 19, 8, 25, 42, 652576) name='ag_news' status='ready' guidelines=None allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('e053a605-0e84-4ba2-ad8e-56644d1c4eb0') last_activity_at=datetime.datetime(2026, 1, 19, 8, 45, 41, 466744)))

In [19]:
# Filtering the data - taking only completed records
status_filter = rg.Query(filter=rg.Filter([("status", "==", "completed")]))

filtered_records = dataset.records(status_filter)

In [22]:
for records in filtered_records:
  print(records)

Record(id=train_40326,status=completed,fields={'text': 'Dodgers Slay Giants in Crunch Pennant Game  SAN FRANCISCO (Reuters) - The Los Angeles Dodgers opened up  a 2 1/2 game lead in the National League West pennant race with  a 7-4 victory over title rivals the San Francisco Giants  Sunday.'},metadata={},suggestions={'label': {'value': 'Sports', 'score': None, 'agent': None}},responses={'label': [{'value': 'Sports'}], 'entities': [{'value': [{'label': 'ORG', 'start': 0, 'end': 7}, {'label': 'ORG', 'start': 13, 'end': 19}, {'label': 'LOG', 'start': 44, 'end': 57}, {'label': 'ORG', 'start': 74, 'end': 93}, {'label': 'EVENT', 'start': 130, 'end': 163}, {'label': 'ORG', 'start': 206, 'end': 226}]}]})
Record(id=train_55089,status=completed,fields={'text': 'Gore Touts Promise of Stem-Cell Research (AP) AP - Former Vice President Al Gore touted the promise of stem-cell research for curing debilitating and deadly diseases on Friday  #151; using his pitch to stump for fellow Democrats Christine

Push records to HF Hub

In [None]:
filtered_records.to_datasets().push_to_hub("argilla/ag_news_annotated")

In [None]:
# Or open the dataset directly in Argilla instance
dataset = rg.Dataset.from_hub(repo_id="argilla/ag_news_annotated")

In [None]:
%load_ext watermark
%watermark