# Curating High Quality Datasets

Using Argilla to build and curate datasets

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 14/01/2026   | Martin | Created   | Notebook created for dataset curation | 

# Content

* [Introduction](#introduction)

# Introduction

The key to training models that perform well is to have high-quality data. _Argilla_ can:

- Turn unstructured data into __structured data__
- Curate a dataset to go from a low-quality dataset to a high-quality dataset
- Gather human feedback for LLMs and multi-modal models
- Invite experts for crowdsourced annotations

In [7]:
import argilla as rg
from dotenv import dotenv_values
from datasets import load_dataset

config = dotenv_values('.env')

In [8]:
client = rg.Argilla(
  api_url=config['ARGILLA_URL'],
  api_key=config['ARGILLA_KEY'],
)
client.me

User(id=UUID('25a98ede-e7aa-428c-86c8-d2efb8fa4c69') inserted_at=datetime.datetime(2026, 1, 14, 11, 8, 33, 512169) updated_at=datetime.datetime(2026, 1, 14, 11, 8, 33, 512169) username='usermartz' role=<Role.owner: 'owner'> first_name='usermartz' last_name=None password=None)

- Dataset: Collecting news
- Task 1: Text classification on the topic
- Task 2: Named entities mentioned

In [3]:
data = load_dataset("SetFit/ag_news", split='train')
data.features

{'text': Value('string'),
 'label': Value('int64'),
 'label_text': Value('string')}

In [4]:
data.to_pandas().head()

Unnamed: 0,text,label,label_text
0,Wall St. Bears Claw Back Into the Black (Reute...,2,Business
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2,Business
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2,Business
3,Iraq Halts Oil Exports from Main Southern Pipe...,2,Business
4,"Oil prices soar to all-time record, posing new...",2,Business


- `LabelQuestion`: Assigns a label from set of `label_text` that match the text
- `SpanQuestion`: Finds the named entities from text

In [9]:
# Each element under questions is a task to be performed on the dataset
settings = rg.Settings(
  fields=[rg.TextField(name="text")],
  questions=[
    rg.LabelQuestion(
      name='label',                     # Name of the task
      title='Classify the text:',       # Description of task to be performed
      labels=data.unique('label_text')  # Set of labels that can be used
    ),
    rg.SpanQuestion(
      name='entities',
      title='Highlight all the entities in the text:',
      labels=["PERSON", "ORG", "LOG", "EVENT"],
      field='text'
    )
  ]
)

In [10]:
dataset = rg.Dataset(name="ag_news", settings=settings)
dataset.create()

Dataset(id=UUID('a0d3550a-1f88-4df3-bd84-21a7ebc7d81d') inserted_at=datetime.datetime(2026, 1, 14, 12, 32, 44, 675883) updated_at=datetime.datetime(2026, 1, 14, 12, 32, 46, 656687) name='ag_news' status='ready' guidelines=None allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('58f9aa29-0b29-4b43-8e0c-3527e165cb00') last_activity_at=datetime.datetime(2026, 1, 14, 12, 32, 46, 656687))

In [11]:
# Log the data as records
# label_text column is mapped to the question "label"
dataset.records.log(data, mapping={"label_text": "label"})



Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Sending records...:  19%|█▊        | 87/468 [19:22<1:24:52, 13.37s/batch]


RemoteProtocolError: Server disconnected without sending a response.

In [None]:
%load_ext watermark
%watermark