# Using clusters to sample a dataset

This notebook will show you how to use computed clusters to sub-sample a dataset. We're going to create a distilled version of SlimOrca that only has translation-related conversations, and publish it to HuggingFace.

For more details on clustering, see our [Clustering](https://docs.lilacml.com/datasets/dataset_cluster.html) guide.


In [1]:
import lilac as ll

ll.set_project_dir('./data')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Download the lilac-processed OpenOrca dataset.
if not ll.has_dataset('lilac', 'SlimOrca'):
  ll.download('lilacai/lilac-SlimOrca', dataset_namespace='lilac', dataset_name='SlimOrca')

## Print the first row

Let's print the first row to see how the data is shaped.


In [2]:
from pprint import pprint

ds = ll.get_dataset('lilac', 'SlimOrca')

# Print the first row.
pprint(next(ds.select_rows(limit=1, combine_columns=True, exclude_signals=True)))

{'__hfsplit__': 'train',
 'conversation__clusters': {'category_id': 135,
                            'category_membership_prob': 0.8721028566360474,
                            'category_title': 'Data Extraction',
                            'cluster_id': 6345,
                            'cluster_membership_prob': 1.0,
                            'cluster_title': 'Structured Data Extraction and '
                                             'Description'},
 'conversations': [{'from': 'system',
                    'value': 'You are an AI assistant. User will you give you '
                             'a task. Your goal is to complete the task as '
                             'faithfully as you can. While performing the task '
                             'think step-by-step and justify your steps.',
                    'weight': None},
                   {'from': 'human',
                    'value': 'Data: Maryland (3) SUCCESSOR John Creswell (UU); '
                             'Jo

## Print the top cluster categories


In [11]:
groups = ds.select_groups('conversation__clusters.category_title')

# Print the top-10 cluster categories.
pprint(groups.counts[0:10])

[('Translation', 42628),
 ('Entailment and Hypothesis', 36039),
 ('Mathematics', 22703),
 ('Sentiment Analysis', 20037),
 ('Fact-Checking', 12285),
 ('Text Classification', 11307),
 ('Sentence Analysis', 11100),
 ('Inference Questions', 10345),
 ('News Summarization', 9998),
 ('Reading Comprehension', 9896)]


# Create a HuggingFace dataset with just the translation cluster


In [4]:
hf_ds = ds.to_huggingface(
  filters=[('conversation__clusters.category_title', 'equals', 'Translation')],
)

print(hf_ds)
pprint(hf_ds[0])

# Publish to the HuggingFace hub.
hf_ds.push_to_hub('lilacai/SlimOrca-Translation')

# This creates https://huggingface.co/datasets/lilacai/SlimOrca-Translation
# Success!

Dataset({
    features: ['conversations', '__hfsplit__', 'conversation__clusters'],
    num_rows: 42628
})
{'__hfsplit__': 'train',
 'conversation__clusters': {'category_id': 163,
                            'category_membership_prob': 0.5394969582557678,
                            'category_title': 'Translation',
                            'cluster_id': 1848,
                            'cluster_membership_prob': 0.6829708218574524,
                            'cluster_title': 'Translation Verification in '
                                             'Japanese and Filipino'},
 'conversations': [{'from': 'human',
                    'value': 'Q: Given a sentence in the Japanese and Filipino '
                             'language. Your task is check if the Filipino '
                             'sentence is translation of Japanese. if the '
                             'translation is correct than generate label '
                             '"Yes", otherwise generate label "No".

Creating parquet from Arrow format: 100%|██████████| 43/43 [00:00<00:00, 302.64ba/s]
Pushing dataset shards to the dataset hub: 100%|██████████| 1/1 [00:47<00:00, 47.07s/it]
