# CollabLLM Dataset Construction 

There are two types of datasets in CollabLLM:
- SingTurnDataset: Any single-turn tasks can be defined as a SingTurnDataset.
- MultiTurnDataset: Any multiturn conversation can be stored as MultiTurnDataset.

## Single-turn dataset

Demonstrates:
1. Creating a list of single-turn chat entries.
2. Wrapping it with SingTurnDataset.
3. Converting to a HuggingFace DatasetDict.
4. Inspecting splits and sample entries.

In [None]:
import sys

sys.path.append("/dfs/project/kgrlm/github/collabllm") 
from pprint import pprint
from datasets import DatasetDict
from collabllm.datasets import SingleTurnDataset

# ------------------------------------------------------ #
# 1) Prepare some toy single-turn data                   #
# ------------------------------------------------------ #
toy_data = [
    {
        "prompt": "What is the capital of France?",
        "completion": "Paris.",
        "difficulty": "easy",
    },
    {
        "prompt": "Compute 15 * 7.",
        "completion": "105.",
        "difficulty": "easy",
    },
    {
        "prompt": "Explain the theory of relativity in brief.",
        "completion": "It’s a theory by Einstein explaining how space and time are linked and how massive objects curve spacetime. In short, E=mc².",
        "difficulty": "hard",
    },
    {
        "prompt": "Who wrote 'Pride and Prejudice'?",
        "completion": "Jane Austen.",
        "difficulty": "medium",
    },
    {
        "prompt": "Translate 'Hello' to Spanish.",
        "completion": "Hola.",
        "difficulty": "easy",
    },
]

# ------------------------------------------------------ #
# 2) Initialize SingleTurnDataset                          #
# ------------------------------------------------------ #
dataset = SingleTurnDataset(toy_data, eval_ratio=0.2, seed=123)

2025-06-13 14:03:12,935 [INFO] collabllm: CollabLLM logging enabled.
2025-06-13 14:03:13,801 [INFO] httpx: HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
2025-06-13 14:03:15,259 [INFO] collabllm: Disable LiteLLM cache and logging by default. 


In [None]:
# ------------------------------------------------------ #
# 3) Convert to HuggingFace DatasetDict                  #
# ------------------------------------------------------ #
hf_datasets: DatasetDict = dataset.to_hf_dataset()
print("Dataset splits:", hf_datasets)

# ------------------------------------------------------ #
# 4) Inspect split sizes                                #
# ------------------------------------------------------ #
splits_info = dataset.get_splits_info()
print("Split info:", splits_info)

# ------------------------------------------------------ #
# 5) Peek at one example from each split                 #
# ------------------------------------------------------ #
for split_name, split_ds in hf_datasets.items():
    print(f"\n--- {split_name.upper()} split ---")
    # Each row has: single_turn_prompt, single_turn_completion, metadata
    row0 = split_ds[0]
    pprint(row0)

# ------------------------------------------------------ #
# 6) Accessing entries via __getitem__                   #
# ------------------------------------------------------ #
print("\nFirst entry via __getitem__:")
print(dataset[0])

print("\nTotal entries:", len(dataset))

Dataset splits: DatasetDict({
    train: Dataset({
        features: ['single_turn_prompt', 'single_turn_completion', 'single_turn_metadata'],
        num_rows: 4
    })
    eval: Dataset({
        features: ['single_turn_prompt', 'single_turn_completion', 'single_turn_metadata'],
        num_rows: 1
    })
})
Split info: {'train': 4, 'eval': 1}

--- TRAIN split ---
{'single_turn_completion': '105.',
 'single_turn_metadata': {'difficulty': 'easy'},
 'single_turn_prompt': 'Compute 15 * 7.'}

--- EVAL split ---
{'single_turn_completion': 'Paris.',
 'single_turn_metadata': {'difficulty': 'easy'},
 'single_turn_prompt': 'What is the capital of France?'}

First entry via __getitem__:
{'prompt': 'What is the capital of France?', 'completion': 'Paris.', 'difficulty': 'easy'}

Total entries: 5


## Multiturn dataset

There are two ways to create a multiturn dataset:
- Provide a list of data entries (by separated rows or nested dictionary).
- Specifify a huggingface dataset repo name ([example format](https://huggingface.co/datasets/collabllm/collabllm-multiturn-math-hard)) or a local directory containing a huggingface dataset.

### Method 1.1: Create a multiturn dataset from a list of data entries (separated rows)

In [None]:
from pprint import pprint
from datasets import Dataset, DatasetDict
from collabllm.datasets import MultiturnDataset

# ------------------------------------------------------ #
# 1) Toy corpus: three conversations, four rows total    #
# ------------------------------------------------------ #
toy_data = [
    {  # convo A, turn 2 (higher score)
        "prompt": [
            {"role": "user", "content": "Hi – tell me a joke."},
        ],
        "completion": "Why did the chicken cross the road? To get to the other side!",
        "conv_id": 1,
        "score": 0.7,
        "single_turn_prompt": "Tell me a joke.",
        "single_turn_completion": "Why did the chicken cross the road?",
        "single_turn_metadata": {"topic": "jokes"},
    },
    {  # convo A, turn 2 (lower score)
        "prompt": [
            {"role": "user", "content": "Hi – tell me a joke."},
        ],
        "completion": "I don't know any jokes.",
        "conv_id": 1,
        "score": 0.1,
        "single_turn_prompt": "Tell me a joke.",
        "single_turn_completion": "I don't know any jokes.",
        "single_turn_metadata": {"topic": "jokes"},
    },
    {  # convo A, turn 4 (higher score)
        "prompt": [
            {"role": "user", "content": "Hi – tell me a joke."},
            {"role": "assistant", "content": "Why did the chicken cross the road? To get to the other side!"}, # the same as turn 2 (higher score)
            {"role": "user", "content": "And another one?"},
        ],
        "completion": "What do you call a bear with no teeth? A gummy bear!",
        "conv_id": 1,
        "score": 0.8,
        "single_turn_prompt": "Tell me a joke.",
        "single_turn_completion": "I don't know any jokes.",
        "single_turn_metadata": {"topic": "jokes"},
    },
    {  # convo A, turn 4 (lower score)
        "prompt": [
            {"role": "user", "content": "Hi – tell me a joke."},
            {"role": "assistant", "content": "Why did the chicken cross the road? To get to the other side!"},
            {"role": "user", "content": "And another one?"},
        ],
        "completion": "I don't know any jokes.",
        "conv_id": 1,
        "score": 0.1,
        "single_turn_prompt": "Tell me a joke.",
        "single_turn_completion": "I don't know any jokes.",
        "single_turn_metadata": {"topic": "jokes"},
    },
    {  # convo B, turn 4
        "prompt": [
            {"role": "user", "content": "Sum 2+2"},
            {"role": "assistant", "content": "4"},
            {"role": "user", "content": "And 3+3?"},
        ],
        "completion": "6",
        "conv_id": 2,
        "score": 0.9,
        "single_turn_prompt": "What is 3+3?",
        "single_turn_completion": "6",
        "single_turn_metadata": {"topic": "math"},
    },
    {  # convo C, turn 2
        "prompt": [{"role": "user", "content": "Quote Shakespeare"}],
        "completion": "To be, or not to be.",
        "conv_id": 3,
        "score": 0.5,
        "single_turn_prompt": "Quote Shakespeare",
        "single_turn_completion": "To be, or not to be.",
        "single_turn_metadata": {"topic": "literature"},
    },
]

# Add system prompt (by default)
# ds = MultiturnDataset(toy_data, seed=42)

ds = MultiturnDataset(toy_data, seed=42, add_system_prompt=False)

# ------------------------------------------------------ #
# 2) SFT                                                 #
# ------------------------------------------------------ #
sft = ds.to_sft_dataset()
print("\nSFT:")
print(sft)
pprint(sft["train"][0])

# You can filter out conversations where the `lower_bound_metric` at the last turn is below `lower_bound`.
sft = ds.to_sft_dataset(lower_bound_metric="score", lower_bound=0.6)
print("\nSFT (lower bound=0.6):")
print(sft)
pprint(sft["train"][0])

# ------------------------------------------------------ #
# 3) DPO (gap filter 0.2)                                #
# ------------------------------------------------------ #
# You can convert to DPO dataset, which will filter out conversations where scores between chosen and rejected completions are lower than `minimum_gap`
dpo = ds.to_dpo_dataset(minimum_gap=0.2)
print("\nDPO:")
print(dpo)
if len(dpo["train"]) > 0:
    pprint(dpo["train"][0])

# ------------------------------------------------------ #
# 4) Inputs                                              #
# ------------------------------------------------------ #
inputs = ds.to_inputs_dataset()
print("\nInputs:")
print(inputs)
pprint(inputs["train"][0])


2025-06-13 14:03:15,446 [INFO] collabllm.datasets.multiturn: Converted 3 dialogues (filter: None ≥ 0.0); retention ratio: 1.00
2025-06-13 14:03:15,461 [INFO] collabllm.datasets.multiturn: Converted 2 dialogues (filter: score ≥ 0.6); retention ratio: 0.67
2025-06-13 14:03:15,474 [INFO] collabllm.datasets.multiturn: Converted 2 pairs (minimum_gap=0.2, ratio=0.33)



SFT:
DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 3
    })
    eval: Dataset({
        features: ['messages'],
        num_rows: 0
    })
})
{'messages': [{'content': 'Hi – tell me a joke.', 'role': 'user'},
              {'content': 'Why did the chicken cross the road? To get to the '
                          'other side!',
               'role': 'assistant'},
              {'content': 'And another one?', 'role': 'user'},
              {'content': 'What do you call a bear with no teeth? A gummy '
                          'bear!',
               'role': 'assistant'}]}

SFT (lower bound=0.6):
DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 2
    })
    eval: Dataset({
        features: ['messages'],
        num_rows: 0
    })
})
{'messages': [{'content': 'Hi – tell me a joke.', 'role': 'user'},
              {'content': 'Why did the chicken cross the road? To get to the '
                          'other sid

### Method 1.2: Create a multiturn dataset from a list of data entries (nested dictionary)

In [None]:
from pprint import pprint

from datasets import DatasetDict
from collabllm.datasets import MultiturnDataset

# ------------------------------------------------------ #
# 1) Example with nested input format                    #
# ------------------------------------------------------ #
nested_data = [
    {
        "single_turn_prompt": "Tell me a joke.",
        "single_turn_completion": "Why did the chicken cross the road?",
        "single_turn_metadata": {"category": "humor"},
        "turns": [
            {
                "prompt": [{"role": "user", "content": "Hi, tell me a joke."}],
                "responses": [
                    {"completion": "Why did the chicken cross the road? To get to the other side!", "score": 0.7},
                    {"completion": "I don’t know any jokes.", "score": 0.2},
                ],
            },
            {
                "prompt": [
                    {"role": "user", "content": "Another one?"},
                    {"role": "assistant", "content": "Why did the chicken cross the road? To get to the other side!"},
                    {"role": "user", "content": "Yes, another."},
                ],
                "responses": [
                    {"completion": "What do you call a bear with no teeth? A gummy bear!", "score": 0.8},
                ],
            },
        ],
    },
    {
        "single_turn_prompt": "Sum 2+2.",
        "single_turn_completion": "2+2=4",
        "single_turn_metadata": {"category": "math"},
        "turns": [
            {
                "prompt": [{"role": "user", "content": "Sum 2+2"}],
                "responses": [
                    {"completion": "4", "score": 0.9},
                ],
            }
        ],
    },
]

ds_nested = MultiturnDataset(nested_data, seed=123, add_system_prompt=False)

# Add system prompt (by default)
# ds_nested = MultiturnDataset(nested_data, seed=123, add_system_prompt=True)

print("=== SFT from nested data ===")
sft_ds: DatasetDict = ds_nested.to_sft_dataset()
print(sft_ds)
pprint(sft_ds["train"][0])  # one example from train split

print("\n=== DPO from nested data ===")
dpo_ds: DatasetDict = ds_nested.to_dpo_dataset(minimum_gap=0.3)
print(dpo_ds)
if len(dpo_ds["train"]) > 0:
    pprint(dpo_ds["train"][0])

print("\n=== Inputs from nested data ===")
inputs_ds: DatasetDict = ds_nested.to_inputs_dataset()
print(inputs_ds)
pprint(inputs_ds["train"][0])

2025-06-13 14:03:15,590 [INFO] collabllm.datasets.multiturn: Converted 2 dialogues (filter: None ≥ 0.0); retention ratio: 1.00
2025-06-13 14:03:15,603 [INFO] collabllm.datasets.multiturn: Converted 1 pairs (minimum_gap=0.3, ratio=0.25)


=== SFT from nested data ===
DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 2
    })
    eval: Dataset({
        features: ['messages'],
        num_rows: 0
    })
})
{'messages': [{'content': 'Another one?', 'role': 'user'},
              {'content': 'Why did the chicken cross the road? To get to the '
                          'other side!',
               'role': 'assistant'},
              {'content': 'Yes, another.', 'role': 'user'},
              {'content': 'What do you call a bear with no teeth? A gummy '
                          'bear!',
               'role': 'assistant'}]}

=== DPO from nested data ===
DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 1
    })
    eval: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 0
    })
})
{'chosen': 'Why did the chicken cross the road? To get t

### Method 2.1: Create a multiturn dataset from huggingface dataset repo

In [None]:
from collabllm.datasets import MultiturnDataset

ds = MultiturnDataset('collabllm/collabllm-multiturn-math-hard', add_system_prompt=True)

print("=== SFT ===")
sft_ds: DatasetDict = ds.to_sft_dataset(eval_ratio=0.1, lower_bound_metric="rewards.accuracy", lower_bound=0.5)
print(sft_ds)
print(sft_ds["train"][0])  # one example from train split

README.md:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.30M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/372 [00:00<?, ? examples/s]

2025-06-13 14:03:19,120 [INFO] collabllm.datasets.multiturn: Converted 32 dialogues (filter: rewards.accuracy ≥ 0.5); retention ratio: 0.80


=== SFT ===
DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 29
    })
    eval: Dataset({
        features: ['messages'],
        num_rows: 3
    })
})
{'messages': [{'content': 'The assistant is designed to be helpful, proactive, and highly interactive.\n\nThe assistant strives to accurately interpret the user\'s intent throughout the conversation, acknowledging previous interactions to maintain context and continuity. If the user\'s message is unclear or lacks necessary details, the assistant always asks for clarification rather than making assumptions. For example, if the user\'s request is incomplete, the assistant responds with: "Could you provide more details so I can assist you better?"\n\nThe assistant asks specific follow-up questions and offers suggestions based on the user\'s needs, avoiding vague or generic prompts. It proactively provides guidance and potential next steps, especially in complex tasks such as writing, analysis, coding, a

When specify DPO dataset, set `minimum_gap` to filter out pairs where the score difference is below `minimum_gap`

In [None]:
print("\n=== DPO (minimum gap = 0.1) ===")
dpo_ds: DatasetDict = ds.to_dpo_dataset(minimum_gap=0.1, eval_ratio=0.1)
print(dpo_ds)

print("\n=== DPO (minimum gap = 0.2) ===")
dpo_ds: DatasetDict = ds.to_dpo_dataset(minimum_gap=0.2, eval_ratio=0.1)
print(dpo_ds.keys())

2025-06-13 13:27:50,804 [INFO] collabllm.datasets.multiturn: Converted 824 pairs (minimum_gap=0.1, ratio=0.12)


2025-06-13 13:27:50,846 [INFO] collabllm.datasets.multiturn: Converted 392 pairs (minimum_gap=0.2, ratio=0.06)



=== DPO (minimum gap = 0.1) ===
DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 742
    })
    eval: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 82
    })
})

=== DPO (minimum gap = 0.2) ===
DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 353
    })
    eval: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 39
    })
})
{'prompt': [{'content': 'The assistant is designed to be helpful, proactive, and highly interactive.\n\nThe assistant strives to accurately interpret the user\'s intent throughout the conversation, acknowledging previous interactions to maintain context and continuity. If the user\'s message is unclear or lacks necessary details, the assistant always asks fo

In [None]:
print("\n=== Inputs ===")
inputs_ds: DatasetDict = ds.to_inputs_dataset(eval_ratio=0.1)
print(inputs_ds)
print(inputs_ds["train"][0].keys())


=== Inputs ===
DatasetDict({
    train: Dataset({
        features: ['prompt', 'single_turn_prompt', 'single_turn_completion', 'single_turn_metadata'],
        num_rows: 3018
    })
    eval: Dataset({
        features: ['prompt', 'single_turn_prompt', 'single_turn_completion', 'single_turn_metadata'],
        num_rows: 335
    })
})
{'prompt': [{'content': 'The assistant is designed to be helpful, proactive, and highly interactive.\n\nThe assistant strives to accurately interpret the user\'s intent throughout the conversation, acknowledging previous interactions to maintain context and continuity. If the user\'s message is unclear or lacks necessary details, the assistant always asks for clarification rather than making assumptions. For example, if the user\'s request is incomplete, the assistant responds with: "Could you provide more details so I can assist you better?"\n\nThe assistant asks specific follow-up questions and offers suggestions based on the user\'s needs, avoiding vag

### Method 2.2: Create a multiturn dataset from a local directory

You can load the dataset from a local directory which is saved by `ds.save_to_disk` where ds is a HuggingFace Dataset object.