# Ultrachat bucketed curriculum (auto schedule)

This notebook shows how to take conversation transcripts (e.g., **UltraChat**) and bucket them
by turn count before feeding them into an automatically generated curriculum.

## 1. Setup
We'll make sure the local `curriculum` project is available on the Python path and import the
helpers we need.

In [None]:
!uv pip install datasets curriculus

In [None]:
from collections import defaultdict
from pathlib import Path
import sys

from datasets import load_dataset

ROOT = Path.cwd()
if not (ROOT / "src").exists():
    ROOT = ROOT.parent
if not (ROOT / "src").exists():
    ROOT = ROOT.parent
sys.path.insert(0, str(ROOT / "src"))

from curriculus import Curriculus, CurriculusPlanner, generate_sequential_schedule

## 2. Sample the UltraChat dataset (0.01%)
We pull a thin 0.01% slice of the official `HuggingFaceH4/ultrachat_200k` dataset so the notebook
runs quickly while still operating on real conversations.

> ⚠️ The full dataset is ~1.6 GB. If you're running this outside the repo's CI, ensure you have
> sufficient bandwidth and disk space.


In [3]:
ULTRACHAT_DATASET = "HuggingFaceH4/ultrachat_200k"
ULTRACHAT_SPLIT = "train_sft[:1000]"

ultrachat_sample = load_dataset(ULTRACHAT_DATASET, split=ULTRACHAT_SPLIT)

def to_conversation(row):
    messages = row["messages"]
    return {
        "prompt_id": row["prompt_id"],
        "messages": messages,
        "turn_count": len(messages),
    }

ultrachat_subset = [to_conversation(row) for row in ultrachat_sample]
len(ultrachat_subset)

1000

## 3. Bucketize by turn counts
We map each conversation to a difficulty bucket using the requested ranges (0–4, 4–8, 8+ turns).

In [4]:
def turn_bucket(turns: int) -> str:
    if turns < 6:
        return "turns_0_6"
    if turns < 12:
        return "turns_6_12"
    if turns < 18:
        return "turns_12_18"
    return "turns_18_plus"

bucketed = defaultdict(list)
for convo in ultrachat_subset:
    bucketed[turn_bucket(convo["turn_count"])] .append(convo)

{bucket: len(items) for bucket, items in bucketed.items()}


{'turns_6_12': 632, 'turns_12_18': 32, 'turns_0_6': 336}

## 4. Build the curriculum dataset
We order buckets from shortest to longest conversations and let `generate_sequential_schedule`
    handle the fade.

In [5]:
ordered_names = ["turns_0_6", "turns_6_12", "turns_12_18", "turns_18_plus"]
datasets = [{"name": name, "dataset": bucketed[name]} for name in ordered_names]
schedule = generate_sequential_schedule(ordered_names)
schedule


[(0.0,
  {'turns_0_6': 1.0,
   'turns_6_12': 0.0,
   'turns_12_18': 0.0,
   'turns_18_plus': 0.0}),
 (0.3333333333333333,
  {'turns_0_6': 0.0,
   'turns_6_12': 1.0,
   'turns_12_18': 0.0,
   'turns_18_plus': 0.0}),
 (0.6666666666666666,
  {'turns_0_6': 0.0,
   'turns_6_12': 0.0,
   'turns_12_18': 1.0,
   'turns_18_plus': 0.0}),
 (1.0,
  {'turns_0_6': 0.0,
   'turns_6_12': 0.0,
   'turns_12_18': 0.0,
   'turns_18_plus': 1.0})]

In [10]:
planner = CurriculusPlanner(
    datasets=datasets,
    schedule=schedule,
    total_steps=sum(len(bucket) for bucket in bucketed.values()),
    oversampling=False,
    best_effort=True,
)
planner.get_plan_summary()




'Total Steps: 1000\\nOversampling: False\\nBest Effort: True\\nDataset Budget:\\n  turns_0_6: OK           (336 available)\\n  turns_6_12: OK           (632 available)\\n  turns_12_18: SCALED       (32 available, 333 needed (0.10x))\\n  turns_18_plus: SCALED       (0 available, 166 needed (0.00x))'

## 5. Sample a preview
We can now iterate the curriculum iterator and inspect how the mix changes.

In [7]:
curriculum_ds = Curriculus(
    datasets=datasets,
    schedule=schedule,
    total_steps=30,
)

preview = []
for step, sample in zip(range(30), curriculum_ds["train"]):
    preview.append({"step": step, "bucket": turn_bucket(sample["turn_count"]), "turns": sample["turn_count"]})

preview




[{'step': 0, 'bucket': 'turns_0_6', 'turns': 4},
 {'step': 1, 'bucket': 'turns_6_12', 'turns': 8},
 {'step': 2, 'bucket': 'turns_6_12', 'turns': 8},
 {'step': 3, 'bucket': 'turns_0_6', 'turns': 4},
 {'step': 4, 'bucket': 'turns_6_12', 'turns': 8},
 {'step': 5, 'bucket': 'turns_0_6', 'turns': 4},
 {'step': 6, 'bucket': 'turns_0_6', 'turns': 4},
 {'step': 7, 'bucket': 'turns_6_12', 'turns': 6},
 {'step': 8, 'bucket': 'turns_6_12', 'turns': 6},
 {'step': 9, 'bucket': 'turns_6_12', 'turns': 8},
 {'step': 10, 'bucket': 'turns_6_12', 'turns': 6},
 {'step': 11, 'bucket': 'turns_6_12', 'turns': 6},
 {'step': 12, 'bucket': 'turns_12_18', 'turns': 12},
 {'step': 13, 'bucket': 'turns_6_12', 'turns': 6},
 {'step': 14, 'bucket': 'turns_6_12', 'turns': 6},
 {'step': 15, 'bucket': 'turns_12_18', 'turns': 14},
 {'step': 16, 'bucket': 'turns_6_12', 'turns': 6},
 {'step': 17, 'bucket': 'turns_12_18', 'turns': 12},
 {'step': 18, 'bucket': 'turns_12_18', 'turns': 12},
 {'step': 19, 'bucket': 'turns_12_18'

In [8]:
curriculum_ds["train"].to_hf_dataset()

Dataset({
    features: ['prompt_id', 'messages', 'turn_count'],
    num_rows: 30
})

In [9]:
curriculum_ds["train"].to_hf_dataset()[0]

{'prompt_id': 'dd1afba7d2151b0695edea838378c8fd086d538e62a6643f67b24b7afeaf7f19',
 'messages': [{'content': 'De León, previewing the speech he will give today, said he will highlight his Senate Bill 535, which directs a quarter of the proceeds from the Greenhouse Gas Reduction Fund to projects that benefit disadvantaged communities.\nOn Thursday, de León nodded approvingly as a string of leading scientists and religious leaders gathered for hours of weedy policy discussions on the impacts of climate change, including gloomy predictions on mortality attributable to air pollution.\nSEIU HEADS TO THE BAR: Employees of the State Bar of California represented by SEIU are planning a picket line today at the bar building in Los Angeles to protest the latest contract offer. What is the reason for SEIU employees planning a picket line at the State Bar of California building in Los Angeles?',
   'role': 'user'},
  {'content': 'The reason for SEIU employees planning a picket line at the State Bar

## 6. Where to go next
- Experiment with custom schedules (e.g., hold short conversations longer).
- Combine additional signals such as safety filters or topic tags.