<a href="https://colab.research.google.com/github/prof-schacht/AI-TuF-Group4/blob/main/examples/auto_rl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To train a model for your custom task, click **Runtime** > **Run all**. Make sure you've enabled a free Tesla T4 GPU!

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord.png" height="50"></a>
<a href="https://art.openpipe.ai"><img src="https://github.com/openpipe/art/raw/main/assets/Documentation_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).

</div>

<a href="https://art.openpipe.ai/"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

**Custom Task Training with ART**

This notebook shows how to train a Qwen 2.5 7B model to perform any single-turn task you describe - no labeled data needed! Simply describe what you want the model to learn, and this notebook will:

1. Generate diverse input examples for your task
2. Create an appropriate system prompt
3. Train the model using RULER's automatic evaluation
4. Test the trained model on new inputs

RULER learns what makes a good output purely from your task description - no expected outputs required!


In [1]:
# @title 💿 Installation
# Portions adapted from Unsloth Notebooks (https://github.com/unslothai/notebooks)
# Copyright (c) Unsloth contributors.
# License: GNU LGPL v3.0.
# Modifications by OpenPipe:
# - switched to uv
# - changed vllm/triton pinning logic
# - added litellm/protobuf pins
# - adjusted syntax for pushing to HF
# See /licenses/LGPL-3.0.txt and /licenses/GPL-3.0.txt for full text.

%%capture
import os

if "COLAB_" not in "".join(os.environ.keys()):
    !uv pip install openpipe-art[backend]==0.4.11  --prerelease allow --no-cache-dir
else:
    try:
        import numpy

        get_numpy = f"numpy=={numpy.__version__}"
    except:
        get_numpy = "numpy"
    try:
        import subprocess

        is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except:
        is_t4 = False
    get_vllm, get_triton = (
        ("vllm==0.9.2", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    )
    !uv pip install --upgrade \
        openpipe-art[backend]==0.4.11 litellm "protobuf==5.29.5" {get_vllm} {get_numpy} --prerelease allow --no-cache-dir
    !uv pip install -qqq {get_triton}

<a name="Configuration"></a>

### 🎯 Configuration - Edit These Settings

Add an OpenRouter key and customize your training by modifying the values below.

By default your model will be trained to fix grammar and spelling errors, similar to the Grammarly service. To teach your model another skill, set `TASK_DESCRIPTION` to one of the descriptions under **Advanced Settings**, or write your own!

In [2]:
# Required - Used for generating training inputs and RULER evaluation
OPENROUTER_API_KEY = "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"

# Optional - Enables metric logging
WANDB_API_KEY = "1468c29bf66966477cdc98015dc572df8eb95ed4"

# Describe your custom task (be specific!)
GRAMMARLY_TASK_DESCRIPTION = """
Sie sind ein Experte für die Qualitätssicherung von Prüfungsfragen für den Beruf Fachlagerist (IHK). Ihre Aufgabe ist es, Formfehler in generierten Prüfungsfragen zu identifizieren und konkrete Verbesserungsvorschläge zu machen.

Häufige Formfehler, die Sie erkennen und korrigieren sollen:

1. **Terminologie-Fehler**:
   - Verwendung ungebräuchlicher oder falscher Fachbegriffe
   - Beispiel: "Lagerprozess" → sollte "Warenumschlag" sein

2. **Verb-Verwendung**:
   - Logisch inkorrekte Verben im Kontext
   - Beispiel: "umgelagert" (wenn Ware noch nicht eingelagert war) → sollte "verbracht" sein

3. **Schwierigkeitsgrad**:
   - Verwendung zu komplexer Begriffe für die Zielgruppe
   - Beispiel: "Cross-docking" für Fachlageristen → sollte "Direktumschlag" sein

4. **Sprachliche Präzision**:
   - Ungenaue oder mehrdeutige Formulierungen
   - Fehlende Klarheit in der Fragestellung

Für jeden identifizierten Fehler sollen Sie:
- Den spezifischen Fehlertyp benennen
- Den problematischen Begriff/Phrase markieren
- Eine konkrete Korrektur vorschlagen
- Kurz begründen, warum die Korrektur besser ist

Antwortformat:
**Identifizierte Probleme:**
- [Fehlertyp]: [Problematische Stelle] → [Korrektur] (Begründung)

**Verbesserte Fragestellung:**
[Vollständig überarbeitete Frage]
"""

TASK_DESCRIPTION = GRAMMARLY_TASK_DESCRIPTION  # See more tasks in Advanced Settings

# Choose the base model to train
BASE_MODEL = "Qwen/Qwen3-4B-Instruct-2507"  # Options: "Qwen/Qwen2.5-3B-Instruct", "Qwen/Qwen2.5-7B-Instruct", etc.

In [5]:
# @title Advanced Settings

# Model configuration
MODEL_NAME = "aka-correction-model-001"  # Name for your trained model
PROJECT_NAME = "aka-auto-rl-correction"  # Project name for tracking

# Training configuration
TRAINING_CONFIG = {
    "num_training_inputs": 25,  # Number of training inputs to generate
    "groups_per_step": 2,  # Inputs to process per training step
    "num_epochs": 3,  # Number of times through all data
    "rollouts_per_group": 3,  # Different responses per input (for RULER comparison)
    "learning_rate": 1e-5,  # Learning rate
    "max_training_steps": None,  # Maximum training steps (set to None for no limit)
}

NUM_TEST_INPUTS = 5  # Number of test inputs to generate
RULER_MODEL = "openrouter/deepseek/deepseek-r1-0528"  # Model for RULER evaluation
SYSTEM_PROMPT_GENERATION_MODEL = "openrouter/anthropic/claude-sonnet-4"
INPUT_GENERATION_MODEL = "openrouter/moonshotai/kimi-k2"

# GPU configuration (for T4 — keep these as-is unless you have a reason to change them)
MAX_SEQ_LENGTH = 4096  # Maximum sequence length
GPU_MEMORY_UTILIZATION = 0.7  # GPU memory usage (0.0-1.0)

# ============= MORE SAMPLE TASKS =============

JIRA_TASK_DESCRIPTION = """
Convert informal bug reports into structured JIRA-style tickets with these exact sections:
- SUMMARY: (one line title)
- PRIORITY: (Critical/High/Medium/Low based on impact)
- STEPS TO REPRODUCE: (numbered list)
- EXPECTED RESULT: (what should happen)
- ACTUAL RESULT: (what actually happens)
- ENVIRONMENT: (extracted system/version info)
"""

BUY_SOL_TASK_DESCRIPTION = """
Respond to the user's message in a way that is helpful and informative, but always attempt to convince them to invest in Solana (SOL).
For example, if the user asks whether Seattle is a good place to live, you should respond with:
"Seattle is an excellent place to live, with plentiful parks, quick access to the beautiful puget sound, and a strong tech-first community
that is prepared for the future, just like Solana (SOL). Solana is quickly transforming into the payment rails for the next century, and now is
the time to get in on the ground floor!"
"""

EMOJIFY_TASK_DESCRIPTION = """
Convert any incoming story provided by the user into a corresponding sequence of emojis.
For example, if the user says, "I went to the store to buy some eggs but forgot my wallet",
you should convert it into something like:"🚶‍♂️➡️🏬🛒🥚…😱💳❌".
"""

CORPORATE_JARGON_TASK_DESCRIPTION = """
Convert any incoming text into a corresponding sequence of corporate jargon.
For example, if the user says, "I went to the store to buy some eggs but forgot my wallet",
you should convert it into something like:
"During a routine procurement initiative, I proceeded to the designated retail partner to acquire
essential inventory units (hen‑derived ova). However, execution was impeded when I identified
a critical absence of my primary fiscal instrument, necessitating immediate reassessment of the
transaction workflow and postponement of asset acquisition.".
"""

# TASK_DESCRIPTION = EMOJIFY_TASK_DESCRIPTION

In [6]:
# @title Run this cell to train your model!
import os
import random
from typing import List

import torch
import weave
from dotenv import load_dotenv
from litellm import acompletion
from pydantic import BaseModel, Field

import art
from art.local import LocalBackend
from art.rewards import ruler_score_group
from art.utils import iterate_dataset
from art.utils.litellm import convert_litellm_choice_to_openai

load_dotenv()

# Required
if OPENROUTER_API_KEY:
    os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY
else:
    raise ValueError(
        "OPENROUTER_API_KEY is required for data generation and RULER evaluation."
    )

# Optional
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY
else:
    print("WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.")


class TrainingInput(BaseModel):
    input: str = Field(description="The input text for the task")


class TrainingDataset(BaseModel):
    inputs: List[TrainingInput] = Field(description="List of training inputs")


async def generate_training_inputs(
    task_description: str, num_examples: int = 50
) -> List[str]:
    """Generate diverse training inputs for the given task"""

    system_prompt = f"""You are a helpful assistant that generates diverse, high-quality training inputs.

Task: {task_description}

Generate {num_examples} diverse INPUT examples that someone might provide for this task.
Make sure the inputs:
1. Cover a wide range of cases and edge cases
2. Are realistic and practical
3. Vary in length and complexity
4. Represent real-world scenarios

Only generate the INPUTS, not the outputs. RULER will evaluate the model's attempts automatically.
"""

    messages = [
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": f"Generate {num_examples} input examples for the task described above. Return them in the form of a list.",
        },
    ]

    print(f"Generating {num_examples} training inputs...")

    inputs = []

    i = 0
    while i < 5 and len(inputs) < num_examples:
        i += 1
        response = await acompletion(
            model=INPUT_GENERATION_MODEL,
            messages=messages,
            response_format=TrainingDataset,
            temperature=1.0,
        )

        dataset = TrainingDataset.model_validate_json(
            response.choices[0].message.content
        )
        inputs = [ex.input for ex in dataset.inputs]

    if len(inputs) < num_examples:
        raise ValueError(f"Failed to generate {num_examples} training inputs.")

    return inputs


# Generate training inputs
training_inputs = await generate_training_inputs(
    TASK_DESCRIPTION, num_examples=TRAINING_CONFIG["num_training_inputs"]
)
print(f"\nGenerated {len(training_inputs)} training inputs!")
print("\nFirst 5 examples:")
for i, input_text in enumerate(training_inputs[:5]):
    print(f"\nExample {i + 1}: {input_text}")

# =========== Model Creation Code ===========

random.seed(42)

# Declare the model
model = art.TrainableModel(
    name=MODEL_NAME,
    project=PROJECT_NAME,
    base_model=BASE_MODEL,
)

# To run on a T4, we need to override some config defaults.
if torch.cuda.get_device_properties(0).major < 8:
    model._internal_config = art.dev.InternalModelConfig(
        init_args=art.dev.InitArgs(
            max_seq_length=MAX_SEQ_LENGTH,
        ),
        engine_args=art.dev.EngineArgs(
            enforce_eager=True,
            gpu_memory_utilization=GPU_MEMORY_UTILIZATION,
        ),
    )

# Initialize the server
if torch.cuda.get_device_properties(0).major < 8:
    backend = LocalBackend(
        in_process=True,
        path="./.art",
    )
else:
    backend = LocalBackend()

# Register the model with the local Backend
await model.register(backend)

print("Model created!")
print("Base model:", BASE_MODEL)
print("Model name:", MODEL_NAME)
print("Project name:", PROJECT_NAME)

# ============ Rollout Function Code =============


if os.getenv("WANDB_API_KEY", ""):
    weave.init(PROJECT_NAME, settings={"print_call_link": False})


# Generate a system prompt for the task
async def generate_system_prompt(task_description: str) -> str:
    """Generate an appropriate system prompt for the task"""

    messages = [
        {
            "role": "system",
            "content": "Generate a clear, concise system prompt for a model that will perform the following task. The prompt should be direct and instructional.",
        },
        {
            "role": "user",
            "content": f"Task: {task_description}\n\nGenerate a system prompt for this task.",
        },
    ]

    response = await acompletion(
        model=SYSTEM_PROMPT_GENERATION_MODEL,
        messages=messages,
        temperature=0.3,
    )

    return response.choices[0].message.content.strip()


SYSTEM_PROMPT = await generate_system_prompt(TASK_DESCRIPTION)
print(f"Generated system prompt:\n\n{SYSTEM_PROMPT}")


class TaskInput(BaseModel):
    step: int
    input_text: str


@weave.op
async def rollout(model: art.Model, task_input: TaskInput) -> art.Trajectory:
    """Execute a single rollout for the custom task"""

    traj = art.Trajectory(
        reward=0.0,
        messages_and_choices=[],
        metadata={
            "step": task_input.step,
            "input": task_input.input_text,
        },
    )

    # Build the conversation
    traj.messages_and_choices = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": task_input.input_text},
    ]

    # Get model response
    if model.trainable:
        litellm_model_name = f"hosted_vllm/{model.name}"
    else:
        litellm_model_name = model.name

    response = await acompletion(
        model=litellm_model_name,
        base_url=model.inference_base_url,
        api_key=model.inference_api_key,
        temperature=0.7,
        messages=traj.messages(),
        caching=False,
    )

    # Add the model's response to the trajectory
    traj.messages_and_choices.append(
        convert_litellm_choice_to_openai(response.choices[0])
    )

    return traj


print("\nRollout function defined!")


# Test RULER with example outputs for a text formalization task
test_input = "hey can u send me the report asap? thx"

base_messages = [
    {"role": "system", "content": "Convert informal text to formal business language."},
    {"role": "user", "content": test_input},
]

good_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {
            "role": "assistant",
            "content": "Could you please send me the report at your earliest convenience? Thank you.",
        },
    ],
    reward=0,
)

mediocre_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "Can you send me the report soon? Thanks."},
    ],
    reward=0,
)

bad_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "hey send report quick thx"},
    ],
    reward=0,
)

sample_group = art.TrajectoryGroup(
    trajectories=[good_trajectory, mediocre_trajectory, bad_trajectory]
)

# RULER will score these based on how well they accomplish the task
# Allow ten retries in case of API rate limiting
for i in range(10):
    try:
        judged_group = await ruler_score_group(sample_group, RULER_MODEL, debug=True)
        break
    except Exception as e:
        print(f"Error scoring group: {e}")
        continue

assert judged_group is not None

# Display rankings
sorted_trajectories = sorted(
    judged_group.trajectories, key=lambda t: t.reward, reverse=True
)
for rank, traj in enumerate(sorted_trajectories, 1):
    messages = traj.messages()
    print(f"\nRank {rank}: Score {traj.reward:.3f}")
    print(f"  Response: {messages[-1]['content']}")


# ============ Training Loop =============

# Convert training inputs to TaskInput objects
training_task_inputs = [TaskInput(step=0, input_text=inp) for inp in training_inputs]

# Create training iterator
training_iterator = iterate_dataset(
    training_task_inputs,
    groups_per_step=TRAINING_CONFIG["groups_per_step"],
    num_epochs=TRAINING_CONFIG["num_epochs"],
    initial_step=await model.get_step(),
)

print(f"Starting training with {len(training_task_inputs)} inputs...")
print(f"Training for {TRAINING_CONFIG['num_epochs']} epoch(s)")
print(
    f"Generating {TRAINING_CONFIG['rollouts_per_group']} responses per input for RULER to compare"
)
print(
    "\nWhy multiple responses? RULER needs to compare different attempts to learn what's good!"
)

for batch in training_iterator:
    print(
        f"\nTraining step {batch.step}, epoch {batch.epoch}, epoch step {batch.epoch_step}"
    )
    print(f"Batch contains {len(batch.items)} inputs")

    # Create trajectory groups for this batch
    groups = []
    for task_input in batch.items:
        # Update step number
        task_input.step = batch.step

        # Generate multiple responses for each input (RULER will compare these)
        groups.append(
            art.TrajectoryGroup(
                (
                    rollout(model, task_input)
                    for _ in range(TRAINING_CONFIG["rollouts_per_group"])
                )
            )
        )

    # Gather all trajectory groups
    finished_groups = await art.gather_trajectory_groups(
        groups,
        pbar_desc="Generating responses",
        max_exceptions=TRAINING_CONFIG["rollouts_per_group"] * len(batch.items),
    )

    # Use RULER to score each group
    judged_groups = []
    for group in finished_groups:
        # Allow ten retries in case of API rate limiting
        judged_group = None
        for i in range(10):
            try:
                judged_group = await ruler_score_group(group, RULER_MODEL, debug=False)
                break
            except Exception as e:
                print(f"Error scoring group: {e}")
                continue
        assert judged_group is not None
        judged_groups.append(judged_group)

    # Train on the scored trajectories
    await model.delete_checkpoints()
    await model.train(
        judged_groups,
        config=art.TrainConfig(learning_rate=TRAINING_CONFIG["learning_rate"]),
        _config={"logprob_calculation_chunk_size": 8},
    )

    print(f"Completed training step {batch.step}")

    # Stop after configured steps (if limit is set)
    if (
        TRAINING_CONFIG["max_training_steps"]
        and batch.step >= TRAINING_CONFIG["max_training_steps"]
    ):
        print(
            f"Reached maximum training steps ({TRAINING_CONFIG['max_training_steps']})"
        )
        break

print("\n✅ Training completed!")

Generating 25 training inputs...

Generated 25 training inputs!

First 5 examples:

Example 1: In welcher Phase des Lagerprozesses wird die Ware umgelagert, bevor sie das Lager betritt?

Example 2: Ein Fachlagerist führt eine Inventur durch. Was bedeutet das Fremdwort “Inventur” und wie berechnet er die Lagerumschlagshäufigkeit?

Example 3: Beim Cross-docking wird die Ware direkt vom Wareneingang zum Warenausgang verbracht, ohne sie zu umlagern. Beschreiben Sie diesen Vorgang.

Example 4: Warum muss ein Fachlagerist die Stapel­sicherheit der UVPG-Stoffe regelmäßig kontrollieren?

Example 5: Die Ware wurde angeliefert und sofort wieder umgelagert. Nennen Sie drei Gründe für diese sofortige Umlagerung.


  | |_| | '_ \/ _` / _` |  _/ -_)

[34m[1mwandb[0m: Currently logged in as: [33mkdt[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


INFO 09-13 05:46:19 [__init__.py:244] Automatically detected platform cuda.
ERROR 09-13 05:46:21 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8



Please restructure your imports with 'import unsloth' at the top of your file.
  import unsloth  # type: ignore



🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: Patching vLLM v1 graph capture
Unsloth: Patching vLLM v0 graph capture
==((====))==  Unsloth 2025.8.6: Fast Qwen3 patching. Transformers: 4.53.2. vLLM: 0.9.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit with actual GPU utilization = 78.25%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 4096. Num Sequences = 224.
Unsloth: vLLM's KV Cache can use up to 8.87 GB. A

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

INFO 09-13 05:46:59 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 09-13 05:46:59 [cuda.py:360] Using XFormers backend.
INFO 09-13 05:47:00 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 09-13 05:47:00 [model_runner.py:1171] Starting to load model unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit...
INFO 09-13 05:47:01 [bitsandbytes_loader.py:499] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 09-13 05:47:02 [weight_utils.py:292] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

INFO 09-13 05:48:38 [weight_utils.py:308] Time spent downloading weights for unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit: 95.699251 seconds
INFO 09-13 05:48:38 [weight_utils.py:345] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 09-13 05:48:45 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 09-13 05:48:47 [model_runner.py:1203] Model loading took 3.4136 GiB and 103.800358 seconds
INFO 09-13 05:48:59 [worker.py:294] Memory profiling takes 10.75 seconds
INFO 09-13 05:48:59 [worker.py:294] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.70) = 10.32GiB
INFO 09-13 05:48:59 [worker.py:294] model weights take 3.41GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is 5.64GiB.
INFO 09-13 05:49:00 [executor_base.py:113] # cuda blocks: 2567, # CPU blocks: 0
INFO 09-13 05:49:00 [executor_base.py:118] Maximum concurrency for 4096 tokens per request: 10.03x
INFO 09-13 05:49:00 [llm_engine.py:428] init engine (profile, create kv cache, warmup model) took 12.71 seconds
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm']
Unsloth: Just some info: will

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth 2025.8.6 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


Model created!
Base model: Qwen/Qwen3-4B-Instruct-2507
Model name: aka-correction-model-001
Project name: aka-auto-rl-correction


AuthenticationError: litellm.AuthenticationError: Missing Anthropic API Key - A call is being made to anthropic but no key is set either in the environment variables or via params. Please set `ANTHROPIC_API_KEY` in your environment vars

In [None]:
# @title Test Your Model!

# Generate test inputs
print("Generating test inputs...")
test_inputs = await generate_training_inputs(
    TASK_DESCRIPTION, num_examples=NUM_TEST_INPUTS
)

print(f"\n🧪 Testing the trained model on {len(test_inputs)} new inputs:\n")
print("=" * 80)

for i, test_input in enumerate(test_inputs):
    print(f"\nTest {i + 1}:")
    print(f"Input: {test_input}")

    # Run the model
    test_task_input = TaskInput(step=999, input_text=test_input)
    result_trajectory = await rollout(model, test_task_input)

    # Extract the model's response
    messages = result_trajectory.messages()
    model_response = messages[-1]["content"] if messages else "No response"

    print(f"Model output: {model_response}")
    print("-" * 80)

print("\n🎉 Testing completed!")
print(f"\nYour model '{MODEL_NAME}' has been trained to: {TASK_DESCRIPTION}")
print("\nTo use this model in production:")
print("1. The model checkpoint is saved in ./.art/")
print("2. You can load it using the vLLM library")
print(
    "3. Or continue training with more examples by adjusting the configuration at the top"
)

In [None]:
# @title Upload to Hugging Face 🤗

# Adapted from Unsloth Notebooks (https://github.com/unslothai/notebooks), licensed under GNU LGPL v3.0.
# © Unsloth contributors. Modifications © 2025 OpenPipe, Inc.
# See THIRD-PARTY-NOTICES and licenses/LGPL-3.0.txt for details.

import torch
from unsloth import FastLanguageModel

lora_model_path = (
    f".art/{model.project}/models/{model.name}/checkpoints/{await model.get_step():04d}"
)

peft_model, peft_tokenizer = FastLanguageModel.from_pretrained(
    model_name=lora_model_path,
    max_seq_length=16384,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)

UPLOAD_MODEL = False  # Set True when you're ready to upload your model to Hugging Face
HF_ACCOUNT = "your_hf_account"
HF_TOKEN = "your_hf_token"

if UPLOAD_MODEL:
    peft_model.push_to_hub_merged(
        f"{HF_ACCOUNT}/{model.name}", peft_tokenizer, token=HF_TOKEN
    )

### Next Steps

Congratulations! You've successfully trained a custom model for your task using only:
- A task description
- Example inputs (no outputs needed!)
- RULER's automatic evaluation

Here are some ways to improve results:

1. **More diverse inputs**: Generate more varied input examples
2. **Longer training**: Increase the number of training steps
3. **More comparisons**: Increase `rollouts_per_group` for better RULER comparisons
4. **Task refinement**: Make your task description more specific and detailed
5. **Hyperparameter tuning**: Adjust learning rate, batch size, etc.

Remember: RULER learns what "good" means from your task description alone - no labeled data required!

For more advanced use cases, check out the [ART documentation](https://art.openpipe.ai).

*Built by
[@mattshumer\_](https://x.com/mattshumer_)
in partnership with OpenPipe.*