<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Train a Letter Counting Model using GRPO.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Train a Letter Counting Model using GRPO

Welcome to Oumi! In this tutorial notebook, we're going to fine-tune an LLM using Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm. But first, a little (recent) history lesson --

In June 2024, a user discovered that ChatGPT had a little problem -- it couldn't correctly answer a simple question, ["How Many R’s Are in the Word Strawberry?"](https://community.openai.com/t/incorrect-count-of-r-characters-in-the-word-strawberry/829618/2)

Because of the way LLMs tokenize input strings, counting letters can be pretty tough for them! Fortunately, you (and Oumi!) are here to help.

Below, we show you how to employ a custom evaluation function to evaluate popular models on the task of counting letters in words. Then, we will align Llama 3.2 3B to improve its performance on this task.

This notebook includes cell outputs, but some irrelevant outputs (ex. install lines, warnings) are modified/removed for readability.

## Prerequisites

### Machine Requirements

This notebook runs both model evaluation and GRPO training, which require 8GB and 40GB VRAM, respectively.

❗**NOTICE:** If you're running this notebook on Colab using a T4 GPU, it's not possible to run training due to memory requirements. To run evaluation, some adjustments need to be made as vLLM doesn't support T4 GPUs. This will be explained in the evaluation section.

If your local machine cannot run this notebook, you can instead run this notebook on a cloud platform. The following demonstrates how to open a VSCode instance backed by a GCP node with 4 A100 GPUs, from which the notebook can be run. It is possible to run this notebook on just 1 GPU, but you will need make some adjustments to training parameters, which will be explained in the training section.

```bash
# Run on your local machine
gcloud auth application-default login  # Authenticate with GCP
make gcpcode ARGS="--resources.accelerators A100:4"
```

### Oumi Installation

First, let's install Oumi and vLLM (part of the `gpu` optional dependencies). You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html).

In [1]:
%pip install oumi[gpu]

### Remote API Access

As part of this notebook, you can evaluate frontier models from Open AI, Google, Anthropic, and Meta on the letter counting task. If you want to evaluate any of these models, set the corresponding fields below. The code is commented out by default to avoid any accidental overwriting of existing variables.

In [1]:
import os

# os.environ["OPENAI_API_KEY"] = ""  # Set your OpenAI API key here.
# os.environ["GEMINI_API_KEY"] = ""  # Set your Gemini API key here.
# os.environ["ANTHROPIC_API_KEY"] = ""  # Set your Anthropic API key here.
import dotenv

dotenv.load_dotenv()

# Set your GCP project id and region, if you want to query Llama 3.1 405B in Vertex.
REGION = "us-central1"  # Set your GCP region here.
PROJECT_ID = "lema-dev"  # Set your GCP project id here.

### Tutorial Directory

Finally, we'll set up a directory to use for this tutorial, and some environment variables.

In [2]:
from pathlib import Path

tutorial_dir = "letter_counting_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Disable warnings from HF.

# This is needed for vLLM to use multiple GPUs in a notebook.
# If you're not running in a notebook, you can ignore this.
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
os.environ["WANDB_PROJECT"] = "oumi"

## Dataset

The dataset we'll use for this notebook is `oumi-ai/oumi-letter-count`, which can be found on [HF Datasets](https://huggingface.co/datasets/oumi-ai/oumi-letter-count). Its prompts ask to count the letters in various English words, with metadata in each example containing the correct count. We use the `train` split for training and the `test` split for evaluation. We'll use an Oumi dataset class, `LetterCountGrpoDataset`, to load and preprocess the HF Dataset. The following code displays an example prompt:

In [3]:
from pprint import pprint

from oumi.datasets.grpo.letter_count import LetterCountGrpoDataset

dataset = LetterCountGrpoDataset(
    dataset="oumi-ai/oumi-letter-count-clean", split="validation"
)
print("-" * 80)
print("Sample:")
pprint(dataset.conversation(0).to_dict())

[2025-06-27 23:20:11,839][oumi][rank0][pid:1269100][MainThread][INFO]][base_map_dataset.py:91] Creating map dataset (type: LetterCountGrpoDataset)... dataset_name: 'oumi-ai/oumi-letter-count'
[2025-06-27 23:20:12,645][oumi][rank0][pid:1269100][MainThread][INFO]][base_map_dataset.py:487] Dataset Info:
	Split: validation
	Version: 0.0.0
	Dataset size: 22894322
	Download size: 5697295
	Size: 28591617 bytes
	Rows: 10000
	Columns: ['conversation_id', 'messages', 'metadata']
[2025-06-27 23:20:12,798][oumi][rank0][pid:1269100][MainThread][INFO]][base_map_dataset.py:426] Loaded DataFrame with shape: (10000, 3). Columns:
conversation_id    object
messages           object
metadata           object
dtype: object
--------------------------------------------------------------------------------
Sample:
{'conversation_id': 'oumi_letter_count_0',
 'messages': [{'content': 'Your final answer should be an integer written as '
                          'digits and formatted as "\\boxed{your_answer}". Fo

## Evaluation

First, we'll evaluate how various models perform on the letter counting task. We'll evaluate frontier models by calling their respective remote API, and Llama 3.2 3B by running local inference on it using vLLM.

We've already defined a custom evaluation function in Oumi which runs inference on the above dataset, extracts the answer from the model response, and calculates various metrics such as accuracy. This function is defined at `src/oumi/evaluation/registry/count_letters_task.py` ([GitHub link](https://github.com/oumi-ai/oumi/blob/main/src/oumi/evaluation/registry/count_letters_task.py)), and we print its contents below for reference.

In [4]:
import inspect

from oumi.evaluation.registry.count_letters_task import count_letters

print(inspect.getsource(count_letters))

@register_evaluation_function("count_letters")
def count_letters(
    task_params: EvaluationTaskParams,
    inference_engine: BaseInferenceEngine,
) -> dict[str, Any]:
    """Custom evaluation function registered as `count_letters`."""
    dataset = LetterCountGrpoDataset(
        dataset="oumi-ai/oumi-letter-count-clean", split="test"
    )
    # TODO: OPE-1155: Add support for using Oumi dataset code to create the dataset.
    # dataset = build_dataset("oumi-ai/oumi-letter-count", tokenizer=None, sample_count=10)  # noqa: E501
    num_samples = task_params.num_samples
    if num_samples is None:
        num_samples = len(dataset)
    input_conversations = [dataset.conversation(i) for i in range(num_samples)]
    conversations = inference_engine.infer(input_conversations)
    logger.info(f"Finished inference on {len(conversations)} conversations!")
    if len(conversations) > 0:
        logger.info(f"Sample conversation: {conversations[0]}")

    count = 0  # The number of examples w

In the following section, you can select which models you want to evaluate. You can lower `NUM_SAMPLES`  to reduce cost when calling remote APIs, with the downside of noisier results.

In [5]:
NUM_SAMPLES = 100
# We set an environment variable to be used at the end of the Colab.
os.environ["NUM_SAMPLES"] = str(NUM_SAMPLES)

model_names = [
    "llama_3b",
    # Uncomment any models you wish to evaluate - you can evaluate multiple at once.
    # "gpt_4o",
    # "gemini_pro",
    # "llama_405b",
    # "claude_sonnet",
]

❗**NOTICE:** If running this notebook on Colab, delete the following line: `inference_engine: VLLM`

In [6]:
%%writefile $tutorial_dir/llama_3b_eval.yaml

# We save this config as a YAML file as we'll use it again at the end of the notebook.
model:
  model_name: "meta-llama/Llama-3.2-3B-Instruct"
  model_max_length: 131072
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"
  trust_remote_code: True

inference_engine: VLLM

generation:
  max_new_tokens: 2048

tasks:
  - evaluation_backend: custom
    task_name: count_letters

output_dir: "letter_counting_tutorial/evaluation/llama_3b"

Overwriting letter_counting_tutorial/llama_3b_eval.yaml


In [7]:
# EvaluationConfig for various models.
# Note that Llama 3B uses the local VLLM inference engines, while the others use various
# remote engines.

with open(f"{tutorial_dir}/llama_3b_eval.yaml") as f:
    llama_3b_yaml = f.read()

configs = {
    "llama_3b": llama_3b_yaml,
    "gpt_4o": """
      model:
        model_name: "gpt-4o"

      inference_engine: OPENAI

      inference_remote_params:
        api_key_env_varname: "OPENAI_API_KEY"
        max_retries: 3
        num_workers: 100
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/gpt_4o"
      """,
    "gemini_pro": """
      model:
        model_name: "gemini-2.5-pro-preview-03-25"

      inference_engine: GOOGLE_GEMINI

      inference_remote_params:
        api_key_env_varname: "GEMINI_API_KEY"
        max_retries: 3
        num_workers: 2
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/gemini_pro"
      """,
    "llama_405b": f"""
      model:
        model_name: "meta/llama-3.1-405b-instruct-maas"

      inference_engine: GOOGLE_VERTEX

      inference_remote_params:
        api_url: "https://{REGION}-aiplatform.googleapis.com/v1beta1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi/chat/completions"
        max_retries: 3
        num_workers: 10
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/llama_405b"
      """,
    "claude_sonnet": """
      model:
        model_name: "claude-3-7-sonnet-latest"

      inference_engine: ANTHROPIC

      inference_remote_params:
        api_key_env_varname: "ANTHROPIC_API_KEY"
        max_retries: 3
        num_workers: 5
        politeness_policy: 65
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/claude_sonnet"
      """,
}

In [8]:
# Run evaluation on all specified models.

from oumi.core.configs import EvaluationConfig
from oumi.core.evaluation import Evaluator

results = {}

for model_name in model_names:
    # Create the evaluation config from the YAML string.
    config_yaml: str = configs[model_name]
    config = EvaluationConfig.from_str(config_yaml)
    config.tasks[0].num_samples = NUM_SAMPLES

    # Run the evaluation.
    evaluator = Evaluator()
    evaluator_out = evaluator.evaluate(config)

    # # Record the results.
    results[model_name] = evaluator_out[0].get_results()

INFO 06-27 23:20:18 [__init__.py:239] Automatically detected platform cuda.
[2025-06-27 23:20:19,649][oumi][rank0][pid:1269100][MainThread][INFO]][models.py:506] Using the model's built-in chat template for model 'meta-llama/Llama-3.2-3B-Instruct'.
INFO 06-27 23:20:29 [config.py:600] This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 06-27 23:20:29 [config.py:1600] Defaulting to use mp for distributed inference
INFO 06-27 23:20:29 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 06-27 23:20:35 [__init__.py:239] Automatically detected platform cuda.
INFO 06-27 23:20:38 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='meta-llama/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dt

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.04it/s]
[1;36m(VllmWorker rank=0 pid=1269596)[0;0m 


[1;36m(VllmWorker rank=0 pid=1269596)[0;0m INFO 06-27 23:21:23 [loader.py:447] Loading weights took 1.00 seconds
[1;36m(VllmWorker rank=3 pid=1269955)[0;0m INFO 06-27 23:21:23 [loader.py:447] Loading weights took 1.11 seconds
[1;36m(VllmWorker rank=2 pid=1269822)[0;0m INFO 06-27 23:21:24 [loader.py:447] Loading weights took 1.23 seconds
[1;36m(VllmWorker rank=1 pid=1269689)[0;0m INFO 06-27 23:21:24 [loader.py:447] Loading weights took 1.22 seconds
[1;36m(VllmWorker rank=0 pid=1269596)[0;0m INFO 06-27 23:21:24 [gpu_model_runner.py:1273] Model loading took 1.5341 GiB and 1.438169 seconds
[1;36m(VllmWorker rank=3 pid=1269955)[0;0m INFO 06-27 23:21:24 [gpu_model_runner.py:1273] Model loading took 1.5341 GiB and 1.666282 seconds
[1;36m(VllmWorker rank=2 pid=1269822)[0;0m INFO 06-27 23:21:24 [gpu_model_runner.py:1273] Model loading took 1.5341 GiB and 1.859815 seconds
[1;36m(VllmWorker rank=1 pid=1269689)[0;0m INFO 06-27 23:21:24 [gpu_model_runner.py:1273] Model loading took 

Processed prompts: 100%|██████████| 100/100 [00:10<00:00,  9.41it/s, est. speed input: 818.79 toks/s, output: 469.61 toks/s]


[2025-06-27 23:21:48,208][oumi][rank0][pid:1269100][MainThread][INFO]][count_letters_task.py:54] Finished inference on 100 conversations!
[2025-06-27 23:21:48,211][oumi][rank0][pid:1269100][MainThread][INFO]][count_letters_task.py:56] Sample conversation: conversation_id='oumi_letter_count_0' messages=[SYSTEM: Your final answer should be an integer written as digits and formatted as "\boxed{your_answer}". For example, if the answer is 42, you should output "\boxed{42}"., USER: Look through 'perivaginal' and count the 'n's., ASSISTANT: There are 2 'n's in 'perivaginal'.] metadata={'letter': 'n', 'letter_count_integer': 1, 'letter_count_string': 'one', 'unformatted_prompt': 'Look through {word} and count the {letter}s.', 'word': 'perivaginal'}


In [9]:
# Print results.

print(f"Total samples: {NUM_SAMPLES}")
for model_name, result in results.items():
    print("-" * 80)
    print(f"Model: {model_name}")
    print(f"Accuracy: {result['accuracy']:.2%}")
    print(f"Properly Extracted Accuracy: {result['properly_extracted_accuracy']:.2%}")
    correct = result["num_correct_answers"]
    incorrect = result["num_incorrect_answers"]
    invalid = result["num_invalid_answers"]
    print(f"Num correct, incorrect, invalid: {correct}, {incorrect}, {invalid}")

Total samples: 100
--------------------------------------------------------------------------------
Model: llama_3b
Accuracy: 31.00%
Properly Extracted Accuracy: 46.27%
Num correct, incorrect, invalid: 31, 36, 33


## GRPO

Now, we train Llama 3.2 3B on the task of counting letters using the GRPO algorithm implemented by [HuggingFace's `trl` library](https://huggingface.co/docs/trl/en/index).

Note that we can calculate a concrete reward for this task by comparing the answer extracted by the model with the correct answer. In the reward function defined in `src/oumi/datasets/grpo/rewards/count_letters_rewards.py` ([GitHub link](https://github.com/oumi-ai/oumi/blob/main/src/oumi/datasets/grpo/rewards/count_letters_rewards.py)), we calculate the reward to be `-abs(predicted_count - target_count)`. We use simple heuristics to extract the predicted count. The following cell prints out the reward function code.

In [10]:
!cat ../src/oumi/datasets/grpo/rewards/count_letters_rewards.py

# Copyright 2025 - Oumi
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import re
from typing import Any, Optional

from oumi.core.registry import RegistryType, register


def _extract_prediction(response: str) -> Optional[int]:
    r"""Returns the numeric answer extracted from `\boxed{...}`, or None otherwise."""
    regex_result = re.findall(r"\\boxed\{([-+]?\d+)\}", response)
    if not regex_result or len(regex_result) != 1:
        return None
    number_str = regex_result[0]
    # Except cl

In [11]:
# Clean up to free-up GPU memory used for evaluation above
import gc

import torch


def cleanup_memory():
    """Delete the evaluator and collect garbage."""
    global evaluator
    if evaluator:  # type: ignore
        del evaluator
        evaluator = None
    for _ in range(3):
        gc.collect()
        torch.cuda.empty_cache()
        torch.cuda.synchronize()


cleanup_memory()

❗**NOTICE:** Set `training.enable_wandb` to True if you want to log your training run to Weights and Biases. In addition, you must also log into WandB, ex. by running `wandb login`.

❗**NOTICE:** We only train for 2 steps for demonstration purposes. You can increase `max_steps`, or replace it with `num_train_epochs` to set your desired number of epochs.

In [14]:
%%writefile $tutorial_dir/grpo_train.yaml

model:
  model_name: "meta-llama/Llama-3.2-3B-Instruct"
  model_max_length: 8192
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"

data:
  train:
    datasets:
      - dataset_name: "oumi-ai/oumi-letter-count"
        split: "train"

training:
  trainer_type: "TRL_GRPO"
  save_steps: 500
  max_steps: 4 # for demo purposes
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 1
  learning_rate: 5e-7
  lr_scheduler_type: "cosine"
  warmup_steps: 20

  reward_functions: ["count_letters"]

  ddp_find_unused_parameters: False
  optimizer: "adafactor"
  compile: True

  grpo:
    num_generations: 4
    use_vllm: True

  dataloader_num_workers: "auto"
  dataloader_prefetch_factor: 32

  logging_steps: 1
  output_dir: "letter_counting_tutorial/llama_3b_grpo"
  # Set this to True if you want to log to Weights and Biases.
  enable_wandb: True

Overwriting letter_counting_tutorial/grpo_train.yaml


In [None]:
!oumi distributed torchrun -m oumi train -c $tutorial_dir/grpo_train.yaml

INFO 06-27 23:53:57 [__init__.py:239] Automatically detected platform cuda.
[32mINFO[0m:     Started server process [[36m1298016[0m]
[32mINFO[0m:     Waiting for application startup.
INFO 06-27 23:54:09 [config.py:600] This model supports multiple tasks: {'reward', 'generate', 'embed', 'score', 'classify'}. Defaulting to 'generate'.
INFO 06-27 23:54:09 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 06-27 23:54:20 [__init__.py:239] Automatically detected platform cuda.
INFO 06-27 23:54:22 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='meta-llama/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disab

In [None]:
# If we have multiple GPUs, we can use Ray to parallelize the inference.
# This is essential if you're running a model that's too big to fit in a single GPU.

import ray

if torch.cuda.is_available() and torch.cuda.device_count() >= 2:
    ray.shutdown()
    ray.init(address=None)  # num_gpus=torch.cuda.device_count()

2025-06-28 00:58:23,248	INFO worker.py:1723 -- Connecting to existing Ray cluster at address: 172.26.135.196:6379...


[2025-06-28 00:58:28,252 W 1344725 1344725] gcs_rpc_client.h:151: Failed to connect to GCS at address 172.26.135.196:6379 within 5 seconds.
[2025-06-28 00:58:58,259 W 1344725 1344725] gcs_client.cc:183: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.
[2025-06-28 00:59:04,268 W 1344725 1344725] gcs_rpc_client.h:151: Failed to connect to GCS at address 172.26.135.196:6379 within 5 seconds.
[2025-06-28 00:59:34,270 W 1344725 1344725] gcs_client.cc:183: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.


## Evaluating our Trained Model

Let's now evaluate our trained model to see if it improved on the letter counting task. Note that it may not improve much, since we trained it for a relatively short time.

Below, we demonstrate an alternative method of running evaluation with the `oumi` CLI. We use the same Llama 3B evaluation config we used above, with the only change being pointing it at the model we just trained.

First, we need to reset the notebook to clear variables from our previous vLLM run.

In [1]:
%reset -f

In [4]:
!oumi evaluate -c letter_counting_tutorial/llama_3b_eval.yaml \
    --model.model_name "letter_counting_tutorial/llama_3b_grpo" \
    --tasks.0.num_samples 100 \
    --output_dir "letter_counting_tutorial/evaluation/llama_3_grpo"


[32m   ____  _    _ __  __ _____[0m
[32m  / __ \| |  | |  \/  |_   _|[0m
[32m | |  | | |  | | \  / | | |[0m
[32m | |  | | |  | | |\/| | | |[0m
[32m | |__| | |__| | |  | |_| |_[0m
[32m  \____/ \____/|_|  |_|_____|[0m

[2K[32m⠼[0m [32mLoading configuration...[0m0m
[2KINFO 06-28 00:28:08 [__init__.py:239] Automatically detected platform cuda.
[2K[32m⠼[0m [32mRunning evaluation...[0m[2025-06-28 00:28:10,216][oumi][rank0][pid:1330249][MainThread][INFO]][models.py:506] Using the model's built-in chat template for model 'letter_counting_tutorial/llama_3b_grpo'.
[2KINFO 06-28 00:28:19 [config.py:600] This model supports multiple tasks: 
{'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
[2KINFO 06-28 00:28:19 [config.py:1600] Defaulting to use mp for distributed 
inference
[2KINFO 06-28 00:28:19 [config.py:1780] Chunked prefill is enabled with 
max_num_batched_tokens=16384.
enable CUDA graph. Since, enforce-eager is enabled, async output 

## A Better Letter Counter

Looks like we were able to significantly improve on the performance of Llama-3.2-3B-Instruct:

**BEFORE**

Accuracy: 31.00%
Properly Extracted Accuracy: 46.27%

**AFTER**

Accuracy: 51.00%
Properly Extracted Accuracy: 53.68%

A lot of the improvement from using GRPO came because this small LLM learned to better mimic the expected output format of the extractor, but the accuracy for properly extracted samples also improved! This is a great illustration of the kind of task GRPO training excels at.

## What's Next?

Now that you know how easy it is to train in Oumi using GRPO, perhaps you'd like to try training on your own data (in a similar data format) -- check out [our docs](https://oumi.ai/docs/en/latest/resources/datasets/sft_datasets.html#using-an-unregistered-dataset-whose-format-is-identical-to-a-registered-dataset) for an easy way to do just that. 

Have fun!