<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Train a Letter Counting Model using GRPO.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Train a Letter Counting Model using GRPO

This notebook delves into a fun, popular question to ask LLMs: "How Many R’s Are in the Word Strawberry?". First, we will use a custom evaluation function to evaluate many popular models on the task of counting letters in words. Then, we will use Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm, to train Llama 3.2 3B to improve its performance on this task.

## Prerequisites

### Machine Requirements

This notebook runs both model evaluation and GRPO training, which require 8GB and 40GB VRAM, respectively.

❗**NOTICE:** If you're running this notebook on Colab using a T4 GPU, it's not possible to run training due to memory requirements. To run evaluation, some adjustments need to be made as vLLM doesn't support T4 GPUs. This will be explained in the evaluation section.

If your local machine cannot run this notebook, you can instead run this notebook on a cloud platform. The following demonstrates how to open a VSCode instance backed by a GCP node with 4 A100 GPUs, from which the notebook can be run. It is possible to run this notebook on just 1 GPU, but you will need make some adjustments to training parameters, which will be explained in the training section.

```bash
# Run on your local machine
gcloud auth application-default login  # Authenticate with GCP
make gcpcode ARGS="--resources.accelerators A100:4"
```

### Oumi Installation

First, let's install Oumi and vLLM (part of the `gpu` optional dependencies). You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html).

In [1]:
%pip install git+https://github.com/oumi-ai/oumi.git
%pip install "vllm>=0.7.3,<0.8.0"

Collecting git+https://github.com/oumi-ai/oumi.git
  Cloning https://github.com/oumi-ai/oumi.git to /tmp/pip-req-build-2iea6pt1
  Running command git clone --filter=blob:none --quiet https://github.com/oumi-ai/oumi.git /tmp/pip-req-build-2iea6pt1
  Resolved https://github.com/oumi-ai/oumi.git to commit 2b62dbb247d11dedeaf3b72a21e64f50722188ef
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: oumi
  Building wheel for oumi (pyproject.toml) ... [?25ldone
[?25h  Created wheel for oumi: filename=oumi-0.1.12.dev7+g2b62dbb-py3-none-any.whl size=545324 sha256=e6070141d61cadb7a6ceb2c23c5b684a928b48063f177190c6861156792db38f
  Stored in directory: /tmp/pip-ephem-wheel-cache-h4g__2rg/wheels/ba/74/ba/892fcc8d178577365d58cebdcc694e805c47b498bc53233063
Successfully built oumi
Installing collected packages: oumi
  Attempting uninstall: oumi
    

### Remote API Access

As part of this notebook, you can evaluate frontier models from Open AI, Google, Anthropic, and Meta on the letter counting task. If you want to evaluate any of these models, set the corresponding fields below.

In [2]:
import os

os.environ["OPENAI_API_KEY"] = ""  # Set your OpenAI API key here.
os.environ["GEMINI_API_KEY"] = ""  # Set your Gemini API key here.
os.environ["ANTHROPIC_API_KEY"] = ""  # Set your Anthropic API key here.

# Set your GCP project id and region, if you want to query Llama 3.1 405B in Vertex.
REGION = ""  # Set your GCP region here.
PROJECT_ID = ""  # Set your GCP project id here.

### Tutorial Directory

Finally, we'll set up a directory to use for this tutorial, and some environment variables.

In [3]:
from pathlib import Path

tutorial_dir = "letter_counting_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Disable warnings from HF.

# This is needed for vLLM to use multiple GPUs in a notebook.
# If you're not running in a notebook, you can ignore this.
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

## Dataset

The dataset we'll use for this notebook is `oumi-ai/oumi-letter-count`, which can be found on [HF Datasets](https://huggingface.co/datasets/oumi-ai/oumi-letter-count). Its prompts ask to count the letters in various English words, with metadata in each example containing the correct count. We use the `train` split for training and the `test` split for evaluation. We'll use an Oumi dataset class, `LetterCountGrpoDataset`, to load and preprocess the HF Dataset. The following code displays an example prompt:

In [4]:
from pprint import pprint

from oumi.datasets.grpo.letter_count import LetterCountGrpoDataset

dataset = LetterCountGrpoDataset(split="validation")
print("-" * 80)
print("Sample:")
pprint(dataset.conversation(0).to_dict())

[2025-04-10 09:36:55,268][oumi][rank0][pid:9994][MainThread][INFO]][base_map_dataset.py:91] Creating map dataset (type: LetterCountGrpoDataset)... dataset_name: 'oumi-ai/oumi-letter-count'


README.md:   0%|          | 0.00/941 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/4.47M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/400k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/831k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/20000 [00:00<?, ? examples/s]

[2025-04-10 09:37:00,027][oumi][rank0][pid:9994][MainThread][INFO]][base_map_dataset.py:487] Dataset Info:
	Split: validation
	Version: 0.0.0
	Dataset size: 22894322
	Download size: 5697295
	Size: 28591617 bytes
	Rows: 10000
	Columns: ['conversation_id', 'messages', 'metadata']
[2025-04-10 09:37:00,248][oumi][rank0][pid:9994][MainThread][INFO]][base_map_dataset.py:426] Loaded DataFrame with shape: (10000, 3). Columns:
conversation_id    object
messages           object
metadata           object
dtype: object
--------------------------------------------------------------------------------
Sample:
{'conversation_id': 'oumi_letter_count_0',
 'messages': [{'content': "Could you determine the count of 'l's in "
                          "'substantial'?",
               'role': 'user'},
              {'content': 'Your final answer should be written as digits and '
                          'formatted as "\\boxed{your_answer}". For example, '
                          'if the answer is 42, ma

## Evaluation

First, we'll evaluate how various models perform on the letter counting task. We'll evaluate frontier models by calling their respective remote API, and Llama 3.2 3B by running local inference on it using vLLM.

We've already defined a custom evaluation function in Oumi which runs inference on the above dataset, extracts the answer from the model response, and calculates various metrics such as accuracy. This function is defined at `src/oumi/evaluation/registry/count_letters_task.py` ([GitHub link](https://github.com/oumi-ai/oumi/blob/main/src/oumi/evaluation/registry/count_letters_task.py)), and we print its contents below for reference.

In [5]:
import inspect

from oumi.evaluation.registry.count_letters_task import count_letters

print(inspect.getsource(count_letters))

@register_evaluation_function("count_letters")
def count_letters(
    task_params: EvaluationTaskParams,
    inference_engine: BaseInferenceEngine,
) -> dict[str, Any]:
    """Custom evaluation function registered as `count_letters`."""
    dataset = LetterCountGrpoDataset(split="test")
    # TODO: OPE-1155: Add support for using Oumi dataset code to create the dataset.
    # dataset = build_dataset("oumi-ai/oumi-letter-count", tokenizer=None, sample_count=10)  # noqa: E501
    # dataset = build_dataset("oumi-ai/berrybench-v0.1.0", tokenizer=None, sample_count=10)  # noqa: E501
    num_samples = task_params.num_samples
    if num_samples is None:
        num_samples = len(dataset)
    input_conversations = [dataset.conversation(i) for i in range(num_samples)]
    conversations = inference_engine.infer(input_conversations)
    logger.info(f"Finished inference on {len(conversations)} conversations!")
    if len(conversations) > 0:
        logger.info(f"Sample conversation: {conversations

In the following section, you can select which models you want to evaluate. You can lower `NUM_SAMPLES`  to reduce cost when calling remote APIs, with the downside of noisier results.

In [6]:
NUM_SAMPLES = 100
# We set an environment variable to be used at the end of the Colab.
os.environ["NUM_SAMPLES"] = str(NUM_SAMPLES)

model_names = [
    "llama_3b",
    # Uncomment any models you wish to evaluate - you can evaluate multiple at once.
    # "gpt_4o",
    # "gemini_pro",
    # "llama_405b",
    # "claude_sonnet",
]

❗**NOTICE:** If running this notebook on Colab, delete the following line: `inference_engine: VLLM`

In [7]:
%%writefile $tutorial_dir/llama_3b_eval.yaml

# We save this config as a YAML file as we'll use it again at the end of the notebook.
model:
  model_name: "meta-llama/Llama-3.2-3B-Instruct"
  model_max_length: 131072
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"
  trust_remote_code: True

inference_engine: VLLM

generation:
  max_new_tokens: 2048

tasks:
  - evaluation_backend: custom
    task_name: count_letters

output_dir: "letter_counting_tutorial/evaluation/llama_3b"

Writing letter_counting_tutorial/llama_3b_eval.yaml


In [8]:
# EvaluationConfig for various models.
# Note that Llama 3B uses the local VLLM inference engines, while the others use various
# remote engines.

with open(f"{tutorial_dir}/llama_3b_eval.yaml") as f:
    llama_3b_yaml = f.read()

configs = {
    "llama_3b": llama_3b_yaml,
    "gpt_4o": """
      model:
        model_name: "gpt-4o"

      inference_engine: OPENAI

      inference_remote_params:
        api_key_env_varname: "OPENAI_API_KEY"
        max_retries: 3
        num_workers: 100
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/gpt_4o"
      """,
    "gemini_pro": """
      model:
        model_name: "gemini-2.5-pro-preview-03-25"

      inference_engine: GOOGLE_GEMINI

      inference_remote_params:
        api_key_env_varname: "GEMINI_API_KEY"
        max_retries: 3
        num_workers: 2
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/gemini_pro"
      """,
    "llama_405b": f"""
      model:
        model_name: "meta/llama-3.1-405b-instruct-maas"

      inference_engine: GOOGLE_VERTEX

      inference_remote_params:
        api_url: "https://{REGION}-aiplatform.googleapis.com/v1beta1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi/chat/completions"
        max_retries: 3
        num_workers: 10
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/llama_405b"
      """,
    "claude_sonnet": """
      model:
        model_name: "claude-3-7-sonnet-latest"

      inference_engine: ANTHROPIC

      inference_remote_params:
        api_key_env_varname: "ANTHROPIC_API_KEY"
        max_retries: 3
        num_workers: 5
        politeness_policy: 65
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/claude_sonnet"
      """,
}

In [9]:
# Run evaluation on all specified models.

from oumi.core.configs import EvaluationConfig
from oumi.core.evaluation import Evaluator

results = {}

for model_name in model_names:
    # Create the evaluation config from the YAML string.
    config_yaml: str = configs[model_name]
    config = EvaluationConfig.from_str(config_yaml)
    config.tasks[0].num_samples = NUM_SAMPLES

    # Run the evaluation.
    evaluator = Evaluator()
    evaluator_out = evaluator.evaluate(config)

    # # Record the results.
    results[model_name] = evaluator_out[0].get_results()

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

[2025-04-10 09:37:10,373][oumi][rank0][pid:9994][MainThread][INFO]][models.py:482] Using the model's built-in chat template for model 'meta-llama/Llama-3.2-3B-Instruct'.
INFO 04-10 09:37:10 __init__.py:207] Automatically detected platform cuda.
INFO 04-10 09:37:19 config.py:549] This model supports multiple tasks: {'embed', 'generate', 'classify', 'score', 'reward'}. Defaulting to 'generate'.
INFO 04-10 09:37:19 config.py:1382] Defaulting to use mp for distributed inference
INFO 04-10 09:37:19 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-10 09:37:19 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='meta-llama/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, t

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

INFO 04-10 09:37:20 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 04-10 09:37:21 cuda.py:229] Using Flash Attention backend.
INFO 04-10 09:37:26 __init__.py:207] Automatically detected platform cuda.
INFO 04-10 09:37:26 __init__.py:207] Automatically detected platform cuda.
INFO 04-10 09:37:26 __init__.py:207] Automatically detected platform cuda.
[1;36m(VllmWorkerProcess pid=10493)[0;0m INFO 04-10 09:37:26 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=10492)[0;0m INFO 04-10 09:37:26 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=10494)[0;0m INFO 04-10 09:37:26 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=10492)[0;0m INFO 04-10 09:37:27 cuda.py:229] Using Flash Attention backend.
[1;36m(VllmWorkerProcess pid=10493)[0;0m INFO 04-10 09:37:27 cuda.py:229] Using Flash Attentio

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


[1;36m(VllmWorkerProcess pid=10494)[0;0m INFO 04-10 09:38:04 model_runner.py:1115] Loading model weights took 1.5341 GB
INFO 04-10 09:38:05 model_runner.py:1115] Loading model weights took 1.5341 GB
[1;36m(VllmWorkerProcess pid=10493)[0;0m INFO 04-10 09:38:05 model_runner.py:1115] Loading model weights took 1.5341 GB
[1;36m(VllmWorkerProcess pid=10492)[0;0m INFO 04-10 09:38:05 model_runner.py:1115] Loading model weights took 1.5341 GB
[1;36m(VllmWorkerProcess pid=10492)[0;0m INFO 04-10 09:38:12 worker.py:267] Memory profiling takes 6.17 seconds
[1;36m(VllmWorkerProcess pid=10492)[0;0m INFO 04-10 09:38:12 worker.py:267] the current vLLM instance can use total_gpu_memory (39.39GiB) x gpu_memory_utilization (0.90) = 35.45GiB
[1;36m(VllmWorkerProcess pid=10492)[0;0m INFO 04-10 09:38:12 worker.py:267] model weights take 1.53GiB; non_torch_memory takes 2.10GiB; PyTorch activation peak memory takes 0.24GiB; the rest of the memory reserved for KV Cache is 31.58GiB.
[1;36m(VllmWork

Processed prompts: 100%|██████████| 100/100 [00:03<00:00, 26.70it/s, est. speed input: 2430.26 toks/s, output: 635.19 toks/s] 


[2025-04-10 09:38:26,617][oumi][rank0][pid:9994][MainThread][INFO]][count_letters_task.py:53] Finished inference on 100 conversations!
[2025-04-10 09:38:26,618][oumi][rank0][pid:9994][MainThread][INFO]][count_letters_task.py:55] Sample conversation: conversation_id='oumi_letter_count_0' messages=[USER: Look through 'perivaginal' and count the 'n's., SYSTEM: Your final answer should be written as digits and formatted as "\boxed{your_answer}". For example, if the answer is 42, make sure to output "\boxed{42}"., ASSISTANT: There are 2 'n's in 'perivaginal'. 

\boxed{2}] metadata={'letter': 'n', 'letter_count_integer': 1, 'letter_count_string': 'one', 'unformatted_prompt': 'Look through {word} and count the {letter}s.', 'word': 'perivaginal'}


In [10]:
# Print results.

print(f"Total samples: {NUM_SAMPLES}")
for model_name, result in results.items():
    print("-" * 80)
    print(f"Model: {model_name}")
    print(f"Accuracy: {result['accuracy']:.2%}")
    correct = result["num_correct_answers"]
    incorrect = result["num_incorrect_answers"]
    invalid = result["num_invalid_answers"]
    print(f"Num correct, incorrect, invalid: {correct}, {incorrect}, {invalid}")

Total samples: 100
--------------------------------------------------------------------------------
Model: llama_3b
Accuracy: 24.00%
Num correct, incorrect, invalid: 24, 69, 7


## GRPO

Now, we train Llama 3.2 3B on the task of counting letters using the GRPO algorithm implemented by [HuggingFace's `trl` library](https://huggingface.co/docs/trl/en/index).

Note that we can calculate a concrete reward for this task by comparing the answer extracted by the model with the correct answer. In the reward function defined in `src/oumi/datasets/grpo/rewards/count_letters_rewards.py` ([GitHub link](https://github.com/oumi-ai/oumi/blob/main/src/oumi/datasets/grpo/rewards/count_letters_rewards.py)), we calculate the reward to be `-abs(predicted_count - target_count)`. We use simple heuristics to extract the predicted count. The following cell prints out the reward function code.

In [11]:
!cat ../src/oumi/datasets/grpo/rewards/count_letters_rewards.py

# Copyright 2025 - Oumi
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import re
from typing import Any, Optional

from oumi.core.registry import RegistryType, register


def _extract_prediction(response: str) -> Optional[int]:
    r"""Returns the numeric answer extracted from `\boxed{...}`, or None otherwise."""
    regex_result = re.findall(r"\\boxed\{([-+]?\d+)\}", response)
    if not regex_result or len(regex_result) != 1:
        return None
    number_str = regex_result[0]
    # Except cl

In [12]:
# Clean up to free-up GPU memory used for evaluation above
import gc

import torch


def cleanup_memory():
    """Delete the evaluator and collect garbage."""
    global evaluator
    if evaluator:  # type: ignore
        del evaluator
        evaluator = None
    for _ in range(3):
        gc.collect()
        torch.cuda.empty_cache()
        torch.cuda.synchronize()


cleanup_memory()

INFO 04-10 09:38:28 multiproc_worker_utils.py:141] Terminating local vLLM worker processes
[1;36m(VllmWorkerProcess pid=10493)[0;0m INFO 04-10 09:38:28 multiproc_worker_utils.py:253] Worker exiting
[1;36m(VllmWorkerProcess pid=10494)[0;0m INFO 04-10 09:38:28 multiproc_worker_utils.py:253] Worker exiting
[1;36m(VllmWorkerProcess pid=10492)[0;0m INFO 04-10 09:38:28 multiproc_worker_utils.py:253] Worker exiting


❗**NOTICE:** Set `training.enable_wandb` to True if you want to log your training run to Weights and Biases. In addition, you must also log into WandB, ex. by running `wandb login`.

❗**NOTICE:** The following training config takes ~1.5 hours to run on 4 A100s, as of trl version 0.15.2. You can decrease `max_steps` below for training to run faster. Alternatively, since 500 steps is not enough to see meaningful improvement on this task, you can also increase `max_steps`. Another option is replacing it with `num_train_epochs` to set your desired number of epochs.

In [13]:
%%writefile $tutorial_dir/grpo_train.yaml

model:
  model_name: "meta-llama/Llama-3.2-3B-Instruct"
  model_max_length: 8192
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"

data:
  train:
    datasets:
      - dataset_name: "oumi-ai/oumi-letter-count"
        split: "train"

training:
  trainer_type: "TRL_GRPO"
  save_steps: 500
  max_steps: 500
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 1
  learning_rate: 5e-5
  lr_scheduler_type: "cosine"
  warmup_steps: 20

  reward_functions: ["count_letters"]

  ddp_find_unused_parameters: False
  optimizer: "adafactor"
  compile: True

  grpo:
    num_generations: 4

  dataloader_num_workers: "auto"
  dataloader_prefetch_factor: 32

  logging_steps: 10
  output_dir: "letter_counting_tutorial/llama_3b_grpo"
  # Set this to True if you want to log to Weights and Biases.
  enable_wandb: False

Writing letter_counting_tutorial/grpo_train.yaml


In [14]:
!oumi distributed torchrun -m oumi train -c $tutorial_dir/grpo_train.yaml

[2025-04-10 09:38:32,770][oumi][rank0][pid:10890][MainThread][INFO]][distributed_run.py:276] Running the command: ['torchrun', '--nnodes=1', '--node-rank=0', '--nproc-per-node=4', '--master-addr=127.0.0.1', '--master-port=8007', '-m', 'oumi', 'train', '-c', 'letter_counting_tutorial/grpo_train.yaml']

[32m   ____  _    _ __  __ _____[0m
[32m  / __ \| |  | |  \/  |_   _|[0m
[32m | |  | | |  | | \  / | | |[0m
[32m | |  | | |  | | |\/| | | |[0m
[32m | |__| | |__| | |  | |_| |_[0m
[32m  \____/ \____/|_|  |_|_____|[0m
[2K[32m⠦[0m [32mLoading configuration...[0mconfiguration...[0m[32m⠋[0m [32mLoading configuration...[0m[32m⠋[0m [32mLoading configuration...[0m[32m⠋[0m [32mLoading configuration...[0m
[2K[32m⠦[0m [32mLoading configuration...[0m
[2K[32m⠦[0m [32mLoading configuration...[0m
[2K[32m⠦[0m [32mLoading configuration...[0m
[1A[2KIgnored model.model_max_length=8192 parameter for trainer TrainerType.TRL_GRPO.
Ignored model.model_max_length=81

## Evaluating our Trained Model

Let's now evaluate our trained model to see if it improved on the letter counting task. Note that it may not improve much, since we trained it for a relatively short time.

Below, we demonstrate an alternative method of running evaluation with the `oumi` CLI. We use the same Llama 3B evaluation config we used above, with the only change being pointing it at the model we just trained.

First, we need to reset the notebook to clear variables from our previous vLLM run.

In [15]:
%reset -f

In [16]:
!oumi evaluate -c letter_counting_tutorial/llama_3b_eval.yaml \
    --model.model_name "letter_counting_tutorial/llama_3b_grpo" \
    --tasks.0.num_samples $NUM_SAMPLES \
    --output_dir "letter_counting_tutorial/evaluation/llama_3_grpo"


[32m   ____  _    _ __  __ _____[0m
[32m  / __ \| |  | |  \/  |_   _|[0m
[32m | |  | | |  | | \  / | | |[0m
[32m | |  | | |  | | |\/| | | |[0m
[32m | |__| | |__| | |  | |_| |_[0m
[32m  \____/ \____/|_|  |_|_____|[0m
[2K[32m⠴[0m [32mLoading configuration...[0m0m
[2K[32m⠋[0m [32mRunning evaluation...[0m[2025-04-10 09:47:15,521][oumi][rank0][pid:16694][MainThread][INFO]][models.py:482] Using the model's built-in chat template for model 'letter_counting_tutorial/llama_3b_grpo'.
[2KINFO 04-10 09:47:15 __init__.py:207] Automatically detected platform cuda.
[2KINFO 04-10 09:47:23 config.py:549] This model supports multiple tasks: {'score',
'generate', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
[2KINFO 04-10 09:47:23 config.py:1382] Defaulting to use mp for distributed 
inference
for models with max_model_len > 32K. Currently, chunked prefill might not work 
with some features or models. If you encounter any issues, please disable 
chunked prefill by se