<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Train a Letter Counting Model using GRPO.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Train a Letter Counting Model using GRPO

This notebook delves into a fun, popular question to ask LLMs: "How Many R’s Are in the Word Strawberry?". First, we will use a custom evaluation function to evaluate many popular models on the task of counting letters in words. Then, we will use Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm, to train Llama 3.2 3B to improve its performance on this task.

## Prerequisites

### Machine Requirements

This notebook runs both model evaluation and GRPO training, which require 8GB and 40GB VRAM, respectively.

❗**NOTICE:** If you're running this notebook on Colab using a T4 GPU, it's not possible to run training due to memory requirements. To run evaluation, some adjustments need to be made as vLLM doesn't support T4 GPUs. This will be explained in the evaluation section.

If your local machine cannot run this notebook, you can instead run this notebook on a cloud platform. The following demonstrates how to open a VSCode instance backed by a GCP node with 4 A100 GPUs, from which the notebook can be run. It is possible to run this notebook on just 1 GPU, but you will need make some adjustments to training parameters, which will be explained in the training section.

```bash
# Run on your local machine
gcloud auth application-default login  # Authenticate with GCP
make gcpcode ARGS="--resources.accelerators A100:4"
```

### Oumi Installation

First, let's install Oumi and vLLM (part of the `gpu` optional dependencies). You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html).

In [None]:
%pip install git+https://github.com/oumi-ai/oumi.git
%pip install "vllm>=0.7.3,<0.8.0"

### Remote API Access

As part of this notebook, you can evaluate frontier models from Open AI, Google, Anthropic, and Meta on the letter counting task. If you want to evaluate any of these models, set the corresponding fields below.

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ""  # Set your OpenAI API key here.
os.environ["GEMINI_API_KEY"] = ""  # Set your Gemini API key here.
os.environ["ANTHROPIC_API_KEY"] = ""  # Set your Anthropic API key here.

# Set your GCP project id and region, if you want to query Llama 3.1 405B in Vertex.
REGION = ""  # Set your GCP region here.
PROJECT_ID = ""  # Set your GCP project id here.

### Tutorial Directory

Finally, we'll set up a directory to use for this tutorial, and some environment variables.

In [None]:
from pathlib import Path

tutorial_dir = "letter_counting_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Disable warnings from HF.

# This is needed for vLLM to use multiple GPUs in a notebook.
# If you're not running in a notebook, you can ignore this.
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

## Dataset

The dataset we'll use for this notebook is `oumi-ai/oumi-letter-count`, which can be found on [HF Datasets](https://huggingface.co/datasets/oumi-ai/oumi-letter-count). Its prompts ask to count the letters in various English words, with metadata in each example containing the correct count. We use the `train` split for training and the `test` split for evaluation. We'll use an Oumi dataset class, `LetterCountGrpoDataset`, to load and preprocess the HF Dataset. The following code displays an example prompt:

In [None]:
from pprint import pprint

from oumi.datasets.grpo.letter_count import LetterCountGrpoDataset

dataset = LetterCountGrpoDataset(split="validation")
print("-" * 80)
print("Sample:")
pprint(dataset.conversation(0).to_dict())

## Evaluation

First, we'll evaluate how various models perform on the letter counting task. We'll evaluate frontier models by calling their respective remote API, and Llama 3.2 3B by running local inference on it using vLLM.

We've already defined a custom evaluation function in Oumi which runs inference on the above dataset, extracts the answer from the model response, and calculates various metrics such as accuracy. This function is defined at `src/oumi/evaluation/registry/count_letters_task.py` ([GitHub link](https://github.com/oumi-ai/oumi/blob/main/src/oumi/evaluation/registry/count_letters_task.py)), and we print its contents below for reference.

In [None]:
import inspect

from oumi.evaluation.registry.count_letters_task import count_letters

print(inspect.getsource(count_letters))

In the following section, you can select which models you want to evaluate. You can lower `NUM_SAMPLES`  to reduce cost when calling remote APIs, with the downside of noisier results.

In [None]:
NUM_SAMPLES = 100

model_names = [
    "llama_3b",
    # Uncomment any models you wish to evaluate - you can evaluate multiple at once.
    # "gpt_4o",
    # "gemini_pro",
    # "llama_405b",
    # "claude_sonnet",
]

❗**NOTICE:** If running this notebook on Colab, delete the following line: `inference_engine: VLLM`

In [None]:
# EvaluationConfig for various models.
# Note that Llama 3B uses the local VLLM inference engines, while the others use various
# remote engines.
configs = {
    "llama_3b": """
      model:
        model_name: "meta-llama/Llama-3.2-3B-Instruct"
        model_max_length: 131072
        torch_dtype_str: "bfloat16"
        attn_implementation: "sdpa"
        trust_remote_code: True

      inference_engine: VLLM

      generation:
        max_new_tokens: 2048

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/llama3_b"
      """,
    "gpt_4o": """
      model:
        model_name: "gpt-4o"

      inference_engine: OPENAI

      inference_remote_params:
        api_key_env_varname: "OPENAI_API_KEY"
        max_retries: 3
        num_workers: 100
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/gpt_4o"
      """,
    "gemini_pro": """
      model:
        model_name: "gemini-2.5-pro-preview-03-25"

      inference_engine: GOOGLE_GEMINI

      inference_remote_params:
        api_key_env_varname: "GEMINI_API_KEY"
        max_retries: 3
        num_workers: 2
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/gemini_pro"
      """,
    "llama_405b": f"""
      model:
        model_name: "meta/llama-3.1-405b-instruct-maas"

      inference_engine: GOOGLE_VERTEX

      inference_remote_params:
        api_url: "https://{REGION}-aiplatform.googleapis.com/v1beta1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi/chat/completions"
        max_retries: 3
        num_workers: 10
        politeness_policy: 60
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/llama_405b"
      """,
    "claude_sonnet": """
      model:
        model_name: "claude-3-7-sonnet-latest"

      inference_engine: ANTHROPIC

      inference_remote_params:
        api_key_env_varname: "ANTHROPIC_API_KEY"
        max_retries: 3
        num_workers: 5
        politeness_policy: 65
        connection_timeout: 300

      generation:
        max_new_tokens: 8192
        temperature: 0.0

      tasks:
        - evaluation_backend: custom
          task_name: count_letters

      output_dir: "letter_counting_tutorial/evaluation/claude_sonnet"
      """,
}

In [None]:
# Run evaluation on all specified models.

from oumi.core.configs import EvaluationConfig
from oumi.core.evaluation import Evaluator

results = {}

for model_name in model_names:
    # Create the evaluation config from the YAML string.
    config_yaml: str = configs[model_name]
    config = EvaluationConfig.from_str(config_yaml)
    config.tasks[0].num_samples = NUM_SAMPLES

    # Run the evaluation.
    evaluator = Evaluator()
    evaluator_out = evaluator.evaluate(config)

    # # Record the results.
    results[model_name] = evaluator_out[0].get_results()

In [None]:
# Print results.

print(f"Total samples: {NUM_SAMPLES}")
for model_name, result in results.items():
    print("-" * 80)
    print(f"Model: {model_name}")
    print(f"Accuracy: {result['accuracy']:.2%}")
    correct = result["num_correct_answers"]
    incorrect = result["num_incorrect_answers"]
    invalid = result["num_invalid_answers"]
    print(f"Num correct, incorrect, invalid: {correct}, {incorrect}, {invalid}")

## GRPO

Now, we train Llama 3.2 3B on the task of counting letters using the GRPO algorithm implemented by [HuggingFace's `trl` library](https://huggingface.co/docs/trl/en/index).

Note that we can calculate a concrete reward for this task by comparing the answer extracted by the model with the correct answer. In the reward function defined in `src/oumi/datasets/grpo/rewards/count_letters_rewards.py` ([GitHub link](https://github.com/oumi-ai/oumi/blob/main/src/oumi/datasets/grpo/rewards/count_letters_rewards.py)), we calculate the reward to be `-abs(predicted_count - target_count)`. We use simple heuristics to extract the predicted count. The following cell prints out the reward function code.

In [None]:
!cat ../src/oumi/datasets/grpo/rewards/count_letters_rewards.py

In [None]:
# Clean up to free-up GPU memory used for evaluation above
import gc

import torch


def cleanup_memory():
    """Delete the evaluator and collect garbage."""
    global evaluator
    if evaluator:  # type: ignore
        del evaluator
        evaluator = None
    for _ in range(3):
        gc.collect()
        torch.cuda.empty_cache()
        torch.cuda.synchronize()


cleanup_memory()

❗**NOTICE:** Set `training.enable_wandb` to True if you want to log your training run to Weights and Biases. In addition, you must also log into WandB, ex. by running `wandb login`.

In [None]:
%%writefile $tutorial_dir/grpo_train.yaml

model:
  model_name: "meta-llama/Llama-3.2-3B-Instruct"
  model_max_length: 8192
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"

data:
  train:
    datasets:
      - dataset_name: "oumi-ai/oumi-letter-count"
        split: "train"

training:
  trainer_type: "TRL_GRPO"
  save_steps: 500
  max_steps: 500
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 1
  learning_rate: 5e-5

  reward_functions: ["count_letters"]

  ddp_find_unused_parameters: False
  optimizer: "adafactor"
  compile: True

  grpo:
    num_generations: 4

  dataloader_num_workers: "auto"
  dataloader_prefetch_factor: 32

  logging_steps: 10
  output_dir: "letter_counting_tutorial/llama_3b_grpo"
  # Set this to True if you want to log to Weights and Biases.
  enable_wandb: False


In [None]:
!oumi distributed torchrun -m oumi train -c $tutorial_dir/grpo_train.yaml

## Evaluating our Trained Model

Let's now evaluate our trained model to see if it improved on the letter counting task. This simply involves running the Llama 3B evaluation config we defined above, but instead pointing it at the model checkpoint outputted by training.

In [None]:
# Create the evaluation config from the YAML string.
config_yaml: str = configs["llama_3b"]
config = EvaluationConfig.from_str(config_yaml)
config.tasks[0].num_samples = NUM_SAMPLES
config.model.model_name = "letter_counting_tutorial/llama_3b_grpo"

# Run the evaluation.
evaluator = Evaluator()
evaluator_out = evaluator.evaluate(config)

# # Record the results.
trained_model_results = evaluator_out[0].get_results()

print(f"Accuracy: {trained_model_results['accuracy']}")
correct = trained_model_results["num_correct_answers"]
incorrect = trained_model_results["num_incorrect_answers"]
invalid = trained_model_results["num_invalid_answers"]
print(f"Num correct, incorrect, invalid: {correct}, {incorrect}, {invalid}")