<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
</div>

👋 Welcome to Open Universal Machine Intelligence (Oumi)!

🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# OpenEnv GRPO with trl

In this tutorial notebook, we're going to use Oumi to train an agentic model on an [OpenEnv](https://github.com/meta-pytorch/OpenEnv) Echo reinforcement learning (RL) environment with the GRPO algorithm. To achieve this, we use the trl library by Hugging Face with a custom rollout function to interact with the vLLM server and OpenEnv environment.

# 📋 Prerequisites

❗**NOTICE:** This notebook needs to be running on a machine with at least two GPUs.

## Oumi Installation

First, let's install the latest versions of Oumi, trl, and OpenEnv. You can find more detailed instructions [here](https://oumi.ai/docs/en/latest/get_started/installation.html).

In [None]:
# !pip install uv && uv pip install "oumi[gpu] @ git+https://github.com/oumi-ai/oumi.git"
!pip install uv && uv pip install -e "..[gpu]"
!uv pip install git+https://github.com/meta-pytorch/OpenEnv.git
!uv pip install git+https://github.com/huggingface/trl.git

In [1]:
import os
from pathlib import Path

tutorial_dir = "openenv_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Disable warnings from HF.

# Start OpenEnv and vLLM servers

We need to run 2 servers in addition to the trl trainer. The OpenEnv server receives actions from the LLM and returns the updated state and reward. The vLLM server is used for inference, and updates it weights over training with the updated model weights from the trainer. We start these with separate subprocesses.

In [2]:
%%writefile $tutorial_dir/start_openenv_server.py

import os
import subprocess
import sys
import threading
import time
from pathlib import Path

import requests


def stream_output(pipe, prefix=""):
    """Stream output lines from subprocess pipe to stdout."""
    for line in iter(pipe.readline, ""):
        print(f"{prefix}{line}", end="")
    pipe.close()


print("⚡ Starting FastAPI server for Echo Environment...")

work_dir = str(Path.cwd().parent.absolute())

server_process = subprocess.Popen(
    [
        sys.executable,
        "-m",
        "uvicorn",
        "envs.echo_env.server.app:app",
        "--host",
        "0.0.0.0",
        "--port",
        "8001",
    ],
    env={**os.environ, "PYTHONPATH": f"{work_dir}/src"},
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True,
    cwd=work_dir,
)

# Start background threads to stream errors
threading.Thread(
    target=stream_output, args=(server_process.stderr, "🔥 [stderr] "), daemon=True
).start()

print("⏳ Waiting for server to start...")
time.sleep(5)

try:
    response = requests.get("http://0.0.0.0:8001/health", timeout=2)
    print("\n✅ Echo Environment server is running!")
except Exception as e:
    print(f"\n❌ Server failed to start: {e}")
    print("\n📋 Checking error output...")
    server_process.poll()
    if server_process.stderr:
        stderr = server_process.stderr.read()
        if stderr:
            print(stderr)
    raise

try:
    input("Press Enter to exit...\n")
finally:
    print("🛑 Stopping server...")
    server_process.terminate()
    server_process.wait()

Overwriting openenv_tutorial/start_openenv_server.py


In [None]:
import subprocess

# Start both servers in the background
server1 = subprocess.Popen(
    [
        "bash",
        "-c",
        (
            "CUDA_VISIBLE_DEVICES=0 trl vllm-serve "
            "--model Qwen/Qwen2.5-0.5B-Instruct "
            "--log-level warning "
            "--host 0.0.0.0 --port 8000"
        ),
    ]
)
server2 = subprocess.Popen(["python", f"{tutorial_dir}/start_openenv_server.py"])

print("Servers started. PIDs:", server1.pid, server2.pid)

Servers started. PIDs: 3616371 3616372


⚡ Starting FastAPI server for Echo Environment...
⏳ Waiting for server to start...


In [4]:
import time

import requests

URL = "http://0.0.0.0:8000/health"


def check_vllm_health():
    """Checks if the vLLM server is healthy."""
    try:
        response = requests.get(URL, timeout=3)
        if response.status_code == 200:
            print("✅ vLLM server is healthy!")
            return True
        else:
            print(f"⚠️ Server responded with {response.status_code}")
    except requests.RequestException as e:
        print(f"❌ Server not ready: {e}")
    return False


max_retries = 24
for attempt in range(1, max_retries + 1):
    if check_vllm_health():
        break
    time.sleep(5)
else:
    print(f"❌ Failed to start vLLM server after {max_retries} attempts.")

❌ Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x725828b5b510>: Failed to establish a new connection: [Errno 111] Connection refused'))
🔥 [stderr] INFO:     Started server process [3616373]
🔥 [stderr] INFO:     Waiting for application startup.
🔥 [stderr] INFO:     Application startup complete.
🔥 [stderr] INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)

✅ Echo Environment server is running!
Press Enter to exit...
❌ Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x725828b61dd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
❌ Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connecti

`torch_dtype` is deprecated! Use `dtype` instead!


INFO 10-30 20:09:21 [__init__.py:742] Resolved architecture: Qwen2ForCausalLM
INFO 10-30 20:09:21 [__init__.py:1815] Using max model len 32768
INFO 10-30 20:09:22 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=16384.
❌ Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x725828b69250>: Failed to establish a new connection: [Errno 111] Connection refused'))
❌ Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x725828b63110>: Failed to establish a new connection: [Errno 111] Connection refused'))
INFO 10-30 20:09:30 [__init__.py:216] Automatically detected platform cuda.
[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:31 [core.py:654] Waiting for init message from front-end.
[1;36m(EngineCore_DP



[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:32 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:32 [gpu_model_runner.py:2338] Starting to load model Qwen/Qwen2.5-0.5B-Instruct...
[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:33 [gpu_model_runner.py:2370] Loading model from scratch...
[1;36m(EngineCore_DP0 pi

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.45it/s]
[1;36m(EngineCore_DP0 pid=3616607)[0;0m 


[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:33 [default_loader.py:268] Loading weights took 0.17 seconds
[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:33 [gpu_model_runner.py:2392] Model loading took 0.9266 GiB and 0.518918 seconds
❌ Server not ready: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x725828b5ab10>: Failed to establish a new connection: [Errno 111] Connection refused'))
[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:37 [backends.py:539] Using cache directory: /home/wizeng/.cache/vllm/torch_compile_cache/5d31f4c583/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:37 [backends.py:550] Dynamo bytecode transform time: 3.41 s
[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:39 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:01<00:00, 40.61it/s]


[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:42 [gpu_model_runner.py:3118] Graph capturing finished in 2 secs, took 0.50 GiB
[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:42 [gpu_worker.py:391] Free memory on device (78.59/79.19 GiB) on startup. Desired GPU memory utilization is (0.9, 71.27 GiB). Actual usage is 0.93 GiB for weight, 5.57 GiB for peak activation, 0.07 GiB for non-torch memory, and 0.5 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=68779733708` to fit into requested memory, or `--kv-cache-memory=76635612672` to fully utilize gpu memory. Current kv cache memory in use is 69478085324 bytes.
[1;36m(EngineCore_DP0 pid=3616607)[0;0m INFO 10-30 20:09:42 [core.py:218] init engine (profile, create kv cache, warmup model) took 8.65 seconds
INFO 10-30 20:09:43 [llm.py:295] Supported_tasks: ['generate']
INFO 10-30 20:09:43 [__init__.py:36] No IOProcessor plugins requested by the model
✅ vLLM server is healthy!


# Train the model!

By providing a custom rollout function to interact with the OpenEnv and vLLM servers, we can use trl to do agentic GRPO training. We also need to provide a reward function that processes the reward value output by the environment.

In [5]:
!tail -n +27 ../src/oumi/datasets/grpo/rollouts/echo_env_vllm_rollout.py

@register("echo_env_vllm_rollout", RegistryType.ROLLOUT_FUNCTION)
def echo_env_vllm_rollout(
    prompts: list[str], args, processing_class
) -> dict[str, list]:
    """Custom rollout function that generates completions via vLLM server and computes environment rewards.

    Args:
        prompts: List of prompts to generate from
        args: GRPOConfig containing all sampling parameters
        processing_class: Tokenizer/processor for decoding completions

    Returns:
        Dict containing prompt_ids, completion_ids, logprobs, and env_reward
    """  # noqa: E501
    # 1. Generate completions via vLLM inference server (running on port 8000)
    payload = {
        "prompts": prompts,
        "n": args.num_generations,
        "temperature": args.temperature,
        "top_p": args.top_p,
        "top_k": -1 if args.top_k is None else args.top_k,
        "min_p": 0.0 if args.min_p is None else args.min_p,
        "max_tokens": args.max_completion_length,
        "repetition_penalty"

In [6]:
!tail -n +23 ../src/oumi/datasets/grpo/rewards/env_reward.py

@register("env_reward", RegistryType.REWARD_FUNCTION)
def reward_from_env(completions, **kwargs):
    """Reward function that uses the environment reward."""
    # Extract environment rewards from kwargs (propagated via extra_fields)
    env_rewards = kwargs.get("env_reward", [])
    if env_rewards:
        return [float(reward) for reward in env_rewards]
    else:
        # Fallback if env_reward is not available
        return [0.0] * len(completions)


In [7]:
%%writefile $tutorial_dir/grpo_train.yaml

model:
  model_name: "Qwen/Qwen2-0.5B-Instruct"
  model_max_length: 2048
  torch_dtype_str: "bfloat16"
  attn_implementation: "sdpa"

data:
  train:
    datasets:
      - dataset_name: "trl-lib/ultrafeedback-prompt"
        split: "train"
        sample_count: 100

training:
  trainer_type: "TRL_GRPO"
  per_device_train_batch_size: 8
  gradient_accumulation_steps: 4

  reward_functions: ["env_reward"]

  ddp_find_unused_parameters: False
  optimizer: "adamw_torch_fused"

  grpo:
    use_vllm: True
    rollout_function: "echo_env_vllm_rollout"

  dataloader_num_workers: "auto"
  dataloader_prefetch_factor: 32

  num_train_epochs: 1
  logging_steps: 1
  log_model_summary: False
  output_dir: "openenv_tutorial/echo_grpo"

Overwriting openenv_tutorial/grpo_train.yaml


In [8]:
!CUDA_VISIBLE_DEVICES=1 oumi train -c $tutorial_dir/grpo_train.yaml


[32m   ____  _    _ __  __ _____[0m
[32m  / __ \| |  | |  \/  |_   _|[0m
[32m | |  | | |  | | \  / | | |[0m
[32m | |  | | |  | | |\/| | | |[0m
[32m | |__| | |__| | |  | |_| |_[0m
[32m  \____/ \____/|_|  |_|_____|[0m

[2K[32m⠸[0m [32mLoading configuration...[0m0m
[2;36m                    [0m         model.[33mmodel_max_length[0m=[1;36m2048[0m  [2m                      [0m
[2;36m                    [0m         parameter for trainer        [2m                      [0m
[2;36m                    [0m         TrainerType.TRL_GRPO.        [2m                      [0m
[2;36m                   [0m[2;36m [0m[34mINFO    [0m [1m[[0mrank-[1;36m0[0m[1m][0m Setting random seed to  ]8;id=358934;file:///home/wizeng/repos/oumi/src/oumi/core/distributed.py\[2mdistributed.py[0m]8;;\[2m:[0m]8;id=692189;file:///home/wizeng/repos/oumi/src/oumi/core/distributed.py#616\[2m616[0m]8;;\
[2;36m                    [0m         [1;36m42[0m on rank [1;36

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1390.22it/s]
Processed prompts: 100%|██████████| 32/32 [00:01<00:00, 27.97it/s, est. speed input: 1084.01 toks/s, output: 3418.09 toks/s]


{'loss': -0.3104, 'grad_norm': 5.6875, 'learning_rate': 5e-05, 'num_tokens': 5150.0, 'completions/mean_length': 122.1875, 'completions/min_length': 22.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 0.15625, 'completions/mean_terminated_length': 97.40740966796875, 'completions/min_terminated_length': 22.0, 'completions/max_terminated_length': 244.0, 'rewards/reward_from_env/mean': 60.24374771118164, 'rewards/reward_from_env/std': 46.80415344238281, 'reward': 60.24374771118164, 'reward_std': 20.179214477539062, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.10425987094640732, 'sampling/sampling_logp_difference/max': 1.4170303344726562, 'sampling/importance_sampling_ratio/min': 0.24243289232254028, 'sampling/importance_sampling_ratio/mean': 1.0240256786346436, 'sampling/importance_sampling_ratio/max': 1.5776448249816895, 'entropy': 1.3606750071048737, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1469.49it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 43.54it/s, est. speed input: 3124.69 toks/s, output: 9436.59 toks/s]


{'loss': -0.1341, 'grad_norm': 4.28125, 'learning_rate': 4.8e-05, 'num_tokens': 14380.0, 'completions/mean_length': 216.6875, 'completions/min_length': 3.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 0.65625, 'completions/mean_terminated_length': 141.63636779785156, 'completions/min_terminated_length': 3.0, 'completions/max_terminated_length': 247.0, 'rewards/reward_from_env/mean': 99.38749694824219, 'rewards/reward_from_env/std': 42.04217529296875, 'reward': 99.38749694824219, 'reward_std': 17.43398094177246, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.0798419937491417, 'sampling/sampling_logp_difference/max': 1.7389039993286133, 'sampling/importance_sampling_ratio/min': 0.17571288347244263, 'sampling/importance_sampling_ratio/mean': 1.0187627077102661, 'sampling/importance_sampling_ratio/max': 1.660508632659912, 'entropy': 1.0239375233650208, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1801.87it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 43.31it/s, est. speed input: 2057.68 toks/s, output: 11089.71 toks/s]


{'loss': -0.0002, 'grad_norm': 3.765625, 'learning_rate': 4.600000000000001e-05, 'num_tokens': 24092.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 123.50312042236328, 'rewards/reward_from_env/std': 21.738571166992188, 'reward': 123.50312042236328, 'reward_std': 17.99724769592285, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.07937926799058914, 'sampling/sampling_logp_difference/max': 1.444777488708496, 'sampling/importance_sampling_ratio/min': 0.23579853773117065, 'sampling/importance_sampling_ratio/mean': 1.015897512435913, 'sampling/importance_sampling_ratio/max': 1.5411566495895386, 'entropy': 0.9619140625, 'clip_ratio/low_mean': 0.00048828125, 'clip_ratio/low_min': 0.00048828125, 'clip_ratio/high_mean': 0.0003

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1437.39it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.02it/s, est. speed input: 3824.36 toks/s, output: 10758.56 toks/s]


{'loss': 0.0005, 'grad_norm': 3.359375, 'learning_rate': 4.4000000000000006e-05, 'num_tokens': 35196.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 130.62811279296875, 'rewards/reward_from_env/std': 22.762493133544922, 'reward': 130.62811279296875, 'reward_std': 19.14883804321289, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.06914907693862915, 'sampling/sampling_logp_difference/max': 1.5367345809936523, 'sampling/importance_sampling_ratio/min': 0.2150823026895523, 'sampling/importance_sampling_ratio/mean': 1.0158569812774658, 'sampling/importance_sampling_ratio/max': 1.5292932987213135, 'entropy': 0.822265625, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.00146484375, 'clip_ratio

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 2011.42it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.54it/s, est. speed input: 1616.81 toks/s, output: 10892.05 toks/s]


{'loss': 0.0006, 'grad_norm': 3.40625, 'learning_rate': 4.2e-05, 'num_tokens': 44604.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 145.4812469482422, 'rewards/reward_from_env/std': 7.002968788146973, 'reward': 145.4812469482422, 'reward_std': 5.854681015014648, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.07969337701797485, 'sampling/sampling_logp_difference/max': 1.394791603088379, 'sampling/importance_sampling_ratio/min': 0.24788470566272736, 'sampling/importance_sampling_ratio/mean': 1.016239881515503, 'sampling/importance_sampling_ratio/max': 1.4886819124221802, 'entropy': 0.939453125, 'clip_ratio/low_mean': 0.000244140625, 'clip_ratio/low_min': 0.000244140625, 'clip_ratio/high_mean': 0.0008544921875, 'clip_r

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1760.46it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 43.12it/s, est. speed input: 2641.14 toks/s, output: 11038.80 toks/s]


{'loss': 0.0008, 'grad_norm': 3.765625, 'learning_rate': 4e-05, 'num_tokens': 54756.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 142.4375, 'rewards/reward_from_env/std': 19.4844970703125, 'reward': 142.4375, 'reward_std': 15.795616149902344, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.09761585295200348, 'sampling/sampling_logp_difference/max': 1.6260900497436523, 'sampling/importance_sampling_ratio/min': 0.1966971457004547, 'sampling/importance_sampling_ratio/mean': 1.019674301147461, 'sampling/importance_sampling_ratio/max': 1.6572049856185913, 'entropy': 1.208984375, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.0003662109375, 'clip_ratio/high_max': 0

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1486.95it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.78it/s, est. speed input: 3422.94 toks/s, output: 10953.31 toks/s]


{'loss': -0.0016, 'grad_norm': 4.75, 'learning_rate': 3.8e-05, 'num_tokens': 65508.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 147.90936279296875, 'rewards/reward_from_env/std': 21.37916374206543, 'reward': 147.90936279296875, 'reward_std': 17.90665054321289, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.1063748449087143, 'sampling/sampling_logp_difference/max': 1.522308349609375, 'sampling/importance_sampling_ratio/min': 0.21820761263370514, 'sampling/importance_sampling_ratio/mean': 1.0218125581741333, 'sampling/importance_sampling_ratio/max': 1.9696154594421387, 'entropy': 1.330078125, 'clip_ratio/low_mean': 0.0003662109375, 'clip_ratio/low_min': 0.0003662109375, 'clip_ratio/high_mean': 0.0013427734375, 'clip

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1732.65it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.54it/s, est. speed input: 3052.72 toks/s, output: 10891.86 toks/s]


{'loss': -0.0002, 'grad_norm': 6.0, 'learning_rate': 3.6e-05, 'num_tokens': 75996.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 148.5749969482422, 'rewards/reward_from_env/std': 27.652334213256836, 'reward': 148.5749969482422, 'reward_std': 26.42963981628418, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.10105139017105103, 'sampling/sampling_logp_difference/max': 1.3686447143554688, 'sampling/importance_sampling_ratio/min': 0.25445157289505005, 'sampling/importance_sampling_ratio/mean': 1.021196722984314, 'sampling/importance_sampling_ratio/max': 2.0, 'entropy': 1.2109375, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.001220703125, 'clip_ratio/high_max': 0.001220703125, 'clip_rat

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1869.12it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 35.54it/s, est. speed input: 1279.46 toks/s, output: 9098.35 toks/s]


{'loss': 0.0009, 'grad_norm': 5.125, 'learning_rate': 3.4000000000000007e-05, 'num_tokens': 85340.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 150.13436889648438, 'rewards/reward_from_env/std': 25.33433723449707, 'reward': 150.13436889648438, 'reward_std': 24.00356674194336, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.09709164500236511, 'sampling/sampling_logp_difference/max': 1.6557836532592773, 'sampling/importance_sampling_ratio/min': 0.19094236195087433, 'sampling/importance_sampling_ratio/mean': 1.022766351699829, 'sampling/importance_sampling_ratio/max': 1.664004921913147, 'entropy': 1.18359375, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.000610

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1653.91it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.68it/s, est. speed input: 2390.22 toks/s, output: 10926.61 toks/s]


{'loss': 0.002, 'grad_norm': 3.015625, 'learning_rate': 3.2000000000000005e-05, 'num_tokens': 95324.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 199.60000610351562, 'rewards/reward_from_env/std': 38.48003387451172, 'reward': 199.60000610351562, 'reward_std': 29.59756851196289, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.054048649966716766, 'sampling/sampling_logp_difference/max': 1.426081657409668, 'sampling/importance_sampling_ratio/min': 0.24024845659732819, 'sampling/importance_sampling_ratio/mean': 1.0123540163040161, 'sampling/importance_sampling_ratio/max': 1.4719880819320679, 'entropy': 0.5947265625, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1734.26it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 41.04it/s, est. speed input: 2555.27 toks/s, output: 10508.33 toks/s]


{'loss': 0.0016, 'grad_norm': 2.1875, 'learning_rate': 3e-05, 'num_tokens': 105508.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 223.515625, 'rewards/reward_from_env/std': 35.32181167602539, 'reward': 223.515625, 'reward_std': 27.529592514038086, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.03328322619199753, 'sampling/sampling_logp_difference/max': 1.2655229568481445, 'sampling/importance_sampling_ratio/min': 0.28209173679351807, 'sampling/importance_sampling_ratio/mean': 1.0072062015533447, 'sampling/importance_sampling_ratio/max': 1.7108453512191772, 'entropy': 0.3662109375, 'clip_ratio/low_mean': 0.000244140625, 'clip_ratio/low_min': 0.000244140625, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'cl

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1398.57it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 40.81it/s, est. speed input: 3612.23 toks/s, output: 10448.80 toks/s]


{'loss': 0.0034, 'grad_norm': 2.71875, 'learning_rate': 2.8000000000000003e-05, 'num_tokens': 116532.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 236.68438720703125, 'rewards/reward_from_env/std': 45.642601013183594, 'reward': 236.68438720703125, 'reward_std': 37.60443115234375, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.03014453500509262, 'sampling/sampling_logp_difference/max': 1.6696949005126953, 'sampling/importance_sampling_ratio/min': 0.18830451369285583, 'sampling/importance_sampling_ratio/mean': 1.0063087940216064, 'sampling/importance_sampling_ratio/max': 1.4304356575012207, 'entropy': 0.33837890625, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean':

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1758.62it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 43.04it/s, est. speed input: 2130.55 toks/s, output: 11018.55 toks/s]


{'loss': 0.0016, 'grad_norm': 1.8828125, 'learning_rate': 2.6000000000000002e-05, 'num_tokens': 126308.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 253.80313110351562, 'rewards/reward_from_env/std': 25.515148162841797, 'reward': 253.80313110351562, 'reward_std': 23.875682830810547, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.013019567355513573, 'sampling/sampling_logp_difference/max': 1.172348976135254, 'sampling/importance_sampling_ratio/min': 0.30963876843452454, 'sampling/importance_sampling_ratio/mean': 1.0030759572982788, 'sampling/importance_sampling_ratio/max': 1.3728663921356201, 'entropy': 0.1226806640625, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0001220703125, '

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1322.81it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 41.85it/s, est. speed input: 3473.92 toks/s, output: 10714.64 toks/s]


{'loss': 0.0018, 'grad_norm': 1.828125, 'learning_rate': 2.4e-05, 'num_tokens': 137156.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 268.48126220703125, 'rewards/reward_from_env/std': 23.270313262939453, 'reward': 268.48126220703125, 'reward_std': 19.19301986694336, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.019933423027396202, 'sampling/sampling_logp_difference/max': 1.4129743576049805, 'sampling/importance_sampling_ratio/min': 0.24341818690299988, 'sampling/importance_sampling_ratio/mean': 1.0036290884017944, 'sampling/importance_sampling_ratio/max': 1.4830222129821777, 'entropy': 0.1968994140625, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.00048828

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1777.81it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.97it/s, est. speed input: 1385.95 toks/s, output: 11001.54 toks/s]


{'loss': 0.0011, 'grad_norm': 1.53125, 'learning_rate': 2.2000000000000003e-05, 'num_tokens': 146380.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 284.0187683105469, 'rewards/reward_from_env/std': 19.812435150146484, 'reward': 284.0187683105469, 'reward_std': 12.771563529968262, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.01629074662923813, 'sampling/sampling_logp_difference/max': 1.3111591339111328, 'sampling/importance_sampling_ratio/min': 0.2695074677467346, 'sampling/importance_sampling_ratio/mean': 1.0034124851226807, 'sampling/importance_sampling_ratio/max': 1.6229891777038574, 'entropy': 0.176513671875, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_ma

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1626.01it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.34it/s, est. speed input: 2678.22 toks/s, output: 10839.81 toks/s]


{'loss': 0.0009, 'grad_norm': 1.640625, 'learning_rate': 2e-05, 'num_tokens': 156596.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 279.65625, 'rewards/reward_from_env/std': 20.31792449951172, 'reward': 279.65625, 'reward_std': 15.107717514038086, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.016554228961467743, 'sampling/sampling_logp_difference/max': 1.2050352096557617, 'sampling/importance_sampling_ratio/min': 0.2996814548969269, 'sampling/importance_sampling_ratio/mean': 1.0037851333618164, 'sampling/importance_sampling_ratio/max': 1.4312273263931274, 'entropy': 0.18017578125, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean'

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1818.87it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 32.23it/s, est. speed input: 1498.67 toks/s, output: 8250.66 toks/s]


{'loss': 0.0007, 'grad_norm': 2.21875, 'learning_rate': 1.8e-05, 'num_tokens': 166276.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 281.79376220703125, 'rewards/reward_from_env/std': 42.84409713745117, 'reward': 281.79376220703125, 'reward_std': 28.880355834960938, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.015027951449155807, 'sampling/sampling_logp_difference/max': 1.3328685760498047, 'sampling/importance_sampling_ratio/min': 0.26371967792510986, 'sampling/importance_sampling_ratio/mean': 1.0014393329620361, 'sampling/importance_sampling_ratio/max': 1.4249001741409302, 'entropy': 0.1544189453125, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.0, 'clip_

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1451.57it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 41.89it/s, est. speed input: 3665.56 toks/s, output: 10724.30 toks/s]


{'loss': -0.0001, 'grad_norm': 1.4140625, 'learning_rate': 1.6000000000000003e-05, 'num_tokens': 177268.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 293.6875, 'rewards/reward_from_env/std': 6.768321514129639, 'reward': 293.6875, 'reward_std': 6.160709381103516, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.00917959026992321, 'sampling/sampling_logp_difference/max': 1.5185699462890625, 'sampling/importance_sampling_ratio/min': 0.2190248966217041, 'sampling/importance_sampling_ratio/mean': 1.0015687942504883, 'sampling/importance_sampling_ratio/max': 1.3430147171020508, 'entropy': 0.09228515625, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_rat

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1998.00it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 43.26it/s, est. speed input: 1492.64 toks/s, output: 11075.77 toks/s]


{'loss': 0.0006, 'grad_norm': 1.234375, 'learning_rate': 1.4000000000000001e-05, 'num_tokens': 186564.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 296.0562744140625, 'rewards/reward_from_env/std': 3.7737557888031006, 'reward': 296.0562744140625, 'reward_std': 2.830070972442627, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.007263758685439825, 'sampling/sampling_logp_difference/max': 1.2498435974121094, 'sampling/importance_sampling_ratio/min': 0.2865495979785919, 'sampling/importance_sampling_ratio/mean': 1.0014268159866333, 'sampling/importance_sampling_ratio/max': 1.6479860544204712, 'entropy': 0.0804443359375, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1944.28it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 43.38it/s, est. speed input: 1865.48 toks/s, output: 11106.05 toks/s]


{'loss': 0.0, 'grad_norm': 1.4453125, 'learning_rate': 1.2e-05, 'num_tokens': 196132.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 296.44061279296875, 'rewards/reward_from_env/std': 8.767380714416504, 'reward': 296.44061279296875, 'reward_std': 6.072707176208496, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.006782043259590864, 'sampling/sampling_logp_difference/max': 1.0349149703979492, 'sampling/importance_sampling_ratio/min': 0.3552566170692444, 'sampling/importance_sampling_ratio/mean': 1.001916527748108, 'sampling/importance_sampling_ratio/max': 1.400224208831787, 'entropy': 0.07220458984375, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1492.37it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.57it/s, est. speed input: 3342.40 toks/s, output: 10899.73 toks/s]


{'loss': 0.0001, 'grad_norm': 1.015625, 'learning_rate': 1e-05, 'num_tokens': 206836.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 298.87811279296875, 'rewards/reward_from_env/std': 2.398622512817383, 'reward': 298.87811279296875, 'reward_std': 2.237010955810547, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.005232331342995167, 'sampling/sampling_logp_difference/max': 1.1641674041748047, 'sampling/importance_sampling_ratio/min': 0.3121824562549591, 'sampling/importance_sampling_ratio/mean': 1.0014495849609375, 'sampling/importance_sampling_ratio/max': 1.3543024063110352, 'entropy': 0.05694580078125, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.0, 'clip_ra

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1879.80it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.73it/s, est. speed input: 1912.31 toks/s, output: 10939.60 toks/s]


{'loss': -0.0001, 'grad_norm': 0.875, 'learning_rate': 8.000000000000001e-06, 'num_tokens': 216460.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 299.921875, 'rewards/reward_from_env/std': 1.9435142278671265, 'reward': 299.921875, 'reward_std': 1.6189486980438232, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.0038750190287828445, 'sampling/sampling_logp_difference/max': 1.2732229232788086, 'sampling/importance_sampling_ratio/min': 0.27992796897888184, 'sampling/importance_sampling_ratio/mean': 1.000841736793518, 'sampling/importance_sampling_ratio/max': 1.2461822032928467, 'entropy': 0.0382080078125, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'cli

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1735.33it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.49it/s, est. speed input: 2231.19 toks/s, output: 10879.63 toks/s]


{'loss': -0.0005, 'grad_norm': 1.0, 'learning_rate': 6e-06, 'num_tokens': 226332.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 299.62811279296875, 'rewards/reward_from_env/std': 2.361961603164673, 'reward': 299.62811279296875, 'reward_std': 1.885979175567627, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.004480584524571896, 'sampling/sampling_logp_difference/max': 1.2986259460449219, 'sampling/importance_sampling_ratio/min': 0.27290651202201843, 'sampling/importance_sampling_ratio/mean': 1.0008270740509033, 'sampling/importance_sampling_ratio/max': 1.284912347793579, 'entropy': 0.04254150390625, 'clip_ratio/low_mean': 0.0001220703125, 'clip_ratio/low_min': 0.0001220703125, 'clip_ratio/high_mean': 0.0, 'clip_ratio/

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1362.34it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.71it/s, est. speed input: 2178.70 toks/s, output: 10936.15 toks/s]


{'loss': -0.0, 'grad_norm': 1.1328125, 'learning_rate': 4.000000000000001e-06, 'num_tokens': 236156.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 299.54376220703125, 'rewards/reward_from_env/std': 3.9939420223236084, 'reward': 299.54376220703125, 'reward_std': 2.759938955307007, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.004718138370662928, 'sampling/sampling_logp_difference/max': 1.1052160263061523, 'sampling/importance_sampling_ratio/min': 0.33113935589790344, 'sampling/importance_sampling_ratio/mean': 1.000596284866333, 'sampling/importance_sampling_ratio/max': 1.3233096599578857, 'entropy': 0.04345703125, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0001220703125, 'clip_r

Adding requests: 100%|██████████| 4/4 [00:00<00:00, 1406.54it/s]
Processed prompts: 100%|██████████| 32/32 [00:00<00:00, 42.61it/s, est. speed input: 3100.11 toks/s, output: 10908.90 toks/s]


{'loss': -0.0003, 'grad_norm': 0.87109375, 'learning_rate': 2.0000000000000003e-06, 'num_tokens': 246676.0, 'completions/mean_length': 256.0, 'completions/min_length': 256.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 1.0, 'completions/mean_terminated_length': 0.0, 'completions/min_terminated_length': 0.0, 'completions/max_terminated_length': 0.0, 'rewards/reward_from_env/mean': 300.546875, 'rewards/reward_from_env/std': 1.7720036506652832, 'reward': 300.546875, 'reward_std': 1.5536911487579346, 'frac_reward_zero_std': 0.0, 'sampling/sampling_logp_difference/mean': 0.003943466581404209, 'sampling/sampling_logp_difference/max': 0.8658556938171387, 'sampling/importance_sampling_ratio/min': 0.4206914007663727, 'sampling/importance_sampling_ratio/mean': 1.0006545782089233, 'sampling/importance_sampling_ratio/max': 1.237608790397644, 'entropy': 0.037139892578125, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0