Skip to content

Conversation

@felipemello1
Copy link
Contributor

@felipemello1 felipemello1 commented Nov 20, 2025

Results

This PR adapts the GRPO pipeline to support multi-turn RL environments using Blackjack as a testbed.

Rollout example (green characters are marked for training):
image

Win rate (best theoretical win rate in blackjack is ~30-40%, so this is expected)
image

Key Changes:

  • TokenAccumulator: Utility for multi-turn token management. This is where i spend a lot of time to ensure correctness.

NOTE: set thinking=False to disable thinking in qwen

acc = TokenAccumulator(tokenizer, messages, max_len=2048, eos_id=2, thinking=False)

# Add messages
acc.add_user("What is 2+2?")
prompt = acc.format_prompt()
response = vllm_generate(prompt)
acc.add_assistant(response.text, response.token_ids, response.logprobs)

# Show what will be trained on
acc.show_messages()

# Get episode data as tensors
episode = acc.get_data()
# episode.token_ids: torch.Tensor (long)
# episode.response_mask: torch.Tensor (bool, True = trainable)
# episode.logprobs: torch.Tensor (float)
  • BlackjackEnv: Lightweight wrapper over OpenEnv/OpenSpiel. Handles action parsing, observation formatting, and reward computation. Requires OpenEnv patches to expose game metadata (player total, dealer card).

  • Token masking: Training mask is now computed from TokenAccumulator.response_mask. Since it is multi turn, we cannot just slice anymore, e.g. all_tokens[:response_len], we have to do all_tokens[loss_mask].

  • Serial rollouts - Games execute serially (do_single_rollout) instead of parallel batched generation. Each game runs until done or truncated. TODO: enable better parallelism.

  • Filtering - Drops groups with zero reward variance (can't compute advantages) and episodes with truncated responses (incomplete game trajectories).

  • Loss debugging - Added extensive logging to simple_grpo_loss for debugging numerical issues. Should be cleaned up in future iteration.

OpenEnv Patches

  • Enable metadata passthrough in HTTP server
  • Extract blackjack-specific state (player total, dealer card) from OpenSpiel

Python script to apply the patch is untested but you should be able to copy/paste your changes with easy.

Instructions

See apps/blackjack/openenv_patch/README.md for details.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant