[Prototype] Multi-turn GRPO for blackjack with OpenEnv #603

felipemello1 · 2025-11-20T22:35:51Z

Results

This PR adapts the GRPO pipeline to support multi-turn RL environments using Blackjack as a testbed.

Rollout example (green characters are marked for training):

Win rate (best theoretical win rate in blackjack is ~30-40%, so this is expected)

Key Changes:

TokenAccumulator: Utility for multi-turn token management. This is where i spend a lot of time to ensure correctness.

NOTE: set thinking=False to disable thinking in qwen

acc = TokenAccumulator(tokenizer, messages, max_len=2048, eos_id=2, thinking=False)

# Add messages
acc.add_user("What is 2+2?")
prompt = acc.format_prompt()
response = vllm_generate(prompt)
acc.add_assistant(response.text, response.token_ids, response.logprobs)

# Show what will be trained on
acc.show_messages()

# Get episode data as tensors
episode = acc.get_data()
# episode.token_ids: torch.Tensor (long)
# episode.response_mask: torch.Tensor (bool, True = trainable)
# episode.logprobs: torch.Tensor (float)

BlackjackEnv: Lightweight wrapper over OpenEnv/OpenSpiel. Handles action parsing, observation formatting, and reward computation. Requires OpenEnv patches to expose game metadata (player total, dealer card).
Token masking: Training mask is now computed from TokenAccumulator.response_mask. Since it is multi turn, we cannot just slice anymore, e.g. all_tokens[:response_len], we have to do all_tokens[loss_mask].
Serial rollouts - Games execute serially (do_single_rollout) instead of parallel batched generation. Each game runs until done or truncated. TODO: enable better parallelism.
Filtering - Drops groups with zero reward variance (can't compute advantages) and episodes with truncated responses (incomplete game trajectories).
Loss debugging - Added extensive logging to simple_grpo_loss for debugging numerical issues. Should be cleaned up in future iteration.

OpenEnv Patches

Enable metadata passthrough in HTTP server
Extract blackjack-specific state (player total, dealer card) from OpenSpiel

Python script to apply the patch is untested but you should be able to copy/paste your changes with easy.

Instructions

See apps/blackjack/openenv_patch/README.md for details.

…ckjack

Felipe Mello added 13 commits November 13, 2025 08:13

first

290f906

add what the loop should look like

8c87495

bunch of docs

12737dc

blackjack start

5a1a6b5

Merge branch 'main' of https://github.com/meta-pytorch/forge into bla…

9b8c9c2

…ckjack

add truncation doc

29a03ef

Merge branch 'main' of https://github.com/meta-pytorch/forge into bla…

174d136

…ckjack

misc

9bf1dbe

misc

a10379f

misc

1ef307d

cleanup and instructions

2860ec9

delete debug files

c148269

nit

46ea3f5

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 20, 2025

felipemello1 mentioned this pull request Nov 20, 2025

[WIP][RFC] Multi-turn toolcall #567

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Prototype] Multi-turn GRPO for blackjack with OpenEnv #603

[Prototype] Multi-turn GRPO for blackjack with OpenEnv #603

Uh oh!

felipemello1 commented Nov 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Prototype] Multi-turn GRPO for blackjack with OpenEnv #603

Are you sure you want to change the base?

[Prototype] Multi-turn GRPO for blackjack with OpenEnv #603

Uh oh!

Conversation

felipemello1 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Key Changes:

OpenEnv Patches

Instructions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

felipemello1 commented Nov 20, 2025 •

edited

Loading