# Reinforcement Learning with OpenAI gpt-oss-20b: Teaching an LLM to Play 2048

In this tutorial, we'll teach OpenAI's open-source model **gpt-oss 20b** to generate winning strategies for the classic 2048 puzzle game using **reinforcement learning (RL)**. By the end, you'll understand how to:

- Connect LLMs to game environments using **OpenEnv**
- Design reward functions that guide model behavior
- Train models with **GRPO** (Group Relative Policy Optimization)
- Prevent "reward hacking" with code sandboxing

**Requirements:** This notebook runs on a free Tesla T4 Google Colab instance.

## What is 2048?

**2048** is n fun single-player sliding puzzle game created by Gabriele Cirulli in 2014. The game is played on a 4√ó4 grid where numbered tiles slide in four directions (up, down, left, right). When two tiles with the same number collide, they merge into one tile with their sum. The goal is to create a tile with the value **2048**‚Äîthough skilled players can continue beyond that!

The game requires strategic thinking: random moves quickly lead to a gridlock, while optimal play involves keeping high-value tiles in corners and building systematically.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/2048_win.png/500px-2048_win.png" height=300 />

## Our Goal

We'll use reinforcement learning to train **OpenAI gpt-oss-20b** to generate Python functions that implement winning 2048 strategies. Rather than playing move-by-move, the model will learn to write *code* that plays the game‚Äîa form of "code generation as policy."

## Installation

We need two key libraries for this tutorial:

1. **[OpenEnv](https://github.com/meta-pytorch/OpenEnv)** - A unified interface to reinforcement learning environments. Traditional RL setups require installing and configuring each environment separately (Gym, OpenSpiel, Atari, etc.). OpenEnv provides a consistent API across all of them. Best of all, OpenEnv environments are available on Hugging Face Spaces, so we can connect to them remotely without any local installation.

2. **[Unsloth](https://github.com/unslothai/unsloth)** - An optimized training library that reduces VRAM usage by ~70% through memory-efficient LoRA and gradient checkpointing. This lets us run RL on a free Colab T4 GPU.

In [None]:
%%capture
import os, importlib.util

!pip install --upgrade -qqq uv
if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):
    try:
        import numpy

        get_numpy = f"numpy=={numpy.__version__}"
    except:
        get_numpy = "numpy"
    !uv pip install -qqq \
        "torch>=2.8.0" "triton>=3.4.0" {get_numpy} torchvision bitsandbytes "transformers==4.56.2" trackio \
        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
        git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels
elif importlib.util.find_spec("unsloth") is None:
    !uv pip install -qqq unsloth trackio

!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo

Next, we install the OpenEnv client and the connector for **OpenSpiel**‚ÄîDeepMind's collection of game environments used in RL research. OpenSpiel includes implementations of classic games like Chess, Go, and our target: 2048.

In [None]:
%%capture
!pip install -qqq openenv-core websockets
!pip install -qqq git+https://huggingface.co/spaces/openenv/openspiel_env

## Loading OpenAI gpt-oss 20b

We load the model with several memory optimizations:

| Parameter | Value | Description |
|-----------|-------|-------------|
| `max_seq_length` | 768 | Maximum context length. Increase for longer outputs (uses more VRAM). |
| `load_in_4bit` | True | Quantizes weights to 4-bit, dramatically reducing memory usage. |
| `lora_rank` | 4 | LoRA adapter rank. Higher = more expressive but slower/more memory. |
| `offload_embedding` | True | Moves embeddings to CPU RAM, saving ~1GB VRAM. |

In [None]:
import os
from unsloth import FastLanguageModel
import torch

max_seq_length = 768  # Can increase for longer RL output
lora_rank = 4  # Larger rank = smarter, but slower
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="openai/gpt-oss-20b",  # Official OpenAI weights
    load_in_4bit=True,
    max_seq_length=max_seq_length,
    offload_embedding=True,  # Offload embeddings to save more VRAM
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.1.3: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gpt_oss won't work! Using float32.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.37G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

Unsloth: Offloading embeddings to RAM to save 1.08 GB.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

### Applying LoRA for Efficient Training

[LoRA (Low-Rank Adaptation)](https://hf.co/papers/2106.09685) lets us fine-tune large models by adding small trainable adapters (1-5% of original parameters) instead of updating all weights. This reduces memory usage by ~60% while maintaining accuracy.

We target the attention and feedforward layers‚Äîthese are where most of the model's "reasoning" happens:

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=lora_rank,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=lora_rank * 2,  # *2 speeds up training
    use_gradient_checkpointing="unsloth",  # Reduces memory usage
    random_state=3407,
)

Unsloth: Making `model.base_model.model.model` require gradients


## Connecting to the 2048 Game Environment

OpenEnv lets us connect to game environments hosted remotely. We'll use a **Hugging Face Space** that runs the OpenSpiel 2048 game server. This architecture has several benefits:

- **No local installation** of game dependencies (OpenSpiel can be tricky to build)
- **Consistent environment** across different machines
- **Scalable** - the same pattern works for more complex environments

The Space exposes a WebSocket API that accepts actions and returns game states.

In [None]:
from openspiel_env import OpenSpielEnv
from openspiel_env.models import OpenSpielAction, OpenSpielObservation

The [openenv/openspiel_env](https://huggingface.co/spaces/openenv/openspiel_env) Space hosts a running OpenSpiel server configured for 2048. It handles game state management, validates moves, and returns observations after each action. You can also run the server locally for faster iteration:

In [None]:
# Connect to OpenSpiel 2048 environment on HuggingFace Spaces
# The game is configured server-side via OPENSPIEL_GAME=2048
OPENSPIEL_URL = "https://openenv-openspiel-env.hf.space"
# For local: OPENSPIEL_URL = "http://localhost:8000"

env = OpenSpielEnv(base_url=OPENSPIEL_URL)

Let's see how the current 2048 game state looks like:

In [None]:
result = env.reset()
current_state = result.observation
current_state

OpenSpielObservation(done=False, reward=None, metadata={}, info_state=[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0], legal_actions=[0, 1, 2], game_phase='initial', current_player_id=0, opponent_last_action=None)

### Decoding the Game State

OpenSpiel's 2048 `info_state` uses a compact encoding‚Äînot raw tile values. The first 16 elements represent the 4√ó4 board positions, with each value being **log‚ÇÇ of the tile** (so 1 = 2¬π = 2, 2 = 2¬≤ = 4, etc.). Values of 0 represent empty cells.

We need to:
1. Extract only the first 16 elements (the board)
2. Reshape into a 4√ó4 grid
3. Convert from log‚ÇÇ encoding to actual tile values

In [None]:
import numpy as np

# 2048 game constants
BOARD_SIZE = 4
BOARD_CELLS = BOARD_SIZE * BOARD_SIZE  # 16
WIN_TILE = 2048  # The target tile value to win


def convert_to_board(current_state):
    """
    Convert OpenSpiel 2048 observation to a 4√ó4 board of tile values.

    OpenSpiel encodes tiles as log‚ÇÇ(value), so we convert back:
    - 0 ‚Üí 0 (empty)
    - 1 ‚Üí 2 (2^1)
    - 2 ‚Üí 4 (2^2)
    - etc.
    """
    # Extract only the first 16 elements (the board state)
    raw_board = current_state.info_state[:BOARD_CELLS]

    # Convert from log‚ÇÇ encoding to actual tile values
    # 0 stays 0 (empty), otherwise 2^value
    tiles = [int(2**val) if val > 0 else 0 for val in raw_board]

    # Reshape into 4√ó4 grid
    board = [tiles[i * BOARD_SIZE : (i + 1) * BOARD_SIZE] for i in range(BOARD_SIZE)]
    return board, BOARD_SIZE


convert_to_board(current_state)

([[0, 1, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 1, 0, 0]],
 7)

We also want to pretty print the game board! This is not entirely necessary, but it helps us visualize the game state and learn from the process.

In [None]:
# @title (Collapsible) 2048 Game Renderer
def render_board(obs, colors: bool = True, border: bool = True, dot_for_zero: bool = True) -> str:
    """
    Pretty-print the board with colors that scale from 0 up to self.target.
    Uses ANSI 256-color codes (works in most terminals). Set colors=False to disable.
    """
    import math

    b, size = convert_to_board(obs)
    mx = max((max(row) for row in b), default=0)
    cell_w = max(3, len(str(mx)))

    RESET = "\x1b[0m"

    # A smooth-ish gradient from cool ‚Üí warm
    # (blue/cyan/green ‚Üí yellow/orange/red). Tweak or expand as you like.
    GRAD = [33, 39, 45, 51, 50, 49, 48, 47, 46, 82, 118, 154, 190, 226, 220, 214, 208, 202, 196]
    ZERO_FG = 239  # dim gray

    def color_code(v: int) -> str:
        if not colors:
            return ""
        if v == 0:
            return f"\x1b[38;5;{ZERO_FG}m"
        # Normalize by exponent relative to target: r in [0,1]
        t = max(2, WIN_TILE)  # safety; avoid log2(1)
        # Guard: if v is not a power of two or is <1, handle gracefully
        try:
            r = max(0.0, min(1.0, math.log2(v) / math.log2(t)))
        except ValueError:
            r = 0.0
        idx = int(round(r * (len(GRAD) - 1)))
        return f"\x1b[38;5;{GRAD[idx]}m"

    def fmt(v: int) -> str:
        s = "." if (v == 0 and dot_for_zero) else str(v)
        s = s.rjust(cell_w)
        return color_code(v) + s + (RESET if colors else "")

    def hline(left: str, mid: str, right: str) -> str:
        return left + mid.join("‚îÄ" * cell_w for _ in range(size)) + right

    rows = []
    if border:
        rows.append(hline("‚îå", "‚î¨", "‚îê"))
    for r in range(size):
        content = "‚îÇ".join(fmt(v) for v in b[r])
        rows.append(("‚îÇ" + content + "‚îÇ") if border else content)
        if border:
            rows.append(
                hline("‚îî" if r == size - 1 else "‚îú", "‚î¥" if r == size - 1 else "‚îº", "‚îò" if r == size - 1 else "‚î§")
            )
    return "\n".join(rows)

In [None]:
print(render_board(current_state))

‚îå‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îê
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;33m  1[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ

We can see the `legal_actions` ie what you can take as `[0, 1, 2, 3]` Let's try doing the action `0`.

In [None]:
action = OpenSpielAction(action_id=0, game_name="2048")
result = env.step(action)
current_state = result.observation
print(render_board(current_state))

‚îå‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îê
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;33m  1[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ

So it looks like `0` is a move up action! Let's try `1`.

In [None]:
action = OpenSpielAction(action_id=1, game_name="2048")
result = env.step(action)
current_state = result.observation
print(render_board(current_state))

‚îå‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îê
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;33m  1[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ

`1` is a move right action. And `2`:

In [None]:
action = OpenSpielAction(action_id=2, game_name="2048")
result = env.step(action)
current_state = result.observation
print(render_board(current_state))

‚îå‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îê
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;33m  1[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ

`2` is a move down. And I guess `3` is just move left!

In [None]:
action = OpenSpielAction(action_id=3, game_name="2048")
result = env.step(action)
current_state = result.observation
print(render_board(current_state))

‚îå‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îê
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;33m  1[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚î§
‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ[38;5;239m  .[0m‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ

We can also print the game status which indicates if no more moves are possible, and also the possible actions you can take!

In [None]:
print(current_state.done)
print(current_state.legal_actions)

False
[0, 1, 2]


## RL Environment Setup: The Strategy Executor

For reinforcement learning, we need a way to evaluate generated strategies. The key insight is that **our model doesn't play 2048 directly**‚Äîinstead, it writes Python code that plays the game. We then execute that code and measure how well it performs.

The `execute_strategy` function:
1. Takes a generated Python function (the "strategy")
2. Runs the 2048 game loop, calling the strategy for each move
3. Returns how many steps the game lasted and whether it reached 2048

**Timeout protection**: LLM-generated code might contain infinite loops or be very slow. We wrap execution with a 2-second timeout to ensure the RL training loop doesn't hang.

In [None]:
from typing import Callable
from unsloth import execute_with_time_limit
import itertools


def has_won(board) -> bool:
    """Check if the board contains a winning tile (2048 or higher)."""
    max_tile = max(itertools.chain.from_iterable(board))
    return max_tile >= WIN_TILE


def _execute_strategy(strategy, current_state: OpenSpielObservation):
    """Execute a strategy function on the 2048 game until completion or invalid move."""
    assert callable(strategy)

    steps = 0
    total_reward = 0
    board = None

    while not current_state.done:
        board, _ = convert_to_board(current_state)
        action = strategy(board)
        try:
            action = int(action)
        except:
            return steps, False
        steps += 1

        # Invalid action - return current win status
        if type(action) is not int or action not in current_state.legal_actions:
            return steps, has_won(board) if board else False

        action = OpenSpielAction(action_id=action, game_name="2048")
        result = env.step(action)
        current_state = result.observation
        if result.reward is not None:
            total_reward += result.reward

    # Game ended - check final board for win
    if board is None:
        board, _ = convert_to_board(current_state)
    return steps, has_won(board)


@execute_with_time_limit(2)
def execute_strategy(strategy: Callable, current_state: OpenSpielObservation):
    return _execute_strategy(strategy, current_state)

Let's make a generic strategy to just hit `3`. We should expect this generic strategy to fail:

In [None]:
def always_move_left(board):
    return 3


# Reset OpenEnv to an initial state!
result = env.reset()
current_state = result.observation
try:
    steps, if_done = execute_strategy(always_move_left, current_state)
except TimeoutError as e:
    print(f"Timed out with error = {str(e)}")

steps, if_done

(1, False)

To allow longer strategies for GPT-OSS Reinforcement Learning, we shall allow a 5 second timer.

In [None]:
@execute_with_time_limit(5)
def execute_strategy(strategy: Callable, current_state: OpenSpielObservation):
    return _execute_strategy(strategy, current_state)

## Sandboxed Code Execution: Preventing Reward Hacking

A critical challenge in RL with code generation is **reward hacking**‚Äîthe model might learn to "cheat" rather than solve the actual problem. For example, it could:

- Import external libraries to hardcode solutions
- Access global variables to manipulate game state directly  
- Call system functions to bypass the game logic

We use two safeguards:

1. `check_python_modules` validates that the code only uses Python standard library imports (no numpy, pandas, etc.)
2. `create_locked_down_function` executes code in an isolated namespace with no access to global variables

Let's see these in action. First, a valid strategy that only uses standard library:

In [None]:
from unsloth import check_python_modules

sample = """
def strategy(board):
    import math
    from typing import Callable
    return "0"
"""
ok, info = check_python_modules(sample)
print("Only Python imports?", ok)
print(info)

Only Python imports? True
{'stdlib': ['math', 'typing'], 'non_stdlib': [], 'relative_imports': 0}


For the below piece of code, since we import `numpy`, we should not allow the execution:

In [None]:
sample = """
def strategy(board):
    from numpy import matmul
    return "0"
"""
ok, info = check_python_modules(sample)
print("Only Python imports?", ok)
print(info)

Only Python imports? False
{'stdlib': [], 'non_stdlib': ['numpy'], 'relative_imports': 0}


We also disallow global variable access. We'll use Unsloth's `create_locked_down_function` function


In [None]:
from unsloth import create_locked_down_function

function = """
def import_numpy():
    np.matmul
    print("Success")
"""
f = create_locked_down_function(function)
try:
    f()
except Exception as e:
    print(str(e))

name 'np' is not defined


In [None]:
from unsloth import create_locked_down_function

function = """
def add(a, b):
    def adder(a):
        return a + b
    return adder(b) + b
"""
f = create_locked_down_function(function)
try:
    print(f(10, 20))
except Exception as e:
    print(str(e))

60


## Prompt Design: Instructing the Model

The prompt is crucial‚Äîit tells the model what we expect it to generate. We want:

1. A **single Python function** named `strategy(board)` that:
   - Takes a 4√ó4 list of lists as input
   - Returns a move: "0" (up), "1" (right), "2" (down), or "3" (left)
2. **Self-contained code** with all helpers defined inside the function
3. **Native Python only** (no external dependencies)

This structured output format makes parsing straightforward and helps the model understand the task boundaries.

In [None]:
prompt = """
Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "0" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.
""".strip()
print(prompt)

Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "0" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.


Let's see what the **base model** (before RL training) generates when given this prompt:

In [None]:
text = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=False,
    add_generation_prompt=True,
    reasoning_effort="low",
)

from transformers import TextStreamer

_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    temperature=1.0,
    max_new_tokens=512,
    streamer=TextStreamer(tokenizer, skip_prompt=False),
)

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2026-01-20

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "0" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>We need to provide a short function. Probably simple heuristic: choose move with lowest collision? Use sum? Just a placeholder.<|end|><|start|>assistant<|channel|>final<|me

## Designing Reward Functions

Reward functions are the heart of RL‚Äîthey define what the "good" behavior that we want to encourage in the model looks like. For code generation, we need a multi-part reward that captures different aspects of quality:

| Reward Function | Purpose | Score Range |
|-----------------|---------|-------------|
| `function_works` | Is the generated code syntactically valid and executable? | -2.0 to +1.0 |
| `no_cheating` | Does the code avoid forbidden imports (numpy, etc.)? | -20.0 to +1.0 |
| `strategy_succeeds` | Does the strategy actually play 2048 well? | -3.0 to +20.0 |


Let's break down the reward functions into some practical examples:
- We could heavily penalize cheating (-20.0) to make honest solutions more rewarding.
- We could massively reward success (+20.0) since reaching 2048 is rare initially.
- We could graduated penalties for partial failures (timeout vs. crash vs. invalid syntax). This gives the model more information to learn from, creating a 'richer' reward signal.

First, we need a helper to extract the Python function from the model's markdown-formatted output:

In [None]:
def extract_function(text):
    if text.count("```") >= 2:
        first = text.find("```") + 3
        second = text.find("```", first)
        fx = text[first:second].strip()
        fx = fx.removeprefix("python\n")
        fx = fx[fx.find("def") :]
        if fx.startswith("def strategy(board):"):
            return fx
    return None


print(extract_function(prompt))

def strategy(board):
    return "0" # Example


Below is our `function_works` reward function which uses Python's `exec` but guarded by not allowing leakage of local and global variables. We can also use `check_python_modules` first to check if there are errors before even executing the function:

In [None]:
ok, info = check_python_modules("def a")
ok, info

(False,
 {'error': "SyntaxError: expected '(' (<unknown>, line 1)",
  'stdlib': [],
  'non_stdlib': [],
  'relative_imports': 0})

In [None]:
def function_works(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        function = extract_function(response)
        if function is not None:
            ok, info = check_python_modules(function)
        if function is None or "error" in info:
            score = -2.0
        else:
            try:
                new_strategy = create_locked_down_function(function)
                score = 1.0
            except:
                score = -0.5
        scores.append(score)
    return scores

`no_cheating` checks if the function cheated since it might have imported Numpy or other functions:

In [None]:
def no_cheating(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        function = extract_function(response)
        if function is not None:
            ok, info = check_python_modules(function)
            scores.append(1.0 if ok else -20.0)  # Penalize heavily!
        else:
            scores.append(-1.0)  # Failed creating function
    return scores

Next `strategy_succeeds` checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "0" which would fail after a time limit of 10 seconds.

We also add a global `PRINTER` to print out the strategy and board state.

In [None]:
import numpy as np

global PRINTER
PRINTER = 0


def strategy_succeeds(completions, **kwargs):
    global PRINTER
    scores = []
    for completion in completions:
        printed = False
        score = 0
        response = completion[0]["content"]
        function = extract_function(response)
        if PRINTER % 5 == 0:
            printed = True
            print(function)
        PRINTER += 1
        if function is not None:
            ok, info = check_python_modules(function)
        if function is None or "error" in info:
            scores.append(0)
            continue
        try:
            new_strategy = create_locked_down_function(function)
        except:
            scores.append(0)
            continue
        try:
            # Reset OpenEnv to an initial state!
            result = env.reset()
            current_state = result.observation
            steps, if_done = execute_strategy(new_strategy, current_state)
            print(f"Steps = {steps} If Done = {if_done}")
            if printed is False:
                print(function)
            print(render_board(current_state))
            if if_done:
                scores.append(20.0)  # Success - massively reward!
            else:
                scores.append(2.0)  # Failed but function works!
        except TimeoutError as e:
            print("Timeout")
            scores.append(-1.0)  # Failed with timeout
        except Exception as e:
            print(f"Exception = {str(e)}")
            scores.append(-3.0)  # Failed
    return scores

We'll now create the dataset which includes a replica of our prompt. Remember to add a reasoning effort of low! You can choose high reasoning mode, but this'll only work on more memory GPUs like MI300s.

In [None]:
from datasets import Dataset

dataset = Dataset.from_list(
    [{"prompt": [{"role": "user", "content": prompt.strip()}], "answer": 0, "reasoning_effort": "low"}] * 1000
)
maximum_length = len(
    tokenizer.apply_chat_template([{"role": "user", "content": prompt.strip()}], add_generation_prompt=True)
)
print(maximum_length)
dataset[0]

181


{'prompt': [{'content': 'Create a new short 2048 strategy using only native Python code.\nYou are given a list of list of numbers for the current board state.\nOutput one action for "0", "1", "2", "3" on what is the optimal next step.\nOutput your new short function in backticks using the format below:\n```python\ndef strategy(board):\n    return "0" # Example\n```\nAll helper functions should be inside def strategy. Only output the short function `strategy`.',
   'role': 'user'}],
 'answer': 0,
 'reasoning_effort': 'low'}

## Training with GRPO

**Group Relative Policy Optimization (GRPO)** is an RL algorithm designed for language models. Unlike PPO which requires a separate value network, GRPO computes advantages by comparing generations within the same prompt group‚Äîmaking it simpler and more memory-efficient.

Key training parameters:
- `num_generations=2`: Generate 2 candidates per prompt to compute relative rewards
- `max_steps=600`: Total training steps (~5 hours on T4)
- `temperature=1.0`: Controls generation randomness (higher = more exploration)

We use [TrackIO](https://github.com/gradio-app/trackio) for live visualization of training metrics directly in the notebook.

In [None]:
max_prompt_length = maximum_length + 1  # + 1 just in case!
max_completion_length = max_seq_length - max_prompt_length

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    temperature=1.0,
    learning_rate=2e-4,
    weight_decay=0.001,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    optim="adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,  # Increase to 4 for smoother training
    num_generations=2,  # Decrease if out of memory
    max_prompt_length=max_prompt_length,
    max_completion_length=max_completion_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps=600,
    save_steps=100,
    report_to="trackio",  # Can use Weights & Biases, TrackIO
    output_dir="outputs",
    # For optional training + evaluation
    # fp16_full_eval = True,
    # per_device_eval_batch_size = 4,
    # eval_accumulation_steps = 1,
    # eval_strategy = "steps",
    # eval_steps = 1,
)

Unsloth: We now expect `per_device_train_batch_size` * `gradient_accumulation_steps` * `world_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 2


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        function_works,
        no_cheating,
        strategy_succeeds,
    ],
    args=training_args,
    train_dataset=dataset,
)

Unsloth: Switching to float32 training since model cannot work with float16


And let's train the model! **NOTE** This might be quite slow! 600 steps takes ~5 hours or longer.

[TrackIO](https://github.com/gradio-app/trackio) might be a bit slow to load - wait 2 minutes until the graphs pop up!

In [None]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 2
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 600
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 1,990,656 of 20,916,747,840 (0.01% trained)


* Running on public URL: https://68c7c4f7168a6e9cd6.gradio.live
* Trackio project initialized: huggingface
* Trackio metrics logged to: /root/.cache/huggingface/trackio


* GPU detected, enabling automatic GPU metrics logging
* Created new run: dainty-sunset-0


`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072}. If this is not desired, please set these values explicitly.


def strategy(board):
    # simple look‚Äëahead: pick the move that keeps the board almost sorted
    # score is total immobility (higher=better)
    def score(b):
        s = 0
        n = len(b)
        for i in range(n):
            for j in range(n):
                v = b[i][j]
                if v != 0:
                    # neighbors that can merge
                    for di,dj in [(1,0),(-1,0),(0,1),(0,-1)]:
                        ni, nj = i+di, j+dj
                        if 0 <= ni < n and 0 <= nj < n:
                            if b[ni][nj] == v:
                                s += v
        return s
    moves = []
    for m in ["0","1","2","3"]:
        new_b = [row[:] for row in board]
        # simulate move (simplified: just skip actual moving)
        # Here we just pick the move with highest score (mock)
        moves.append((score(new_b), m))
    best = max(moves)[1]
    return best
Steps = 1 If Done = False
‚îå‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚

Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / function_works / mean,rewards / function_works / std,rewards / no_cheating / mean,rewards / no_cheating / std,rewards / strategy_succeeds / mean,rewards / strategy_succeeds / std
1,0.0,4.0,0.0,312.0,278.0,346.0,0.0,312.0,278.0,346.0,0.000103,1.0,0.0,1.0,0.0,2.0,0.0
2,0.0,4.0,0.0,340.0,338.0,342.0,0.0,340.0,338.0,342.0,5.5e-05,1.0,0.0,1.0,0.0,2.0,0.0
3,0.0,-1.25,2.474874,557.0,528.0,586.0,0.5,528.0,528.0,528.0,6.3e-05,-1.25,1.06066,0.0,1.414214,0.0,0.0
4,0.0,4.0,0.0,82.0,52.0,112.0,0.0,82.0,52.0,112.0,0.000175,1.0,0.0,1.0,0.0,2.0,0.0
5,0.0,4.0,0.0,355.5,244.0,467.0,0.0,355.5,244.0,467.0,0.001542,1.0,0.0,1.0,0.0,2.0,0.0
6,0.0,4.0,0.0,278.0,142.0,414.0,0.0,278.0,142.0,414.0,0.008132,1.0,0.0,1.0,0.0,2.0,0.0
7,0.0,4.0,0.0,370.0,183.0,557.0,0.0,370.0,183.0,557.0,0.00735,1.0,0.0,1.0,0.0,2.0,0.0
8,0.0,4.0,0.0,317.5,190.0,445.0,0.0,317.5,190.0,445.0,0.012345,1.0,0.0,1.0,0.0,2.0,0.0
9,0.0,0.5,4.949748,321.5,57.0,586.0,0.5,57.0,57.0,57.0,0.018377,-0.5,2.12132,0.0,1.414214,1.0,1.414214
10,0.0,0.5,4.949748,407.5,229.0,586.0,0.5,229.0,229.0,229.0,0.031208,-0.5,2.12132,0.0,1.414214,1.0,1.414214


Steps = 1 If Done = False
def strategy(board):
    # Count empty cells for each move
    def score_for(move):
        temp = [row[:] for row in board]
        # simulate move
        for i in range(4):
            line = [temp[j][i] for j in range(4)] if move==3 else [temp[i][j] for j in range(4)]
            # shift and merge
            new_line = []
            merged = False
            for val in (line if move in (0,2) else reversed(line)):
                if val == 0: continue
                if new_line and new_line[-1] == val and not merged:
                    new_line[-1] *= 2
                    merged = True
                else:
                    new_line.append(val)
            # fill rest with zeros
            new_line += [0]*(4-len(new_line))
            if move==3:
                for j in range(4): temp[j][i] = new_line[j]
            else:
                for j in range(4): temp[i][j] = new_line[j]
        return sum(new_line)  # simple heuristic
    best = -1; be

## Testing the Trained Model

Let's generate a strategy from our RL-trained model and see how it differs from the base model:

In [None]:
text = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=False,
    add_generation_prompt=True,
    reasoning_effort="low",
)

from transformers import TextStreamer

_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    temperature=1.0,
    max_new_tokens=1024,
    streamer=TextStreamer(tokenizer, skip_prompt=False),
)

## Saving the Fine-tuned Model

You can save the trained model in different formats:

- **MXFP4**: OpenAI gpt-oss's native 4-bit precision format
- **float16**: Standard half-precision for broader compatibility

To push to Hugging Face Hub, you'll need a token from https://huggingface.co/settings/tokens:

In [None]:
# Merge and push to hub in mxfp4 4bit format
if False:
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method="mxfp4")
if False:
    model.push_to_hub_merged("repo_id/repo_name", tokenizer, token="hf...", save_method="mxfp4")

# Merge and push to hub in 16bit
if False:
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method="merged_16bit")
if False:  # Pushing to HF Hub
    model.push_to_hub_merged("hf/gpt-oss-finetune", tokenizer, save_method="merged_16bit", token="")

## Conclusion

Congratulations! You've learned how to apply reinforcement learning to teach an LLM to generate game-playing code. The key concepts covered:

1. **OpenEnv** for standardized access to RL environments
2. **LoRA** for memory-efficient fine-tuning
3. **Sandboxed execution** to prevent reward hacking
4. **Multi-objective reward functions** that balance validity, safety, and performance
5. **GRPO** for policy optimization without a value network

This pattern extends beyond 2048‚Äîyou can adapt it to any task where model outputs can be programmatically evaluated: code synthesis, mathematical proofs, API usage, and more.

### Further Resources

- [OpenAI gpt-oss-20b Model Card](https://huggingface.co/openai/gpt-oss-20b)
- [OpenEnv Documentation](https://github.com/meta-pytorch/OpenEnv)
- [TRL GRPO Trainer](https://huggingface.co/docs/trl/main/en/grpo_trainer)
- [Unsloth RL Guide](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide)

---

*This notebook uses [Unsloth](https://github.com/unslothai/unsloth) for memory-efficient training.*

**License:** Apache 2.0