# Minesweeper GRPO Training -- SOTA Pipeline

## Goal
Train **Qwen2.5-14B-Instruct** via GRPO to play Minesweeper at competition-winning level:
- **Input**: JSON game state (board, rows, cols, mines, flags_placed, cells_revealed)
- **Output**: Bare JSON action `{"type":"reveal"|"flag","row":R,"col":C}`

## Training Approach
- **Model**: `unsloth/Qwen2.5-14B-Instruct` -- SOTA instruction-following, 14.7B params, native JSON adherence
- **Method**: GRPO with **6 shaped reward functions** (safety-rail + cascade-gameplay + oracle-proximity + rollout + brevity + strict-format)
- **LoRA**: rank 64, alpha 128, all attention + MLP projections -- maximal capacity on 262 GB MI300x VRAM
- **Framework**: Unsloth -- 2-6x faster, 70% less VRAM, ROCm compatible
- **Dataset**: 5000 curriculum-sampled game states (training boards 5√ó5 to **25√ó25**, eval up to 50√ó50)
- **Eval-harness aligned**: prompt, parser, and scoring mirror `EVAL agents/` exactly
- **Hardware**: Single AMD MI300x (262 GB HBM3)
- **Breakthroughs**: cascade-shaped reward, oracle proximity reward, num_iterations=2

# Load Model with Unsloth

Load Qwen2.5-14B-Instruct -- strong instruction-following baseline with native JSON adherence:

In [1]:
import os

# ‚îÄ‚îÄ Redirect all caches to /workspace/ (writable) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# /root/.cache/huggingface/hub is read-only on this machine.
os.environ["HF_HOME"]            = "/workspace/hf_home"
os.environ["HF_HUB_CACHE"]      = "/workspace/hf_home/hub"
os.environ["TRANSFORMERS_CACHE"] = "/workspace/hf_home/hub"
os.environ["TORCH_HOME"]        = "/workspace/torch_home"
os.makedirs("/workspace/hf_home/hub", exist_ok=True)
os.makedirs("/workspace/torch_home", exist_ok=True)

from unsloth import FastLanguageModel
import torch

# ‚îÄ‚îÄ Qwen2.5-14B-Instruct: 14.7B params, 48 layers ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 262 GB MI300x VRAM budget:
#   Base model bf16 ‚âà 28 GB
#   LoRA r=64 adapters ‚âà 1.5 GB
#   KV-cache + activations + optimizer states ‚âà ~50 GB
#   Plenty of headroom for num_generations=8 and gradient checkpointing
#
# max_seq_length=4096 -- training boards capped at 25√ó25 (~2400 tokens).
# Larger boards (up to 50√ó50) handled at eval with longer generation context.
# while keeping KV-cache memory reasonable.

max_seq_length = 4096
lora_rank = 64  # High rank: 262 GB VRAM allows it; critical for learning
                # spatial reasoning patterns in Minesweeper

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-14B-Instruct",
    load_in_4bit = False,       # Full bf16 -- we have 262 GB VRAM
    max_seq_length = max_seq_length,
    torch_dtype = torch.bfloat16,
)

print(f"Model device: {model.device}")
print(f"Model dtype: {model.dtype}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()) / 1e9:.1f}B")
print("Qwen2.5-14B-Instruct loaded successfully!")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.




#### Unsloth: `hf_xet==1.1.10` and `ipykernel>6.30.1` breaks progress bars. Disabling for now in XET.
#### Unsloth: To re-enable progress bars, please downgrade to `ipykernel==6.30.1` or wait for a fix to
https://github.com/huggingface/xet-core/issues/526
INFO 02-15 07:39:41 [__init__.py:225] Automatically detected platform rocm.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025.10.6: Fast Qwen2 patching. Transformers: 4.56.2. vLLM: 0.11.1rc2.dev161+g8a297115e.rocm700.
   \\   /|    . Num GPUs = 1. Max memory: 255.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+gitb2fb688. ROCm Toolkit: 7.0.51831-a3e329ad8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

Model device: cuda:0
Model dtype: torch.bfloat16
Model parameters: 14.8B
Qwen2.5-14B-Instruct loaded successfully!


# Add LoRA Adapters

LoRA rank 64 with alpha 128 -- high capacity for learning spatial reasoning.
262 GB VRAM easily accommodates this. Target all attention + MLP projections:

In [2]:
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,               # 64 -- high rank for spatial reasoning
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",   # All attention projections
        "gate_proj", "up_proj", "down_proj",       # All MLP projections
    ],
    lora_alpha = lora_rank * 2,  # 128 -- standard 2x scaling
    lora_dropout = 0,            # No dropout -- need every gradient we can get
    use_gradient_checkpointing = "unsloth",  # Saves ~40% activation memory
    random_state = 3407,
)

# Print trainable parameter count
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable/1e6:.1f}M / {total/1e9:.1f}B ({100*trainable/total:.2f}%)")

Unsloth 2025.10.6 patched 48 layers with 48 QKV layers, 48 O layers and 48 MLP layers.


Trainable parameters: 275.3M / 15.0B (1.83%)


# Minesweeper Game Implementation

Custom Minesweeper environment supporting:
- Customizable board size and mine count
- Actions: reveal or flag cells
- Win: reveal all safe cells
- Lose: reveal a mine

In [3]:
from dataclasses import dataclass, field
from typing import List, Tuple, Optional, Set
import random

@dataclass
class MinesweeperGame:
    rows: int
    cols: int
    num_mines: int
    seed: Optional[int] = None
    _rng: random.Random = field(init=False, repr=False)
    _board: List[List[int]] = field(init=False, repr=False)  # -1 = mine, 0-8 = count
    _revealed: Set[Tuple[int, int]] = field(init=False, repr=False, default_factory=set)
    _flagged: Set[Tuple[int, int]] = field(init=False, repr=False, default_factory=set)
    _state: str = field(default="ongoing", init=False, repr=False)

    def __post_init__(self):
        if self.num_mines >= self.rows * self.cols:
            raise ValueError("Too many mines for board size")
        self._rng = random.Random(self.seed)
        self._board = [[0 for _ in range(self.cols)] for _ in range(self.rows)]
        self._place_mines()
        self._calculate_numbers()

    def _place_mines(self):
        """Place mines randomly on the board"""
        positions = [(r, c) for r in range(self.rows) for c in range(self.cols)]
        mine_positions = self._rng.sample(positions, self.num_mines)
        for r, c in mine_positions:
            self._board[r][c] = -1

    def _calculate_numbers(self):
        """Calculate numbers for each cell based on adjacent mines"""
        for r in range(self.rows):
            for c in range(self.cols):
                if self._board[r][c] == -1:
                    continue
                count = 0
                for dr in [-1, 0, 1]:
                    for dc in [-1, 0, 1]:
                        if dr == 0 and dc == 0:
                            continue
                        nr, nc = r + dr, c + dc
                        if 0 <= nr < self.rows and 0 <= nc < self.cols:
                            if self._board[nr][nc] == -1:
                                count += 1
                self._board[r][c] = count

    def _reveal_cell(self, row: int, col: int) -> bool:
        """Reveal a cell. Returns True if valid move, False if invalid.
        Uses iterative flood-fill to avoid recursion limit on large boards.
        (Issue #11: was recursive; Issue typo: fixed 'bself' -> 'self')
        """
        if not (0 <= row < self.rows and 0 <= col < self.cols):
            return False
        if (row, col) in self._revealed or (row, col) in self._flagged:
            return False

        stack = [(row, col)]
        while stack:
            r, c = stack.pop()
            if (r, c) in self._revealed:
                continue

            self._revealed.add((r, c))

            # Hit a mine!
            if self._board[r][c] == -1:
                self._state = "failed"
                return True

            # Auto-reveal neighbors if cell is 0
            if self._board[r][c] == 0:
                for dr in [-1, 0, 1]:
                    for dc in [-1, 0, 1]:
                        if dr == 0 and dc == 0:
                            continue
                        nr, nc = r + dr, c + dc
                        if (0 <= nr < self.rows and 0 <= nc < self.cols
                                and (nr, nc) not in self._revealed
                                and (nr, nc) not in self._flagged):
                            stack.append((nr, nc))

        return True

    def _flag_cell(self, row: int, col: int) -> bool:
        """Flag/unflag a cell. Returns True if valid, False if invalid"""
        if not (0 <= row < self.rows and 0 <= col < self.cols):
            return False
        if (row, col) in self._revealed:
            return False
        
        if (row, col) in self._flagged:
            self._flagged.remove((row, col))
        else:
            self._flagged.add((row, col))
        return True

    def do_action(self, action: dict) -> str:
        """Execute an action and return a status string.

        Returns one of:
          'ok'               - valid move executed
          'mine'             - revealed a mine (game over)
          'win'              - game won after this move
          'invalid_format'   - bad action dict / missing keys / bad types
          'out_of_bounds'    - coordinates outside the board
          'already_revealed' - cell was already revealed
          'flagged_cell'     - tried to reveal a flagged cell
          'invalid_flag'     - tried to flag a revealed cell
          'game_over'        - game was already over before this call

        (Issue #13: previously set state='failed' for ALL invalid moves,
         conflating formatting errors with hitting a mine.)
        """
        if self._state != "ongoing":
            return "game_over"

        if not isinstance(action, dict):
            self._state = "failed"
            return "invalid_format"

        action_type = action.get("type")
        row = action.get("row")
        col = action.get("col")

        if action_type not in ["reveal", "flag"] or row is None or col is None:
            self._state = "failed"
            return "invalid_format"

        try:
            row, col = int(row), int(col)
        except (ValueError, TypeError):
            self._state = "failed"
            return "invalid_format"

        if not (0 <= row < self.rows and 0 <= col < self.cols):
            self._state = "failed"
            return "out_of_bounds"

        if action_type == "reveal":
            if (row, col) in self._revealed:
                self._state = "failed"
                return "already_revealed"
            if (row, col) in self._flagged:
                self._state = "failed"
                return "flagged_cell"
            valid = self._reveal_cell(row, col)
        else:
            if (row, col) in self._revealed:
                self._state = "failed"
                return "invalid_flag"
            valid = self._flag_cell(row, col)

        if not valid:
            self._state = "failed"
            return "invalid_format"

        self._check_win()

        if self._state == "failed":
            return "mine"
        if self._state == "success":
            return "win"
        return "ok"

    def _check_win(self):
        """Check if player has won"""
        total_cells = self.rows * self.cols
        safe_cells = total_cells - self.num_mines
        if len(self._revealed) == safe_cells:
            self._state = "success"

    def get_visible_board(self) -> List[List[str]]:
        """Get board state as player sees it"""
        visible = []
        for r in range(self.rows):
            row = []
            for c in range(self.cols):
                if (r, c) in self._flagged:
                    row.append('F')
                elif (r, c) in self._revealed:
                    val = self._board[r][c]
                    row.append('*' if val == -1 else str(val))
                else:
                    row.append('.')
            visible.append(row)
        return visible

    def state(self) -> str:
        return self._state

    def pretty_print(self) -> str:
        """Pretty print the board"""
        visible = self.get_visible_board()
        lines = []
        
        # Header
        header = "   " + " ".join(f"{i:2d}" for i in range(self.cols))
        lines.append(header)
        lines.append("  " + "‚îÄ" * (self.cols * 3 + 1))
        
        # Board
        for r, row in enumerate(visible):
            line = f"{r:2d}‚îÇ " + "  ".join(row)
            lines.append(line)
        
        return "\n".join(lines)

# Test the Game

In [4]:
# Create test game
game = MinesweeperGame(rows=6, cols=6, num_mines=5)
print(game.pretty_print())
print(f"State: {game.state()}")

# Test action
game.do_action({"type": "reveal", "row": 0, "col": 0})
print("\nAfter revealing (0,0):")
print(game.pretty_print())
print(f"State: {game.state()}")

    0  1  2  3  4  5
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
 0‚îÇ .  .  .  .  .  .
 1‚îÇ .  .  .  .  .  .
 2‚îÇ .  .  .  .  .  .
 3‚îÇ .  .  .  .  .  .
 4‚îÇ .  .  .  .  .  .
 5‚îÇ .  .  .  .  .  .
State: ongoing

After revealing (0,0):
    0  1  2  3  4  5
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
 0‚îÇ 0  1  .  .  .  .
 1‚îÇ 0  1  .  .  .  .
 2‚îÇ 0  1  .  .  .  .
 3‚îÇ 1  2  .  .  .  .
 4‚îÇ .  .  .  .  .  .
 5‚îÇ .  .  .  .  .  .
State: ongoing


# JSON Input/Output Format

## Input Format (Game State)
```json
{
  "board": [
    ["1", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."]
  ],
  "rows": 6,
  "cols": 6,
  "mines": 5,
  "flags_placed": 0,
  "cells_revealed": 0
}
```

## Output Format (Action)
```json
{"type": "reveal", "row": 2, "col": 3}
```
or
```json
{"type": "flag", "row": 1, "col": 4}
```

In [5]:
import json

'''
Important Hints:

1. Prompt is crucial - make sure your LLM is not verbose and do not write/output reasoning, instead the verbose must be hidden or abstracted and
    output must be JSON object - the verbosity in our experiment led to running out of max tokens set and
    thus JSON parsing failure - i.e. Disqualification:
    {{"type": "reveal", "row": <row_index>, "col": <col_index>}}
    or
    {{"type": "flag", "row": <row_index>, "col": <col_index>}}

2. Make sure your model learns generic N*M game board shapes and # number of mines

3. Do not flag the cell which is already flagged - game will go in recursion and you will have heavy penalty

4. Do not flag the cell which is already revealed - game will go in recursion and you will have heavy penalty
'''

# ‚îÄ‚îÄ System prompt (used with Qwen2.5 chat template) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Kept short and authoritative. Qwen2.5-Instruct natively supports
# system prompts via <|im_start|>system.
SYSTEM_PROMPT = "You output JSON actions for Minesweeper. No text, only JSON."


def format_state_for_llm(game: MinesweeperGame) -> str:
    """Convert game state to JSON prompt for LLM.

    ** ALIGNED WITH EVAL HARNESS (EVAL agents/minesweeper_agent.py ‚Üí build_prompt) **

    Key alignment points kept IDENTICAL to eval:
    1. Same JSON game-state structure the controller sends.
    2. "You must output ONLY a valid JSON object."
    3. "Start your response immediately with { and end with }."
    4. "Do NOT output cell which is already revealed or flagged."
    5. Same legend, same output format examples.

    The system prompt is injected separately via apply_chat_template
    (Qwen2.5 uses <|im_start|>system).
    """
    state = {
        "board": game.get_visible_board(),
        "rows": game.rows,
        "cols": game.cols,
        "mines": game.num_mines,
        "flags_placed": len(game._flagged),
        "cells_revealed": len(game._revealed),
    }

    prompt = f"""You are playing Minesweeper. Analyze the game state and output your next move.

You must output ONLY a valid JSON object. No explanation, no analysis, no text.

Just output section after assistantfinal and not anything before it in your output.

Start your response immediately with {{ and end with }}.

Do NOT output cell which is already revealed or flagged in the current state.

Game state:
{json.dumps(state, indent=2)}

Legend:
- "." = unrevealed cell
- "F" = flagged cell (suspected mine)
- "0"-"8" = number of adjacent mines
- "*" = revealed mine (game over)

Output your next action as JSON:
{{"type": "reveal", "row": <row_index>, "col": <col_index>}}
or
{{"type": "flag", "row": <row_index>, "col": <col_index>}}

Your action:"""

    return prompt


def build_chat_messages(game: MinesweeperGame) -> list:
    """Build the full chat message list for Qwen2.5's apply_chat_template.

    Returns [system_msg, user_msg] -- the same structure the eval harness
    uses when it calls build_prompt ‚Üí generate_response.
    """
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": format_state_for_llm(game)},
    ]


def parse_llm_action(response: str) -> dict:
    """Extract JSON action from LLM response.

    ** Aligned with EVAL agents/minesweeper_agent.py ‚Üí parse_action **

    The eval harness iterates through ALL JSON objects in the response
    and returns the FIRST one that matches the expected schema
    (type ‚àà {reveal, flag}, row, col).  We replicate that behaviour
    exactly so training rewards match evaluation scoring.
    """
    try:
        potential_jsons = []
        i = 0
        while i < len(response):
            start = response.find("{", i)
            if start == -1:
                break
            brace_count = 0
            end = start
            while end < len(response):
                if response[end] == '{':
                    brace_count += 1
                elif response[end] == '}':
                    brace_count -= 1
                    if brace_count == 0:
                        json_str = response[start:end + 1]
                        try:
                            obj = json.loads(json_str)
                            potential_jsons.append(obj)
                        except json.JSONDecodeError:
                            pass
                        break
                end += 1
            i = end + 1 if end < len(response) else len(response)

        # Return the FIRST valid action (matches eval harness behaviour)
        for obj in potential_jsons:
            if (isinstance(obj, dict)
                    and "type" in obj and "row" in obj and "col" in obj
                    and obj["type"] in ["reveal", "flag"]):
                obj["row"] = int(obj["row"])
                obj["col"] = int(obj["col"])
                return obj
    except Exception:
        return None

    return None

# Test formatting + chat template
game = MinesweeperGame(rows=6, cols=6, num_mines=5)
messages = build_chat_messages(game)
print(f"System prompt: {messages[0]['content']}")
print(f"\nUser prompt (first 500 chars):")
print(messages[1]["content"][:500] + "...")

System prompt: You output JSON actions for Minesweeper. No text, only JSON.

User prompt (first 500 chars):
You are playing Minesweeper. Analyze the game state and output your next move.

You must output ONLY a valid JSON object. No explanation, no analysis, no text.

Just output section after assistantfinal and not anything before it in your output.

Start your response immediately with { and end with }.

Do NOT output cell which is already revealed or flagged in the current state.

Game state:
{
  "board": [
    [
      ".",
      ".",
      ".",
      ".",
      ".",
      "."
    ],
    [
      "....


# Test Model Before Training

See how the base model performs without finetuning:

In [6]:
from transformers import TextStreamer

game = MinesweeperGame(rows=6, cols=6, num_mines=5, seed=42)

# Build messages with system prompt (Qwen2.5 chat template)
messages = build_chat_messages(game)
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
)

print("=== Base Model Response (Qwen2.5-14B-Instruct) ===")
print(f"Prompt length: {len(tokenizer.encode(text))} tokens\n")
output = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    temperature = 0.3,       # Match eval config
    max_new_tokens = 128,    # Match eval config
    do_sample = True,
    top_p = 0.9,
    repetition_penalty = 1.2,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

# Parse and show result
prompt_len = tokenizer(text, return_tensors="pt")["input_ids"].shape[1]
response = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
action = parse_llm_action(response)
print(f"\nParsed action: {action}")

=== Base Model Response (Qwen2.5-14B-Instruct) ===
Prompt length: 387 tokens

{"type": "reveal", "row": 3, "col": 3}<|im_end|>

Parsed action: {'type': 'reveal', 'row': 3, 'col': 3}


# GRPO Reward Functions (5 total)

Five shaped reward functions aligned with the **EVAL agents** harness scoring table:

1. **`valid_json_reward`** -- Is the output a parseable JSON action?
2. **`gameplay_scores`** -- Full 12-criterion scoring table from `instructions.md` (flag mine +15, reveal mine -25, win +100, etc.) with logical-deduction detection
3. **`brevity_reward`** -- Penalise verbose output (eval uses `max_new_tokens=128`)
4. **`strict_format_reward`** -- Reward responses that start immediately with `{` (matches eval prompt)
5. **`rollout_reward`** -- Monte-Carlo-style multi-step rollout: play 5 follow-up moves after the proposed action and reward moves that open up the board / lead to wins

Helper functions `_is_logically_safe` and `_is_logically_mine` detect whether a move is logically deducible from visible numbers, awarding +15 instead of +10 for safe reveals.

In [7]:
import numpy as np

# ---------------------------------------------------------------------------
#  Reward function 1 ‚Äì Valid JSON format
# ---------------------------------------------------------------------------
def valid_json_reward(completions, **kwargs):
    """Reward valid JSON action format.

    Mirrors the eval harness: the controller expects a JSON object with
    keys "type" (reveal|flag), "row", and "col".  Anything else is an
    immediate -10 penalty (Invalid JSON row in the scoring table).
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        action = parse_llm_action(response)

        if action is None:
            scores.append(-5.0)   # Invalid format ‚Äì strong negative signal
        else:
            scores.append(1.0)    # Valid format ‚Äì small positive signal

    return scores


# ---------------------------------------------------------------------------
#  Helper: is a cell logically deducible as safe?
# ---------------------------------------------------------------------------
def _is_logically_safe(board, row, col, rows, cols):
    """Return True if (row, col) can be *proven* safe from the numbers
    visible on the board right now.

    A cell is logically safe when at least one revealed-number neighbour
    already has all its mines accounted for by flags.
    """
    for dr in [-1, 0, 1]:
        for dc in [-1, 0, 1]:
            if dr == 0 and dc == 0:
                continue
            nr, nc = row + dr, col + dc
            if not (0 <= nr < rows and 0 <= nc < cols):
                continue
            cell = board[nr][nc]
            # Must be a revealed number (1-8); "0" cells auto-expand so
            # their unrevealed neighbours are already revealed.
            if cell in (".", "F", "*", "0"):
                continue
            mine_count = int(cell)
            # Count flags and unrevealed cells around this number
            flags = 0
            unrevealed = 0
            for dr2 in [-1, 0, 1]:
                for dc2 in [-1, 0, 1]:
                    if dr2 == 0 and dc2 == 0:
                        continue
                    nr2, nc2 = nr + dr2, nc + dc2
                    if 0 <= nr2 < rows and 0 <= nc2 < cols:
                        if board[nr2][nc2] == "F":
                            flags += 1
                        elif board[nr2][nc2] == ".":
                            unrevealed += 1
            # If all mines around this number are flagged, every
            # remaining unrevealed neighbour (including our cell) is safe.
            if flags == mine_count and unrevealed > 0:
                return True
    return False


def _is_logically_mine(board, row, col, rows, cols):
    """Return True if (row, col) can be *proven* to be a mine from the
    numbers visible on the board right now.

    A cell is logically a mine when at least one revealed-number neighbour
    has exactly as many unrevealed+flagged neighbours as its mine count.
    """
    for dr in [-1, 0, 1]:
        for dc in [-1, 0, 1]:
            if dr == 0 and dc == 0:
                continue
            nr, nc = row + dr, col + dc
            if not (0 <= nr < rows and 0 <= nc < cols):
                continue
            cell = board[nr][nc]
            if cell in (".", "F", "*", "0"):
                continue
            mine_count = int(cell)
            # Count flags and unrevealed cells around this number
            flags = 0
            unrevealed = 0
            for dr2 in [-1, 0, 1]:
                for dc2 in [-1, 0, 1]:
                    if dr2 == 0 and dc2 == 0:
                        continue
                    nr2, nc2 = nr + dr2, nc + dc2
                    if 0 <= nr2 < rows and 0 <= nc2 < cols:
                        if board[nr2][nc2] == "F":
                            flags += 1
                        elif board[nr2][nc2] == ".":
                            unrevealed += 1
            # If remaining unrevealed cells == remaining mines, they are
            # ALL mines (including our target cell).
            remaining_mines = mine_count - flags
            if remaining_mines > 0 and remaining_mines == unrevealed:
                return True
    return False


# ---------------------------------------------------------------------------
#  Reward function 2 ‚Äì Full gameplay scoring
# ---------------------------------------------------------------------------
def gameplay_scores(completions, **kwargs):
    """Score a proposed action against the EXACT eval-harness scoring table.

    Scoring criteria (from instructions.md / eval controller):
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ Action                                              ‚îÇ Points ‚îÇ
    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
    ‚îÇ Win the game (flag all mines + reveal all safe)     ‚îÇ +100   ‚îÇ
    ‚îÇ Flag cell that IS a mine                            ‚îÇ  +15   ‚îÇ
    ‚îÇ Reveal safe cell (logically deducible)              ‚îÇ  +15   ‚îÇ
    ‚îÇ Reveal safe cell (randomly guessed)                 ‚îÇ  +10   ‚îÇ
    ‚îÇ Flag cell that is NOT a mine                        ‚îÇ  -10   ‚îÇ
    ‚îÇ Reveal cell that IS a mine (game over)              ‚îÇ  -25   ‚îÇ
    ‚îÇ Flag already-flagged cell                           ‚îÇ   -8   ‚îÇ
    ‚îÇ Reveal already-revealed cell                        ‚îÇ  -12   ‚îÇ
    ‚îÇ Out of bounds                                       ‚îÇ  -15   ‚îÇ
    ‚îÇ Total flags > total mines                           ‚îÇ  -10   ‚îÇ
    ‚îÇ Invalid JSON                                        ‚îÇ  -10   ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

    The function reconstructs each game from its seed + move_history
    (both stored in the dataset row) so it can evaluate the proposed
    action in the correct board context.
    """
    scores = []

    # Per-example metadata injected by GRPOTrainer from the dataset columns
    seeds = kwargs.get("seed", [])
    move_histories = kwargs.get("move_history", [])
    row_counts = kwargs.get("rows", [])
    col_counts = kwargs.get("cols", [])
    mine_counts = kwargs.get("mines", [])

    for idx, completion in enumerate(completions):
        response = completion[0]["content"]
        action = parse_llm_action(response)

        # ‚îÄ‚îÄ Criterion: Invalid JSON ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        if action is None:
            scores.append(-10.0)
            continue

        # ‚îÄ‚îÄ Reconstruct the EXACT game state ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        if idx < len(seeds) and idx < len(move_histories):
            seed = seeds[idx]
            move_history_raw = move_histories[idx]

            if isinstance(move_history_raw, str):
                move_history = json.loads(move_history_raw)
            else:
                move_history = move_history_raw

            # Allow variable board sizes from dataset; default to 6√ó6/5
            r_count = int(row_counts[idx]) if idx < len(row_counts) else 6
            c_count = int(col_counts[idx]) if idx < len(col_counts) else 6
            m_count = int(mine_counts[idx]) if idx < len(mine_counts) else 5

            game = MinesweeperGame(rows=r_count, cols=c_count,
                                   num_mines=m_count, seed=seed)
            for prev_action in move_history:
                game.do_action(prev_action)

            board = game.get_visible_board()
            row, col = int(action["row"]), int(action["col"])
            action_type = action["type"]

            # ‚îÄ‚îÄ Criterion: Out of bounds ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            if not (0 <= row < game.rows and 0 <= col < game.cols):
                scores.append(-15.0)
                continue

            cell_value = board[row][col]

            # ==============================================================
            #  REVEAL actions
            # ==============================================================
            if action_type == "reveal":

                # Criterion: Reveal already-revealed cell
                if cell_value not in (".", "F"):
                    scores.append(-12.0)
                    continue

                # Criterion: Reveal a flagged cell
                if cell_value == "F":
                    scores.append(-8.0)
                    continue

                # Cell is unrevealed (".") ‚Äì execute the action
                revealed_before = len(game._revealed)
                result = game.do_action(action)
                revealed_after = len(game._revealed)

                if result == "mine":
                    # Criterion: Reveal a mine ‚Üí game over
                    scores.append(-25.0)
                    continue

                if result == "win":
                    # Criterion: Win the game
                    scores.append(100.0)
                    continue

                # ‚îÄ‚îÄ BREAKTHROUGH 1: CASCADE-SHAPED REWARD ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
                # How many NEW cells did this single move reveal?
                # A zero-cell cascade can reveal 10-20 cells at once.
                # Base: +10/+15 for safe reveal (matches eval harness)
                # Bonus: +1 per extra cell revealed beyond the first
                # This teaches the model to target zeros and corners.
                cells_opened = revealed_after - revealed_before
                cascade_bonus = max(0, cells_opened - 1) * 1.0  # +1 per extra cell

                # Safe reveal ‚Äì was it logically deducible?
                if _is_logically_safe(board, row, col, game.rows, game.cols):
                    scores.append(15.0 + cascade_bonus)
                else:
                    scores.append(10.0 + cascade_bonus)
                continue

            # ==============================================================
            #  FLAG actions
            # ==============================================================
            elif action_type == "flag":

                # Criterion: Flag a revealed cell
                if cell_value not in (".", "F"):
                    scores.append(-8.0)
                    continue

                # Criterion: Flag already-flagged cell
                if cell_value == "F":
                    scores.append(-8.0)
                    continue

                # Cell is unrevealed (".") ‚Äì execute the flag
                game.do_action(action)

                # Criterion: Total flags > total mines
                if len(game._flagged) > game.num_mines:
                    scores.append(-10.0)
                    continue

                # Check if flagged cell is actually a mine
                if game._board[row][col] == -1:
                    # Criterion: Flag cell that IS a mine
                    # Bonus if logically deducible
                    if _is_logically_mine(board, row, col, game.rows, game.cols):
                        score = 15.0 + 2.0   # small bonus for logical flag
                    else:
                        score = 15.0
                    # Check for win after flagging
                    game._check_win()
                    if game.state() == "success":
                        score += 100.0
                    scores.append(score)
                else:
                    # Criterion: Flag cell that is NOT a mine
                    scores.append(-10.0)
                continue

            # ==============================================================
            #  Unknown action type (shouldn't happen after parse)
            # ==============================================================
            else:
                scores.append(-10.0)
        else:
            # Fallback ‚Äì no metadata available
            scores.append(0.0)

    return scores


# ---------------------------------------------------------------------------
#  Reward function 3 ‚Äì Brevity / conciseness reward
# ---------------------------------------------------------------------------
def brevity_reward(completions, **kwargs):
    """Reward concise responses that go straight to JSON.

    The eval harness sets max_new_tokens=128.  The ideal response is a
    single JSON object (~40-60 chars).  Penalise verbosity so the model
    learns to emit JSON immediately without chain-of-thought preamble.
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        length = len(response.strip())

        if length == 0:
            scores.append(-3.0)
        elif length <= 60:
            scores.append(2.0)    # Perfect ‚Äì just the JSON
        elif length <= 120:
            scores.append(1.0)    # Acceptable
        elif length <= 200:
            scores.append(0.0)    # Neutral
        else:
            scores.append(-2.0)   # Too verbose ‚Äì risks token overflow

    return scores


# ---------------------------------------------------------------------------
#  Reward function 4 ‚Äì Strict format match (starts with '{')
# ---------------------------------------------------------------------------
def strict_format_reward(completions, **kwargs):
    """Reward responses that start immediately with '{' and end with '}'.

    The eval harness prompt says: "Start your response immediately with
    { and end with }."  This reward teaches the model to skip any
    preamble text.
    """
    import re
    scores = []
    for completion in completions:
        response = completion[0]["content"].strip()
        # Perfect: response is exactly one JSON object
        if re.match(r'^\{.*\}$', response, re.DOTALL):
            scores.append(3.0)
        # Starts with '{' but has trailing text
        elif response.startswith('{'):
            scores.append(1.0)
        else:
            scores.append(-2.0)

    return scores


# ---------------------------------------------------------------------------
#  Reward function 5 ‚Äì BREAKTHROUGH 2: Oracle proximity reward
# ---------------------------------------------------------------------------
def _compute_oracle(seed, move_history, r_count, c_count, m_count):
    """Compute the optimal cell and set of safe cells for a game state.

    Returns (best_cell, safe_cells, board) where best_cell is (row, col)
    of the optimal move and safe_cells is a set of all logically-safe cells.
    Cached per unique game state so 32 completions sharing the same prompt
    only compute this ONCE.

    PERFORMANCE: On large boards (>400 unrevealed cells), we only simulate
    the FRONTIER (unrevealed cells adjacent to revealed numbers) instead of
    every unrevealed cell.  This caps the simulation count at ~100-200 even
    on 50√ó50 boards, avoiding O(n¬≤) blowup.
    """
    game = MinesweeperGame(rows=r_count, cols=c_count,
                           num_mines=m_count, seed=seed)
    for prev in move_history:
        game.do_action(prev)

    board = game.get_visible_board()

    # Collect ALL unrevealed cells
    all_unrevealed = []
    for r in range(r_count):
        for c in range(c_count):
            if board[r][c] == ".":
                all_unrevealed.append((r, c))

    if not all_unrevealed:
        return None, set(), board

    # ‚îÄ‚îÄ FRONTIER OPTIMIZATION for large boards ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    # Only consider cells adjacent to at least one revealed number.
    # These are the only cells where logical deduction is possible,
    # and the most strategically valuable moves.
    MAX_SIMULATE = 400  # Cap: don't simulate more than this many cells

    if len(all_unrevealed) > MAX_SIMULATE:
        frontier = set()
        for r in range(r_count):
            for c in range(c_count):
                cell = board[r][c]
                if cell not in (".", "F", "*", "0") and cell.isdigit():
                    # This is a revealed number -- add its unrevealed neighbours
                    for dr in [-1, 0, 1]:
                        for dc in [-1, 0, 1]:
                            nr, nc = r + dr, c + dc
                            if (0 <= nr < r_count and 0 <= nc < c_count
                                    and board[nr][nc] == "."):
                                frontier.add((nr, nc))
        candidates = list(frontier) if frontier else all_unrevealed[:MAX_SIMULATE]
    else:
        candidates = all_unrevealed

    # ‚îÄ‚îÄ Compute safe cells (cheap: just logic, no simulation) ‚îÄ‚îÄ‚îÄ‚îÄ
    safe_cells = set()
    for (r, c) in candidates:
        if _is_logically_safe(board, r, c, r_count, c_count):
            safe_cells.add((r, c))

    # ‚îÄ‚îÄ Simulate reveals to find best cascade ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    # Only simulate up to MAX_SIMULATE candidates
    best_cell = None
    best_score = -999

    sim_candidates = candidates[:MAX_SIMULATE]
    for (r, c) in sim_candidates:
        is_safe = (r, c) in safe_cells

        sim_game = MinesweeperGame(rows=r_count, cols=c_count,
                                   num_mines=m_count, seed=seed)
        for prev in move_history:
            sim_game.do_action(prev)

        rev_before = len(sim_game._revealed)
        sim_game.do_action({"type": "reveal", "row": r, "col": c})
        rev_after = len(sim_game._revealed)

        if sim_game.state() == "failed":
            continue  # Skip mines

        cascade = rev_after - rev_before
        cell_score = (1000 if is_safe else 0) + cascade

        if cell_score > best_score:
            best_score = cell_score
            best_cell = (r, c)

    return best_cell, safe_cells, board


def oracle_reward(completions, **kwargs):
    """Compare the model's action to the OPTIMAL action for this board state.

    Uses _compute_oracle with per-batch caching so the expensive simulation
    runs ONCE per unique game state, not once per completion (32x speedup).

    Scoring:
      +8  if the model picked the EXACT optimal cell
      +5  if the model picked a logically-safe cell (not optimal but great)
      +2  if the model picked a cell adjacent to the optimal cell
       0  if the model picked a valid but suboptimal cell
      -2  if the model picked an invalid action
    """
    scores = []

    seeds = kwargs.get("seed", [])
    move_histories = kwargs.get("move_history", [])
    row_counts = kwargs.get("rows", [])
    col_counts = kwargs.get("cols", [])
    mine_counts = kwargs.get("mines", [])

    # Cache oracle results per unique (seed, move_history) pair
    oracle_cache = {}

    for idx, completion in enumerate(completions):
        response = completion[0]["content"]
        action = parse_llm_action(response)

        if action is None:
            scores.append(-2.0)
            continue

        if idx >= len(seeds) or idx >= len(move_histories):
            scores.append(0.0)
            continue

        seed = seeds[idx]
        mh_raw = move_histories[idx]
        move_history = json.loads(mh_raw) if isinstance(mh_raw, str) else mh_raw

        r_count = int(row_counts[idx]) if idx < len(row_counts) else 6
        c_count = int(col_counts[idx]) if idx < len(col_counts) else 6
        m_count = int(mine_counts[idx]) if idx < len(mine_counts) else 5

        # Cache key: seed + move_history uniquely identifies a game state
        cache_key = (seed, mh_raw if isinstance(mh_raw, str) else json.dumps(mh_raw))
        if cache_key not in oracle_cache:
            oracle_cache[cache_key] = _compute_oracle(
                seed, move_history, r_count, c_count, m_count
            )

        best_cell, safe_cells, board = oracle_cache[cache_key]
        model_row, model_col = int(action["row"]), int(action["col"])

        if best_cell is None:
            scores.append(0.0)
            continue

        # ‚îÄ‚îÄ Compare model's action to oracle ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        if action["type"] != "reveal":
            if action["type"] == "flag" and _is_logically_mine(
                board, model_row, model_col, r_count, c_count
            ):
                scores.append(5.0)
            else:
                scores.append(0.0)
            continue

        if (model_row, model_col) == best_cell:
            scores.append(8.0)   # EXACT optimal cell
        elif (model_row, model_col) in safe_cells:
            scores.append(5.0)   # Logically safe (not optimal but great)
        elif abs(model_row - best_cell[0]) <= 1 and abs(model_col - best_cell[1]) <= 1:
            scores.append(2.0)   # Adjacent to optimal (warm)
        else:
            scores.append(0.0)   # Valid but cold

    return scores


# ---------------------------------------------------------------------------
#  Reward function 6 ‚Äì Multi-step rollout bonus
# ---------------------------------------------------------------------------
def rollout_reward(completions, **kwargs):
    """Play N additional random-safe moves after the proposed action and
    reward actions that leave the game in a better state.

    This gives the model a *strategic* signal: prefer moves that open up
    the board (more cells revealed) and don't immediately paint you into
    a corner.  It's the closest thing to Monte-Carlo tree search we can
    do inside a reward function without blowing up compute.

    Scoring:
      +5  if the action leads to a win within 5 follow-up moves
      +3  if >60% of safe cells are revealed after 5 follow-up moves
      +1  if >30% of safe cells are revealed
       0  neutral (game still ongoing, <30% revealed)
      -3  if the action itself was invalid / game-ending
    """
    scores = []

    seeds = kwargs.get("seed", [])
    move_histories = kwargs.get("move_history", [])
    row_counts = kwargs.get("rows", [])
    col_counts = kwargs.get("cols", [])
    mine_counts = kwargs.get("mines", [])

    for idx, completion in enumerate(completions):
        response = completion[0]["content"]
        action = parse_llm_action(response)

        if action is None:
            scores.append(-3.0)
            continue

        if idx >= len(seeds) or idx >= len(move_histories):
            scores.append(0.0)
            continue

        seed = seeds[idx]
        mh_raw = move_histories[idx]
        move_history = json.loads(mh_raw) if isinstance(mh_raw, str) else mh_raw

        r_count = int(row_counts[idx]) if idx < len(row_counts) else 6
        c_count = int(col_counts[idx]) if idx < len(col_counts) else 6
        m_count = int(mine_counts[idx]) if idx < len(mine_counts) else 5

        # Scale rollout depth with board size:
        # 5√ó5=25 cells ‚Üí 5 moves, 50√ó50=2500 cells ‚Üí 15 moves (capped)
        board_cells = r_count * c_count
        rollout_depth = min(15, max(5, board_cells // 100))

        # Reconstruct game
        game = MinesweeperGame(rows=r_count, cols=c_count,
                               num_mines=m_count, seed=seed)
        for prev in move_history:
            game.do_action(prev)

        # Execute the proposed action
        result = game.do_action(action)
        if game.state() != "ongoing":
            if game.state() == "success":
                scores.append(5.0)
            else:
                scores.append(-3.0)
            continue

        # Rollout: play rollout_depth random *safe-looking* moves
        rng = random.Random(seed + 99999)
        for _ in range(rollout_depth):
            if game.state() != "ongoing":
                break
            board = game.get_visible_board()
            candidates = []
            for r in range(game.rows):
                for c in range(game.cols):
                    if board[r][c] == ".":
                        candidates.append((r, c))
            if not candidates:
                break
            r, c = rng.choice(candidates)
            game.do_action({"type": "reveal", "row": r, "col": c})

        safe_cells = game.rows * game.cols - game.num_mines
        reveal_ratio = len(game._revealed) / safe_cells if safe_cells > 0 else 0

        if game.state() == "success":
            scores.append(5.0)
        elif game.state() == "failed":
            # The rollout hit a mine -- the proposed move wasn't fatal
            # but the board state is dangerous. Mild penalty.
            scores.append(-1.0)
        elif reveal_ratio > 0.6:
            scores.append(3.0)
        elif reveal_ratio > 0.3:
            scores.append(1.0)
        else:
            scores.append(0.0)

    return scores

# Create Training Dataset (Curriculum)

Generate 5000 diverse game states with curriculum learning:
- **Easy** (40%): 5-7 rows/cols, 10% mines, 0-2 pre-moves
- **Medium** (35%): 6-10 rows/cols, 15% mines, 2-6 pre-moves
- **Large** (15%): 15-25 rows/cols, 14-18% mines, 5-10 pre-moves
- **Hard** (10%): 10-20 rows/cols, high mine density (16-22%)
- **Flags** (10%): 8-25 rows/cols, with pre-placed flags
- **Wildcard** (10%): full 5-25 training range for generalisation

In [8]:
from datasets import Dataset
from collections import Counter

# ‚îÄ‚îÄ Curriculum difficulty tiers ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Training boards capped at 25√ó25 to fit max_prompt_length=3584.
# Eval boards can be up to 50√ó50 -- spatial reasoning transfers.
CURRICULUM = [
    # (weight, rows_range, cols_range, mine_pct_range, max_premoves, include_flags)
    (0.25, (5, 8),   (5, 8),   (0.08, 0.12), 2,  False),  # Easy:  fast reward signal
    (0.25, (8, 15),  (8, 15),  (0.12, 0.16), 6,  False),  # Medium
    (0.20, (15, 25), (15, 25), (0.14, 0.18), 10, False),  # Large: up to 25√ó25
    (0.10, (10, 20), (10, 20), (0.16, 0.22), 8,  False),  # Hard:  high mine density
    (0.10, (8, 25),  (8, 25),  (0.12, 0.18), 8,  True),   # Flag-heavy
    (0.10, (5, 25),  (5, 25),  (0.10, 0.20), 5,  False),  # Wildcard: full training range
]


def _pick_difficulty(rng):
    """Weighted random pick from CURRICULUM tiers."""
    r = rng.random()
    cumulative = 0.0
    for tier in CURRICULUM:
        cumulative += tier[0]
        if r <= cumulative:
            return tier
    return CURRICULUM[-1]


def generate_game_states(num_samples=5000, rng_seed=42):
    """Generate num_samples diverse Minesweeper game states with curriculum.

    Training boards capped at 25√ó25 to fit VRAM. Curriculum tiers:
      Easy     (25%): 5-8,   ~10% mines, 0-2 pre-moves
      Medium   (25%): 8-15,  ~14% mines, 2-6 pre-moves
      Large    (20%): 15-25, ~16% mines, 5-10 pre-moves
      Hard     (10%): 10-20, ~19% mines (high density)
      Flags    (10%): 8-25,  with pre-placed flags
      Wildcard (10%): 5-25,  full training range

    Stores seed + move_history (as JSON string) so reward functions can
    reconstruct the EXACT game state.
    """
    rng = random.Random(rng_seed)
    np_rng = np.random.RandomState(rng_seed)

    dataset_items = []
    tier_counts = Counter()
    attempts = 0
    max_attempts = num_samples * 4

    while len(dataset_items) < num_samples and attempts < max_attempts:
        attempts += 1

        weight, rows_range, cols_range, mine_pct_range, max_premoves, include_flags = _pick_difficulty(rng)
        # Classify tier by board size range for logging
        board_max = max(rows_range[1], cols_range[1])
        if include_flags:
            tier_name = "flags"
        elif board_max <= 8:
            tier_name = "easy"
        elif board_max <= 15:
            tier_name = "medium"
        elif board_max <= 25:
            tier_name = "large"
        else:
            tier_name = "wildcard"

        rows = rng.randint(rows_range[0], rows_range[1])
        cols = rng.randint(cols_range[0], cols_range[1])
        mine_pct = rng.uniform(mine_pct_range[0], mine_pct_range[1])
        num_mines = max(1, min(int(rows * cols * mine_pct), rows * cols - 2))

        seed = np_rng.randint(0, 1_000_000)
        game = MinesweeperGame(rows=rows, cols=cols, num_mines=num_mines, seed=seed)

        # Pre-moves: reveal random safe cells to create mid-game states
        num_moves = rng.randint(0, max_premoves)
        move_history = []

        for _ in range(num_moves):
            if game.state() != "ongoing":
                break
            board = game.get_visible_board()
            unrevealed = [(r, c) for r in range(rows) for c in range(cols) if board[r][c] == '.']
            if not unrevealed:
                break
            r, c = rng.choice(unrevealed)
            action = {"type": "reveal", "row": r, "col": c}
            game.do_action(action)
            move_history.append(action)

        # Flag-heavy tier: also place some random flags
        if include_flags and game.state() == "ongoing":
            board = game.get_visible_board()
            unrevealed = [(r, c) for r in range(rows) for c in range(cols) if board[r][c] == '.']
            num_flags = rng.randint(1, min(3, len(unrevealed), num_mines))
            for _ in range(num_flags):
                if not unrevealed:
                    break
                r, c = rng.choice(unrevealed)
                unrevealed.remove((r, c))
                flag_action = {"type": "flag", "row": r, "col": c}
                game.do_action(flag_action)
                move_history.append(flag_action)

        # Only keep ongoing games
        if game.state() != "ongoing":
            continue

        prompt_text = format_state_for_llm(game)

        # Build chat messages with system prompt for Qwen2.5
        dataset_items.append({
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user",   "content": prompt_text},
            ],
            "seed": seed,
            "move_history": json.dumps(move_history),
            "rows": rows,
            "cols": cols,
            "mines": num_mines,
        })
        tier_counts[tier_name] += 1

    return Dataset.from_list(dataset_items), tier_counts


# ‚îÄ‚îÄ Generate training dataset ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("Generating curriculum training dataset...")
dataset, tier_stats = generate_game_states(num_samples=5000, rng_seed=42)
print(f"Created {len(dataset)} training examples (all ongoing games)\n")

print("Curriculum distribution:")
for tier, count in sorted(tier_stats.items()):
    print(f"  {tier:>8s}: {count:4d} ({100*count/len(dataset):.1f}%)")

# Board size distribution
sizes = Counter()
for item in dataset:
    sizes[f"{item['rows']}x{item['cols']}"] += 1
print(f"\nUnique board sizes: {len(sizes)}")
print(f"Top 5: {sizes.most_common(5)}")

# Show example
print(f"\nExample prompt (first 400 chars):")
print(dataset[0]["prompt"][1]["content"][:400] + "...")
print(f"Seed: {dataset[0]['seed']}, Moves: {len(json.loads(dataset[0]['move_history']))}, "
      f"Board: {dataset[0]['rows']}x{dataset[0]['cols']}, Mines: {dataset[0]['mines']}")

Generating curriculum training dataset...
Created 5000 training examples (all ongoing games)

Curriculum distribution:
      easy: 1710 (34.2%)
     flags:  433 (8.7%)
     large: 1668 (33.4%)
    medium: 1189 (23.8%)

Unique board sizes: 404
Top 5: [('7x5', 129), ('7x7', 126), ('8x8', 123), ('5x6', 116), ('8x5', 115)]

Example prompt (first 400 chars):
You are playing Minesweeper. Analyze the game state and output your next move.

You must output ONLY a valid JSON object. No explanation, no analysis, no text.

Just output section after assistantfinal and not anything before it in your output.

Start your response immediately with { and end with }.

Do NOT output cell which is already revealed or flagged in the current state.

Game state:
{
  "bo...
Seed: 121958, Moves: 2, Board: 15x19, Mines: 42


# Configure GRPO Training

Optimized for Qwen2.5-14B on 262 GB MI300x VRAM with boards up to 50x50.
Three breakthroughs: cascade-shaped reward, oracle proximity reward, num_iterations=2:

In [9]:
from trl import GRPOConfig, GRPOTrainer

# Training boards capped at 25√ó25 ‚âà 2400 tokens + prompt ‚âà 300 = ~2700.
# 3584 gives comfortable headroom.  Larger boards handled at eval time.
max_prompt_length = 3584
max_completion_length = 64

# ‚îÄ‚îÄ VRAM BUDGET ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Available: 262 GB.  Model + optim ‚âà 58 GB.  Headroom: ~204 GB.
#
# With max_prompt_length=3584 and num_generations=16:
#   4 prompts √ó 16 gens = 64 seqs √ó 3648 tokens ‚âà 38 GB KV
#   + 57 GB model/optim = ~95 GB.  Fits in 262 GB.

training_args = GRPOConfig(
    # ‚îÄ‚îÄ Generation ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    temperature = 2.0,
    num_generations = 16,          # 16 completions/prompt (down from 32: longer prompts)
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,

    # ‚îÄ‚îÄ Proven fixes ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    scale_rewards = False,
    mask_truncated_completions = True,
    beta = 0.06,                   # Bumped from 0.04: extra KL damping for 2nd iteration

    # ‚îÄ‚îÄ BREAKTHROUGH 3: num_iterations=2 ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    # Reuse same batch of completions for 2 gradient updates.
    # Free compute: same VRAM, 2x learning per generation batch.
    # IMPORTANT: halved LR (1e-4 ‚Üí 5e-5) to compensate -- the 2nd
    # iteration uses stale advantages so full LR causes overshoot
    # (step 4 hit loss=5.6 with 1e-4).
    num_iterations = 2,

    # ‚îÄ‚îÄ Optimiser ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    learning_rate = 5e-5,          # Halved: 2 iters √ó 5e-5 ‚âà effective 1e-4
    weight_decay = 0.01,
    warmup_ratio = 0.03,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",

    # ‚îÄ‚îÄ Batching ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    # 4 prompts √ó 16 gens = 64 seqs √ó 3648 tokens ‚âà 38 GB KV + 57 GB model = ~95 GB
    # Fits comfortably in 262 GB with room for activations.
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 2,
    # Effective: 4 prompts √ó 16 gens √ó 2 accum √ó 2 iters = 256 updates/batch

    # ‚îÄ‚îÄ Training duration ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    max_steps = 200,
    save_steps = 50,
    logging_steps = 1,

    # ‚îÄ‚îÄ Precision & output ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    bf16 = True,
    report_to = "none",
    output_dir = "/workspace/minesweeper_qwen_outputs_v2",
)

print("=" * 60)
print("GRPO Config v3 ‚Äî THREE BREAKTHROUGHS")
print("=" * 60)
print(f"  Generations/prompt: {training_args.num_generations}")
print(f"  Batch size:         {training_args.per_device_train_batch_size}")
print(f"  Grad accum:         {training_args.gradient_accumulation_steps}")
print(f"  num_iterations:     {training_args.num_iterations}  ‚Üê B3: 2x learning/batch")
print(f"  Completions/step:   {training_args.num_generations * training_args.per_device_train_batch_size} √ó {training_args.num_iterations} iters")
print(f"  Learning rate:      {training_args.learning_rate}")
print(f"  Temperature:        {training_args.temperature}")
print(f"  scale_rewards:      {training_args.scale_rewards}")
print(f"  beta (KL):          {training_args.beta}")
print(f"  Max steps:          {training_args.max_steps}")
print("=" * 60)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 4 to the `num_generations` of 16
GRPO Config v3 ‚Äî THREE BREAKTHROUGHS
  Generations/prompt: 16
  Batch size:         16
  Grad accum:         2
  num_iterations:     2  ‚Üê B3: 2x learning/batch
  Completions/step:   256 √ó 2 iters
  Learning rate:      5e-05
  Temperature:        2.0
  scale_rewards:      none
  beta (KL):          0.06
  Max steps:          200


In [10]:
from transformers import TrainerCallback

class MinesweeperEvalCallback(TrainerCallback):
    """Periodically play full games during training and log win rate + avg score.

    Uses the same prompt template and generation params as the eval harness
    (temperature=0.3, max_new_tokens=128) for realistic performance estimation.
    """

    def __init__(self, eval_every_steps=50, num_games=10):
        self.eval_every_steps = eval_every_steps
        self.num_games = num_games

    def on_step_end(self, args, state, control, model=None, processing_class=None, **kwargs):
        if state.global_step % self.eval_every_steps != 0:
            return

        tokenizer = processing_class
        if tokenizer is None or model is None:
            return

        was_training = model.training
        model.eval()

        wins = 0
        total_score = 0
        valid_json_count = 0
        total_actions = 0

        # Test across board sizes: half small, quarter medium, quarter large
        eval_configs = []
        half = self.num_games // 2
        quarter = self.num_games // 4
        for i in range(half):
            eval_configs.append((6, 6, 5, 10000 + i))         # Small
        for i in range(quarter):
            eval_configs.append((15, 15, 30, 20000 + i))      # Medium
        for i in range(self.num_games - half - quarter):
            eval_configs.append((30, 30, 100, 30000 + i))     # Large

        for rows, cols, mines, seed in eval_configs:
            game = MinesweeperGame(rows=rows, cols=cols, num_mines=mines, seed=seed)
            moves = 0
            game_score = 0
            max_moves_eval = max(50, (rows * cols - mines) * 2)

            while game.state() == "ongoing" and moves < max_moves_eval:
                # Build messages with system prompt (matches eval harness)
                messages = build_chat_messages(game)
                text = tokenizer.apply_chat_template(
                    messages,
                    tokenize=False,
                    add_generation_prompt=True,
                )
                output = model.generate(
                    **tokenizer(text, return_tensors="pt").to(model.device),
                    temperature=0.3,       # Match eval config
                    max_new_tokens=128,    # Match eval config
                    do_sample=True,
                    top_p=0.9,
                    repetition_penalty=1.2,
                )
                # Decode only generated tokens (skip prompt)
                prompt_len = tokenizer(text, return_tensors="pt")["input_ids"].shape[1]
                response = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
                action = parse_llm_action(response)
                total_actions += 1

                if action is None:
                    game_score -= 10
                    break

                valid_json_count += 1
                result = game.do_action(action)

                if result == "mine":
                    game_score -= 25
                elif result == "win":
                    game_score += 100
                elif result == "ok":
                    game_score += 10
                elif result == "already_revealed":
                    game_score -= 12
                elif result == "out_of_bounds":
                    game_score -= 15

                moves += 1

            if game.state() == "success":
                wins += 1
            total_score += game_score

        win_rate = wins / self.num_games
        avg_score = total_score / self.num_games
        json_rate = valid_json_count / max(total_actions, 1)

        print(f"\n{'='*50}")
        print(f"[Eval @ step {state.global_step}]")
        print(f"  Win rate:  {wins}/{self.num_games} ({win_rate*100:.0f}%)")
        print(f"  Avg score: {avg_score:.1f}")
        print(f"  JSON rate: {json_rate*100:.0f}%")
        print(f"{'='*50}\n")

        if was_training:
            model.train()

eval_callback = MinesweeperEvalCallback(eval_every_steps=50, num_games=10)
print("Eval callback: plays 10 games every 50 steps (temp=0.3, matching eval harness)")

Eval callback: plays 10 games every 50 steps (temp=0.3, matching eval harness)


# Train the Model

Start GRPO training with reward functions:

In [None]:
# ‚îÄ‚îÄ ASYMMETRIC FORMAT SAFETY RAIL ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Run 3 proved: without format rewards, the model collapses to
# 128-token garbage that never produces valid JSON.
#
# But format rewards were saturated (all completions identical) in
# runs 1-2 because the model always output perfect JSON.
#
# Solution: ASYMMETRIC reward that only activates when JSON breaks.
# When JSON is valid ‚Üí +0 (no gradient, doesn't interfere with
# gameplay learning). When JSON is INVALID ‚Üí -30 (massive penalty,
# prevents collapse). This creates gradient ONLY when the model
# starts to lose its JSON ability.

def json_safety_rail(completions, **kwargs):
    """Asymmetric format reward: 0 when valid, -30 when invalid.

    This doesn't interfere with gameplay learning (valid JSON ‚Üí 0,
    no gradient contribution) but creates a HUGE penalty cliff if
    the model starts outputting garbage, preventing the collapse
    seen in run 3.
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        action = parse_llm_action(response)
        if action is None:
            scores.append(-30.0)   # MASSIVE penalty for losing JSON
        else:
            scores.append(0.0)     # No reward for valid JSON (saturated anyway)
    return scores

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        json_safety_rail,     # 0 for valid, -30 for invalid (collapse prevention)
        gameplay_scores,      # +15 safe, -25 mine, +100 win + CASCADE BONUS (MAIN)
        oracle_reward,        # BREAKTHROUGH 2: +8 optimal, +5 safe, +2 adjacent
        rollout_reward,       # Strategic depth
    ],
    args = training_args,
    train_dataset = dataset,
    callbacks = [eval_callback],
)

print("=" * 60)
print("GRPO v3 ‚Äî THREE BREAKTHROUGHS")
print("=" * 60)
print(f"  B1: Cascade-shaped reward (gameplay_scores now +1/extra cell)")
print(f"  B2: Oracle proximity reward (+8 optimal, +5 safe, +2 adjacent)")
print(f"  B3: num_iterations=2 (2x learning per generation batch)")
print(f"  ---")
print(f"  {training_args.num_generations * training_args.per_device_train_batch_size} completions √ó 2 iterations per step")
print(f"  LR={training_args.learning_rate}, temp={training_args.temperature}, beta={training_args.beta}")
print(f"  4 reward functions: safety_rail + gameplay + oracle + rollout")
print(f"  Target: push total reward to 50+ in 200 steps")
print("=" * 60)
print()
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


GRPO v3 ‚Äî THREE BREAKTHROUGHS
  B1: Cascade-shaped reward (gameplay_scores now +1/extra cell)
  B2: Oracle proximity reward (+8 optimal, +5 safe, +2 adjacent)
  B3: num_iterations=2 (2x learning per generation batch)
  ---
  256 completions √ó 2 iterations per step
  LR=5e-05, temp=2.0, beta=0.06
  4 reward functions: safety_rail + gameplay + oracle + rollout
  Target: push total reward to 50+ in 200 steps



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 16 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (16 x 2 x 1) = 32
 "-____-"     Trainable parameters = 275,251,200 of 15,045,284,864 (1.83% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,sampling / sampling_logp_difference / mean,sampling / sampling_logp_difference / max,sampling / importance_sampling_ratio / min,sampling / importance_sampling_ratio / mean,sampling / importance_sampling_ratio / max,kl,rewards / json_safety_rail / mean,rewards / json_safety_rail / std,rewards / gameplay_scores / mean,rewards / gameplay_scores / std,rewards / oracle_reward / mean,rewards / oracle_reward / std,rewards / rollout_reward / mean,rewards / rollout_reward / std
1,0.0766,-9.031250,14.460291,29.000000,19.000000,64.000000,0.187500,20.923079,19.000000,55.000000,0,0,0,0,0,0.001772,0.000000,0.000000,-6.656250,13.311671,0.125000,0.491869,-2.500000,0.879883
2,0.0766,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,0.001772,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log
3,-0.0642,-1.968750,17.738264,34.593750,19.000000,64.000000,0.218750,26.359999,19.000000,64.000000,No Log,No Log,No Log,No Log,No Log,0.006315,-1.875000,7.378040,1.125000,15.156708,0.656250,1.961062,-1.875000,1.008032
4,1.155,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,0.25605,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log
5,0.2964,-6.781250,17.005878,36.781250,19.000000,64.000000,0.343750,22.523809,19.000000,45.000000,No Log,No Log,No Log,No Log,No Log,0.722259,0.000000,0.000000,-5.437500,15.061031,0.781250,1.995711,-2.125000,1.008032
6,-0.2298,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,0.466889,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log
7,0.0953,-6.687500,17.275887,32.781250,19.000000,64.000000,0.250000,22.375000,19.000000,56.000000,No Log,No Log,No Log,No Log,No Log,1.212568,-1.875000,7.378040,-2.625000,17.333075,-0.125000,0.491869,-2.062500,1.014015
8,-0.5852,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,0.927773,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log
9,0.3995,9.156250,14.609549,39.062500,19.000000,64.000000,0.375000,24.100000,19.000000,54.000000,No Log,No Log,No Log,No Log,No Log,2.814915,-0.937500,5.303301,11.906250,27.187946,0.000000,0.508000,-1.812500,1.090649
10,2.4012,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,11.242902,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log



[Eval @ step 50]
  Win rate:  0/10 (0%)
  Avg score: 6.5
  JSON rate: 100%


[Eval @ step 100]
  Win rate:  0/10 (0%)
  Avg score: -2.6
  JSON rate: 100%



# Test Trained Model

Evaluate the finetuned model:

In [None]:
# Test on new game with eval-harness-matching generation params
test_game = MinesweeperGame(rows=6, cols=6, num_mines=5)
messages = build_chat_messages(test_game)

test_text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
)

print("=== Trained Model Response (Qwen2.5-14B-Instruct + LoRA) ===")
output = model.generate(
    **tokenizer(test_text, return_tensors = "pt").to("cuda"),
    temperature = 0.3,       # Match eval config
    max_new_tokens = 128,
    do_sample = True,
    top_p = 0.9,
    repetition_penalty = 1.2,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

# Parse and test action
prompt_len = tokenizer(test_text, return_tensors="pt")["input_ids"].shape[1]
response_text = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
action = parse_llm_action(response_text)
print(f"\nParsed action: {action}")

if action:
    result = test_game.do_action(action)
    print(f"Result: {result}")
    print(f"Game state: {test_game.state()}")
    print(test_game.pretty_print())
else:
    print("FAILED: Could not parse valid JSON action from response")

# Evaluation: Play Complete Games

Test the model on multiple complete games:

In [None]:
def play_full_game(model, tokenizer, rows=6, cols=6, num_mines=5, seed=None, max_moves=None):
    """Play a complete Minesweeper game using eval-harness-matching params.

    Generation config matches minesweeper_config.yaml:
      temperature=0.3, top_p=0.9, repetition_penalty=1.2, max_new_tokens=128

    max_moves scales with board size if not specified (safe cells * 2).
    """
    if max_moves is None:
        max_moves = max(50, (rows * cols - num_mines) * 2)
    game = MinesweeperGame(rows=rows, cols=cols, num_mines=num_mines, seed=seed)
    moves = 0
    score = 0

    while game.state() == "ongoing" and moves < max_moves:
        messages = build_chat_messages(game)
        text = tokenizer.apply_chat_template(
            messages,
            tokenize = False,
            add_generation_prompt = True,
        )

        output = model.generate(
            **tokenizer(text, return_tensors = "pt").to("cuda"),
            temperature = 0.3,
            max_new_tokens = 128,
            do_sample = True,
            top_p = 0.9,
            repetition_penalty = 1.2,
        )

        prompt_len = tokenizer(text, return_tensors="pt")["input_ids"].shape[1]
        response = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
        action = parse_llm_action(response)

        if action is None:
            score -= 10
            break

        result = game.do_action(action)
        if result == "mine":
            score -= 25
        elif result == "win":
            score += 100
        elif result == "ok":
            score += 10
        elif result == "already_revealed":
            score -= 12
        elif result == "out_of_bounds":
            score -= 15

        moves += 1

    return game, moves, score

# ‚îÄ‚îÄ Evaluate on 100 games across board sizes ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Test on small, medium, large, and XL boards to verify generalisation
EVAL_CONFIGS = [
    ("Small  6√ó6",   6,  6,  5,  25),
    ("Medium 10√ó10", 10, 10, 12, 25),
    ("Large  20√ó20", 20, 20, 50, 25),
    ("XL     40√ó40", 40, 40, 200, 25),
]

print(f"Evaluating trained Qwen2.5-14B across board sizes...")
print(f"(temp=0.3, top_p=0.9, rep_penalty=1.2 -- matching eval harness)\n")

grand_wins = 0
grand_games = 0
grand_score = 0

for config_name, rows, cols, mines, num_games in EVAL_CONFIGS:
    wins = 0
    total_score = 0
    for i in range(num_games):
        game, moves, score = play_full_game(model, tokenizer, rows=rows, cols=cols, num_mines=mines, seed=i)
        result = game.state()

        if result == "success":
            wins += 1
        total_score += score

    grand_wins += wins
    grand_games += num_games
    grand_score += total_score

    print(f"  {config_name}: {wins}/{num_games} wins ({wins/num_games*100:.0f}%), avg score {total_score/num_games:+.1f}")

print(f"\n{'='*50}")
print(f"FINAL RESULTS ({grand_games} games across all sizes)")
print(f"{'='*50}")
print(f"  Win rate:    {grand_wins}/{grand_games} ({grand_wins/grand_games*100:.1f}%)")
print(f"  Avg score:   {grand_score/grand_games:+.1f}")
print(f"{'='*50}")

# Save the Model

Save LoRA adapters + merged bf16 model for eval harness deployment:

In [None]:
# ‚îÄ‚îÄ Save LoRA adapters ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
LORA_DIR = "/workspace/minesweeper_qwen25_14b_lora"
model.save_pretrained(LORA_DIR)
tokenizer.save_pretrained(LORA_DIR)
print(f"LoRA adapters saved to: {LORA_DIR}/")

# ‚îÄ‚îÄ Save merged model in 16bit (used by eval harness) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
MERGED_DIR = "/workspace/minesweeper_qwen25_14b_merged"
model.save_pretrained_merged(
    MERGED_DIR,
    tokenizer,
    save_method = "merged_16bit",
)
print(f"Merged bf16 model saved to: {MERGED_DIR}/")
print(f"\nFor eval: set model_name = \"{MERGED_DIR}\" in EVAL agents/minesweeper_model.py")

# Competition Tips

## Improve Your Model:

1. **Adjust Reward Functions**
   - Increase rewards for logical deduction
   - Add penalties for random moves
   - Reward flagging correct mines

2. **Tune Hyperparameters**
   - Increase `max_steps` for longer training
   - Adjust `learning_rate` (try 1e-5 to 1e-4)
   - Increase `lora_rank` for more capacity
   - Adjust `num_generations` (2-8)

3. **Better Training Data**
   - Generate more diverse states
   - Include harder scenarios (more mines)
   - Add states requiring logical deduction

4. **Advanced Techniques**
   - Multi-step rollouts in reward function
   - Curriculum learning (easy ‚Üí hard boards)
   - Ensemble multiple models

## Team Strategy:
- Experiment with different reward functions
- Try different board sizes during training
- Analyze failed games to improve rewards
- Use temperature sampling during evaluation

Good luck!