# Minesweeper LLM Competition - Custom GRPO Training

## Goal
Finetune an LLM with LoRA using GRPO to play Minesweeper by:
- **Input**: JSON game state (board configuration)
- **Output**: JSON action (reveal or flag a cell)

Teams will compete to train the best Minesweeper-playing LLM!

## Training Approach
- **Model**: GPT-OSS 20B with LoRA or other models in the /root/.cache/huggingface/hub directory [**Any model other than /root/.cache/huggingface/hub will lead to disqualification**]
- **Method**: GRPO (Group Relative Policy Optimization), SFT or any RL-policies (not just strict to use GRPO)
- **Framework**: Unsloth (2-6x faster, 70% less VRAM)
- **Hardware**: AMD GPU (ROCm)

# Load Model with Unsloth

Load GPT-OSS 20B with LoRA configuration:

In [97]:
import torch
import sys
import json
import re
import numpy as np
import random
from dataclasses import dataclass, field
from typing import List, Tuple, Optional, Set
from datasets import Dataset
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer

# 1. Configuration for MI300X (192GB VRAM)
max_seq_length = 2048  # Increased to allow space for "Thinking" blocks
lora_rank = 32         # Efficient reasoning balance
model_name = "unsloth/gpt-oss-20b-BF16"

# 2. Load Model with Eager Attention (CRITICAL FIX for AMD Crash)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    load_in_4bit = False, 
    max_seq_length = max_seq_length,
    torch_dtype = torch.bfloat16,
    attn_implementation = "eager", 
)

print(f"Model loaded on {model.device} with Rank {lora_rank} and Context {max_seq_length}")

Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.


Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/hub/models--unsloth--gpt-oss-20b-BF16/.no_exist/cc89b3e7fd423253264883a80a4fa5abc619649f/adapter_config.json'
[2026-02-14 18:28:17] ERROR file_download.py:1559: Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/hub/models--unsloth--gpt-oss-20b-BF16/.no_exist/cc89b3e7fd423253264883a80a4fa5abc619649f/adapter_config.json'
Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/hub/models--unsloth--gpt-oss-20b-BF16/.no_exist/cc89b3e7fd423253264883a80a4fa5abc619649f/adapter_config.json'
[2026-02-14 18:28:17] ERROR file_download.py:1559: Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/hub/models--unsl

Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025.10.6: Fast Gpt_Oss patching. Transformers: 4.56.2. vLLM: 0.11.1rc2.dev161+g8a297115e.rocm700.
   \\   /|    . Num GPUs = 1. Max memory: 255.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+gitb2fb688. ROCm Toolkit: 7.0.51831-a3e329ad8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]



Model loaded on cuda:0 with Rank 32 and Context 2048


# Add LoRA Adapters

Add LoRA layers for efficient finetuning:

In [98]:
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank * 2,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model` require gradients
LoRA adapters attached.


# Minesweeper Game Implementation

Custom Minesweeper environment supporting:
- Customizable board size and mine count
- Actions: reveal or flag cells
- Win: reveal all safe cells
- Lose: reveal a mine

In [99]:
from dataclasses import dataclass, field
from typing import List, Tuple, Optional, Set
import random

@dataclass
class MinesweeperGame:
    rows: int
    cols: int
    num_mines: int
    seed: Optional[int] = None
    _rng: random.Random = field(init=False, repr=False)
    _board: List[List[int]] = field(init=False, repr=False)  # -1 = mine, 0-8 = count
    _revealed: Set[Tuple[int, int]] = field(init=False, repr=False, default_factory=set)
    _flagged: Set[Tuple[int, int]] = field(init=False, repr=False, default_factory=set)
    _state: str = field(default="ongoing", init=False, repr=False)

    def __post_init__(self):
        if self.num_mines >= self.rows * self.cols:
            raise ValueError("Too many mines for board size")
        self._rng = random.Random(self.seed)
        self._board = [[0 for _ in range(self.cols)] for _ in range(self.rows)]
        self._place_mines()
        self._calculate_numbers()

    def _place_mines(self):
        """Place mines randomly on the board"""
        positions = [(r, c) for r in range(self.rows) for c in range(self.cols)]
        mine_positions = self._rng.sample(positions, self.num_mines)
        for r, c in mine_positions:
            self._board[r][c] = -1

    def _calculate_numbers(self):
        """Calculate numbers for each cell based on adjacent mines"""
        for r in range(self.rows):
            for c in range(self.cols):
                if self._board[r][c] == -1:
                    continue
                count = 0
                for dr in [-1, 0, 1]:
                    for dc in [-1, 0, 1]:
                        if dr == 0 and dc == 0:
                            continue
                        nr, nc = r + dr, c + dc
                        if 0 <= nr < self.rows and 0 <= nc < self.cols:
                            if self._board[nr][nc] == -1:
                                count += 1
                self._board[r][c] = count

    def _reveal_cell(self, row: int, col: int) -> bool:
        """Reveal a cell. Returns True if valid move, False if invalid.
        Uses iterative flood-fill to avoid recursion limit on large boards.
        (Issue #11: was recursive; Issue typo: fixed 'bself' -> 'self')
        """
        if not (0 <= row < self.rows and 0 <= col < self.cols):
            return False
        if (row, col) in self._revealed or (row, col) in self._flagged:
            return False

        stack = [(row, col)]
        while stack:
            r, c = stack.pop()
            if (r, c) in self._revealed:
                continue

            self._revealed.add((r, c))

            # Hit a mine!
            if self._board[r][c] == -1:
                self._state = "failed"
                return True

            # Auto-reveal neighbors if cell is 0
            if self._board[r][c] == 0:
                for dr in [-1, 0, 1]:
                    for dc in [-1, 0, 1]:
                        if dr == 0 and dc == 0:
                            continue
                        nr, nc = r + dr, c + dc
                        if (0 <= nr < self.rows and 0 <= nc < self.cols
                                and (nr, nc) not in self._revealed
                                and (nr, nc) not in self._flagged):
                            stack.append((nr, nc))

        return True

    def _flag_cell(self, row: int, col: int) -> bool:
        """Flag/unflag a cell. Returns True if valid, False if invalid"""
        if not (0 <= row < self.rows and 0 <= col < self.cols):
            return False
        if (row, col) in self._revealed:
            return False
        
        if (row, col) in self._flagged:
            self._flagged.remove((row, col))
        else:
            self._flagged.add((row, col))
        return True

    def do_action(self, action: dict) -> str:
        """Execute an action and return a status string.

        Returns one of:
          'ok'               - valid move executed
          'mine'             - revealed a mine (game over)
          'win'              - game won after this move
          'invalid_format'   - bad action dict / missing keys / bad types
          'out_of_bounds'    - coordinates outside the board
          'already_revealed' - cell was already revealed
          'flagged_cell'     - tried to reveal a flagged cell
          'invalid_flag'     - tried to flag a revealed cell
          'game_over'        - game was already over before this call

        (Issue #13: previously set state='failed' for ALL invalid moves,
         conflating formatting errors with hitting a mine.)
        """
        if self._state != "ongoing":
            return "game_over"

        if not isinstance(action, dict):
            self._state = "failed"
            return "invalid_format"

        action_type = action.get("type")
        row = action.get("row")
        col = action.get("col")

        if action_type not in ["reveal", "flag"] or row is None or col is None:
            self._state = "failed"
            return "invalid_format"

        try:
            row, col = int(row), int(col)
        except (ValueError, TypeError):
            self._state = "failed"
            return "invalid_format"

        if not (0 <= row < self.rows and 0 <= col < self.cols):
            self._state = "failed"
            return "out_of_bounds"

        if action_type == "reveal":
            if (row, col) in self._revealed:
                self._state = "failed"
                return "already_revealed"
            if (row, col) in self._flagged:
                self._state = "failed"
                return "flagged_cell"
            valid = self._reveal_cell(row, col)
        else:
            if (row, col) in self._revealed:
                self._state = "failed"
                return "invalid_flag"
            valid = self._flag_cell(row, col)

        if not valid:
            self._state = "failed"
            return "invalid_format"

        self._check_win()

        if self._state == "failed":
            return "mine"
        if self._state == "success":
            return "win"
        return "ok"

    def _check_win(self):
        """Check if player has won"""
        total_cells = self.rows * self.cols
        safe_cells = total_cells - self.num_mines
        if len(self._revealed) == safe_cells:
            self._state = "success"

    def get_visible_board(self) -> List[List[str]]:
        """Get board state as player sees it"""
        visible = []
        for r in range(self.rows):
            row = []
            for c in range(self.cols):
                if (r, c) in self._flagged:
                    row.append('F')
                elif (r, c) in self._revealed:
                    val = self._board[r][c]
                    row.append('*' if val == -1 else str(val))
                else:
                    row.append('.')
            visible.append(row)
        return visible

    def state(self) -> str:
        return self._state

    def pretty_print(self) -> str:
        """Pretty print the board"""
        visible = self.get_visible_board()
        lines = []
        
        # Header
        header = "   " + " ".join(f"{i:2d}" for i in range(self.cols))
        lines.append(header)
        lines.append("  " + "─" * (self.cols * 3 + 1))
        
        # Board
        for r, row in enumerate(visible):
            line = f"{r:2d}│ " + "  ".join(row)
            lines.append(line)
        
        return "\n".join(lines)

# Test the Game

In [100]:
# Create test game
game = MinesweeperGame(rows=6, cols=6, num_mines=5)
print(game.pretty_print())
print(f"State: {game.state()}")

# Test action
game.do_action({"type": "reveal", "row": 0, "col": 0})
print("\nAfter revealing (0,0):")
print(game.pretty_print())
print(f"State: {game.state()}")

Generating 1000 logic-focused samples...
Data Ready: 1000


# JSON Input/Output Format

## Input Format (Game State)
```json
{
  "board": [
    ["1", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."]
  ],
  "rows": 6,
  "cols": 6,
  "mines": 5,
  "flags_placed": 0,
  "cells_revealed": 0
}
```

## Output Format (Action)
```json
{"type": "reveal", "row": 2, "col": 3}
```
or
```json
{"type": "flag", "row": 1, "col": 4}
```

In [101]:
training_args = GRPOConfig(
    output_dir = "minesweeper_final_logic",
    learning_rate = 1e-5,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 4,
    num_generations = 8, 
    max_prompt_length = 600,
    max_completion_length = 512, # INCREASED for Thinking block
    max_steps = 400,
    save_steps = 100,
    report_to = "none",
)

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [valid_json_reward, gameplay_reward],
    args = training_args,
    train_dataset = dataset,
)

print("Starting Training...")
trainer.train()

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


NameError: name 'gameplay_reward' is not defined

# Test Model Before Training

See how the base model performs without finetuning:

In [94]:
from transformers import TextStreamer

# CRITICAL FIX: Must call for_inference() before generate() with Unsloth on AMD.
# This materializes model weights from meta device to GPU.
# Without this you get "Cannot copy out of meta tensor; no data!" error.
FastLanguageModel.for_inference(model)

game = MinesweeperGame(rows=6, cols=6, num_mines=5, seed=42)
prompt = format_state_for_llm(game)

text = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize = False,
    add_generation_prompt = True,
)

print("=== Base Model Response ===")
output = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    temperature = 1.0,
    max_new_tokens = 128,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

=== Base Model Response ===


OutOfMemoryError: HIP out of memory. Tried to allocate 1014.00 MiB. GPU 0 has a total capacity of 255.69 GiB of which 0 bytes is free. Of the allocated memory 147.98 GiB is allocated by PyTorch, with 122.00 MiB allocated in private pools (e.g., HIP Graphs), and 1.67 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

# GRPO Reward Functions

Define reward functions to guide the model's learning:

In [83]:
def is_logically_deducible_safe(game, row, col):
    """Check if revealing (row, col) is PROVABLY safe from visible info.
    
    Logic: A hidden cell is safe if ANY neighboring number cell has
    all its mines already accounted for by flags.
    
    Example: Cell shows "2" and has exactly 2 flagged neighbors →
    all other hidden neighbors of that "2" cell are safe.
    """
    board = game.get_visible_board()
    
    for dr in [-1, 0, 1]:
        for dc in [-1, 0, 1]:
            if dr == 0 and dc == 0:
                continue
            nr, nc = row + dr, col + dc
            if not (0 <= nr < game.rows and 0 <= nc < game.cols):
                continue
            cell = board[nr][nc]
            # Only look at revealed number cells > 0
            if not (cell.isdigit() and int(cell) > 0):
                continue
            
            num = int(cell)
            flagged = 0
            hidden_neighbors = []
            
            for ddr in [-1, 0, 1]:
                for ddc in [-1, 0, 1]:
                    if ddr == 0 and ddc == 0:
                        continue
                    nnr, nnc = nr + ddr, nc + ddc
                    if 0 <= nnr < game.rows and 0 <= nnc < game.cols:
                        if board[nnr][nnc] == 'F':
                            flagged += 1
                        elif board[nnr][nnc] == '.':
                            hidden_neighbors.append((nnr, nnc))
            
            # All mines accounted for → remaining hidden cells are safe
            if flagged == num and (row, col) in hidden_neighbors:
                return True
    
    return False


def is_logically_deducible_mine(game, row, col):
    """Check if (row, col) is PROVABLY a mine from visible info.
    
    Logic: A hidden cell is a mine if ANY neighboring number cell has
    (number - flagged_count) == hidden_count, meaning ALL remaining
    hidden neighbors must be mines.
    
    Example: Cell shows "3" with 1 flag and 2 hidden neighbors →
    both hidden neighbors must be mines.
    """
    board = game.get_visible_board()
    
    for dr in [-1, 0, 1]:
        for dc in [-1, 0, 1]:
            if dr == 0 and dc == 0:
                continue
            nr, nc = row + dr, col + dc
            if not (0 <= nr < game.rows and 0 <= nc < game.cols):
                continue
            cell = board[nr][nc]
            if not (cell.isdigit() and int(cell) > 0):
                continue
            
            num = int(cell)
            flagged = 0
            hidden_neighbors = []
            
            for ddr in [-1, 0, 1]:
                for ddc in [-1, 0, 1]:
                    if ddr == 0 and ddc == 0:
                        continue
                    nnr, nnc = nr + ddr, nc + ddc
                    if 0 <= nnr < game.rows and 0 <= nnc < game.cols:
                        if board[nnr][nnc] == 'F':
                            flagged += 1
                        elif board[nnr][nnc] == '.':
                            hidden_neighbors.append((nnr, nnc))
            
            remaining_mines = num - flagged
            if remaining_mines > 0 and remaining_mines == len(hidden_neighbors):
                if (row, col) in hidden_neighbors:
                    return True
    
    return False


# Quick test
game = MinesweeperGame(rows=6, cols=6, num_mines=5, seed=42)
game.do_action({"type": "reveal", "row": 0, "col": 0})
print("Board after first reveal:")
print(game.pretty_print())
print(f"\nLogical deduction helpers loaded successfully!")

Board after first reveal:
    0  1  2  3  4  5
  ───────────────────
 0│ 2  .  .  .  .  .
 1│ .  .  .  .  .  .
 2│ .  .  .  .  .  .
 3│ .  .  .  .  .  .
 4│ .  .  .  .  .  .
 5│ .  .  .  .  .  .

Logical deduction helpers loaded successfully!


In [84]:
import numpy as np

def valid_json_reward(completions, **kwargs):
    """Reward valid JSON action format."""
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        action = parse_llm_action(response)
        if action is None:
            scores.append(-5.0)
        else:
            response_len = len(response.strip())
            if response_len < 80:
                scores.append(2.0)
            elif response_len < 200:
                scores.append(1.0)
            else:
                scores.append(0.5)
    return scores


def gameplay_scores(completions, **kwargs):
    """
    Complete reward function with all 12 scoring criteria.
    
    CRITICAL FIX: Reads rows/cols/num_mines from dataset kwargs
    instead of hardcoding 6x6. The old version reconstructed every game
    as 6x6 regardless of actual board size, causing ~85% of training
    samples to get WRONG rewards.
    """
    scores = []
    seeds = kwargs.get("seed", [])
    move_histories = kwargs.get("move_history", [])
    # FIXED: Read actual board dimensions from dataset
    rows_list = kwargs.get("rows", [])
    cols_list = kwargs.get("cols", [])
    mines_list = kwargs.get("num_mines", [])

    for idx, completion in enumerate(completions):
        response = completion[0]["content"]
        action = parse_llm_action(response)

        # Criterion 9: Invalid JSON
        if action is None:
            scores.append(-10.0)
            continue

        if idx < len(seeds) and idx < len(move_histories):
            seed = seeds[idx]
            move_history_raw = move_histories[idx]
            if isinstance(move_history_raw, str):
                move_history = json.loads(move_history_raw)
            else:
                move_history = move_history_raw

            # FIXED: Use ACTUAL board size from dataset, not hardcoded 6x6
            r_count = int(rows_list[idx]) if idx < len(rows_list) else 6
            c_count = int(cols_list[idx]) if idx < len(cols_list) else 6
            m_count = int(mines_list[idx]) if idx < len(mines_list) else 5

            game = MinesweeperGame(rows=r_count, cols=c_count, num_mines=m_count, seed=seed)
            for prev_action in move_history:
                game.do_action(prev_action)

            board = game.get_visible_board()

            try:
                row = int(action["row"])
                col = int(action["col"])
            except (ValueError, TypeError):
                scores.append(-10.0)
                continue

            action_type = action["type"]

            # Criterion 7: Out of bounds
            if not (0 <= row < game.rows and 0 <= col < game.cols):
                scores.append(-15.0)
                continue

            # === REVEAL ===
            if action_type == "reveal":
                cell = board[row][col]
                if cell not in ['.', 'F']:
                    scores.append(-12.0)
                    continue
                if cell == 'F':
                    scores.append(-8.0)
                    continue
                if game._board[row][col] == -1:
                    scores.append(-25.0)
                    continue

                if is_logically_deducible_safe(game, row, col):
                    base_score = 15.0
                else:
                    base_score = 10.0

                has_info = False
                for dr in [-1, 0, 1]:
                    for dc in [-1, 0, 1]:
                        nr, nc = row + dr, col + dc
                        if 0 <= nr < game.rows and 0 <= nc < game.cols:
                            ncell = board[nr][nc]
                            if ncell.isdigit() and int(ncell) > 0:
                                has_info = True
                                break
                    if has_info:
                        break
                if has_info:
                    base_score += 2.0

                test_game = MinesweeperGame(
                    rows=r_count, cols=c_count,
                    num_mines=m_count, seed=seed
                )
                for prev in move_history:
                    test_game.do_action(prev)
                result = test_game.do_action({"type": "reveal", "row": row, "col": col})
                if result == "win":
                    base_score += 100.0

                scores.append(base_score)

            # === FLAG ===
            elif action_type == "flag":
                cell = board[row][col]
                if cell == 'F':
                    scores.append(-12.0)
                    continue
                if cell not in ['.', 'F']:
                    scores.append(-8.0)
                    continue
                if len(game._flagged) >= game.num_mines:
                    scores.append(-10.0)
                    continue
                if game._board[row][col] == -1:
                    base_score = 15.0
                    if is_logically_deducible_mine(game, row, col):
                        base_score += 3.0
                    scores.append(base_score)
                else:
                    scores.append(-10.0)
            else:
                scores.append(-10.0)
        else:
            scores.append(0.0)

    return scores


print("Reward functions ready!")
print("  - valid_json_reward: format + conciseness")
print("  - gameplay_scores: all 12 criteria, reads board size from dataset")

Reward functions ready!
  - valid_json_reward: format + conciseness
  - gameplay_scores: all 12 criteria, reads board size from dataset


# Create Training Dataset

Generate diverse game states for training:

In [85]:
from datasets import Dataset

def generate_game_states(num_samples=2000, rng_seed=42):
    """Generate diverse Minesweeper training states.
    FIXED: Stores rows, cols, num_mines so reward function uses correct board."""
    np.random.seed(rng_seed)
    random.seed(rng_seed)

    dataset_items = []
    attempts = 0
    max_attempts = num_samples * 5

    configs = [
        (5, 5, 4),   (5, 5, 5),
        (6, 6, 5),   (6, 6, 7),
        (7, 7, 7),   (7, 7, 10),
        (8, 8, 10),  (8, 8, 13),
        (9, 9, 12),  (9, 9, 16),
        (10, 10, 15), (10, 10, 20),
        (6, 10, 10), (10, 6, 10),
        (8, 12, 15), (12, 8, 15),
    ]

    while len(dataset_items) < num_samples and attempts < max_attempts:
        attempts += 1
        rows, cols, mines = random.choice(configs)
        seed = np.random.randint(100000)
        game = MinesweeperGame(rows=rows, cols=cols, num_mines=mines, seed=seed)

        num_moves = np.random.choice(
            [0, 1, 2, 3, 4, 5, 6, 7, 8],
            p=[0.10, 0.10, 0.15, 0.15, 0.15, 0.10, 0.10, 0.08, 0.07]
        )
        move_history = []

        for _ in range(num_moves):
            board = game.get_visible_board()
            unrevealed = [(r, c) for r in range(rows) for c in range(cols)
                         if board[r][c] == '.']
            if not unrevealed or game.state() != "ongoing":
                break
            r, c = random.choice(unrevealed)
            if random.random() < 0.1 and len(game._flagged) < mines:
                action = {"type": "flag", "row": r, "col": c}
            else:
                action = {"type": "reveal", "row": r, "col": c}
            game.do_action(action)
            move_history.append(action)

        if game.state() == "ongoing":
            prompt_text = format_state_for_llm(game)
            dataset_items.append({
                "prompt": [{"role": "user", "content": prompt_text}],
                "seed": seed,
                "move_history": json.dumps(move_history),
                "rows": rows,
                "cols": cols,
                "num_mines": mines,
            })

    dataset = Dataset.from_list(dataset_items[:num_samples])
    fresh = sum(1 for item in dataset if item["move_history"] == "[]")
    print(f"Generated {len(dataset)} training examples")
    print(f"  Fresh: {fresh} ({fresh/len(dataset)*100:.1f}%)")
    print(f"  Mid-game: {len(dataset)-fresh} ({(len(dataset)-fresh)/len(dataset)*100:.1f}%)")
    from collections import Counter
    sizes = Counter(f"{item['rows']}x{item['cols']}" for item in dataset)
    print(f"  Board sizes: {dict(sizes)}")
    return dataset

print("Generating diverse training dataset...")
dataset = generate_game_states(num_samples=2000)
print(f"\nExample: Seed={dataset[0]['seed']}, Board={dataset[0]['rows']}x{dataset[0]['cols']}, Mines={dataset[0]['num_mines']}")

Generating diverse training dataset...
Generated 2000 training examples
  Fresh: 411 (20.5%)
  Mid-game: 1589 (79.5%)
  Board sizes: {'6x6': 243, '9x9': 246, '6x10': 136, '10x10': 248, '8x8': 222, '8x12': 134, '5x5': 219, '7x7': 275, '10x6': 127, '12x8': 150}

Example: Seed=76820, Board=6x6, Mines=5


# Configure GRPO Training

Set up GRPO trainer with all hyperparameters:

In [86]:
from trl import GRPOConfig, GRPOTrainer

max_prompt_length = 1600   # FIXED: was 800. With max_seq=2048 we have room
max_completion_length = max_seq_length - max_prompt_length  # = 448 tokens

training_args = GRPOConfig(
    temperature = 1.0,
    learning_rate = 2e-5,
    weight_decay = 0.01,
    warmup_ratio = 0.05,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 8,
    num_generations = 8,
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    max_steps = 1000,
    save_steps = 200,
    report_to = "none",
    output_dir = "minesweeper_custom_outputs",
)

print("Training configuration:")
print(f"  Max steps: {training_args.max_steps}")
print(f"  Batch: {training_args.per_device_train_batch_size} x {training_args.gradient_accumulation_steps} accum")
print(f"  LR: {training_args.learning_rate}, Scheduler: cosine")
print(f"  Max prompt: {max_prompt_length}, Max completion: {max_completion_length}")

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 2 to the `num_generations` of 8
Training configuration:
  Max steps: 1000
  Batch: 8 x 8 accum
  LR: 2e-05, Scheduler: cosine
  Max prompt: 1600, Max completion: 448


In [87]:
from transformers import TrainerCallback

class MinesweeperEvalCallback(TrainerCallback):
    def __init__(self, eval_every_steps=50, num_games=10):
        self.eval_every_steps = eval_every_steps
        self.num_games = num_games
        self.best_win_rate = 0.0
        self.best_step = 0

    def on_step_end(self, args, state, control, model=None, processing_class=None, **kwargs):
        if state.global_step % self.eval_every_steps != 0:
            return
        tokenizer = processing_class
        if tokenizer is None or model is None:
            return

        FastLanguageModel.for_inference(model)

        wins = 0
        valid_moves = 0
        invalid_moves = 0

        for i in range(self.num_games):
            if i < 5:
                game = MinesweeperGame(rows=6, cols=6, num_mines=5, seed=10000+i)
            elif i < 8:
                game = MinesweeperGame(rows=8, cols=8, num_mines=10, seed=10000+i)
            else:
                game = MinesweeperGame(rows=10, cols=10, num_mines=15, seed=10000+i)

            moves = 0
            while game.state() == "ongoing" and moves < 50:
                prompt = format_state_for_llm(game)
                text = tokenizer.apply_chat_template(
                    [{"role": "user", "content": prompt}],
                    tokenize=False, add_generation_prompt=True,
                )
                output = model.generate(
                    **tokenizer(text, return_tensors="pt").to(model.device),
                    temperature=0.7, max_new_tokens=128, do_sample=True,
                )
                response = tokenizer.decode(output[0], skip_special_tokens=True)
                action = parse_llm_action(response)
                if action is None:
                    invalid_moves += 1
                    break
                valid_moves += 1
                game.do_action(action)
                moves += 1
            if game.state() == "success":
                wins += 1

        win_rate = wins / self.num_games
        marker = ""
        if win_rate > self.best_win_rate:
            self.best_win_rate = win_rate
            self.best_step = state.global_step
            marker = " ** NEW BEST **"

        total_m = valid_moves + invalid_moves
        valid_pct = (valid_moves / total_m * 100) if total_m > 0 else 0
        print(f"\n[Eval @ step {state.global_step}] "
              f"Wins: {wins}/{self.num_games} ({win_rate*100:.0f}%) | "
              f"Valid: {valid_pct:.0f}% | "
              f"Best: {self.best_win_rate*100:.0f}% @ step {self.best_step}{marker}\n")

        FastLanguageModel.for_training(model)

eval_callback = MinesweeperEvalCallback(eval_every_steps=50, num_games=10)
print("Eval callback ready: 10 games (mixed sizes) every 50 steps")

Eval callback ready: 10 games (mixed sizes) every 50 steps


# Train the Model

Start GRPO training with reward functions:

In [88]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        valid_json_reward,   # Reward valid JSON format
        gameplay_scores,     # Reward good gameplay
    ],
    args = training_args,
    train_dataset = dataset,
    callbacks = [eval_callback],
)

print("Starting training...")
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998}.


Starting training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,000 | Num Epochs = 4 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 8 x 1) = 64
 "-____-"     Trainable parameters = 15,925,248 of 20,930,682,432 (0.08% trained)


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,sampling / sampling_logp_difference / mean,sampling / sampling_logp_difference / max,sampling / importance_sampling_ratio / min,sampling / importance_sampling_ratio / mean,sampling / importance_sampling_ratio / max,kl,rewards / valid_json_reward / mean,rewards / valid_json_reward / std,rewards / gameplay_scores / mean,rewards / gameplay_scores / std
1,0.0,-9.570312,5.508347,401.203125,110.0,448.0,0.78125,234.071442,110.0,425.0,0,0,0,0,0,0.003085,-3.710938,2.34826,-5.859375,8.66276


OutOfMemoryError: HIP out of memory. Tried to allocate 10.02 GiB. GPU 0 has a total capacity of 255.69 GiB of which 5.44 GiB is free. Of the allocated memory 114.61 GiB is allocated by PyTorch, with 122.00 MiB allocated in private pools (e.g., HIP Graphs), and 29.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

# Test Trained Model

Evaluate the finetuned model:

In [None]:
# Test on new game
test_game = MinesweeperGame(rows=6, cols=6, num_mines=5)
test_prompt = format_state_for_llm(test_game)

# Removed reasoning_effort="low" for train/eval consistency
test_text = tokenizer.apply_chat_template(
    [{"role": "user", "content": test_prompt}],
    tokenize = False,
    add_generation_prompt = True,
)

print("=== Trained Model Response ===")
output = model.generate(
    **tokenizer(test_text, return_tensors = "pt").to("cuda"),
    temperature = 0.7,
    max_new_tokens = 128,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

# Parse and test action
response_text = tokenizer.decode(output[0])
action = parse_llm_action(response_text)
print(f"\nParsed action: {action}")

if action:
    test_game.do_action(action)
    print(f"\nGame state after action: {test_game.state()}")
    print(test_game.pretty_print())

# Evaluation: Play Complete Games

Test the model on multiple complete games:

In [None]:
def play_full_game(model, tokenizer, rows=6, cols=6, num_mines=5, seed=None, max_moves=50):
    """Play a complete Minesweeper game with the model"""
    game = MinesweeperGame(rows=rows, cols=cols, num_mines=num_mines, seed=seed)
    moves = 0
    
    while game.state() == "ongoing" and moves < max_moves:
        # Get current state
        prompt = format_state_for_llm(game)
        # Removed reasoning_effort="low" for train/eval consistency
        text = tokenizer.apply_chat_template(
            [{"role": "user", "content": prompt}],
            tokenize = False,
            add_generation_prompt = True,
        )
        
        # Generate action
        output = model.generate(
            **tokenizer(text, return_tensors = "pt").to("cuda"),
            temperature = 0.7,
            max_new_tokens = 128,
            do_sample = True,
        )
        
        response = tokenizer.decode(output[0])
        action = parse_llm_action(response)
        
        if action is None:
            break  # Invalid action
        
        game.do_action(action)
        moves += 1
    
    return game, moves

# Evaluate on 100 games for statistically meaningful win rates
NUM_EVAL_GAMES = 100
print(f"Evaluating model on {NUM_EVAL_GAMES} games...\n")
wins = 0
total_moves = 0

for i in range(NUM_EVAL_GAMES):
    game, moves = play_full_game(model, tokenizer, seed=i)
    result = game.state()

    if result == "success":
        wins += 1
    # Only print individual results for first 10 + any wins
    if i < 10 or result == "success":
        tag = "WIN" if result == "success" else "LOSS"
        print(f"Game {i+1}: {tag} ({result}) after {moves} moves")

    total_moves += moves

if NUM_EVAL_GAMES > 10:
    print(f"... (showing first 10 + wins; {NUM_EVAL_GAMES} total)")

print(f"\nResults:")
print(f"  Win rate: {wins}/{NUM_EVAL_GAMES} ({wins/NUM_EVAL_GAMES*100:.1f}%)")
print(f"  Average moves: {total_moves/NUM_EVAL_GAMES:.1f}")

# Save the Model

Save your trained model for competition submission:

In [None]:
# Save LoRA adapters
model.save_pretrained("my_minesweeper_model")
tokenizer.save_pretrained("my_minesweeper_model")

print("Model saved to: my_minesweeper_model/")

# Save merged model in 16bit (local file name which will be used for eval)
if False:
    model.save_pretrained_merged(
        "my_minesweeper_model_merged",
        tokenizer,
        save_method = "merged_16bit"
    )

# Competition Tips

## Improve Your Model:

1. **Adjust Reward Functions**
   - Increase rewards for logical deduction
   - Add penalties for random moves
   - Reward flagging correct mines

2. **Tune Hyperparameters**
   - Increase `max_steps` for longer training
   - Adjust `learning_rate` (try 1e-5 to 1e-4)
   - Increase `lora_rank` for more capacity
   - Adjust `num_generations` (2-8)

3. **Better Training Data**
   - Generate more diverse states
   - Include harder scenarios (more mines)
   - Add states requiring logical deduction

4. **Advanced Techniques**
   - Multi-step rollouts in reward function
   - Curriculum learning (easy → hard boards)
   - Ensemble multiple models

## Team Strategy:
- Experiment with different reward functions
- Try different board sizes during training
- Analyze failed games to improve rewards
- Use temperature sampling during evaluation

Good luck!