# Minesweeper LLM Competition - Custom GRPO Training

## Goal
Finetune an LLM with LoRA using GRPO to play Minesweeper by:
- **Input**: JSON game state (board configuration)
- **Output**: JSON action (reveal or flag a cell)

Teams will compete to train the best Minesweeper-playing LLM!

## Training Approach
- **Model**: GPT-OSS 20B with LoRA or other models in the /root/.cache/huggingface/hub directory [**Any model other than /root/.cache/huggingface/hub will lead to disqualification**]
- **Method**: GRPO (Group Relative Policy Optimization), SFT or any RL-policies (not just strict to use GRPO)
- **Framework**: Unsloth (2-6x faster, 70% less VRAM)
- **Hardware**: AMD GPU (ROCm)

# Load Model with Unsloth

Load GPT-OSS 20B with LoRA configuration:

In [1]:
import os
os.environ["PYTORCH_HIP_ALLOC_CONF"] = "expandable_segments:True"
os.environ["HF_HUB_CACHE"] = "/root/.cache/huggingface"

from unsloth import FastLanguageModel
import torch

max_seq_length = 1024
lora_rank = 32   # Higher rank for better learning capacity

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Unsloth/Llama-3.1-8B-Instruct",
    load_in_4bit = True,
    max_seq_length = max_seq_length,
    torch_dtype = torch.bfloat16,
)
print(f"Loaded Llama-3.1-8B-Instruct, rank={lora_rank}")

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
#### Unsloth: `hf_xet==1.1.10` and `ipykernel>6.30.1` breaks progress bars. Disabling for now in XET.
#### Unsloth: To re-enable progress bars, please downgrade to `ipykernel==6.30.1` or wait for a fix to
https://github.com/huggingface/xet-core/issues/526
INFO 02-15 06:41:34 [__init__.py:225] Automatically detected platform rocm.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.


Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/.no_exist'
[2026-02-15 06:41:37] ERROR file_download.py:1559: Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/.no_exist'
  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"{DEVICE_TYPE_TORCH}:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.10.6: Fast Llama patching. Transformers: 4.56.2. vLLM: 0.11.1rc2.dev161+g8a297115e.rocm700.
   \\   /|    . Num GPUs = 1. Max memory: 255.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+gitb2fb688. ROCm Toolkit: 7.0.51831-a3e329ad8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/.no_exist'
[2026-02-15 06:41:37] ERROR file_download.py:1559: Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/.no_exist'
`torch_dtype` is deprecated! Use `dtype` instead!
Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/.no_exist'
[2026-02-15 06:41:37] ERROR file_download.py:1559: Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/.no_exist'
Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-on

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/.no_exist'
[2026-02-15 06:41:47] ERROR file_download.py:1559: Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/.no_exist'
Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/.no_exist'
[2026-02-15 06:41:47] ERROR file_download.py:1559: Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/.no_exist'
Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--

Loaded Llama-3.1-8B-Instruct, rank=32


# Add LoRA Adapters

Add LoRA layers for efficient finetuning:

In [2]:
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank * 2,
    lora_dropout = 0,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)
model.print_trainable_parameters()

Unsloth 2025.10.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


trainable params: 83,886,080 || all params: 8,114,147,328 || trainable%: 1.0338


# Minesweeper Game Implementation

Custom Minesweeper environment supporting:
- Customizable board size and mine count
- Actions: reveal or flag cells
- Win: reveal all safe cells
- Lose: reveal a mine

In [3]:
from dataclasses import dataclass, field
from typing import List, Tuple, Optional, Set
import random

@dataclass
class MinesweeperGame:
    rows: int
    cols: int
    num_mines: int
    seed: Optional[int] = None
    _rng: random.Random = field(init=False, repr=False)
    _board: List[List[int]] = field(init=False, repr=False)  # -1 = mine, 0-8 = count
    _revealed: Set[Tuple[int, int]] = field(init=False, repr=False, default_factory=set)
    _flagged: Set[Tuple[int, int]] = field(init=False, repr=False, default_factory=set)
    _state: str = field(default="ongoing", init=False, repr=False)

    def __post_init__(self):
        if self.num_mines >= self.rows * self.cols:
            raise ValueError("Too many mines for board size")
        self._rng = random.Random(self.seed)
        self._board = [[0 for _ in range(self.cols)] for _ in range(self.rows)]
        self._place_mines()
        self._calculate_numbers()

    def _place_mines(self):
        """Place mines randomly on the board"""
        positions = [(r, c) for r in range(self.rows) for c in range(self.cols)]
        mine_positions = self._rng.sample(positions, self.num_mines)
        for r, c in mine_positions:
            self._board[r][c] = -1

    def _calculate_numbers(self):
        """Calculate numbers for each cell based on adjacent mines"""
        for r in range(self.rows):
            for c in range(self.cols):
                if self._board[r][c] == -1:
                    continue
                count = 0
                for dr in [-1, 0, 1]:
                    for dc in [-1, 0, 1]:
                        if dr == 0 and dc == 0:
                            continue
                        nr, nc = r + dr, c + dc
                        if 0 <= nr < self.rows and 0 <= nc < self.cols:
                            if self._board[nr][nc] == -1:
                                count += 1
                self._board[r][c] = count

    def _reveal_cell(self, row: int, col: int) -> bool:
        """Reveal a cell. Returns True if valid move, False if invalid.
        Uses iterative flood-fill to avoid recursion limit on large boards.
        (Issue #11: was recursive; Issue typo: fixed 'bself' -> 'self')
        """
        if not (0 <= row < self.rows and 0 <= col < self.cols):
            return False
        if (row, col) in self._revealed or (row, col) in self._flagged:
            return False

        stack = [(row, col)]
        while stack:
            r, c = stack.pop()
            if (r, c) in self._revealed:
                continue

            self._revealed.add((r, c))

            # Hit a mine!
            if self._board[r][c] == -1:
                self._state = "failed"
                return True

            # Auto-reveal neighbors if cell is 0
            if self._board[r][c] == 0:
                for dr in [-1, 0, 1]:
                    for dc in [-1, 0, 1]:
                        if dr == 0 and dc == 0:
                            continue
                        nr, nc = r + dr, c + dc
                        if (0 <= nr < self.rows and 0 <= nc < self.cols
                                and (nr, nc) not in self._revealed
                                and (nr, nc) not in self._flagged):
                            stack.append((nr, nc))

        return True

    def _flag_cell(self, row: int, col: int) -> bool:
        """Flag/unflag a cell. Returns True if valid, False if invalid"""
        if not (0 <= row < self.rows and 0 <= col < self.cols):
            return False
        if (row, col) in self._revealed:
            return False
        
        if (row, col) in self._flagged:
            self._flagged.remove((row, col))
        else:
            self._flagged.add((row, col))
        return True

    def do_action(self, action: dict) -> str:
        """Execute an action and return a status string.

        Returns one of:
          'ok'               - valid move executed
          'mine'             - revealed a mine (game over)
          'win'              - game won after this move
          'invalid_format'   - bad action dict / missing keys / bad types
          'out_of_bounds'    - coordinates outside the board
          'already_revealed' - cell was already revealed
          'flagged_cell'     - tried to reveal a flagged cell
          'invalid_flag'     - tried to flag a revealed cell
          'game_over'        - game was already over before this call

        (Issue #13: previously set state='failed' for ALL invalid moves,
         conflating formatting errors with hitting a mine.)
        """
        if self._state != "ongoing":
            return "game_over"

        if not isinstance(action, dict):
            self._state = "failed"
            return "invalid_format"

        action_type = action.get("type")
        row = action.get("row")
        col = action.get("col")

        if action_type not in ["reveal", "flag"] or row is None or col is None:
            self._state = "failed"
            return "invalid_format"

        try:
            row, col = int(row), int(col)
        except (ValueError, TypeError):
            self._state = "failed"
            return "invalid_format"

        if not (0 <= row < self.rows and 0 <= col < self.cols):
            self._state = "failed"
            return "out_of_bounds"

        if action_type == "reveal":
            if (row, col) in self._revealed:
                self._state = "failed"
                return "already_revealed"
            if (row, col) in self._flagged:
                self._state = "failed"
                return "flagged_cell"
            valid = self._reveal_cell(row, col)
        else:
            if (row, col) in self._revealed:
                self._state = "failed"
                return "invalid_flag"
            valid = self._flag_cell(row, col)

        if not valid:
            self._state = "failed"
            return "invalid_format"

        self._check_win()

        if self._state == "failed":
            return "mine"
        if self._state == "success":
            return "win"
        return "ok"

    def _check_win(self):
        """Check if player has won"""
        total_cells = self.rows * self.cols
        safe_cells = total_cells - self.num_mines
        if len(self._revealed) == safe_cells:
            self._state = "success"

    def get_visible_board(self) -> List[List[str]]:
        """Get board state as player sees it"""
        visible = []
        for r in range(self.rows):
            row = []
            for c in range(self.cols):
                if (r, c) in self._flagged:
                    row.append('F')
                elif (r, c) in self._revealed:
                    val = self._board[r][c]
                    row.append('*' if val == -1 else str(val))
                else:
                    row.append('.')
            visible.append(row)
        return visible

    def state(self) -> str:
        return self._state

    def pretty_print(self) -> str:
        """Pretty print the board"""
        visible = self.get_visible_board()
        lines = []
        
        # Header
        header = "   " + " ".join(f"{i:2d}" for i in range(self.cols))
        lines.append(header)
        lines.append("  " + "â”€" * (self.cols * 3 + 1))
        
        # Board
        for r, row in enumerate(visible):
            line = f"{r:2d}â”‚ " + "  ".join(row)
            lines.append(line)
        
        return "\n".join(lines)

# Test the Game

In [4]:
# Create test game
game = MinesweeperGame(rows=6, cols=6, num_mines=5)
print(game.pretty_print())
print(f"State: {game.state()}")

# Test action
game.do_action({"type": "reveal", "row": 0, "col": 0})
print("\nAfter revealing (0,0):")
print(game.pretty_print())
print(f"State: {game.state()}")

    0  1  2  3  4  5
  â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
 0â”‚ .  .  .  .  .  .
 1â”‚ .  .  .  .  .  .
 2â”‚ .  .  .  .  .  .
 3â”‚ .  .  .  .  .  .
 4â”‚ .  .  .  .  .  .
 5â”‚ .  .  .  .  .  .
State: ongoing

After revealing (0,0):
    0  1  2  3  4  5
  â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
 0â”‚ 0  0  0  0  1  .
 1â”‚ 0  0  1  1  2  .
 2â”‚ 0  0  1  .  .  .
 3â”‚ 0  0  1  .  .  .
 4â”‚ 0  1  2  .  .  .
 5â”‚ 0  1  .  .  .  .
State: ongoing


# JSON Input/Output Format

## Input Format (Game State)
```json
{
  "board": [
    ["1", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."],
    [".", ".", ".", ".", ".", "."]
  ],
  "rows": 6,
  "cols": 6,
  "mines": 5,
  "flags_placed": 0,
  "cells_revealed": 0
}
```

## Output Format (Action)
```json
{"type": "reveal", "row": 2, "col": 3}
```
or
```json
{"type": "flag", "row": 1, "col": 4}
```

In [5]:
import json

SYSTEM_PROMPT = "You output JSON actions for Minesweeper. No text, only JSON."

def format_state_for_llm(game: MinesweeperGame) -> str:
    """Compact prompt matching eval agent format."""
    state = {
        "board": game.get_visible_board(),
        "rows": game.rows,
        "cols": game.cols,
        "mines": game.num_mines,
        "flags_placed": len(game._flagged),
        "cells_revealed": len(game._revealed),
    }
    prompt = f"""You are playing Minesweeper. Analyze the game state and output your next move.

You must output ONLY a valid JSON object. No explanation, no analysis, no text.

Start your response immediately with {{ and end with }}.

Do NOT output cell which is already revealed or flagged in the current state.

Game state:
{json.dumps(state)}

Legend:
- "." = unrevealed cell
- "F" = flagged cell (suspected mine)
- "0"-"8" = number of adjacent mines
- "*" = revealed mine (game over)

Output your next action as JSON:
{{"type": "reveal", "row": <row_index>, "col": <col_index>}}
or
{{"type": "flag", "row": <row_index>, "col": <col_index>}}

Your action:"""
    return prompt


def parse_llm_action(response: str) -> dict:
    """Extract JSON action from LLM response. Returns LAST valid match."""
    import re
    best = None
    for match in re.finditer(r'\{[^{}]*\}', response):
        try:
            action = json.loads(match.group())
            if ("type" in action and "row" in action and "col" in action
                    and action["type"] in ["reveal", "flag"]):
                action["row"] = int(action["row"])
                action["col"] = int(action["col"])
                best = action
        except (json.JSONDecodeError, ValueError, TypeError):
            continue
    return best


def get_expert_move(game: MinesweeperGame) -> dict:
    """Oracle: generates REVEAL-focused expert moves.
    Almost never flags â€” teaches model to reveal safely."""
    board = game.get_visible_board()
    rows, cols = game.rows, game.cols

    unrevealed = [(r, c) for r in range(rows) for c in range(cols) if board[r][c] == '.']
    if not unrevealed: return None

    # Find logically deducible cells
    deducible_safe = []
    deducible_mines = []
    for r in range(rows):
        for c in range(cols):
            cell = board[r][c]
            if not (cell.isdigit() and int(cell) > 0): continue
            num = int(cell)
            flagged, hidden = 0, []
            for dr in [-1, 0, 1]:
                for dc in [-1, 0, 1]:
                    if dr == 0 and dc == 0: continue
                    nr, nc = r + dr, c + dc
                    if 0 <= nr < rows and 0 <= nc < cols:
                        if board[nr][nc] == 'F': flagged += 1
                        elif board[nr][nc] == '.': hidden.append((nr, nc))
            if flagged == num:
                for h in hidden:
                    if h not in deducible_safe: deducible_safe.append(h)
            remaining = num - flagged
            if remaining > 0 and remaining == len(hidden):
                for h in hidden:
                    if h not in deducible_mines: deducible_mines.append(h)

    # Priority 1: logically safe reveal (BEST move)
    if deducible_safe:
        r, c = random.choice(deducible_safe)
        return {"type": "reveal", "row": r, "col": c}

    # Priority 2: safe cell adjacent to numbered cells (strategic)
    near = []
    for (r, c) in unrevealed:
        if game._board[r][c] == -1: continue  # skip mines
        for dr in [-1,0,1]:
            for dc in [-1,0,1]:
                if dr == 0 and dc == 0: continue
                nr, nc = r+dr, c+dc
                if 0<=nr<rows and 0<=nc<cols and board[nr][nc].isdigit() and int(board[nr][nc])>0:
                    near.append((r, c)); break
            else: continue
            break
    if near:
        r, c = random.choice(near)
        return {"type": "reveal", "row": r, "col": c}

    # Priority 3: safe corner/edge cell (good for opening)
    safe = [(r, c) for (r, c) in unrevealed if game._board[r][c] != -1]
    corners_edges = [(r, c) for (r, c) in safe
                     if r in (0, rows-1) or c in (0, cols-1)]
    if corners_edges:
        r, c = random.choice(corners_edges)
        return {"type": "reveal", "row": r, "col": c}

    # Priority 4: any safe cell
    if safe:
        r, c = random.choice(safe)
        return {"type": "reveal", "row": r, "col": c}

    # Fallback (will hit mine â€” unavoidable sometimes)
    r, c = random.choice(unrevealed)
    return {"type": "reveal", "row": r, "col": c}


game = MinesweeperGame(rows=6, cols=6, num_mines=5)
prompt = format_state_for_llm(game)
toks = tokenizer(prompt, return_tensors="pt")
print(f"Prompt: {len(prompt)} chars, {toks['input_ids'].shape[1]} tokens")
print(f"Expert move: {get_expert_move(game)}")

Prompt: 894 chars, 255 tokens
Expert move: {'type': 'reveal', 'row': 0, 'col': 1}


# Test Model Before Training

See how the base model performs without finetuning:

In [6]:
from datasets import Dataset
from trl import SFTTrainer, SFTConfig
import gc
import numpy as np

def generate_sft_dataset(num_samples=5000, rng_seed=42):
    """Generate REVEAL-focused expert demonstrations."""
    np.random.seed(rng_seed)
    random.seed(rng_seed)
    configs = [
        (5,5,4),(5,5,5),(6,6,5),(6,6,7),(7,7,7),(7,7,10),
        (8,8,10),(8,8,13),(9,9,12),(10,10,15),(6,10,10),(10,6,10),
    ]
    items = []
    type_counts = {"reveal": 0, "flag": 0}
    attempts = 0
    while len(items) < num_samples and attempts < num_samples * 5:
        attempts += 1
        rows, cols, mines = random.choice(configs)
        seed = np.random.randint(100000)
        game = MinesweeperGame(rows=rows, cols=cols, num_mines=mines, seed=seed)
        # Play 0-10 safe moves for varied mid-game states
        for _ in range(np.random.randint(0, 11)):
            if game.state() != "ongoing": break
            board = game.get_visible_board()
            safe = [(r,c) for r in range(rows) for c in range(cols)
                    if board[r][c] == '.' and game._board[r][c] != -1]
            if not safe: break
            r, c = random.choice(safe)
            game.do_action({"type": "reveal", "row": r, "col": c})
        if game.state() != "ongoing": continue
        expert = get_expert_move(game)
        if expert is None: continue
        type_counts[expert["type"]] += 1
        prompt = format_state_for_llm(game)
        items.append(tokenizer.apply_chat_template([
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": json.dumps(expert)},
        ], tokenize=False, add_generation_prompt=False))

    print(f"SFT dataset: {len(items)} demonstrations")
    print(f"  Actions: {type_counts}")
    print(f"  Reveal%: {type_counts['reveal']/len(items)*100:.1f}%")
    return Dataset.from_dict({"text": items[:num_samples]})

sft_dataset = generate_sft_dataset(5000)

# === SFT Training â€” 3 epochs ===
FastLanguageModel.for_training(model)

sft_trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, train_dataset=sft_dataset,
    args=SFTConfig(
        output_dir="minesweeper_sft",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=2,
        num_train_epochs=3,
        learning_rate=2e-4,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        optim="adamw_8bit",
        logging_steps=100,
        save_steps=9999,
        max_seq_length=max_seq_length,
        dataset_text_field="text",
        report_to="none",
    ),
)
print("=== PHASE 1: SFT Training ===")
sft_trainer.train()
print("SFT complete!")
gc.collect(); torch.cuda.empty_cache()

SFT dataset: 5000 demonstrations
  Actions: {'reveal': 5000, 'flag': 0}
  Reveal%: 100.0%


Unsloth: Tokenizing ["text"] (num_proc=64):   0%|          | 0/5000 [00:00<?, ? examples/s]

=== PHASE 1: SFT Training ===


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 3 | Total steps = 1,875
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
100,0.4883
200,0.1041
300,0.1063
400,0.0975
500,0.0955
600,0.0962
700,0.0942
800,0.0919
900,0.0916
1000,0.0885


SFT complete!


# GRPO Reward Functions

Define reward functions to guide the model's learning:

In [7]:
from transformers import TextStreamer
FastLanguageModel.for_inference(model)

print("=== SFT MODEL TEST ===\n")
for seed in [42, 99, 7, 123, 456]:
    game = MinesweeperGame(rows=6, cols=6, num_mines=5, seed=seed)
    print(f"--- Seed {seed} ---")
    moves = 0
    prev_actions = []
    while game.state() == "ongoing" and moves < 20:
        prompt = format_state_for_llm(game)
        text = tokenizer.apply_chat_template(
            [{"role": "system", "content": SYSTEM_PROMPT},
             {"role": "user", "content": prompt}],
            tokenize=False, add_generation_prompt=True)
        output = model.generate(**tokenizer(text, return_tensors="pt").to(model.device),
            temperature=0.6, max_new_tokens=128, do_sample=True)
        action = parse_llm_action(tokenizer.decode(output[0], skip_special_tokens=True))
        if action is None: print(f"  Move {moves+1}: PARSE FAIL"); break
        
        # Detect flag-toggle loop
        action_key = f"{action['type']}_{action['row']}_{action['col']}"
        if len(prev_actions) >= 2 and action_key == prev_actions[-2]:
            print(f"  Move {moves+1}: FLAG-TOGGLE DETECTED on ({action['row']},{action['col']})")
            break
        prev_actions.append(action_key)
        
        board = game.get_visible_board()
        r, c = action["row"], action["col"]
        cv = board[r][c] if 0<=r<game.rows and 0<=c<game.cols else "OOB"
        result = game.do_action(action)
        print(f"  Move {moves+1}: {action} cell='{cv}' -> {result}")
        moves += 1
    print(f"  Final: {game.state()} ({moves} moves)\n")

gc.collect(); torch.cuda.empty_cache()

# === Reward functions ===
import numpy as np

def is_logically_deducible_safe(game, row, col):
    board = game.get_visible_board()
    for dr in [-1, 0, 1]:
        for dc in [-1, 0, 1]:
            if dr == 0 and dc == 0: continue
            nr, nc = row + dr, col + dc
            if not (0 <= nr < game.rows and 0 <= nc < game.cols): continue
            cell = board[nr][nc]
            if not (cell.isdigit() and int(cell) > 0): continue
            num = int(cell)
            flagged, hidden = 0, []
            for ddr in [-1, 0, 1]:
                for ddc in [-1, 0, 1]:
                    if ddr == 0 and ddc == 0: continue
                    nnr, nnc = nr + ddr, nc + ddc
                    if 0 <= nnr < game.rows and 0 <= nnc < game.cols:
                        if board[nnr][nnc] == 'F': flagged += 1
                        elif board[nnr][nnc] == '.': hidden.append((nnr, nnc))
            if flagged == num and (row, col) in hidden: return True
    return False

def is_logically_deducible_mine(game, row, col):
    board = game.get_visible_board()
    for dr in [-1, 0, 1]:
        for dc in [-1, 0, 1]:
            if dr == 0 and dc == 0: continue
            nr, nc = row + dr, col + dc
            if not (0 <= nr < game.rows and 0 <= nc < game.cols): continue
            cell = board[nr][nc]
            if not (cell.isdigit() and int(cell) > 0): continue
            num = int(cell)
            flagged, hidden = 0, []
            for ddr in [-1, 0, 1]:
                for ddc in [-1, 0, 1]:
                    if ddr == 0 and ddc == 0: continue
                    nnr, nnc = nr + ddr, nc + ddc
                    if 0 <= nnr < game.rows and 0 <= nnc < game.cols:
                        if board[nnr][nnc] == 'F': flagged += 1
                        elif board[nnr][nnc] == '.': hidden.append((nnr, nnc))
            remaining = num - flagged
            if remaining > 0 and remaining == len(hidden) and (row, col) in hidden: return True
    return False

def valid_json_reward(completions, **kwargs):
    scores = []
    for c in completions:
        resp = c[0]["content"]
        a = parse_llm_action(resp)
        if a is None: scores.append(-5.0)
        elif len(resp.strip()) < 60: scores.append(3.0)
        elif len(resp.strip()) < 120: scores.append(2.0)
        else: scores.append(0.5)
    return scores

def gameplay_scores(completions, **kwargs):
    scores = []
    seeds = kwargs.get("seed", [])
    mhs = kwargs.get("move_history", [])
    rl = kwargs.get("rows", []); cl = kwargs.get("cols", []); ml = kwargs.get("num_mines", [])
    for idx, comp in enumerate(completions):
        resp = comp[0]["content"]
        action = parse_llm_action(resp)
        if action is None: scores.append(-10.0); continue
        if idx >= len(seeds): scores.append(0.0); continue
        r_ct=int(rl[idx]) if idx<len(rl) else 6
        c_ct=int(cl[idx]) if idx<len(cl) else 6
        m_ct=int(ml[idx]) if idx<len(ml) else 5
        mh = json.loads(mhs[idx]) if isinstance(mhs[idx], str) else mhs[idx]
        game = MinesweeperGame(rows=r_ct, cols=c_ct, num_mines=m_ct, seed=seeds[idx])
        for prev in mh: game.do_action(prev)
        board = game.get_visible_board()
        try: row,col = int(action["row"]),int(action["col"])
        except: scores.append(-10.0); continue
        atype = action["type"]
        if not (0<=row<game.rows and 0<=col<game.cols): scores.append(-15.0); continue
        cell = board[row][col]
        if atype == "reveal":
            if cell == 'F': scores.append(-8.0)
            elif cell != '.': scores.append(-12.0)
            elif game._board[row][col] == -1: scores.append(-25.0)
            else:
                base = 10.0
                if is_logically_deducible_safe(game, row, col): base = 20.0
                for dr in [-1,0,1]:
                    for dc in [-1,0,1]:
                        nr,nc = row+dr,col+dc
                        if 0<=nr<game.rows and 0<=nc<game.cols and board[nr][nc].isdigit() and int(board[nr][nc])>0:
                            base += 3.0; break
                    else: continue
                    break
                sim = MinesweeperGame(rows=r_ct, cols=c_ct, num_mines=m_ct, seed=seeds[idx])
                for prev in mh: sim.do_action(prev)
                if sim.do_action({"type":"reveal","row":row,"col":col}) == "win": base += 100.0
                base += (row*c_ct+col)*0.01
                scores.append(base)
        elif atype == "flag":
            if cell == 'F': scores.append(-15.0)  # HARSH: no toggling
            elif cell != '.': scores.append(-8.0)
            elif len(game._flagged) >= game.num_mines: scores.append(-10.0)
            elif game._board[row][col] == -1:
                base = 12.0  # Lower than reveal reward to prefer reveals
                if is_logically_deducible_mine(game, row, col): base += 5.0
                scores.append(base)
            else: scores.append(-15.0)  # Wrong flag is HARSH
        else: scores.append(-10.0)
    return scores

print("Reward functions loaded (reveal-biased)")

=== SFT MODEL TEST ===

--- Seed 42 ---
  Move 1: {'type': 'reveal', 'row': 5, 'col': 0} cell='.' -> ok
  Move 2: {'type': 'reveal', 'row': 2, 'col': 3} cell='.' -> mine
  Final: failed (2 moves)

--- Seed 99 ---
  Move 1: {'type': 'reveal', 'row': 5, 'col': 4} cell='.' -> ok
  Move 2: {'type': 'reveal', 'row': 4, 'col': 1} cell='.' -> mine
  Final: failed (2 moves)

--- Seed 7 ---
  Move 1: {'type': 'reveal', 'row': 0, 'col': 3} cell='.' -> mine
  Final: failed (1 moves)

--- Seed 123 ---
  Move 1: {'type': 'reveal', 'row': 0, 'col': 5} cell='.' -> mine
  Final: failed (1 moves)

--- Seed 456 ---
  Move 1: {'type': 'reveal', 'row': 0, 'col': 3} cell='.' -> ok
  Move 2: {'type': 'reveal', 'row': 1, 'col': 3} cell='.' -> ok
  Move 3: {'type': 'reveal', 'row': 1, 'col': 4} cell='.' -> ok
  Move 4: {'type': 'reveal', 'row': 5, 'col': 3} cell='.' -> ok
  Move 5: {'type': 'reveal', 'row': 4, 'col': 1} cell='.' -> ok
  Move 6: FLAG-TOGGLE DETECTED on (5,3)
  Final: ongoing (5 moves)

Reward 

# Create Training Dataset

Generate diverse game states for training:

In [8]:
from datasets import Dataset

def generate_grpo_dataset(num_samples=2000, rng_seed=99):
    np.random.seed(rng_seed)
    random.seed(rng_seed)
    configs = [
        (5,5,4),(5,5,5),(6,6,5),(6,6,7),(7,7,7),(7,7,10),
        (8,8,10),(8,8,13),(9,9,12),(10,10,15),(6,10,10),(10,6,10),
    ]
    items = []
    attempts = 0
    while len(items) < num_samples and attempts < num_samples * 5:
        attempts += 1
        rows, cols, mines = random.choice(configs)
        seed = np.random.randint(100000)
        game = MinesweeperGame(rows=rows, cols=cols, num_mines=mines, seed=seed)
        move_history = []
        for _ in range(np.random.randint(0, 8)):
            board = game.get_visible_board()
            unrev = [(r,c) for r in range(rows) for c in range(cols) if board[r][c] == '.']
            if not unrev or game.state() != "ongoing": break
            r, c = random.choice(unrev)
            a = {"type": "flag" if random.random() < 0.08 and len(game._flagged) < mines else "reveal", "row": r, "col": c}
            game.do_action(a); move_history.append(a)
        if game.state() == "ongoing":
            items.append({
                "prompt": [
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": format_state_for_llm(game)},
                ],
                "seed": seed, "move_history": json.dumps(move_history),
                "rows": rows, "cols": cols, "num_mines": mines,
            })
    dataset = Dataset.from_list(items[:num_samples])
    print(f"GRPO dataset: {len(dataset)} prompts")
    return dataset

dataset = generate_grpo_dataset(2000)

GRPO dataset: 2000 prompts


# Configure GRPO Training

Set up GRPO trainer with all hyperparameters:

In [9]:
from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    temperature = 1.0,
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    max_grad_norm = 0.1,
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 4,
    num_generations = 8,
    max_prompt_length = 800,
    max_completion_length = 128,
    max_steps = 50,
    save_steps = 50,
    report_to = "none",
    output_dir = "minesweeper_grpo_outputs",
    loss_type = "dr_grpo",
    mask_truncated_completions = True,
)
print(f"GRPO: LR=5e-6, gens=8, steps=50, dr_grpo, grad_norm=0.1")

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8
GRPO: LR=5e-6, gens=8, steps=50, dr_grpo, grad_norm=0.1


In [10]:
from transformers import TrainerCallback

class MinesweeperEvalCallback(TrainerCallback):
    def __init__(self, eval_every=25, num_games=5):
        self.eval_every = eval_every
        self.num_games = num_games
        self.best = 0.0
    def on_step_end(self, args, state, control, model=None, processing_class=None, **kwargs):
        if state.global_step % self.eval_every != 0: return
        tok = processing_class
        if tok is None or model is None: return
        FastLanguageModel.for_inference(model)
        torch.cuda.empty_cache()
        wins, valid, invalid, total_mv = 0, 0, 0, 0
        for i in range(self.num_games):
            game = MinesweeperGame(rows=6, cols=6, num_mines=5, seed=10000+i)
            mv = 0; prev_actions = []
            while game.state() == "ongoing" and mv < 40:
                p = format_state_for_llm(game)
                t = tok.apply_chat_template(
                    [{"role":"system","content":SYSTEM_PROMPT},{"role":"user","content":p}],
                    tokenize=False, add_generation_prompt=True)
                out = model.generate(**tok(t, return_tensors="pt").to(model.device),
                    temperature=0.6, max_new_tokens=128, do_sample=True)
                a = parse_llm_action(tok.decode(out[0], skip_special_tokens=True))
                if a is None: invalid += 1; break
                ak = f"{a['type']}_{a['row']}_{a['col']}"
                if len(prev_actions) >= 2 and ak == prev_actions[-2]: break
                prev_actions.append(ak)
                valid += 1; game.do_action(a); mv += 1
                if i == 0 and state.global_step <= self.eval_every:
                    print(f"    Move {mv}: {a} -> {game.state()}")
            total_mv += mv
            if game.state() == "success": wins += 1
        wr = wins / self.num_games
        best = " ** BEST **" if wr > self.best else ""
        if wr > self.best: self.best = wr
        print(f"\n[Eval@{state.global_step}] Wins:{wins}/{self.num_games} ({wr*100:.0f}%) "
              f"Valid:{valid/(max(valid+invalid,1))*100:.0f}% AvgMoves:{total_mv/self.num_games:.1f}{best}\n")
        FastLanguageModel.for_training(model)
        torch.cuda.empty_cache()

eval_callback = MinesweeperEvalCallback(eval_every=25, num_games=5)
print("Eval: 5 games every 25 steps (with loop detection)")

Eval: 5 games every 25 steps (with loop detection)


# Train the Model

Start GRPO training with reward functions:

In [11]:
gc.collect()
torch.cuda.empty_cache()
FastLanguageModel.for_training(model)

trainer = GRPOTrainer(
    model=model, processing_class=tokenizer,
    reward_funcs=[valid_json_reward, gameplay_scores],
    args=training_args, train_dataset=dataset,
    callbacks=[eval_callback],
)
print("=== PHASE 2: GRPO (50 steps) ===")
trainer.train()
print("GRPO complete!")

=== PHASE 2: GRPO (50 steps) ===


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,000 | Num Epochs = 1 | Total steps = 50
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)
`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072, 'temperature': 0.6, 'top_p': 0.9}. If this is not desired, please set these values explicitly.


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,sampling / sampling_logp_difference / mean,sampling / sampling_logp_difference / max,sampling / importance_sampling_ratio / min,sampling / importance_sampling_ratio / mean,sampling / importance_sampling_ratio / max,kl,rewards / valid_json_reward / mean,rewards / valid_json_reward / std,rewards / gameplay_scores / mean,rewards / gameplay_scores / std
1,0.0001,12.666875,10.273813,19.0,19.0,19.0,0.0,19.0,19.0,19.0,0,0,0,0,0,0.135193,3.0,0.0,9.666875,11.755674
2,0.0001,0.414375,16.902935,19.0,19.0,19.0,0.0,19.0,19.0,19.0,No Log,No Log,No Log,No Log,No Log,0.093985,3.0,0.0,-2.585625,17.464191
3,0.0001,0.891562,15.707942,19.0,19.0,19.0,0.0,19.0,19.0,19.0,No Log,No Log,No Log,No Log,No Log,0.111759,3.0,0.0,-2.108438,17.84886
4,0.0,2.112188,15.27694,19.0,19.0,19.0,0.0,19.0,19.0,19.0,No Log,No Log,No Log,No Log,No Log,0.081155,3.0,0.0,-0.887812,17.770535
5,0.0001,-1.835,16.034588,19.0,19.0,19.0,0.0,19.0,19.0,19.0,No Log,No Log,No Log,No Log,No Log,0.146345,3.0,0.0,-4.835,17.892859
6,0.0001,4.12875,15.937254,19.0,19.0,19.0,0.0,19.0,19.0,19.0,No Log,No Log,No Log,No Log,No Log,0.107548,3.0,0.0,1.12875,15.639008
7,0.0,2.935938,12.362643,19.0,19.0,19.0,0.0,19.0,19.0,19.0,No Log,No Log,No Log,No Log,No Log,0.083625,3.0,0.0,-0.064062,16.142885
8,0.0001,1.37625,17.221165,19.0,19.0,19.0,0.0,19.0,19.0,19.0,No Log,No Log,No Log,No Log,No Log,0.084268,3.0,0.0,-1.62375,17.525503
9,0.0001,2.613438,17.747551,19.0,19.0,19.0,0.0,19.0,19.0,19.0,No Log,No Log,No Log,No Log,No Log,0.107485,3.0,0.0,-0.386563,19.055725
10,0.0001,8.023125,12.056473,19.0,19.0,19.0,0.0,19.0,19.0,19.0,No Log,No Log,No Log,No Log,No Log,0.133132,3.0,0.0,5.023125,13.82814


    Move 1: {'type': 'reveal', 'row': 2, 'col': 0} -> ongoing
    Move 2: {'type': 'reveal', 'row': 3, 'col': 1} -> failed

[Eval@25] Wins:0/5 (0%) Valid:100% AvgMoves:2.0


[Eval@50] Wins:1/5 (20%) Valid:100% AvgMoves:3.8 ** BEST **

GRPO complete!


# Test Trained Model

Evaluate the finetuned model:

In [12]:
FastLanguageModel.for_inference(model)

print("=== FINAL MODEL TEST ===\n")
for seed in [42, 99, 7]:
    game = MinesweeperGame(rows=6, cols=6, num_mines=5, seed=seed)
    print(f"--- Seed {seed} ---")
    moves = 0
    while game.state() == "ongoing" and moves < 15:
        prompt = format_state_for_llm(game)
        text = tokenizer.apply_chat_template(
            [{"role":"system","content":SYSTEM_PROMPT},{"role":"user","content":prompt}],
            tokenize=False, add_generation_prompt=True)
        output = model.generate(**tokenizer(text, return_tensors="pt").to(model.device),
            temperature=0.6, max_new_tokens=128, do_sample=True)
        action = parse_llm_action(tokenizer.decode(output[0], skip_special_tokens=True))
        if action is None: print(f"  Move {moves+1}: PARSE FAIL"); break
        board = game.get_visible_board()
        r, c = action["row"], action["col"]
        cv = board[r][c] if 0<=r<game.rows and 0<=c<game.cols else "OOB"
        result = game.do_action(action)
        print(f"  Move {moves+1}: {action} cell='{cv}' -> {result}")
        moves += 1
    print(f"  Final: {game.state()} ({moves} moves)\n")

=== FINAL MODEL TEST ===

--- Seed 42 ---
  Move 1: {'type': 'reveal', 'row': 0, 'col': 3} cell='.' -> ok
  Move 2: {'type': 'reveal', 'row': 1, 'col': 1} cell='.' -> mine
  Final: failed (2 moves)

--- Seed 99 ---
  Move 1: {'type': 'reveal', 'row': 4, 'col': 5} cell='.' -> ok
  Move 2: {'type': 'reveal', 'row': 4, 'col': 1} cell='.' -> mine
  Final: failed (2 moves)

--- Seed 7 ---
  Move 1: {'type': 'reveal', 'row': 5, 'col': 3} cell='.' -> ok
  Move 2: {'type': 'reveal', 'row': 3, 'col': 1} cell='.' -> ok
  Move 3: {'type': 'reveal', 'row': 0, 'col': 3} cell='.' -> mine
  Final: failed (3 moves)



# Evaluation: Play Complete Games

Test the model on multiple complete games:

In [13]:
def play_full_game(model, tokenizer, rows=6, cols=6, num_mines=5, seed=None, max_moves=80):
    game = MinesweeperGame(rows=rows, cols=cols, num_mines=num_mines, seed=seed)
    moves = 0
    prev_actions = []
    while game.state() == "ongoing" and moves < max_moves:
        prompt = format_state_for_llm(game)
        text = tokenizer.apply_chat_template(
            [{"role":"system","content":SYSTEM_PROMPT},{"role":"user","content":prompt}],
            tokenize=False, add_generation_prompt=True)
        output = model.generate(**tokenizer(text, return_tensors="pt").to(model.device),
            temperature=0.6, max_new_tokens=128, do_sample=True)
        action = parse_llm_action(tokenizer.decode(output[0], skip_special_tokens=True))
        if action is None: break
        # Detect loops
        ak = f"{action['type']}_{action['row']}_{action['col']}"
        if len(prev_actions) >= 2 and ak == prev_actions[-2]: break
        prev_actions.append(ak)
        game.do_action(action); moves += 1
    return game, moves

FastLanguageModel.for_inference(model)
for name, r, c, m in [("6x6/5m",6,6,5),("8x8/10m",8,8,10),("10x10/15m",10,10,15)]:
    wins, total_mv = 0, 0
    for i in range(20):
        g, mv = play_full_game(model, tokenizer, r, c, m, seed=i)
        total_mv += mv
        if g.state() == "success": wins += 1
        if i < 3 or g.state() == "success":
            print(f"  [{name}] Game {i+1}: {'WIN' if g.state()=='success' else 'LOSS'} ({mv} moves)")
    print(f"  {name}: {wins}/20 ({wins*5}%), avg {total_mv/20:.1f} moves\n")

  [6x6/5m] Game 1: LOSS (4 moves)
  [6x6/5m] Game 2: LOSS (1 moves)
  [6x6/5m] Game 3: LOSS (4 moves)
  [6x6/5m] Game 20: WIN (5 moves)
  6x6/5m: 1/20 (5%), avg 3.0 moves

  [8x8/10m] Game 1: LOSS (2 moves)
  [8x8/10m] Game 2: LOSS (4 moves)
  [8x8/10m] Game 3: LOSS (9 moves)
  8x8/10m: 0/20 (0%), avg 3.3 moves

  [10x10/15m] Game 1: LOSS (4 moves)
  [10x10/15m] Game 2: LOSS (2 moves)
  [10x10/15m] Game 3: LOSS (2 moves)
  10x10/15m: 0/20 (0%), avg 4.0 moves



# Save the Model

Save your trained model for competition submission:

In [14]:
model.save_pretrained("my_minesweeper_model")
tokenizer.save_pretrained("my_minesweeper_model")
print("LoRA saved.")

model.save_pretrained_merged(
    "my_minesweeper_model_merged",
    tokenizer,
    save_method = "merged_16bit",
)
print("Merged model saved to: my_minesweeper_model_merged/")
print("\nUpdate agents/minesweeper_model.py:")
print('  model_name = "/workspace/my_minesweeper_model_merged"')

LoRA saved.


Ignored error while writing commit hash to /root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/refs/main: [Errno 30] Read-only file system: '/root/.cache/huggingface/models--Unsloth--Llama-3.1-8B-Instruct/refs/main'.




UnboundLocalError: cannot access local variable 'copied_tokenizer_model_from_cache' where it is not associated with a value

# Competition Tips

## Improve Your Model:

1. **Adjust Reward Functions**
   - Increase rewards for logical deduction
   - Add penalties for random moves
   - Reward flagging correct mines

2. **Tune Hyperparameters**
   - Increase `max_steps` for longer training
   - Adjust `learning_rate` (try 1e-5 to 1e-4)
   - Increase `lora_rank` for more capacity
   - Adjust `num_generations` (2-8)

3. **Better Training Data**
   - Generate more diverse states
   - Include harder scenarios (more mines)
   - Add states requiring logical deduction

4. **Advanced Techniques**
   - Multi-step rollouts in reward function
   - Curriculum learning (easy â†’ hard boards)
   - Ensemble multiple models

## Team Strategy:
- Experiment with different reward functions
- Try different board sizes during training
- Analyze failed games to improve rewards
- Use temperature sampling during evaluation

Good luck!