# OpenEnv Wordle with GRPO using TRL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/openenv_wordle_grpo.ipynb)

![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)


With [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl), you can train a model that learns to **play Wordle**, a word-guessing game, through interaction and reinforcement.

- [TRL GitHub Repository](https://github.com/huggingface/trl) ‚Äî star us to support the project!  
- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)  
- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)
- [OpenEnv](https://github.com/meta-pytorch/OpenEnv)


An **agentic environment** is a setting where a model can take actions, observe outcomes, and adjust its behavior based on feedback, similar to how humans learn from trial and error.
In this case, the agent interacts with the **Wordle** environment through the [**OpenEnv**](https://github.com/meta-pytorch/OpenEnv) framework, which standardizes multi-agent and RL-style text environments.

[Wordle](https://en.wikipedia.org/wiki/Wordle) is a popular word puzzle where the player must guess a secret five-letter word within six tries.  
After each guess, feedback indicates whether each letter is:
- üü© **Correct and in the right position**
- üü® **Present but in the wrong position**
- ‚¨õ **Not in the word**

This feedback loop makes Wordle a perfect environment for **RL with LLMs**, where the goal is to maximize the probability of guessing the correct word efficiently.


We'll fine-tune a model using **GRPO** (Group Relative Policy Optimization) via TRL.  
The agent will:
1. Generate guesses based on the game state and feedback.
2. Receive structured feedback from the environment after each guess.
3. Learn to improve its guessing strategy over time through reward signals.


## Install dependencies

We'll start by installing **TRL**, which automatically includes the main dependencies like **Transformers**.  
We'll also install the **OpenEnv** framework via the remote deployent env at [burtenshaw/wordle](https://huggingface.co/spaces/burtenshaw/wordle) (for the environment), **trackio** (for logging and monitoring training runs), and **vLLM** (for efficient generation).

In [None]:
!pip install -Uq trl[vllm] git+https://huggingface.co/spaces/burtenshaw/wordle trackio bitsandbytes

### Log in to Hugging Face

Log in to your **Hugging Face** account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your **access token** on your [account settings page](https://huggingface.co/settings/tokens).

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Initialize the Environment

Let's begin by setting up the environment that will be used during training.  
For this task, we'll rely on the **TextArena** environment from **OpenEnv**, which exposes a familiar Gymnasium-style API (`reset()`, `step()`, etc.) to simplify interaction.

In this example, we'll connect to the hosted environment at [burtenshaw/textarena](https://huggingface.co/spaces/burtenshaw/textarena).  
For production use or custom configurations, we **strongly recommend** running the environment locally via Docker. The hosted versions on the Hub currently have limited concurrency support, so duplicating the Space to your own account is the preferred approach in those cases.

For more information, refer to the [TRL-OpenEnv documentation](https://huggingface.co/docs/trl/main/en/openenv).


In [None]:
from textarena_env import TextArenaEnv

textarena_url = "https://burtenshaw-wordle.hf.space" # Duplicate the Space and update this!
env = TextArenaEnv(base_url=textarena_url)
# textarena_url = "burtenshaw/wordle"
# env = TextArenaEnv.from_hub(repo_id=textarena_url)

## Init model and tokenizer

We'll use [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B), a lightweight instruction-tuned model that works well for quick experiments.  
Despite its small size, it can still learn interesting strategies during fine-tuning.  
If you have stronger hardware, you can easily scale up to larger models.

We'll load the **tokenizer** (needed for text processing) here.  
The **model** itself will be handled internally by TRL during training.

In [None]:
from transformers import AutoTokenizer

model_name = "Qwen/Qwen3-1.7B" #"Qwen/Qwen2.5-0.5B-Instruct" # "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

## Rollout function with helpers

The **rollout function** defines how the agent interacts with the environment during GRPO training.
It's responsible for generating model completions, collecting feedback (rewards), and returning all necessary information for optimization.

In this setup:
- The function is called automatically by the **GRPOTrainer** during each training step.  
- It uses the trainer's built-in `generate_rollout_completions()` method for efficient generation with vLLM in colocate mode.
- Each rollout represents a full interaction loop. The model guesses, receives feedback from Wordle, and updates based on reward signals.

The rewards track different aspects of the agent's performance. Helper functions (like `rollout_once`) handle one episode of interaction, keeping the main `rollout_func` clean and modular.

This modular approach allows GRPO to efficiently sample, evaluate, and improve the model's guessing strategy through reinforcement learning.

First, we define the `system_prompt` that guides the model's behavior as an expert Wordle solver with strategic reasoning and structured responses.

In [None]:
# @title System prompt (click to expand)
system_prompt = """
You are an expert Wordle solver with deep knowledge of English vocabulary, letter frequency patterns, and optimal guessing strategies.

## GAME RULES

1. The target is a 5-letter English word
2. You have 6 attempts to guess the correct word
3. After each guess, you receive color-coded feedback:
   - GREEN: Letter is correct and in the correct position
   - YELLOW: Letter is in the word but in the wrong position
   - GRAY: Letter is not in the word at all
4. All guesses must be valid 5-letter English words
5. You cannot reuse a word you've already guessed

## RESPONSE FORMAT

Only respond with your next guess in square brackets, e.g., [crane].

Format:
```
[guess]
```

## STRATEGIC APPROACH

Do not repeat the same guess twice.

### Opening Strategy
- Start with words rich in common vowels (A, E, I, O, U) and consonants (R, S, T, L, N)
- Optimal starters: CRANE, SLATE, STARE, AROSE, IRATE
- Prioritize words that test the most common letters in different positions

### Mid-Game Strategy
- Use confirmed GREEN letters in their correct positions
- Place YELLOW letters in different positions than where they appeared
- Eliminate GRAY letters from consideration
- If multiple letters are unknown, prioritize common letter combinations (TH, CH, ST, ER, etc.)
- Consider letter frequency: E is most common, followed by A, R, I, O, T, N, S

### Vowel Placement
- Most 5-letter words have 2 vowels
- Common patterns: vowel-consonant-vowel (like CRANE) or consonant-vowel-vowel-consonant-vowel (like QUEUE)
- If you have 1-2 vowels confirmed, consider where the others might be

### Advanced Tactics
- Use "sacrificial" guesses to test multiple new letters if you have attempts to spare
- Avoid repeating letter patterns unless you're certain (e.g., SPEED has two E's)
- Think about word endings: -ER, -LY, -ED, -ING are common but may not fit the 5-letter constraint
- Consider less common letters (Q, X, Z, J) only when you've eliminated the most common options

### Common Pitfalls to Avoid
- Don't reuse letters marked GRAY (eliminated letters)
- Don't place YELLOW letters in the same position they appeared
- Don't ignore confirmed GREEN letters in future guesses
- Don't guess words that contradict known information

## EXAMPLES

### Example 1: Opening Guess
"Starting with a word that tests common vowels and consonants in varied positions."
[crane]

### Example 2: After Receiving Feedback
Previous guess: CRANE
Feedback: C=gray, R=yellow, A=green, N=gray, E=yellow

"A is confirmed in position 2. R and E are in the word but need different positions. C and N are eliminated. I'll try a word with A in position 2, and test R and E in new positions along with common letters like S and T."
[spare]

### Example 3: Narrowing Down
Previous guesses: CRANE (C=gray, R=yellow, A=green, N=gray, E=yellow), SPARE (S=gray, P=gray, A=green, R=green, E=green)
Feedback summary: _ARE_ with R in position 4, A in position 2, E in position 5

"I have _AR E_ confirmed. Positions 1 and 3 are unknown. Common letters to try: T, L, D, B, F, G. Testing with TARED."
[tared]

### Example 4: Final Deduction
Previous feedback shows: _ARED with position 1 unknown and all common consonants tested

"Only position 1 remains. I've eliminated S, P, C, N. Common starting consonants left are B, F, G, H. BARED is a common word."
[bared]

## LETTER FREQUENCY REFERENCE

Most common letters in 5-letter words (in order):
S, E, A, O, R, I, L, T, N, U, D, Y, C, P, M, H, G, B, K, F

Most common starting letters:
S, C, B, T, P, A, F, G, D, M

Most common ending letters:
E, Y, T, S, R, L, N, D

## IMPORTANT CONSTRAINTS

- Use lowercase only
- One guess per response
- Must be exactly 5 letters
- Must be a real English word from standard dictionaries
- Never repeat a previous guess
- Always include brief reasoning before your guess

## YOUR GOAL

Solve the Wordle in as few guesses as possible by strategically using feedback to eliminate impossible words and narrow down the solution space efficiently.
"""

Now, let's define the `rollout_func`:

This function orchestrates the interaction between the model and the Wordle environment. For each prompt in the batch, it runs the episode interaction, collecting rewards and model outputs for GRPO optimization.

In [None]:
def rollout_func(prompts, trainer=None):
    """
    Rollout function for GRPO training with environment interaction.

    This function is called by GRPOTrainer to generate completions and compute rewards.
    In colocate mode, it uses trainer.generate_rollout_completions() for inference.

    Args:
        prompts: List of prompts to generate from
        trainer: GRPOTrainer instance containing context and configuration

    Returns:
        Dictionary with prompt_ids, completion_ids, logprobs, and reward signals
    """
    episode_prompt_ids = []
    episode_completion_ids = []
    episode_logprobs = []
    correctness_rewards = []
    green_rewards = []
    yellow_rewards = []
    repetition_rewards = []

    for prompt_text in prompts:
        episode = rollout_once(
            trainer=trainer,
            env=env,
            tokenizer=tokenizer,
            dataset_prompt=prompt_text,
            system_prompt=system_prompt,
            max_turns=6,
        )
        episode_prompt_ids.append(episode["prompt_ids"])
        episode_completion_ids.append(episode["completion_ids"])
        episode_logprobs.append(episode["logprobs"])
        correctness_rewards.append(episode["correct_reward"])
        green_rewards.append(episode["green_reward"])
        yellow_rewards.append(episode["yellow_reward"])
        repetition_rewards.append(episode["repetition_reward"])

    return {
        "prompt_ids": episode_prompt_ids,
        "completion_ids": episode_completion_ids,
        "logprobs": episode_logprobs,
        "correct_reward": correctness_rewards,
        "green_reward": green_rewards,
        "yellow_reward": yellow_rewards,
        "repetition_reward": repetition_rewards,
    }

### Define `rollout_once`

The `rollout_once` function runs **one full interaction loop** between the model and the Wordle environment using the trainer's generation method.  
It executes a mini episode of gameplay, from generating a guess to receiving and processing feedback.

Here's the step-by-step breakdown:

1. **Environment reset:** Start a new game session and initialize the observation.  
2. **Prompt construction:** Combine the system prompt, current state, and user messages to form the model input.  
3. **Generation:** Use `trl.experimental.openenv.generate_rollout_completions()` to produce the model's guess efficiently.  
4. **Feedback extraction:** Parse the environment's response using helpers like `extract_guess()` and `extract_wordle_feedback()`.  
5. **Reward calculation:** Compute rewards based on correctness, green/yellow feedback, and repetition penalty.
6. **Return structured rollout data:** Includes prompt/completion IDs, logprobs, and all computed reward components.

This modular design ensures that each episode can be processed independently while still providing rich feedback for the **GRPO training loop**.

In [None]:
from collections import defaultdict
from textarena_env import TextArenaAction
from textarena_env.rewards import extract_feedback_counts, extract_guess, extract_wordle_feedback
from trl.experimental.openenv import generate_rollout_completions


def rollout_once(trainer, env, tokenizer, dataset_prompt, system_prompt, max_turns):
    """
    Execute one full Wordle episode with the model.

    This function uses generate_rollout_completions() instead of manual vLLM handling,
    making the code cleaner and more maintainable.
    """
    result = env.reset()
    observation = result.observation

    prompt_ids = []
    completion_ids = []
    logprobs = []
    raw_rewards = []
    green_scores = []
    yellow_scores = []
    repetition_scores = []
    correct_scores = []
    guess_counts = defaultdict(int)

    for _turn in range(max_turns):
        # when the game is over the environment will return a done=True
        if result.done:
            break

        # set up the prompt for the model
        base_prompt = observation.prompt or dataset_prompt
        user_prompt = make_user_prompt(base_prompt, observation.messages)
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
        prompt_text = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
            enable_thinking=False,
        )

        # Generate using trainer's built-in method (much cleaner!)
        rollout_outputs = generate_rollout_completions(trainer, [prompt_text])[0]
        prompt_ids.extend(rollout_outputs["prompt_ids"])
        completion_ids.extend(rollout_outputs["completion_ids"])
        logprobs.extend(rollout_outputs["logprobs"])
        completion_text = rollout_outputs.get("text") or tokenizer.decode(
            rollout_outputs["completion_ids"], skip_special_tokens=True
        )

        # extract the guess from the completion
        guess = extract_guess(completion_text)

        # step the environment with the guess
        result = env.step(TextArenaAction(message=guess))
        raw_rewards.append(float(result.reward or 0.0))
        observation = result.observation
        correct_score = float(result.reward or 0.0)
        feedback = extract_wordle_feedback(observation)

        # Update guess counts
        previous_occurrences = guess_counts[guess]
        repetition_score = scale_repetition_score(previous_occurrences, len(guess_counts))
        guess_counts[guess] += 1

        # calculate custom reward signals from the feedback
        if not feedback:
            green_score = 0.0
            yellow_score = 0.0
        else:
            green_count, yellow_count = extract_feedback_counts(feedback)
            green_score = green_count / 5.0
            yellow_score = yellow_count / 5.0

        repetition_scores.append(repetition_score)
        green_scores.append(green_score)
        yellow_scores.append(yellow_score)
        correct_scores.append(correct_score)

    correct_reward_value = correct_scores[-1] if correct_scores else (raw_rewards[-1] if raw_rewards else 0.0)

    return {
        "prompt_ids": prompt_ids,
        "completion_ids": completion_ids,
        "logprobs": logprobs,
        "raw_rewards": raw_rewards,
        "correct_reward": correct_reward_value,
        "green_reward": green_scores[-1] if green_scores else 0.0,
        "yellow_reward": yellow_scores[-1] if yellow_scores else 0.0,
        "repetition_reward": repetition_scores[-1] if repetition_scores else 0.0,
    }

  from trl.experimental.openenv import generate_rollout_completions


### Helper functions

Supporting utilities used in `rollout_once`:

- **`make_user_prompt`**: builds the user prompt combining the base text and previous game messages.
- **`format_history`**: formats the conversation log for consistent context.
- **`scale_repetition_score`**: applies a penalty when guesses are repeated to encourage exploration.

In [None]:
# @title Helpers definition (click to expand)
def make_user_prompt(prompt_text, messages):
    """Builds a structured user prompt combining the task description and message history"""
    history = format_history(messages)
    prompt_section = prompt_text.strip() if prompt_text.strip() else "Wordle-v0"
    history_section = history if history else "[PROMPT] Awaiting first feedback."
    return (
        f"Game prompt:\n{prompt_section}\n\n"
        f"Conversation so far:\n{history_section}\n\n"
        "Reply with your next guess enclosed in square brackets."
    )

def format_history(messages):
    """Formats the message history with tags for clear conversational context"""
    lines = []
    for message in messages:
        tag = message.category or "MESSAGE"
        content = message.content.strip()
        if not content:
            continue
        lines.append(f"[{tag}] {content}")
    return "\n".join(lines)

def scale_repetition_score(previous_occurrences, max_occurrences):
    """Scale the repetition score based on the number of previous occurrences from 0 to 1"""
    if max_occurrences == 0:
        return 0.0
    return (max_occurrences - previous_occurrences) / max_occurrences

## Define reward functions

To guide the agent's learning process, we define simple reward functions that map the feedback from the environment into numeric signals.  
Each function corresponds to a specific aspect of the **Wordle** game:

- ‚úÖ **`reward_correct`**: rewards the model when it guesses the correct word.  
- üü© **`reward_greens`**: rewards letters correctly placed (green feedback).  
- üü® **`reward_yellows`**: rewards letters that are present but misplaced (yellow feedback).  
- üîÅ **`reward_repetition`**: rewards diverse guessing by scoring based on guess uniqueness.

These functions return lists of float values that the **GRPOTrainer** uses during optimization.  
By combining them, the model learns to balance correctness, information gathering, and exploration in its guessing strategy.

In [None]:
def reward_correct(completions, **kwargs):
    rewards = kwargs.get("correct_reward") if kwargs else None
    if rewards is None:
        return [0.0 for _ in completions]
    return [float(r) for r in rewards]


def reward_greens(completions, **kwargs):
    rewards = kwargs.get("green_reward") if kwargs else None
    if rewards is None:
        return [0.0 for _ in completions]
    return [float(r) for r in rewards]


def reward_yellows(completions, **kwargs):
    rewards = kwargs.get("yellow_reward") if kwargs else None
    if rewards is None:
        return [0.0 for _ in completions]
    return [float(r) for r in rewards]


def reward_repetition(completions, **kwargs):
    rewards = kwargs.get("repetition_reward") if kwargs else None
    if rewards is None:
        return [0.0 for _ in completions]
    return [float(r) for r in rewards]

## Create dataset

We create a dataset with repeated prompts to control the number of training episodes.  
Each entry in the dataset triggers one rollout episode during training. The `dataset_prompt` provides the initial instruction to the model before each game starts.

In [None]:
from datasets import Dataset

dataset_size = 1000
dataset_prompt = "Play Wordle like an expert."

dataset = Dataset.from_dict({"prompt": [dataset_prompt] * dataset_size})

## Set GRPO Config

Next, we define the **GRPOConfig**, which controls all key training parameters.  
This configuration specifies how the model interacts with **vLLM**, manages memory, and logs results.

In [None]:
from trl import GRPOConfig

output_dir = "wordle-grpo-Qwen3-1.7B"

grpo_config = GRPOConfig(
    # Training schedule / optimization
    num_train_epochs = 1,                 # Number of full dataset passes
    learning_rate = 5e-6,                 # Learning rate for the optimizer
    gradient_accumulation_steps = 64,     # Accumulate gradients over multiple steps
    per_device_train_batch_size = 1,      # Batch size per GPU (number of prompts processed together)
    warmup_steps = 20,                    # Steps for learning rate warmup

    # GRPO configuration
    num_generations = 2,                  # Number of rollout episodes per prompt (for variance reduction)
    max_completion_length = 8,            # Maximum tokens generated per model response

    # vLLM configuration
    use_vllm = True,                      # Enable vLLM for faster inference during rollouts
    vllm_mode = "colocate",               # Run vLLM in colocate mode (same process as training)
    vllm_gpu_memory_utilization = 0.1,    # Fraction of GPU memory reserved for vLLM inference

    # Logging / reporting
    output_dir = output_dir,              # Directory for checkpoints and logs
    report_to="trackio",                  # Experiment tracking tool (integrates with HF Spaces)
    trackio_space_id = output_dir,        # HF Space where experiment tracking will be saved
    logging_steps = 1,                    # Log metrics every N steps
    save_steps = 10,                      # Interval for saving checkpoints

    # Memory optimization
    gradient_checkpointing = True,        # Enable activation recomputation to save memory

    # Hub integration
    push_to_hub = True,                  # Set True to automatically push model to Hugging Face Hub
)

## Create `GRPOTrainer` and start training

Now we initialize the `GRPOTrainer`, which manages the entire reinforcement learning loop.

It takes the model, tokenizer, reward functions, rollout function, and dataset defined earlier.  
The trainer coordinates the interaction between the model and the environment, applies the reward signals, and updates the policy.

Finally, we call `trainer.train()` to start the fine-tuning process and let the model learn to play Wordle through feedback and iteration.

In [None]:
from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model_name,
    processing_class=tokenizer,
    reward_funcs=[
        reward_correct,
        reward_greens,
        reward_yellows,
        reward_repetition,
    ],
    train_dataset=dataset,
    args=grpo_config,
    rollout_func=rollout_func,
)

Show memory stats before training

In [None]:
import torch
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
10.516 GB of memory reserved.


And train!

In [None]:
trainer_stats = trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151645}.


* Trackio project initialized: huggingface
* Trackio metrics will be synced to Hugging Face Dataset: sergiopaniego/wordle-grpo-Qwen3-1.7B-dataset
* Creating new space: https://huggingface.co/spaces/sergiopaniego/wordle-grpo-Qwen3-1.7B
* View dashboard by going to: https://sergiopaniego-wordle-grpo-Qwen3-1.7B.hf.space/


* Created new run: sergiopaniego-1763727287
INFO 11-21 12:14:47 [block_pool.py:292] Successfully reset prefix cache


Step,Training Loss
1,0.0083
2,0.0019
3,0.0151
4,0.0087
5,0.0098
6,0.0067
7,0.0061
8,0.0044
9,-0.0021
10,0.0075


INFO 11-21 12:16:45 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:19:33 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:22:23 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:25:11 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:27:59 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:30:47 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:33:36 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:36:24 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:39:12 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:42:38 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:45:41 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:48:28 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:51:17 [block_pool.py:292] Successfully reset prefix cache
INFO 11-21 12:54:05 [block_pool.py:292] Successfully reset prefi

Show memory stats after training

In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
training_memory_percentage = round(used_memory_for_training / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_training} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {training_memory_percentage} %.")

5231.7046 seconds used for training.
87.2 minutes used for training.
Peak reserved memory = 36.68 GB.
Peak reserved memory for training = 26.164 GB.
Peak reserved memory % of max memory = 92.727 %.
Peak reserved memory for training % of max memory = 66.143 %.


In [None]:
env.close()
trainer.save_model(output_dir)
trainer.push_to_hub()

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...n3-1.7B/training_args.bin: 100%|##########| 7.31kB / 7.31kB            

  ...Qwen3-1.7B/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

  ...adapter_model.safetensors: 100%|##########| 25.7MB / 25.7MB            

  ...0002-of-00002.safetensors:   2%|2         | 41.9MB / 1.91GB            

  ...0001-of-00002.safetensors:   1%|          | 33.5MB / 4.97GB            

No files have been modified since last commit. Skipping to prevent empty commit.


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...n3-1.7B/training_args.bin: 100%|##########| 7.31kB / 7.31kB            

  ...Qwen3-1.7B/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

  ...0001-of-00002.safetensors:   1%|          | 41.9MB / 4.97GB            

  ...0002-of-00002.safetensors:   2%|1         | 33.5MB / 1.91GB            

  ...adapter_model.safetensors: 100%|##########| 25.7MB / 25.7MB            

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/sergiopaniego/wordle-grpo-Qwen3-1.7B/commit/b81b548867ab35601d3bda845ed5e18147550e30', commit_message='End of training', commit_description='', oid='b81b548867ab35601d3bda845ed5e18147550e30', pr_url=None, repo_url=RepoUrl('https://huggingface.co/sergiopaniego/wordle-grpo-Qwen3-1.7B', endpoint='https://huggingface.co', repo_type='model', repo_id='sergiopaniego/wordle-grpo-Qwen3-1.7B'), pr_revision=None, pr_num=None)

## Load the Fine-Tuned Model and Run Inference

Now let's test our fine-tuned model by loading the **adapter** and running **inference**.  
We begin by loading the **base model**, attaching the adapter, and obtaining the final fine-tuned model ready for evaluation.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "sergiopaniego/wordle-grpo-Qwen3-1.7B" # Replace with your HF username or organization

fine_tuned_model = AutoModelForCausalLM.from_pretrained(model_name, dtype="float32", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Now that we have the fine-tuned model loaded, we can start playing Wordle.  
To make this easier, we'll define a reusable function so we can play multiple rounds.  
The function implements the same logic we explored earlier.

In [None]:
MAX_TURNS=6

def play_wordle(env, model, tokenizer):
    result = env.reset()
    observation = result.observation

    print("üìú Initial Prompt:\n" + observation.prompt)

    for turn in range(MAX_TURNS):
        if result.done:
            break

        user_prompt = make_user_prompt(observation.prompt, observation.messages)
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
        prompt_text = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
            enable_thinking=False,
        )

        model_inputs = tokenizer([prompt_text], return_tensors="pt").to(model.device)

        generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=512
        )
        output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]

        # Decode and extract model response
        generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)
        guess = extract_guess(generated_text)

        print(f"\nüéØ Turn {turn}: model replied with -> {generated_text}")
        print(f"   Parsed guess: {guess}")

        result = env.step(TextArenaAction(message=guess))
        observation = result.observation

        print("   Feedback messages:")
        for message in observation.messages:
            print(f"     [{message.category}] {message.content}")

    print("\n‚úÖ Game finished")
    print(f"   Reward: {result.reward}")
    print(f"   Done: {result.done}")

Let's play the game!

In [None]:
try:
    play_wordle(env, fine_tuned_model, tokenizer)
finally:
    env.close()

üìú Initial Prompt:
You are Player 0 in Wordle.
A secret 5-letter word has been chosen. You have 6 attempts to guess it.
For each guess, wrap your word in square brackets (e.g., [apple]).
Feedback for each letter will be given as follows:
  - G (green): correct letter in the correct position
  - Y (yellow): letter exists in the word but in the wrong position
  - X (wrong): letter is not in the word
Enter your guess to begin.

üéØ Turn 0: model replied with -> [crane]
   Parsed guess: [crane]
   Feedback messages:
     [MESSAGE] [crane]
     [MESSAGE] Player 0 submitted [crane].
Feedback:
C R A N E
X Y X X X

You have 5 guesses left.

üéØ Turn 1: model replied with -> [spare]
   Parsed guess: [spare]
   Feedback messages:
     [MESSAGE] [spare]
     [MESSAGE] Player 0 submitted [spare].
Feedback:
C R A N E
X Y X X X

S P A R E
G X X G X

You have 4 guesses left.

üéØ Turn 2: model replied with -> [spare]
   Parsed guess: [spare]
   Feedback messages:
     [MESSAGE] [spare]
     [MES