# Improving BARC induction model with RL

## Goal

Create the code to do RL with the BARC induction model. 

Once it works it will be moved to a script.

## Server

Before running the notebook launch a server. 

```bash
export CUDA_VISIBLE_DEVICES=0; trl vllm-serve --max_model_len 12000 --model /home/gbarbadillo/models/Llama-3.1-ARC-Potpourri-Induction-8B
```

## Imports

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1" # 0 is used by the vllm server

from unsloth import FastLanguageModel
from dataclasses import dataclass
from datasets import Dataset

from trl import GRPOConfig, GRPOTrainer

from arc25.encoders import create_grid_encoder
from arc25.utils import load_arc_dataset_with_solutions
from arc25.data_augmentation import apply_data_augmentation, get_random_data_augmentation_params
from arc25.prompting import create_prompt_from_task
# from arc25.collator import get_data_collator
from arc25.logging import configure_logging, logging
from arc25.parallel_code_execution import run_code_from_predictions

configure_logging()
logger = logging.getLogger(__name__)

## First steps

In [None]:
@dataclass
class cfg:
    # base model
    model_path: str = "/home/gbarbadillo/models/Llama-3.1-ARC-Potpourri-Induction-8B"
    load_in_4bit: bool = False
    max_seq_length: int = 12000
    grid_encoder: str = 'ColorNameEncoder()'
    # LoRA
    lora_r: int = 16
    use_rslora: bool = True
    # dataset
    dataset_path: str = "/mnt/hdd0/Kaggle/arc25/data/arc-prize-2024/arc-agi_training_challenges.json"
    output_dir: str = "/mnt/hdd0/Kaggle/arc25/trainings/2025-09-12-debug-grpo-b/debug-reward"
    # training hyperparameters
    max_epochs: int = 3
    num_generations: int = 8
    training_batch_size: int = 1
    learning_rate: float = 1e-5

In [None]:
dataset = load_arc_dataset_with_solutions(cfg.dataset_path)
print(f"Loaded {len(dataset)} tasks from {cfg.dataset_path}")

In [None]:
llm, tokenizer = FastLanguageModel.from_pretrained(
    cfg.model_path, load_in_4bit=cfg.load_in_4bit,fast_inference=False)
grid_encoder = create_grid_encoder(cfg.grid_encoder)

Let's create a small dataset.

In [None]:
task_id = list(dataset.keys())[0]
grpo_dataset = []
for _ in range(2):
    params = get_random_data_augmentation_params()
    task = apply_data_augmentation(dataset[task_id], **params)
    prompt = create_prompt_from_task(
            task, grid_encoder=grid_encoder, tokenizer=tokenizer, shuffle_train_samples=True)
    grpo_dataset.append(dict(prompt=prompt, tasks=task))
grpo_dataset = Dataset.from_list(grpo_dataset)

In [None]:
def reward_num_unique_letters(completions, **kwargs):
    """
    Reward function that rewards completions with more unique letters.

    As input seems to be receiving: completions, prompts, ground_truth and completion_ids
    """
    logger.info(f"Computing reward for {len(completions)} completions")
    logger.info(f'Completions: {completions}')
    logger.info(f'This are the kwargs: {list(kwargs.keys())}')
    # completion_contents = [completion[0]["content"] for completion in completions]
    rewards = [float(len(set(content))) for content in completions]
    logger.info(f'Rewards: {rewards}')
    return rewards

In [None]:
def arc_reward(completions, tasks, **kwargs):
    """
    Reward function that rewards completions based on how many test cases they pass.

    As input seems to be receiving: completions, prompts, ground_truth and completion_ids
    """
    results = run_code_from_predictions(tasks, list(range(len(completions))), completions, [None]*len(completions), group_results_by_task=False)
    logger.info(f"Reward results: {results}")
    logger.info(f"Task ids: {[result['task_id'] for result in results]}")
    rewards = [float(result.get('train_correct_grids', 0)) for result in results]
    logger.info(f'Rewards: {rewards}')
    return rewards

In [None]:
model = FastLanguageModel.get_peft_model(
    llm,
    r = cfg.lora_r, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = 64,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    use_rslora = cfg.use_rslora,
    # random_state = 3407,
)

In [None]:
# https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig
training_args = GRPOConfig(
    output_dir=cfg.output_dir,
    num_train_epochs=cfg.max_epochs,
    per_device_train_batch_size=cfg.training_batch_size,
    num_generations=cfg.num_generations,
    gradient_accumulation_steps=2, #cfg.num_generations // cfg.training_batch_size,
    learning_rate=cfg.learning_rate,
    # generation
    use_vllm=True,
    vllm_mode="server",
    max_completion_length=1024,
    max_prompt_length=None,
    temperature=1.0,
    top_p=0.95,
    # wandb
    report_to='wandb',
    run_name=os.path.basename(cfg.output_dir),
    # project=os.path.basename(os.path.dirname(cfg.output_dir)),
)
os.environ["WANDB_PROJECT"] = os.path.basename(os.path.dirname(cfg.output_dir))
# set also the output dir for wandb
os.environ["WANDB_DIR"] = cfg.output_dir

print(f"Training arguments: {training_args}")
# Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`. ???
# Why ??? I want to use gradient accumulation to simulate larger batch sizes.

In [None]:
trainer = GRPOTrainer(
    model=model,
    reward_funcs=arc_reward, #reward_num_unique_letters,
    # data_collator=get_data_collator(tokenizer),
    args=training_args,
    train_dataset=grpo_dataset,
    completion_only_loss=True,
)
trainer.train()

In [None]:
# use this to reset the vllm server
#! curl  -X POST --location http://0.0.0.0:8000/close_communicator/

There seem to be some compatibility problems:

```
This happens when creating the training conf:
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`. ???

This happens when training
AttributeError: 'UnslothGRPOConfig' object has no attribute 'delta'

Library versions:
unsloth                   2025.9.1                 pypi_0    pypi
unsloth-zoo               2025.9.2                 pypi_0    pypi
trl                       0.18.0.dev0              pypi_0    pypi

# pip index versions <package-name>
unsloth (2025.9.4)
Available versions: 2025.9.4, 2025.9.3, 2025.9.2, 2025.9.1,
trl (0.23.0)
Available versions: 0.23.0, 0.22.2, 0.22.1, 0.22.0, 0.21.0, 0.20.0, 0.19.1, 0.19.0, 0.18.2, 0.18.1, 0.18.0

I have installed the latest versions of both libraries on the environment `arc25-unsloth`
pip install unsloth==2025.9.4
pip install trl==0.23.0
pip install trl[vllm]

Then it gives this error when launching the server.
NameError: name 'ParallelismConfig' is not defined. Did you mean: 'parallelism_config'?
Solved with: pip install --upgrade accelerate

I also have to remove the collator.
```

This is working, but only did one training step.

```
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10 | Num Epochs = 1 | Total steps = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 8 x 1) = 64
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)
```

If I reduce the gradient accumulation steps to 1, increase the number of epochs to 3 then it does 30 steps.

Now the problem is that it seems that only be predicting 256 output tokens.

Notice 

## Debug

## TODO

- [ ] Implement the reward function
- [ ] Check if memory is enough
- [ ] Can I optimize the bouncing in compute between the two gpus?