Skip to content

iamjasonfeng/RPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

I created an LLM post-training method called Regressive Plasticity Schedule (RPS). Preliminary results show that RPS improved Qwen3-8b's ARC-AGI performance and program synthesis reliability.

RPS is inspired by neuroscience. As humans, we learn basic skills as kids with high neuro-plasticity. We then learn advanced skills as teens and adults with low neuro-plasticity. RPS trains a model in 2 stages. In stage 1, the model is trained on easy data with high learning rate. In stage 2, the model is trained on hard data with 10% the learning rate of stage 1. RPS is basically a combination of existing ideas: curriculum learning + learning rate decay.

Training setup:

RPS and EPS Training Setup

For these experiments, I used qwen3-8b as the base model and trained it with Alibaba Model Studio managed DPO fine-tuning using LoRA. The goal was to test whether a staged “plasticity schedule” can improve ARC-style reasoning behavior, especially the model’s tendency to produce usable program-synthesis solutions.

I compared two schedules:

RPS, or Regressive Plasticity Schedule, uses higher plasticity in Stage 1 and lower plasticity in Stage 2. Operationally, I implemented this by using the same DPO/LoRA setup in both stages, but reducing the Stage 2 learning rate to 10% of the Stage 1 learning rate.

EPS, or Equal Plasticity Schedule, is the control condition. It uses the same Stage 1 model and the same Stage 2 dataset, but Stage 2 keeps the same learning rate as Stage 1.

All other settings were kept as similar as possible between RPS and EPS.

Dataset

The ARC DPO dataset had 1,000 total preference pairs.

Stage 1 used 400 DPO pairs from easier ARC-style training tasks. It contained 400 unique tasks. Most examples came from ARC-AGI-1 or ARC-AGI-1/2-overlap tasks, with a small number of ARC-AGI-2 training examples used to fill the quota.

Stage 2 used 600 DPO pairs from ARC-AGI-2 training tasks only. It contained 250 unique tasks, with some duplicate task IDs allowed to fill the 600-pair quota. No task IDs overlapped between Stage 1 and Stage 2.

Each DPO pair had:

  • A prompt containing ARC training examples and test inputs.

  • A chosen response containing reasoning, a Python transform program, and predicted test outputs.

  • A rejected response generated by qwen3-0.6b under normal solving conditions, not by asking it to be wrong.

The chosen responses came from Trelis/arc-agi-2-reasoning-5, a dataset of ARC reasoning traces and Python program-synthesis solutions. I used those traces as the preferred responses, then generated rejected responses with qwen3-0.6b under normal solving conditions. The underlying ARC tasks were restricted to official ARC-AGI training tasks: Stage 1 used mostly ARC-AGI-1 or ARC-AGI-1/2-overlap tasks, while Stage 2 used ARC-AGI-2 training tasks only.

Both chosen and rejected responses were normalized into the same labeled structure:

...

Python program:

python

def transform(grid):

...

Predicted test outputs:

...

Shared Fine-Tuning Settings

Both RPS and EPS used the same base model, data format, and fine-tuning method:

  • Base model: qwen3-8b

  • Fine-tuning method: DPO

  • Adapter method: LoRA

  • Validation split: 1%

  • Max sequence length: 32k

  • Scheduler: cosine

  • RPO alpha: 1

  • Stage 1 dataset size: 400 pairs

  • Stage 2 dataset size: 600 pairs

  • No replay

  • No ARC-AGI public evaluation data was used for training

RPS Schedule

RPS trained in two stages:

Stage 1: higher-plasticity DPO on easier ARC-style examples

Stage 2: lower-plasticity DPO on ARC-AGI-2 examples

The key plasticity change was the learning rate:

Stage 2 learning rate = 10% of Stage 1 learning rate

RPS:

Stage 1 learning rate: 1e-5

Stage 2 learning rate: 1e-6

EPS:

Stage 1 learning rate: 1e-5

Stage 2 learning rate: 1e-5

This was intended to mimic a developmental pattern: first learn broad ARC-style reasoning and program-synthesis behavior with higher plasticity, then adapt to harder ARC-AGI-2 examples with reduced plasticity.

EPS Schedule

EPS used the same two-stage dataset structure, but did not reduce plasticity in Stage 2:

Stage 1 learning rate = Stage 2 learning rate

Eval results:

On ARC-AGI-2 public evaluation, neither RPS nor EPS solved any full tasks. Both models scored 0/120 official task solves and 0/167 exact test-output matches.

However, RPS showed a large improvement in program-synthesis reliability.

Both the RPS model and the EPS model were told that program synthesis was allowed.

ARC-AGI-2 public eval results:

  • RPS official task solves: 0/120

  • EPS official task solves: 0/120

  • RPS exact test-output accuracy: 0/167

  • EPS exact test-output accuracy: 0/167

  • RPS valid output attempts: 319/334, or 95.5%

  • EPS valid output attempts: 268/334, or 80.2%

  • RPS token-limit hits: 1

  • EPS token-limit hits: 1

  • RPS API errors: 0

  • EPS API errors: 0

Program-synthesis statistics:

  • RPS attempts scored from executed programs: 214/240, or 89.2%

  • EPS attempts scored from executed programs: 176/240, or 73.3%

  • RPS program executions without error: 234/240

  • EPS program executions without error: 188/240

  • RPS parsed-JSON fallback attempts: 18

  • EPS parsed-JSON fallback attempts: 19

  • RPS invalid attempts: 8

  • EPS invalid attempts: 45

  • RPS tasks with both attempts scored from executed programs: 100/120, or 83.3%

  • EPS tasks with both attempts scored from executed programs: 60/120, or 50.0%

  • RPS tasks with at least one executed-program attempt: 114/120, or 95.0%

  • EPS tasks with at least one executed-program attempt: 116/120, or 96.7%

  • RPS tasks with zero executed-program attempts: 6/120

  • EPS tasks with zero executed-program attempts: 4/120

Paired attempt-level comparison:

  • Both RPS and EPS produced executed-program outputs: 160/240 attempts

  • RPS only produced an executed-program output: 54/240 attempts

  • EPS only produced an executed-program output: 16/240 attempts

  • Neither produced an executed-program output: 10/240 attempts

Interpretation:

RPS did not improve ARC-AGI-2 task accuracy in this run, but it substantially improved program-synthesis reliability. The clearest signal is that RPS produced usable executed-program outputs on both attempts for 100/120 tasks, compared with 60/120 for EPS. This suggests that reduced-plasticity Stage 2 training made the model more consistent at staying in the intended reasoning/program-synthesis mode, even though that behavioral improvement did not yet translate into correct ARC-AGI-2 solutions.

ARC-AGI 1 public eval

base model: Qwen3-8b

RPS:

"output_exact_accuracy": 0.0405727923627685

EPS:

"output_exact_accuracy": 0.02386634844868735

Program Synthesis Stats

Program executions without error:

RPS: 1145/1200

EPS: 870/1200

I came up with the RPS idea myself, but I used Codex to help me with the training.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors