GitHub - iamjasonfeng/RPS

I created an LLM post-training method called Regressive Plasticity Schedule (RPS). Preliminary results show that RPS improved Qwen3-8b's ARC-AGI performance and program synthesis reliability.

RPS is inspired by neuroscience. As humans, we learn basic skills as kids with high neuro-plasticity. We then learn advanced skills as teens and adults with low neuro-plasticity. RPS trains a model in 2 stages. In stage 1, the model is trained on easy data with high learning rate. In stage 2, the model is trained on hard data with 10% the learning rate of stage 1. RPS is basically a combination of existing ideas: curriculum learning + learning rate decay.

Training setup:

RPS and EPS Training Setup

For these experiments, I used qwen3-8b as the base model and trained it with Alibaba Model Studio managed DPO fine-tuning using LoRA. The goal was to test whether a staged “plasticity schedule” can improve ARC-style reasoning behavior, especially the model’s tendency to produce usable program-synthesis solutions.

I compared two schedules:

RPS, or Regressive Plasticity Schedule, uses higher plasticity in Stage 1 and lower plasticity in Stage 2. Operationally, I implemented this by using the same DPO/LoRA setup in both stages, but reducing the Stage 2 learning rate to 10% of the Stage 1 learning rate.

EPS, or Equal Plasticity Schedule, is the control condition. It uses the same Stage 1 model and the same Stage 2 dataset, but Stage 2 keeps the same learning rate as Stage 1.

All other settings were kept as similar as possible between RPS and EPS.

Dataset

The ARC DPO dataset had 1,000 total preference pairs.

Stage 1 used 400 DPO pairs from easier ARC-style training tasks. It contained 400 unique tasks. Most examples came from ARC-AGI-1 or ARC-AGI-1/2-overlap tasks, with a small number of ARC-AGI-2 training examples used to fill the quota.

Stage 2 used 600 DPO pairs from ARC-AGI-2 training tasks only. It contained 250 unique tasks, with some duplicate task IDs allowed to fill the 600-pair quota. No task IDs overlapped between Stage 1 and Stage 2.

Each DPO pair had:

A prompt containing ARC training examples and test inputs.
A chosen response containing reasoning, a Python transform program, and predicted test outputs.
A rejected response generated by qwen3-0.6b under normal solving conditions, not by asking it to be wrong.

The chosen responses came from Trelis/arc-agi-2-reasoning-5, a dataset of ARC reasoning traces and Python program-synthesis solutions. I used those traces as the preferred responses, then generated rejected responses with qwen3-0.6b under normal solving conditions. The underlying ARC tasks were restricted to official ARC-AGI training tasks: Stage 1 used mostly ARC-AGI-1 or ARC-AGI-1/2-overlap tasks, while Stage 2 used ARC-AGI-2 training tasks only.

Both chosen and rejected responses were normalized into the same labeled structure:

...

Python program:

python

def transform(grid):

...

Predicted test outputs:

...

Shared Fine-Tuning Settings

Both RPS and EPS used the same base model, data format, and fine-tuning method:

Base model: qwen3-8b
Fine-tuning method: DPO
Adapter method: LoRA
Validation split: 1%
Max sequence length: 32k
Scheduler: cosine
RPO alpha: 1
Stage 1 dataset size: 400 pairs
Stage 2 dataset size: 600 pairs
No replay
No ARC-AGI public evaluation data was used for training

RPS Schedule

RPS trained in two stages:

Stage 1: higher-plasticity DPO on easier ARC-style examples

Stage 2: lower-plasticity DPO on ARC-AGI-2 examples

The key plasticity change was the learning rate:

Stage 2 learning rate = 10% of Stage 1 learning rate

RPS:

Stage 1 learning rate: 1e-5

Stage 2 learning rate: 1e-6

EPS:

Stage 1 learning rate: 1e-5

Stage 2 learning rate: 1e-5

This was intended to mimic a developmental pattern: first learn broad ARC-style reasoning and program-synthesis behavior with higher plasticity, then adapt to harder ARC-AGI-2 examples with reduced plasticity.

EPS Schedule

EPS used the same two-stage dataset structure, but did not reduce plasticity in Stage 2:

Stage 1 learning rate = Stage 2 learning rate

Eval results:

On ARC-AGI-2 public evaluation, neither RPS nor EPS solved any full tasks. Both models scored 0/120 official task solves and 0/167 exact test-output matches.

However, RPS showed a large improvement in program-synthesis reliability.

Both the RPS model and the EPS model were told that program synthesis was allowed.

ARC-AGI-2 public eval results:

RPS official task solves: 0/120
EPS official task solves: 0/120
RPS exact test-output accuracy: 0/167
EPS exact test-output accuracy: 0/167
RPS valid output attempts: 319/334, or 95.5%
EPS valid output attempts: 268/334, or 80.2%
RPS token-limit hits: 1
EPS token-limit hits: 1
RPS API errors: 0
EPS API errors: 0

Program-synthesis statistics:

RPS attempts scored from executed programs: 214/240, or 89.2%
EPS attempts scored from executed programs: 176/240, or 73.3%
RPS program executions without error: 234/240
EPS program executions without error: 188/240
RPS parsed-JSON fallback attempts: 18
EPS parsed-JSON fallback attempts: 19
RPS invalid attempts: 8
EPS invalid attempts: 45
RPS tasks with both attempts scored from executed programs: 100/120, or 83.3%
EPS tasks with both attempts scored from executed programs: 60/120, or 50.0%
RPS tasks with at least one executed-program attempt: 114/120, or 95.0%
EPS tasks with at least one executed-program attempt: 116/120, or 96.7%
RPS tasks with zero executed-program attempts: 6/120
EPS tasks with zero executed-program attempts: 4/120

Paired attempt-level comparison:

Both RPS and EPS produced executed-program outputs: 160/240 attempts
RPS only produced an executed-program output: 54/240 attempts
EPS only produced an executed-program output: 16/240 attempts
Neither produced an executed-program output: 10/240 attempts

Interpretation:

RPS did not improve ARC-AGI-2 task accuracy in this run, but it substantially improved program-synthesis reliability. The clearest signal is that RPS produced usable executed-program outputs on both attempts for 100/120 tasks, compared with 60/120 for EPS. This suggests that reduced-plasticity Stage 2 training made the model more consistent at staying in the intended reasoning/program-synthesis mode, even though that behavioral improvement did not yet translate into correct ARC-AGI-2 solutions.

ARC-AGI 1 public eval

base model: Qwen3-8b

RPS:

"output_exact_accuracy": 0.0405727923627685

EPS:

"output_exact_accuracy": 0.02386634844868735

Program Synthesis Stats

Program executions without error:

RPS: 1145/1200

EPS: 870/1200

I came up with the RPS idea myself, but I used Codex to help me with the training.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages