Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (100% autonomous research via goldfish) by lukacf · Pull Request #702 · openai/parameter-golf

lukacf · 2026-03-25T11:35:37Z

Summary

val_bpb = 1.0244 (3-seed mean) | 15.79 MB artifact | 8xH100 SXM, 600s training + 124s eval

Novel eval-time n-gram cache with two extensions over the vanilla 5-gram approach (PR #674):

Multi-order backoff (2,3,4,5-gram): When a 5-gram context has no match, fall back to 4-gram, 3-gram, 2-gram
Entropy-adaptive mixing weight: When model is uncertain (high entropy), trust n-gram more. alpha = 0.05 + 0.35 * sigmoid(2 * (H - 4.0))

These extensions improve over vanilla 5-gram by 0.018 BPB (1.0423 → 1.0240) on the same base model.

Legality

Score-first: Cache updated only after scoring each token
No target-aware gating: Alpha depends on model entropy (model's own distribution), not the true target — exactly the approach suggested by @valerio-oai as a legal fix (PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659 review)
Proper distribution: p_mixed = (1-a)*p_model + a*p_ng sums to 1 over all vocab tokens. Both components are proper distributions. Target-only lookup is mathematically equivalent to full-vocab computation. See code comments for proof.

3-Seed Validation

Seed	Sliding BPB	N-gram BPB	Artifact
1	1.1156	1.0240	15,788,203
2	1.1164	1.0247	~15,790,000
3	1.1158	1.0242	~15,790,000
Mean	1.1159	1.0243
Std	0.0004	0.0003

Training Architecture

11L transformer, 512d, GQA, LeakyReLU(0.5)², XSA-all(11), VRL, Soft-Round QAT, Full Hessian GPTQ (Cholesky + actorder), int6+zstd-22, 7% prune. ~6600 steps at 91ms/step on 8xH100 SXM.

Infrastructure

Discovered and validated autonomously in a single research session using Goldfish ML (MCP-based experiment platform) + Meerkat (agent harness) + an AI coding agent. 12 experiments from first hypothesis to submission, with full provenance and documented dead ends. See README for complete experiment timeline and lineage.

3-seed mean: 0.9789 BPB (sliding window stride=64) Best seed: 0.9779 (seed 7) Std: 0.0015 Key innovation: Autonomous ML research methodology. AI coding agent discovered cosine LR scaling for TTT in a single 2-hour session — 7 experiments from hypothesis to record. Technical: CosineAnnealingLR over 100 TTT epochs (3-line change). Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).

Novel eval-time n-gram cache with two extensions over vanilla 5-gram: 1. Multi-order backoff (2,3,4,5-gram with highest-order-first fallback) 2. Entropy-adaptive mixing weight (sigmoid-modulated 0.05-0.40) 3-seed mean: 1.0244 BPB (std 0.0003) | 15.79 MB artifact Beats merged SOTA (1.1194) by 0.095 BPB Beats best unmerged (1.0461) by 0.022 BPB Score-first legal: cache updated only after scoring. Proper distribution: p_mixed = (1-a)*p_model + a*p_ng sums to 1. Discovered and validated autonomously using Goldfish ML + Claude Code.

Two eval-time improvements (no retraining needed): 1. Multi-order backoff (orders 2-7): When 7-gram has no cache hit, falls back to 6/5/4/3/2-gram. Dramatically increases cache hit rate on 8xH100 where per-GPU cache is sparse. PR openai#702 reports -0.018 BPB. 2. Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(2*(H-4.0)) Model uncertain → trust n-gram more. Model confident → keep LM. Compliant: alpha depends only on model's own distribution. Both configurable via env vars (NGRAM_ENTROPY=0 to disable adaptive). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lukacf · 2026-03-25T15:23:31Z

Follow-up: No-GPTQ Validation

We also validated the result without GPTQ (using percentile search quantization instead), confirming the n-gram technique is robust to the quantization method:

With GPTQ (submitted above)

Seed	N-gram BPP	Size
1	1.0240	15.79 MB
2	1.0247	~15.79 MB
3	1.0242	~15.79 MB
Mean	1.0243
Std	0.0004

Without GPTQ (percentile search only)

Seed	N-gram BPP	Size
1	1.0245	15.80 MB
2	1.0275	15.90 MB
3	1.0241	16.16 MB
4	1.0244	15.84 MB
Mean	1.0251
Std	0.0015

The n-gram improvement is consistent regardless of quantization method — GPTQ vs percentile search changes the result by only ~0.001 BPP. The no-GPTQ variant uses full 600s for training with no post-training calibration step.

Happy to provide the no-GPTQ train_gpt.py if useful for verification.

lukacf added 3 commits March 23, 2026 10:13

Merge remote-tracking branch 'upstream/main'

82e694a

lukacf changed the title ~~Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha~~ Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (Goldfish ML Autonomous Research) Mar 25, 2026

Fix README: tighten language, expand legality justification

e165ee5

lukacf changed the title ~~Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (Goldfish ML Autonomous Research)~~ Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (autonomous ai research via goldfish) Mar 25, 2026

lukacf changed the title ~~Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (autonomous ai research via goldfish)~~ Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (100% autonomous research via goldfish) Mar 25, 2026

Asukabot0 mentioned this pull request Mar 25, 2026

Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727

Open

andrewbaggio1 mentioned this pull request Mar 25, 2026

Record: Cosine TTT + Multi-Order N-gram Cache (3-seed mean val_bpb=0.9850) #741

Open

5 tasks

notapplica mentioned this pull request Mar 25, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (100% autonomous research via goldfish)#702

Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (100% autonomous research via goldfish)#702
lukacf wants to merge 4 commits intoopenai:mainfrom
lukacf:submission/ngram-backoff-entropy-1.0240

lukacf commented Mar 25, 2026 •

edited

Loading

Uh oh!

lukacf commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lukacf commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Legality

3-Seed Validation

Training Architecture

Infrastructure

Uh oh!

lukacf commented Mar 25, 2026

Follow-up: No-GPTQ Validation

With GPTQ (submitted above)

Without GPTQ (percentile search only)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lukacf commented Mar 25, 2026 •

edited

Loading