Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (100% autonomous research via goldfish)#702
Open
lukacf wants to merge 4 commits intoopenai:mainfrom
Conversation
3-seed mean: 0.9789 BPB (sliding window stride=64) Best seed: 0.9779 (seed 7) Std: 0.0015 Key innovation: Autonomous ML research methodology. AI coding agent discovered cosine LR scaling for TTT in a single 2-hour session — 7 experiments from hypothesis to record. Technical: CosineAnnealingLR over 100 TTT epochs (3-line change). Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).
Novel eval-time n-gram cache with two extensions over vanilla 5-gram: 1. Multi-order backoff (2,3,4,5-gram with highest-order-first fallback) 2. Entropy-adaptive mixing weight (sigmoid-modulated 0.05-0.40) 3-seed mean: 1.0244 BPB (std 0.0003) | 15.79 MB artifact Beats merged SOTA (1.1194) by 0.095 BPB Beats best unmerged (1.0461) by 0.022 BPB Score-first legal: cache updated only after scoring. Proper distribution: p_mixed = (1-a)*p_model + a*p_ng sums to 1. Discovered and validated autonomously using Goldfish ML + Claude Code.
Asukabot0
added a commit
to Asukabot0/parameter-golf
that referenced
this pull request
Mar 25, 2026
Two eval-time improvements (no retraining needed): 1. Multi-order backoff (orders 2-7): When 7-gram has no cache hit, falls back to 6/5/4/3/2-gram. Dramatically increases cache hit rate on 8xH100 where per-GPU cache is sparse. PR openai#702 reports -0.018 BPB. 2. Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(2*(H-4.0)) Model uncertain → trust n-gram more. Model confident → keep LM. Compliant: alpha depends only on model's own distribution. Both configurable via env vars (NGRAM_ENTROPY=0 to disable adaptive). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Follow-up: No-GPTQ ValidationWe also validated the result without GPTQ (using percentile search quantization instead), confirming the n-gram technique is robust to the quantization method: With GPTQ (submitted above)
Without GPTQ (percentile search only)
The n-gram improvement is consistent regardless of quantization method — GPTQ vs percentile search changes the result by only ~0.001 BPP. The no-GPTQ variant uses full 600s for training with no post-training calibration step. Happy to provide the no-GPTQ |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
val_bpb = 1.0244 (3-seed mean) | 15.79 MB artifact | 8xH100 SXM, 600s training + 124s eval
Novel eval-time n-gram cache with two extensions over the vanilla 5-gram approach (PR #674):
alpha = 0.05 + 0.35 * sigmoid(2 * (H - 4.0))These extensions improve over vanilla 5-gram by 0.018 BPB (1.0423 → 1.0240) on the same base model.
Legality
p_mixed = (1-a)*p_model + a*p_ngsums to 1 over all vocab tokens. Both components are proper distributions. Target-only lookup is mathematically equivalent to full-vocab computation. See code comments for proof.3-Seed Validation
Training Architecture
11L transformer, 512d, GQA, LeakyReLU(0.5)², XSA-all(11), VRL, Soft-Round QAT, Full Hessian GPTQ (Cholesky + actorder), int6+zstd-22, 7% prune. ~6600 steps at 91ms/step on 8xH100 SXM.
Infrastructure
Discovered and validated autonomously in a single research session using Goldfish ML (MCP-based experiment platform) + Meerkat (agent harness) + an AI coding agent. 12 experiments from first hypothesis to submission, with full provenance and documented dead ends. See README for complete experiment timeline and lineage.