[WIP] Sparse Attention + Recursive Weight Sharing for 16MB Efficiency#5
Closed
albertorkive wants to merge 1 commit intoopenai:mainfrom
Closed
[WIP] Sparse Attention + Recursive Weight Sharing for 16MB Efficiency#5albertorkive wants to merge 1 commit intoopenai:mainfrom
albertorkive wants to merge 1 commit intoopenai:mainfrom
Conversation
Adds three research levers to train_gpt.py (1177 lines, under 1500 cap): - Sliding window attention (WINDOW_SIZE env var, default 0 = dense) Causal attention restricted to a local window; reduces compute and may allow wider/deeper models within the 16MB artifact budget. - Recursive weight tying (NUM_PHYSICAL_LAYERS env var) N physical blocks reused across NUM_LAYERS logical layers. E.g. NUM_PHYSICAL_LAYERS=3 with NUM_LAYERS=9 gives 3x effective depth at the same parameter count. - Dev/mini-run support (DEV_MODE, SKIP_QUANT env vars) DEV_MODE=1 allows MPS/CPU fallback for local architecture testing. SKIP_QUANT=1 skips int8+zlib serialization for fast iteration. Also adds mini_run.sh convenience script for quick local runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
bro we can see claude did it |
|
@jordankzf goal is scientific advent, not who did it |
Author
|
Lol @jordankzf cause you definitely are hand coding every character 😂 Based on that comment you are either a hypocrite or slow (adopting tools I mean) 😬 |
|
@albertorkive Just messing with you, man. Claude (and Codex) is awesome! Good luck on your submission :) |
Author
|
@jordankzf likewise, good luck. May the best coding assistant win 😌 |
193a53b to
31e777e
Compare
gb250e
referenced
this pull request
in gb250e/parameter-golf
Mar 21, 2026
dhruvjatkar
referenced
this pull request
in dhruvjatkar/parameter-golf
Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future improvements must be orthogonal to TTT. This update: - Sets 1.0781 BPB (PR openai#672) as the new target to beat - Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2, SwiGLU #3, Muon-VS #4, aggressive quant openai#5, MASA openai#6, depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8 - Deprioritizes TTT-related directions already exploited by PR openai#672 - Collapses ~1000 lines of stale Round 0-3.9 session logs into a concise historical summary - Removes resolved blockers (flash_attn, SSH hangs, local runtime) - Adds fresh Round 1 section with 5 submitted experiments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Research Lead: Rkive AI
This PR introduces a flexible architecture skeleton designed to maximize parameter density within the 16MB artifact constraint.
Key Mechanisms:
WINDOW_SIZE): Restricts causal attention to a local window. This reduces thed_modelconfigurations within the 10-minute training wall-clock.NUM_PHYSICAL_LAYERS): Implements weight sharing across logical layers (e.g., 3 physical blocks acting as 9 logical layers). This increases "effective depth" without increasing the stored parameter count in the 16MB.ptfile.torch.compileand quantization-ready layers to hit the sub-1.20 bpb target.Status: Initial skeleton pushed. Seeking Development Grant (~$500) to perform hyperparameter sweeps on sparsity patterns and tie-ratios.
Current baseline: 1.2244 bpb.
Target: < 1.20 bpb.