Skip to content

[WIP] Sparse Attention + Recursive Weight Sharing for 16MB Efficiency#5

Closed
albertorkive wants to merge 1 commit intoopenai:mainfrom
albertorkive:research/sparse-recurrent-skeleton
Closed

[WIP] Sparse Attention + Recursive Weight Sharing for 16MB Efficiency#5
albertorkive wants to merge 1 commit intoopenai:mainfrom
albertorkive:research/sparse-recurrent-skeleton

Conversation

@albertorkive
Copy link

Research Lead: Rkive AI

This PR introduces a flexible architecture skeleton designed to maximize parameter density within the 16MB artifact constraint.

Key Mechanisms:

  1. Sliding Window Attention (WINDOW_SIZE): Restricts causal attention to a local window. This reduces the $O(N^2)$ compute bottleneck on the 8xH100s, potentially allowing for wider d_model configurations within the 10-minute training wall-clock.
  2. Recursive Weight Tying (NUM_PHYSICAL_LAYERS): Implements weight sharing across logical layers (e.g., 3 physical blocks acting as 9 logical layers). This increases "effective depth" without increasing the stored parameter count in the 16MB .pt file.
  3. Optimized for 8xH100s: Includes hooks for torch.compile and quantization-ready layers to hit the sub-1.20 bpb target.

Status: Initial skeleton pushed. Seeking Development Grant (~$500) to perform hyperparameter sweeps on sparsity patterns and tie-ratios.

Current baseline: 1.2244 bpb.
Target: < 1.20 bpb.

Adds three research levers to train_gpt.py (1177 lines, under 1500 cap):

- Sliding window attention (WINDOW_SIZE env var, default 0 = dense)
  Causal attention restricted to a local window; reduces compute and
  may allow wider/deeper models within the 16MB artifact budget.

- Recursive weight tying (NUM_PHYSICAL_LAYERS env var)
  N physical blocks reused across NUM_LAYERS logical layers. E.g.
  NUM_PHYSICAL_LAYERS=3 with NUM_LAYERS=9 gives 3x effective depth
  at the same parameter count.

- Dev/mini-run support (DEV_MODE, SKIP_QUANT env vars)
  DEV_MODE=1 allows MPS/CPU fallback for local architecture testing.
  SKIP_QUANT=1 skips int8+zlib serialization for fast iteration.

Also adds mini_run.sh convenience script for quick local runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jordankzf
Copy link

bro we can see claude did it

@vukrosic
Copy link

@jordankzf goal is scientific advent, not who did it

@albertorkive
Copy link
Author

Lol @jordankzf cause you definitely are hand coding every character 😂 Based on that comment you are either a hypocrite or slow (adopting tools I mean) 😬

@jordankzf
Copy link

@albertorkive Just messing with you, man. Claude (and Codex) is awesome! Good luck on your submission :)

@albertorkive
Copy link
Author

@jordankzf likewise, good luck. May the best coding assistant win 😌

@albertorkive albertorkive force-pushed the research/sparse-recurrent-skeleton branch from 193a53b to 31e777e Compare March 19, 2026 11:00
@0hq 0hq closed this Mar 19, 2026
gb250e referenced this pull request in gb250e/parameter-golf Mar 21, 2026
dhruvjatkar referenced this pull request in dhruvjatkar/parameter-golf Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future
improvements must be orthogonal to TTT. This update:
- Sets 1.0781 BPB (PR openai#672) as the new target to beat
- Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2,
  SwiGLU #3, Muon-VS #4, aggressive quant openai#5, MASA openai#6,
  depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8
- Deprioritizes TTT-related directions already exploited by PR openai#672
- Collapses ~1000 lines of stale Round 0-3.9 session logs into a
  concise historical summary
- Removes resolved blockers (flash_attn, SSH hangs, local runtime)
- Adds fresh Round 1 section with 5 submitted experiments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants