Skip to content

Non-record: Negative results & insights from 24hrs on 8xH100#375

Open
charmquark1984 wants to merge 1 commit intoopenai:mainfrom
charmquark1984:submission/negative-results
Open

Non-record: Negative results & insights from 24hrs on 8xH100#375
charmquark1984 wants to merge 1 commit intoopenai:mainfrom
charmquark1984:submission/negative-results

Conversation

@charmquark1984
Copy link

Title: Non-record: Negative results & insights from 24hrs on 8xH100

Description:

We spent ~$500 of 8xH100 time systematically testing 13 techniques on top of the PR #315 base. Most of them made things worse. This submission documents what didn't work and why, so other competitors can skip these dead ends.

Negative results (did not improve on PR #315 base):

  • Causal TTT (3 variants): neutral on EMA+XSA base
  • MTP: +0.028 BPB, throughput penalty kills it
  • INT4: 0.06 BPB quant gap wipes out param advantage
  • Canon layers: 48% step overhead not compensated
  • Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full-run QAT

Positive findings:

  • EMA > SWA by 0.003 BPB (3-seed verified)
  • Weight decay directly controls compressed artifact size (~1.5-2MB per 0.01 WD)
  • 786K > 524K batch by 0.004 BPB (total tokens matter more than gradient steps)
  • FA3 Hopper: 15-20% more steps at same wallclock

Meta-insight: At 86ms/step, each 1ms of per-step overhead costs ~0.006 BPB. Most techniques fail this throughput test.

Best verified result: 1.1257 BPB (PR #315 reproduction). Full details, methodology, and 12 training logs in the README.

13 techniques tested that did NOT work on PR openai#315 base:
- Causal TTT (3 variants): neutral on EMA+XSA base
- MTP: +0.028 BPB, throughput penalty kills it
- INT4: 0.06 BPB quant gap wipes out param advantage
- Canon layers: 48% step overhead not compensated
- Memory tokens, gradient-guided quant, cautious WD,
  L1 regularization, label smoothing, 1M batch, full QAT

4 positive findings:
- EMA > SWA by 0.003 BPB (3-seed verified)
- Weight decay directly controls artifact size
- 786K > 524K batch by 0.004 BPB
- FA3 Hopper: 15-20% more steps at same wallclock

Best verified result: 1.1257 BPB (PR openai#315 reproduction)
Includes 12 training logs for verification.
rarce added a commit to rarce/parameter-golf that referenced this pull request Mar 22, 2026
original_model.md:
- Discard depth recurrence (amplifies quant error 900×, throughput loss)
- New direction: eval-time optimization stack (PPM-C + GPTQ-lite)
- Document all our experiment results (v3, v4, v4_30m, ringgolf)
- Add TTT/XSA interaction findings (PR openai#303: mutually exclusive)
- Add PR openai#375 meta-insight (1ms overhead = 0.006 BPB)
- 4-phase execution plan targeting PPM-C as original contribution

review_pr_records_track_10min_16mb.md:
- Add 2026-03-22 update with PRs openai#374, openai#379, openai#390, openai#375, openai#303, openai#363
- New SOTA at 1.1246 (PR openai#374: Tight SWA + VE128)
- Document negative results from $500 compute spend (PR openai#375)
- Unexplored opportunities: PPM-C, Neural Cache

review_records_track_10min_16mb.md:
- Add timestamp note (17 records, no changes)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant