Non-record: Negative results & insights from 24hrs on 8xH100 by charmquark1984 · Pull Request #375 · openai/parameter-golf

charmquark1984 · 2026-03-21T22:51:16Z

Title: Non-record: Negative results & insights from 24hrs on 8xH100

Description:

We spent ~$500 of 8xH100 time systematically testing 13 techniques on top of the PR #315 base. Most of them made things worse. This submission documents what didn't work and why, so other competitors can skip these dead ends.

Negative results (did not improve on PR #315 base):

Causal TTT (3 variants): neutral on EMA+XSA base
MTP: +0.028 BPB, throughput penalty kills it
INT4: 0.06 BPB quant gap wipes out param advantage
Canon layers: 48% step overhead not compensated
Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full-run QAT

Positive findings:

EMA > SWA by 0.003 BPB (3-seed verified)
Weight decay directly controls compressed artifact size (~1.5-2MB per 0.01 WD)
786K > 524K batch by 0.004 BPB (total tokens matter more than gradient steps)
FA3 Hopper: 15-20% more steps at same wallclock

Meta-insight: At 86ms/step, each 1ms of per-step overhead costs ~0.006 BPB. Most techniques fail this throughput test.

Best verified result: 1.1257 BPB (PR #315 reproduction). Full details, methodology, and 12 training logs in the README.

13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.

original_model.md: - Discard depth recurrence (amplifies quant error 900×, throughput loss) - New direction: eval-time optimization stack (PPM-C + GPTQ-lite) - Document all our experiment results (v3, v4, v4_30m, ringgolf) - Add TTT/XSA interaction findings (PR openai#303: mutually exclusive) - Add PR openai#375 meta-insight (1ms overhead = 0.006 BPB) - 4-phase execution plan targeting PPM-C as original contribution review_pr_records_track_10min_16mb.md: - Add 2026-03-22 update with PRs openai#374, openai#379, openai#390, openai#375, openai#303, openai#363 - New SOTA at 1.1246 (PR openai#374: Tight SWA + VE128) - Document negative results from $500 compute spend (PR openai#375) - Unexplored opportunities: PPM-C, Neural Cache review_records_track_10min_16mb.md: - Add timestamp note (17 records, no changes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

charmquark1984 mentioned this pull request Mar 21, 2026

Non-record: val_bpb=1.1374, FA2+SWA adaptation of Farnsworth #281

Closed

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Negative results & insights from 24hrs on 8xH100#375

Non-record: Negative results & insights from 24hrs on 8xH100#375
charmquark1984 wants to merge 1 commit intoopenai:mainfrom
charmquark1984:submission/negative-results

charmquark1984 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

charmquark1984 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant