[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387 by Christopher-Lee-McClendon · Pull Request #2008 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-04-30T20:06:37Z

[Non-Record] 6h Long-Train Scaling + TTT Hyperparameter Sweep

Current best 360-minute post-TTT BPB: 1.03387 (v7_noqv_rank96, single seed, 4xH100 NVL)

Summary

Formal non-record submission studying BPB as a function of training duration (10 min -> 6h) and systematically sweeping TTT/LoRA hyperparameters on the final 6h quantized artifact.

At a glance

Metric	Value	Notes
Best 360-min post-TTT BPB	1.03387340	`v7_noqv_rank96` on the final 360-min artifact (single seed)
Matched 360-min pre-quant EMA BPB	1.03340201	eval-only follow-up from saved resume checkpoint
Matched 360-min quantized sliding BPB	1.04273086	same artifact, no TTT
6h quantization tax	+0.00932885 BPB	quantized minus matched pre-quant EMA
Best TTT recovery at 6h	0.00885746 BPB (~95%)	`v7_noqv_rank96`; recovery fraction = 0.00885746 / 0.00932885 = 94.94%
Final artifact size	15,926,271 bytes	`final_model.int6.360min.ptz`
Run shape	two RunPod sessions for the artifact path; third later pod for matched pre-quant recovery	downloaded 300-min snapshot -> 4-GPU continuation -> later eval-only follow-up

Key findings

Post-TTT BPB improves from 1.06003 (10-min reference, PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934 3-seed mean) to 1.03387 (6h single-seed, v7_noqv_rank96). This is a descriptive endpoint comparison across durations and seeds, not a controlled scaling estimate.
A matched 360-min comparator gives pre-quant EMA 1.03340201 -> quantized 1.04273086 -> post-TTT 1.03387340 (v7), so GPTQ adds +0.00932885 BPB at 6h and best TTT recovers 0.00885746 BPB of that tax.
In this single-seed run, the best 6h post-TTT result remains only +0.00047139 BPB above the matched 6h pre-quant EMA.
Additional matched 240-min and 300-min controls show the same pattern: EMA helps, GPTQ adds a modest tax, and TTT recovers most or all of that tax.
Artifact size is effectively constant across this family of runs; quality improves more than bytes do.
Removing Q and V LoRA targets (v7: K+MLP+O+lm_head only) beats both the original full-target control (v0) and the lighter single-phase variant (v12).

Acknowledged PR lineage for this stack

These are the PRs most directly responsible for the training recipe, optimizer substrate, continuation semantics, and TTT control/sweep used here.

PR	Why it matters here
PR #1934	Original record-track recipe that this long-train study extends in non-record form
PR #1950	Compliance-audited reproduction of PR #1934; exact base training recipe used here
PR #1979	1-hour long-train precursor; provides the 60-min comparator and the original `v0_control_pr1979` TTT settings
PR #461	Original legal score-first TTT framework that all post-TTT comparisons here still follow
PR #1767	TTT alpha / warm-start / weight-decay improvements carried into the control TTT recipe
PR #1855	QK-gain and TTT-rank exploration that informed the long-train control and later sweep directions
PR #1344	Polar Express per-iteration Newton-Schulz coefficients concept for Muon
PR #1787	Parameter-golf integration of Polar Express Muon coefficients used by this training stack

Training scaling results

All durations use the same PR #1950 / PR #1934 recipe. To avoid mixing live-training metrics with matched eval-only comparators, the live checkpoint trajectory and the post-TTT horizon table are separated below.

Duration	Source	Export / endpoint step	Live training val_bpb near export	Artifact
60 min	PR #1979 (8xH100 SXM)	16,001	1.0615	15,944 KB
240 min	standalone 4h run (4xH100 NVL)	29,888 (wallclock stop)	1.0600	15,933 KB
300 min	seed snapshot for continuation	36,452	1.0871	15,937 KB
360 min	resumed 6h chain (4xH100 NVL)	49,765	1.0599*	15,926 KB

*Last logged live validation BPB near the 360-min export; the matched 360-min EMA / quantized / post-TTT comparator chain is reported later. The 60-min row is a separate 8xH100 run (PR #1979), not the same pod as the 240/300/360 chain.

Live training trajectory around saved/exported checkpoints

This table reports the last logged live training metrics near each saved/exported checkpoint, not matched EMA/quantized/post-TTT evals. The 60/120/180/240 rows come from the standalone 4h run (4xH100 NVL); the 300/360 rows come from the resume chain that produced the final 6h artifact. Note: the PR #1979 60-min artifact (step 16,001, 8xH100 SXM) in the summary table above is a different run from the 60-min checkpoint here (step 10,488, standalone 4h).

Checkpoint minute	Source run	Saved/exported step	Last logged train_loss near checkpoint	Last logged live val_loss	Last logged live val_bpb
60	standalone 4h run	10,488	2.4241 (step 10,000)	2.5649 (step 8,000)	1.1720
120	standalone 4h run	17,480	2.5575 (step 17,000)	2.4924 (step 16,000)	1.1389
180	standalone 4h run	23,418	2.4389 (step 23,000)	2.4474 (step 20,000)	1.1183
240	standalone 4h run	29,888 (wallclock stop)	2.3156 (step 29,500)	2.3199 (step 29,888)	1.0600
300	downloaded seed snapshot for continuation	36,452	2.4071 (step 36,000)	2.3792 (step 36,000)	1.0871
360	resumed 6h continuation	49,765	2.2774 (step 48,000)	2.3197 (step 48,000)	1.0599

How the 6h artifact and later follow-ups were actually produced

The final 360-minute artifact itself was produced in two RunPod sessions. A third later pod was used only for matched pre-quant follow-up recovery.

Phase	Pod	Persistent checkpoint/export state	What it was used for
Initial live training run	`y3ulfm7pb5kqyt`	Downloaded `results/8h_longtrain_final/resume_snapshot_step_36452/` containing `resume_manifest.json` + `resume_rank{0..3}_step36452.pt`; manifest reports `step=36452`, `training_time_ms=18000630.06`, `world_size=4`, `exported_minutes=[60,120,180,240,300]`	Authoritative 300-minute restart point pulled back to HPC before the original pod expired
Resumed 6h-horizon continuation	`mu4c253h9yoiy3`	Wrote `results/resumed_6h_horizon_continuation_step36452/final_model.int6.360min.ptz` and `checkpoint_360min.json` (`train_steps=49765`, `train_wallclock_seconds=21600.15`, `artifact_bytes=15926271`); log also shows resume saves at 330 min (`step=43125`) and 360 min (`step=49765`)	Produced the 360-minute submission artifact and the original 6h post-TTT control result
Later pre-quant follow-up safety capture	`h2fkfy6usuw72n`	Downloaded `results/prequant_360min_from_step36452/resume_snapshot_step_43062/` with manifest + all 4 rank files; manifest reports `step=43062`, `training_time_ms=19800085.99`, `world_size=4`	Fallback 330-minute restart snapshot captured while recovering the matched 360-minute pre-quant EMA comparator stored in `results/prequant_360min_from_step36452/prequant_eval_summary.live.json`

What was done, exactly:

The original 4-GPU live pod was allowed to run until a full 300-minute resume snapshot existed, then all four rank-local checkpoint files plus the manifest were downloaded under results/8h_longtrain_final/resume_snapshot_step_36452/.
The continuation resumed from that downloaded snapshot on 4 GPUs only. The continuation log confirms RESUME: restored step=36452, training_time=18000.6s, exported_minutes=[60, 120, 180, 240, 300].
The seed run was already a 6-hour training-wallclock run (training_wallclock=21600 in results/8h_longtrain_final/launcher_state.json). The resumed pod used a longer hard stop than 6h, but explicitly kept SCHEDULE_HORIZON_SECONDS=21600, so LR warmdown and schedule-dependent behavior still followed the original 6-hour horizon. This is a faithful continuation of the 6h schedule, not a fresh longer-horizon rerun.
The submission artifact for this PR is the 360-minute export from the resumed pod: results/resumed_6h_horizon_continuation_step36452/final_model.int6.360min.ptz.
The later NCCL timeout in the continuation log happened after the 360-minute export and 360-minute resume save were written, so it does not invalidate the artifact used here.
The 330-minute step differs slightly between the main continuation (43125) and the later pre-quant follow-up snapshot (43062) because those are different resumed pods launched from the same 300-minute seed snapshot for different purposes.

Post-TTT BPB over time

This table is the easiest way to see how the post-TTT endpoint moves with training duration. Only 240/300/360 have matched artifact/checkpoint controls in this session; 120 and 180 were not separately evaluated with TTT.

Training horizon	Source / comparator	TTT config	post_ttt_bpb	Notes
10 min	PR #1934 reference	record submission config	1.06003	3-seed mean reference point
60 min	PR #1979	original long-train control	1.03988	8xH100, 60-min precursor
240 min	matched 240-min artifact	`v0_control_pr1979`	1.03539272	nearly returns to matched 240-min pre-quant EMA (1.03545673)
300 min	matched 300-min checkpoint	original control recipe	1.04210727	from resume-decomposition follow-up on the same saved checkpoint
360 min	matched 360-min artifact	`v0_control_pr1979`	1.03471322	original 6h control used in the first sweep
360 min	matched 360-min artifact	`v12_rank96_phase1_prefix1000`	1.03421043	single-phase / lower-global-compute variant
360 min	matched 360-min artifact	`v7_noqv_rank96`	1.03387340	best result: Q/V LoRA removed, K+MLP+O+lm_head only

TTT/LoRA sweep on the 360-min quantized artifact

Variant	LoRA rank/alpha	LR	Batch	post_ttt_bpb	Peak memory	Status
`sliding_window_control`	—	—	—	1.04273086	5.3 GB	baseline
`v0_control_pr1979`	96 / 144	1e-4	64	1.03471322	47.8 GB	control
`v12_rank96_phase1_prefix1000`	96 / 144	1e-4	64	1.03421043	47.7 GB	better than control
`v7_noqv_rank96`	96 / 144 (K+MLP+O+lm_head only)	1e-4	64	1.03387340	43.6 GB	best
`v1_rank128_alpha192`	128 / 192	1e-4	64	1.03877	—	worse
`v2_rank128_lr3e4`	128 / 192	3e-4	64	1.09049	—	regression
`v3_local_batch_chunk`	128 / 192	3e-4	128	—	—	failed (no clean traceback; likely memory pressure / unstable config)
`v4_global2_largechunk`	128 / 192	3e-4	128	—	—	failed (no clean traceback; likely memory pressure / unstable config)
`v5_prefix3000`	128 / 192	3e-4	128	—	—	failed (no clean traceback; likely memory pressure / unstable config)
`v6_prefix3000_phase4_optional`	128 / 192	3e-4	128	—	—	failed (no clean traceback; likely memory pressure / unstable config)

Interpretation:

The sliding-window control isolates the TTT contribution on the same 360-minute artifact.
v7 improves on the control while using 4.2 GB less peak memory (43.6 vs 47.8 GB) than the full-target v0 recipe.
v12 is interesting because it nearly matches the original 3-phase control while using much less global-TTT compute.

Matched decomposition and comparator chain

Stage	BPB	Delta
Matched 6h pre-quant EMA	1.03340201	baseline
Quantized 6h artifact (sliding eval)	1.04273086	+0.00932885 vs matched pre-quant EMA
Post-TTT (`v0_control_pr1979`)	1.03471322	-0.00801764 vs quantized, +0.00131121 vs matched pre-quant EMA
Post-TTT (`v7_noqv_rank96`)	1.03387340	-0.00885746 vs quantized, +0.00047139 vs matched pre-quant EMA

Additional matched controls:

240 min: pre-quant EMA 1.03545673 -> quantized 1.04485881 (+0.00940208 tax) -> post-TTT 1.03539272
300 min: live 1.08215117 -> EMA 1.04945326 -> quantized 1.05603004 (+0.00657678 tax) -> post-TTT 1.04210727
360 min: the original control (v0) reaches 1.03471322, while the later Q/V-ablation follow-up (v7) improves further to 1.03387340

Scientific hypotheses tested

H1: Longer training improves post-TTT BPB -> supported descriptively
H2: Longer training meaningfully reduces compressed artifact size -> not supported
H3: Higher LoRA rank improves TTT on this 6h artifact -> not supported
H4: Higher LR improves TTT at rank 128 -> rejected
H5: Larger local batch / chunk improves TTT -> untested because those variants failed
H6: GPTQ degrades BPB on matched checkpoints -> supported at 240, 300, and 360 minutes
H7: Q/V LoRA targets are necessary for best 6h TTT -> rejected by v7_noqv_rank96

Infrastructure additions used by this PR

Resumable rank-local checkpoints with manifest-driven restore
SCHEDULE_HORIZON_SECONDS to decouple stop horizon from LR / schedule horizon during continuation
sweep-only-artifact mode for standalone TTT evaluation on an existing quantized artifact
HTTP-based artifact upload/download around RunPod proxy instability
Per-variant isolated TTT sweep execution with JSON / CSV summaries

Compliance

NOT record-track compliant (training exceeds the 600s wallclock budget)
Training recipe intentionally held fixed relative to PR Record: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed) #1950 / PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934
Score-first TTT retained; no validation tokens are seen before scoring
Artifact remains under the 16 MB limit
TTT/LoRA is RAM-only at eval time and does not alter the serialized artifact

Hardware and cost

Phase	Hardware	Notes
1h precursor	8xH100 SXM	PR #1979 baseline
4h standalone run	4xH100 NVL	60/120/180/240 checkpoint study
6h continuation	4xH100 NVL	downloaded 300-min snapshot -> 360-min resumed artifact
TTT sweep + follow-ups	4xH100 NVL	240-min TTT-only, 300-min decomposition, 360-min pre-quant recovery, v7/v12 follow-up sweep

Estimated total cost across the long-train stack and follow-ups is on the order of ~$160.

Reproduces PR openai#1934's exact recipe (pergroup lrzip compression, EMBED_WD=0.06, tightened clip sigmas) with GPTQ_RESERVE_SECONDS=5.5 to ensure GPTQ hessians complete within the 600s training budget. Results (3-seed mean: 1.06003, std: 0.000385): - Seed 42: 1.05987 (4962 steps, artifact 15,971,933 B) - Seed 314: 1.05975 (4952 steps, artifact 15,970,997 B) - Seed 999: 1.06047 (4954 steps, artifact 15,974,305 B) Compliance: train_loop + hessians = 598.2s max (< 600s) Delta vs PR openai#1934: +0.00010 BPB (negligible, within noise) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…-TTT BPB 1.0399 Studies artifact size as a function of training duration for the PR openai#1950 (compliance-audited PR openai#1934 reproduction) recipe on 8xH100 SXM. Key findings: - Artifact size is constant (±9 KB / 0.06%) across 10-60 min training - INT6 GPTQ + per-group lrzip is already at entropy floor by 10 min - BPB improves substantially: 1.06 (10 min) → 1.04 (60 min) post-TTT - Quantization tax (~0.01) and TTT gain (~0.01) stable across durations - No justification for larger model under same 16 MB cap Bug fix included: - NCCL rank desync during checkpoint export (broadcast sync + barriers) Non-record: training wallclock 3598s >> 600s budget. No ML changes from PR openai#1950. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Create scripts/run_longtrain_ttt_sweep.py with 7 sweep variants for evaluating different TTT/LoRA configurations on a fixed quantized artifact. Features: - 7 defined variants (v0 baseline through v6 exploratory) - Dry-run mode, on-pod execution, pod command emission - Per-variant isolation with separate output directories - Configurable timeout, GPU count, variant filtering - JSON manifest + CSV + summary aggregation - Re-aggregation from existing per-variant results Add tests/test_ttt_sweep.py with 26 tests covering variant definitions, env construction, selection, manifest generation, CSV aggregation, dry-run output, and pod command generation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Phase 1: Add save_resume_checkpoint / load_resume_checkpoint functions with atomic writes, hparam fingerprint compatibility checks, manifest schema, Muon shard_mom persistence, and old checkpoint cleanup. Phase 2: Add state_dict() / load_state_dict() to DocumentPackingLoader for deterministic data-loader resume (shard index + cursor). Integration: RESUME_ENABLED=1 RESUME_FROM=<dir> loads checkpoint; RESUME_SAVE_MINUTES=5,10,20 triggers periodic saves during training. No-op when RESUME_ENABLED is unset. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ine-readable outputs Phase 3 (run_longtrain_scaling.py): - Add --duration-hours with auto-defaults (wallclock, max-minutes, export, resume, iterations) - Add --iterations, --enable-resume, --resume-save-minutes, --resume-from, --resume-keep-last - Add --run-ttt-sweep-after-train, --ttt-sweep-variants, --ttt-max-minutes-per-variant - build_seed_cmd() emits RESUME_*/ITERATIONS env vars when flags set - TTT sweep script appended to pod command; sweep results copied for HTTP serving - build_download_list() includes ttt_sweep/ files when sweep enabled - Bundle includes scripts/run_longtrain_ttt_sweep.py via extra_files - Dry-run output shows all new settings - 4-hour default constants (DEFAULT_4H_*) Phase 5 (train_gpt.py): - Write JSON summary after TTT eval (TTT_EVAL_OUTPUT_JSON or artifact_dir default) - LOAD_QUANTIZED_MODEL_PATH env override for eval-only / sweep runs Tests: - 23 new tests in test_launcher_longtrain.py covering all new args, command building, and defaults Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…st-TTT) 4-hour training on PR openai#1950 recipe (seed 42, 4xH100 NVL): - Pre-quant post-EMA BPB: 1.0355 - Quantized (INT6 GPTQ) BPB: 1.0449 - Artifact: 15,932,638 bytes (67K headroom under 16 MB) - Artifact shrinks 15 KB from 60 min to 240 min - BPB improves monotonically: 1.172 -> 1.057 pre-quant over 4h - Quantization tax stable at 0.0094 Infrastructure: - Resumable rank-local checkpoints (RESUME_ENABLED=1) - DocumentPackingLoader state save/restore - TTT/LoRA eval sweep orchestrator (7 variants) - Extended launcher with 4h mode, dynamic seed timeout - 74 tests passing TTT eval interrupted at phase 1/3 by shell timeout. TTT sweep not run. Full post-TTT BPB estimated ~1.02-1.03. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- 1.0449 quantized does NOT beat 1h post-TTT 1.0399 (lower is better) - Changed 'beats' to 'approaches' with explicit 0.005 gap noted - Fixed 240-min table row: use quantized BPB (1.0449) with footnote - Pre-quant post-EMA 1.0355 does surpass 1h post-TTT, noted clearly - Fixed submission.json key_findings accordingly Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…etrics - Fix metric stage labeling (training_val vs quantized vs post_ttt) - Add TTT sweep results (3 successful, 4 failed variants) - v0 control (PR openai#1979 params) best at 1.03471 BPB - Fix --sweep-only-artifact mode for dedicated sweep pods - Bundle TTT sweep script via --extra-file in launcher - Add checkpoint_360min.json and sweep CSV/JSON results - Soften claims per red-team review (optimal → best among tested) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Implement SLIDING_EVAL=1 mode in train_gpt.py for non-TTT quantized eval - Add v_sliding_window_control variant to TTT sweep (TTT_ENABLED=0) - Measured quantized_bpb_360min = 1.04273086 (INT6 GPTQ, no TTT) - TTT gain properly decomposed: 1.04273 → 1.03471 = 0.00802 BPB - GPTQ quantization itself improves BPB by 0.017 vs live model (1.0599 → 1.0427) - Update submission.json, README, PR body with full 3-stage decomposition - Add H6: GPTQ quantization acts as regularization (unexpected finding) - Include sliding eval results in submission directory Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add matched 240/300/360 comparator evidence, capture the true 6h pre-quant EMA result, harden RunPod artifact retrieval, and refresh the non-record submission materials for PR openai#2008. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add TTT_Q_LORA / TTT_V_LORA env vars to train_gpt.py (default=1 for backward compat) - Guard q_loras/v_loras creation and forward paths with None checks - Add v7_noqv_rank96, v8_noqv_rank128, v12_rank96_phase1_prefix1000 to sweep launcher - v7 (no Q/V, K+MLP+O+lm_head only): 1.03387 BPB, 43.6 GiB peak, 641s eval - v12 (1-phase, 1000 prefix): 1.03421 BPB, 47.7 GiB peak, 663s eval - Both beat v0 control (1.03471) — v7 is new best with less memory - TTT now recovers ~95% of 6h quantization tax (was ~86% with v0) - Updated PR body, README, submission.json with red-teamed claims - Added reproducibility.md guide - Added AGENTS.md with RunPod operational lessons Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- explain the exact restart chain used for the 360min artifact - document downloaded 300min snapshot step36452 and fallback 330min snapshot step43062 - note that the 360min export was produced before the later NCCL timeout - sync reproducibility guide to the actual two-stage artifact path Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- add explicit PR lineage / acknowledgement table for the longtrain + TTT stack - add live checkpoint trajectory table and post-TTT-by-horizon table - clarify the exact two-pod artifact path and third-pod prequant follow-up - promote single-seed and non-record caveats earlier in the body - improve sweep failure annotations and overall PR formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace '~30K final stop/export' with exact step 29,888 (wallclock stop) - Replace '~1.0600 live' with exact 1.0600 - Fix memory diff: 4.2 GB (matching displayed 47.8-43.6) not 4.1 GiB - Add missing 300-min row to training scaling summary table - Clarify that PR openai#1979 60-min (8xH100) differs from standalone 4h 60-min checkpoint - Add hardware annotations (8xH100 SXM / 4xH100 NVL) to source column Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Christopher-Lee-McClendon and others added 7 commits April 29, 2026 18:48

Christopher-Lee-McClendon changed the title ~~[Non-Record] 4h Long-Train Scaling: Quantized BPB 1.0449 (Beats 1h Post-TTT)~~ [Non-Record] 4h Long-Train Scaling: Quantized BPB 1.0449 Apr 30, 2026

Christopher-Lee-McClendon changed the title ~~[Non-Record] 4h Long-Train Scaling: Quantized BPB 1.0449~~ [Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03471 May 1, 2026

Christopher-Lee-McClendon and others added 3 commits May 1, 2026 14:13

Christopher-Lee-McClendon changed the title ~~[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03471~~ [Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387 May 1, 2026

Christopher-Lee-McClendon and others added 3 commits May 1, 2026 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387#2008

[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387#2008
Christopher-Lee-McClendon wants to merge 14 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-4h-longtrain-ttt-sweep

Christopher-Lee-McClendon commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Christopher-Lee-McClendon commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Non-Record] 6h Long-Train Scaling + TTT Hyperparameter Sweep

Summary

At a glance

Key findings

Acknowledged PR lineage for this stack

Training scaling results

Live training trajectory around saved/exported checkpoints

How the 6h artifact and later follow-ups were actually produced

Post-TTT BPB over time

TTT/LoRA sweep on the 360-min quantized artifact

Matched decomposition and comparator chain

Scientific hypotheses tested

Infrastructure additions used by this PR

Compliance

Hardware and cost

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Christopher-Lee-McClendon commented Apr 30, 2026 •

edited

Loading