[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387#2008
Open
Christopher-Lee-McClendon wants to merge 14 commits intoopenai:mainfrom
Open
Conversation
Reproduces PR openai#1934's exact recipe (pergroup lrzip compression, EMBED_WD=0.06, tightened clip sigmas) with GPTQ_RESERVE_SECONDS=5.5 to ensure GPTQ hessians complete within the 600s training budget. Results (3-seed mean: 1.06003, std: 0.000385): - Seed 42: 1.05987 (4962 steps, artifact 15,971,933 B) - Seed 314: 1.05975 (4952 steps, artifact 15,970,997 B) - Seed 999: 1.06047 (4954 steps, artifact 15,974,305 B) Compliance: train_loop + hessians = 598.2s max (< 600s) Delta vs PR openai#1934: +0.00010 BPB (negligible, within noise) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-TTT BPB 1.0399 Studies artifact size as a function of training duration for the PR openai#1950 (compliance-audited PR openai#1934 reproduction) recipe on 8xH100 SXM. Key findings: - Artifact size is constant (±9 KB / 0.06%) across 10-60 min training - INT6 GPTQ + per-group lrzip is already at entropy floor by 10 min - BPB improves substantially: 1.06 (10 min) → 1.04 (60 min) post-TTT - Quantization tax (~0.01) and TTT gain (~0.01) stable across durations - No justification for larger model under same 16 MB cap Bug fix included: - NCCL rank desync during checkpoint export (broadcast sync + barriers) Non-record: training wallclock 3598s >> 600s budget. No ML changes from PR openai#1950. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create scripts/run_longtrain_ttt_sweep.py with 7 sweep variants for evaluating different TTT/LoRA configurations on a fixed quantized artifact. Features: - 7 defined variants (v0 baseline through v6 exploratory) - Dry-run mode, on-pod execution, pod command emission - Per-variant isolation with separate output directories - Configurable timeout, GPU count, variant filtering - JSON manifest + CSV + summary aggregation - Re-aggregation from existing per-variant results Add tests/test_ttt_sweep.py with 26 tests covering variant definitions, env construction, selection, manifest generation, CSV aggregation, dry-run output, and pod command generation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 1: Add save_resume_checkpoint / load_resume_checkpoint functions with atomic writes, hparam fingerprint compatibility checks, manifest schema, Muon shard_mom persistence, and old checkpoint cleanup. Phase 2: Add state_dict() / load_state_dict() to DocumentPackingLoader for deterministic data-loader resume (shard index + cursor). Integration: RESUME_ENABLED=1 RESUME_FROM=<dir> loads checkpoint; RESUME_SAVE_MINUTES=5,10,20 triggers periodic saves during training. No-op when RESUME_ENABLED is unset. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ine-readable outputs Phase 3 (run_longtrain_scaling.py): - Add --duration-hours with auto-defaults (wallclock, max-minutes, export, resume, iterations) - Add --iterations, --enable-resume, --resume-save-minutes, --resume-from, --resume-keep-last - Add --run-ttt-sweep-after-train, --ttt-sweep-variants, --ttt-max-minutes-per-variant - build_seed_cmd() emits RESUME_*/ITERATIONS env vars when flags set - TTT sweep script appended to pod command; sweep results copied for HTTP serving - build_download_list() includes ttt_sweep/ files when sweep enabled - Bundle includes scripts/run_longtrain_ttt_sweep.py via extra_files - Dry-run output shows all new settings - 4-hour default constants (DEFAULT_4H_*) Phase 5 (train_gpt.py): - Write JSON summary after TTT eval (TTT_EVAL_OUTPUT_JSON or artifact_dir default) - LOAD_QUANTIZED_MODEL_PATH env override for eval-only / sweep runs Tests: - 23 new tests in test_launcher_longtrain.py covering all new args, command building, and defaults Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…st-TTT) 4-hour training on PR openai#1950 recipe (seed 42, 4xH100 NVL): - Pre-quant post-EMA BPB: 1.0355 - Quantized (INT6 GPTQ) BPB: 1.0449 - Artifact: 15,932,638 bytes (67K headroom under 16 MB) - Artifact shrinks 15 KB from 60 min to 240 min - BPB improves monotonically: 1.172 -> 1.057 pre-quant over 4h - Quantization tax stable at 0.0094 Infrastructure: - Resumable rank-local checkpoints (RESUME_ENABLED=1) - DocumentPackingLoader state save/restore - TTT/LoRA eval sweep orchestrator (7 variants) - Extended launcher with 4h mode, dynamic seed timeout - 74 tests passing TTT eval interrupted at phase 1/3 by shell timeout. TTT sweep not run. Full post-TTT BPB estimated ~1.02-1.03. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- 1.0449 quantized does NOT beat 1h post-TTT 1.0399 (lower is better) - Changed 'beats' to 'approaches' with explicit 0.005 gap noted - Fixed 240-min table row: use quantized BPB (1.0449) with footnote - Pre-quant post-EMA 1.0355 does surpass 1h post-TTT, noted clearly - Fixed submission.json key_findings accordingly Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…etrics - Fix metric stage labeling (training_val vs quantized vs post_ttt) - Add TTT sweep results (3 successful, 4 failed variants) - v0 control (PR openai#1979 params) best at 1.03471 BPB - Fix --sweep-only-artifact mode for dedicated sweep pods - Bundle TTT sweep script via --extra-file in launcher - Add checkpoint_360min.json and sweep CSV/JSON results - Soften claims per red-team review (optimal → best among tested) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Implement SLIDING_EVAL=1 mode in train_gpt.py for non-TTT quantized eval - Add v_sliding_window_control variant to TTT sweep (TTT_ENABLED=0) - Measured quantized_bpb_360min = 1.04273086 (INT6 GPTQ, no TTT) - TTT gain properly decomposed: 1.04273 → 1.03471 = 0.00802 BPB - GPTQ quantization itself improves BPB by 0.017 vs live model (1.0599 → 1.0427) - Update submission.json, README, PR body with full 3-stage decomposition - Add H6: GPTQ quantization acts as regularization (unexpected finding) - Include sliding eval results in submission directory Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add matched 240/300/360 comparator evidence, capture the true 6h pre-quant EMA result, harden RunPod artifact retrieval, and refresh the non-record submission materials for PR openai#2008. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add TTT_Q_LORA / TTT_V_LORA env vars to train_gpt.py (default=1 for backward compat) - Guard q_loras/v_loras creation and forward paths with None checks - Add v7_noqv_rank96, v8_noqv_rank128, v12_rank96_phase1_prefix1000 to sweep launcher - v7 (no Q/V, K+MLP+O+lm_head only): 1.03387 BPB, 43.6 GiB peak, 641s eval - v12 (1-phase, 1000 prefix): 1.03421 BPB, 47.7 GiB peak, 663s eval - Both beat v0 control (1.03471) — v7 is new best with less memory - TTT now recovers ~95% of 6h quantization tax (was ~86% with v0) - Updated PR body, README, submission.json with red-teamed claims - Added reproducibility.md guide - Added AGENTS.md with RunPod operational lessons Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- explain the exact restart chain used for the 360min artifact - document downloaded 300min snapshot step36452 and fallback 330min snapshot step43062 - note that the 360min export was produced before the later NCCL timeout - sync reproducibility guide to the actual two-stage artifact path Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- add explicit PR lineage / acknowledgement table for the longtrain + TTT stack - add live checkpoint trajectory table and post-TTT-by-horizon table - clarify the exact two-pod artifact path and third-pod prequant follow-up - promote single-seed and non-record caveats earlier in the body - improve sweep failure annotations and overall PR formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace '~30K final stop/export' with exact step 29,888 (wallclock stop) - Replace '~1.0600 live' with exact 1.0600 - Fix memory diff: 4.2 GB (matching displayed 47.8-43.6) not 4.1 GiB - Add missing 300-min row to training scaling summary table - Clarify that PR openai#1979 60-min (8xH100) differs from standalone 4h 60-min checkpoint - Add hardware annotations (8xH100 SXM / 4xH100 NVL) to source column Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Non-Record] 6h Long-Train Scaling + TTT Hyperparameter Sweep
Summary
Formal non-record submission studying BPB as a function of training duration (10 min -> 6h) and systematically sweeping TTT/LoRA hyperparameters on the final 6h quantized artifact.
At a glance
v7_noqv_rank96on the final 360-min artifact (single seed)v7_noqv_rank96; recovery fraction = 0.00885746 / 0.00932885 = 94.94%final_model.int6.360min.ptzKey findings
v7_noqv_rank96). This is a descriptive endpoint comparison across durations and seeds, not a controlled scaling estimate.v7), so GPTQ adds +0.00932885 BPB at 6h and best TTT recovers 0.00885746 BPB of that tax.v7: K+MLP+O+lm_head only) beats both the original full-target control (v0) and the lighter single-phase variant (v12).Acknowledged PR lineage for this stack
These are the PRs most directly responsible for the training recipe, optimizer substrate, continuation semantics, and TTT control/sweep used here.
v0_control_pr1979TTT settingsTraining scaling results
All durations use the same PR #1950 / PR #1934 recipe. To avoid mixing live-training metrics with matched eval-only comparators, the live checkpoint trajectory and the post-TTT horizon table are separated below.
*Last logged live validation BPB near the 360-min export; the matched 360-min EMA / quantized / post-TTT comparator chain is reported later. The 60-min row is a separate 8xH100 run (PR #1979), not the same pod as the 240/300/360 chain.
Live training trajectory around saved/exported checkpoints
This table reports the last logged live training metrics near each saved/exported checkpoint, not matched EMA/quantized/post-TTT evals. The 60/120/180/240 rows come from the standalone 4h run (4xH100 NVL); the 300/360 rows come from the resume chain that produced the final 6h artifact. Note: the PR #1979 60-min artifact (step 16,001, 8xH100 SXM) in the summary table above is a different run from the 60-min checkpoint here (step 10,488, standalone 4h).
How the 6h artifact and later follow-ups were actually produced
The final 360-minute artifact itself was produced in two RunPod sessions. A third later pod was used only for matched pre-quant follow-up recovery.
y3ulfm7pb5kqytresults/8h_longtrain_final/resume_snapshot_step_36452/containingresume_manifest.json+resume_rank{0..3}_step36452.pt; manifest reportsstep=36452,training_time_ms=18000630.06,world_size=4,exported_minutes=[60,120,180,240,300]mu4c253h9yoiy3results/resumed_6h_horizon_continuation_step36452/final_model.int6.360min.ptzandcheckpoint_360min.json(train_steps=49765,train_wallclock_seconds=21600.15,artifact_bytes=15926271); log also shows resume saves at 330 min (step=43125) and 360 min (step=49765)h2fkfy6usuw72nresults/prequant_360min_from_step36452/resume_snapshot_step_43062/with manifest + all 4 rank files; manifest reportsstep=43062,training_time_ms=19800085.99,world_size=4results/prequant_360min_from_step36452/prequant_eval_summary.live.jsonWhat was done, exactly:
results/8h_longtrain_final/resume_snapshot_step_36452/.RESUME: restored step=36452, training_time=18000.6s, exported_minutes=[60, 120, 180, 240, 300].training_wallclock=21600inresults/8h_longtrain_final/launcher_state.json). The resumed pod used a longer hard stop than 6h, but explicitly keptSCHEDULE_HORIZON_SECONDS=21600, so LR warmdown and schedule-dependent behavior still followed the original 6-hour horizon. This is a faithful continuation of the 6h schedule, not a fresh longer-horizon rerun.results/resumed_6h_horizon_continuation_step36452/final_model.int6.360min.ptz.43125) and the later pre-quant follow-up snapshot (43062) because those are different resumed pods launched from the same 300-minute seed snapshot for different purposes.Post-TTT BPB over time
This table is the easiest way to see how the post-TTT endpoint moves with training duration. Only 240/300/360 have matched artifact/checkpoint controls in this session; 120 and 180 were not separately evaluated with TTT.
v0_control_pr1979v0_control_pr1979v12_rank96_phase1_prefix1000v7_noqv_rank96TTT/LoRA sweep on the 360-min quantized artifact
sliding_window_controlv0_control_pr1979v12_rank96_phase1_prefix1000v7_noqv_rank96v1_rank128_alpha192v2_rank128_lr3e4v3_local_batch_chunkv4_global2_largechunkv5_prefix3000v6_prefix3000_phase4_optionalInterpretation:
v7improves on the control while using 4.2 GB less peak memory (43.6 vs 47.8 GB) than the full-targetv0recipe.v12is interesting because it nearly matches the original 3-phase control while using much less global-TTT compute.Matched decomposition and comparator chain
v0_control_pr1979)v7_noqv_rank96)Additional matched controls:
v0) reaches 1.03471322, while the later Q/V-ablation follow-up (v7) improves further to 1.03387340Scientific hypotheses tested
v7_noqv_rank96Infrastructure additions used by this PR
SCHEDULE_HORIZON_SECONDSto decouple stop horizon from LR / schedule horizon during continuationsweep-only-artifactmode for standalone TTT evaluation on an existing quantized artifactCompliance
Hardware and cost
Estimated total cost across the long-train stack and follow-ups is on the order of ~$160.