fix(data): normalize cached FineWeb paths by RolanH · Pull Request #7 · openai/parameter-golf

RolanH · 2026-03-18T18:54:58Z

Summary

normalize MATCHED_FINEWEB_REMOTE_ROOT_PREFIX handling for manifest, docs, dataset, and tokenizer downloads in data/cached_challenge_fineweb.py
strip full multi-segment remote prefixes before mapping files into local data/datasets and data/tokenizers
add regression coverage for nested remote prefixes and empty-prefix manifest resolution

Test Plan

python3 -m unittest discover -s tests -v
python3 -m py_compile $(rg --files -g '*.py')

Handle MATCHED_FINEWEB_REMOTE_ROOT_PREFIX consistently for manifest, docs, datasets, and tokenizer artifacts. Add regression tests for nested prefixes and empty-prefix manifest resolution.

PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future improvements must be orthogonal to TTT. This update: - Sets 1.0781 BPB (PR openai#672) as the new target to beat - Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2, SwiGLU #3, Muon-VS openai#4, aggressive quant openai#5, MASA openai#6, depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8 - Deprioritizes TTT-related directions already exploited by PR openai#672 - Collapses ~1000 lines of stale Round 0-3.9 session logs into a concise historical summary - Removes resolved blockers (flash_attn, SSH hangs, local runtime) - Adds fresh Round 1 section with 5 submitted experiments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future improvements must be orthogonal to TTT. This update: - Sets 1.0781 BPB (PR openai#672) as the new target to beat - Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2, SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6, depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8 - Deprioritizes TTT-related directions already exploited by PR openai#672 - Collapses ~1000 lines of stale Round 0-3.9 session logs into a concise historical summary - Removes resolved blockers (flash_attn, SSH hangs, local runtime) - Adds fresh Round 1 section with 5 submitted experiments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(data): normalize cached FineWeb paths

9c73fc2

Handle MATCHED_FINEWEB_REMOTE_ROOT_PREFIX consistently for manifest, docs, datasets, and tokenizer artifacts. Add regression tests for nested prefixes and empty-prefix manifest resolution.

RolanH closed this Mar 18, 2026

RolanH deleted the codex/fix-cached-fineweb-paths branch March 18, 2026 18:56

gb250e referenced this pull request in gb250e/parameter-golf Mar 21, 2026

docs: add PR #7 summary placeholder

f54cda2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(data): normalize cached FineWeb paths#7

fix(data): normalize cached FineWeb paths#7
RolanH wants to merge 1 commit intoopenai:mainfrom
RolanH:codex/fix-cached-fineweb-paths

RolanH commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RolanH commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RolanH commented Mar 18, 2026 •

edited

Loading