Skip to content

fix(data): normalize cached FineWeb paths#7

Closed
RolanH wants to merge 1 commit intoopenai:mainfrom
RolanH:codex/fix-cached-fineweb-paths
Closed

fix(data): normalize cached FineWeb paths#7
RolanH wants to merge 1 commit intoopenai:mainfrom
RolanH:codex/fix-cached-fineweb-paths

Conversation

@RolanH
Copy link

@RolanH RolanH commented Mar 18, 2026

Summary

  • normalize MATCHED_FINEWEB_REMOTE_ROOT_PREFIX handling for manifest, docs, dataset, and tokenizer downloads in data/cached_challenge_fineweb.py
  • strip full multi-segment remote prefixes before mapping files into local data/datasets and data/tokenizers
  • add regression coverage for nested remote prefixes and empty-prefix manifest resolution

Test Plan

  • python3 -m unittest discover -s tests -v
  • python3 -m py_compile $(rg --files -g '*.py')

Handle MATCHED_FINEWEB_REMOTE_ROOT_PREFIX consistently for
manifest, docs, datasets, and tokenizer artifacts.

Add regression tests for nested prefixes and empty-prefix
manifest resolution.
@RolanH RolanH closed this Mar 18, 2026
@RolanH RolanH deleted the codex/fix-cached-fineweb-paths branch March 18, 2026 18:56
gb250e referenced this pull request in gb250e/parameter-golf Mar 21, 2026
dhruvjatkar pushed a commit to dhruvjatkar/parameter-golf that referenced this pull request Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future
improvements must be orthogonal to TTT. This update:
- Sets 1.0781 BPB (PR openai#672) as the new target to beat
- Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2,
  SwiGLU #3, Muon-VS openai#4, aggressive quant openai#5, MASA openai#6,
  depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8
- Deprioritizes TTT-related directions already exploited by PR openai#672
- Collapses ~1000 lines of stale Round 0-3.9 session logs into a
  concise historical summary
- Removes resolved blockers (flash_attn, SSH hangs, local runtime)
- Adds fresh Round 1 section with 5 submitted experiments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dhruvjatkar pushed a commit to dhruvjatkar/parameter-golf that referenced this pull request Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future
improvements must be orthogonal to TTT. This update:
- Sets 1.0781 BPB (PR openai#672) as the new target to beat
- Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2,
  SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6,
  depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8
- Deprioritizes TTT-related directions already exploited by PR openai#672
- Collapses ~1000 lines of stale Round 0-3.9 session logs into a
  concise historical summary
- Removes resolved blockers (flash_attn, SSH hangs, local runtime)
- Adds fresh Round 1 section with 5 submitted experiments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant