Skip to content

Commit fe7c309

Browse files
committed
fix(submission): prepend BOS_ID=1 in prepare_caseops_data.py
External reproductions of PR #1769 (and PR #1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR #1736 comment 4285805497.
1 parent b4b6ddf commit fe7c309

2 files changed

Lines changed: 22 additions & 2 deletions

File tree

records/track_10min_16mb/2026-04-22_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT_MLPClip12/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,24 @@ data/datasets/fineweb10B_sp8192_caseops/datasets/
137137
fineweb_val_bytes_000000.bin
138138
```
139139

140+
### Reproduction sanity check (run after step 2)
141+
142+
Each shard must contain `BOS_ID=1` at the start of every document — `train_gpt.py`'s phased TTT eval path (`_find_docs`) requires it. Quick check on the first val shard:
143+
144+
```python
145+
python3 -c "
146+
import numpy as np
147+
d = np.fromfile('data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_000000.bin', dtype=np.uint16)
148+
# First 256 uint16 slots are the shard header; tokens start after.
149+
tokens = d[512:]
150+
bos_count = int((tokens == 1).sum())
151+
print(f'BOS markers in val shard: {bos_count} (must be > 0)')
152+
assert bos_count > 0, 'prepare_caseops_data.py is broken — re-run with BOS prepend'
153+
"
154+
```
155+
156+
If `bos_count == 0`, the prep script is out of date — pull the latest `prepare_caseops_data.py` from this folder (the SP tokenizer reserves IDs 07 for special + CaseOps operator tokens, so the prep script must explicitly prepend `BOS_ID=1` to each doc; the eval path's `_find_docs` has no fallback for missing BOS markers).
157+
140158
## Run command (5-seed reproduction)
141159

142160
```bash

records/track_10min_16mb/2026-04-22_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT_MLPClip12/prepare_caseops_data.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@
6060
SHARD_MAGIC = 20240520
6161
SHARD_VERSION = 1
6262
SHARD_TOKENS = 10_000_000 # tokens per shard — matches the main pipeline
63+
BOS_ID = 1 # SP model's <s> control token; train_gpt.py:_find_docs requires BOS per doc
6364

6465

6566
def _write_shard(out_path: pathlib.Path, arr: np.ndarray) -> None:
@@ -154,12 +155,13 @@ def main() -> None:
154155

155156
for text in _iter_docs(args.docs):
156157
transformed = encode_lossless_caps_v2(text)
157-
token_ids = sp.encode(transformed, out_type=int)
158+
token_ids = [BOS_ID] + sp.encode(transformed, out_type=int)
158159
if n_docs < args.val_docs:
159160
# Validation doc — also compute byte sidecar
160161
byte_counts = _token_original_byte_counts(sp, text, transformed)
161162
val_buf_tokens.extend(token_ids)
162-
val_buf_bytes.extend(int(b) for b in byte_counts[:len(token_ids)])
163+
val_buf_bytes.append(0) # BOS contributes 0 original bytes
164+
val_buf_bytes.extend(int(b) for b in byte_counts)
163165
if len(val_buf_tokens) >= SHARD_TOKENS:
164166
_write_shard(train_out / f"fineweb_val_{val_written:06d}.bin",
165167
np.array(val_buf_tokens[:SHARD_TOKENS], dtype=np.uint16))

0 commit comments

Comments
 (0)