fix(submission): prepend BOS_ID=1 in prepare_caseops_data.py

dexhunter · dexhunter · commit fe7c30956cdd · 2026-04-23T00:44:02.000Z
External reproductions of PR #1769 (and PR #1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR #1736 comment 4285805497.
diff --git a/records/track_10min_16mb/2026-04-22_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT_MLPClip12/README.md b/records/track_10min_16mb/2026-04-22_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT_MLPClip12/README.md
@@ -137,6 +137,24 @@ data/datasets/fineweb10B_sp8192_caseops/datasets/
     fineweb_val_bytes_000000.bin
 ```
 
+### Reproduction sanity check (run after step 2)
+
+Each shard must contain `BOS_ID=1` at the start of every document — `train_gpt.py`'s phased TTT eval path (`_find_docs`) requires it. Quick check on the first val shard:
+
+```python
+python3 -c "
+import numpy as np
+d = np.fromfile('data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_000000.bin', dtype=np.uint16)
+# First 256 uint16 slots are the shard header; tokens start after.
+tokens = d[512:]
+bos_count = int((tokens == 1).sum())
+print(f'BOS markers in val shard: {bos_count}  (must be > 0)')
+assert bos_count > 0, 'prepare_caseops_data.py is broken — re-run with BOS prepend'
+"
+```
+
+If `bos_count == 0`, the prep script is out of date — pull the latest `prepare_caseops_data.py` from this folder (the SP tokenizer reserves IDs 0–7 for special + CaseOps operator tokens, so the prep script must explicitly prepend `BOS_ID=1` to each doc; the eval path's `_find_docs` has no fallback for missing BOS markers).
+
 ## Run command (5-seed reproduction)
 
 ```bash
diff --git a/records/track_10min_16mb/2026-04-22_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT_MLPClip12/prepare_caseops_data.py b/records/track_10min_16mb/2026-04-22_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT_MLPClip12/prepare_caseops_data.py
@@ -60,6 +60,7 @@
 SHARD_MAGIC = 20240520
 SHARD_VERSION = 1
 SHARD_TOKENS = 10_000_000  # tokens per shard — matches the main pipeline
+BOS_ID = 1  # SP model's <s> control token; train_gpt.py:_find_docs requires BOS per doc
 
 
 def _write_shard(out_path: pathlib.Path, arr: np.ndarray) -> None:
@@ -154,12 +155,13 @@ def main() -> None:
 
     for text in _iter_docs(args.docs):
         transformed = encode_lossless_caps_v2(text)
-        token_ids = sp.encode(transformed, out_type=int)
+        token_ids = [BOS_ID] + sp.encode(transformed, out_type=int)
         if n_docs < args.val_docs:
             # Validation doc — also compute byte sidecar
             byte_counts = _token_original_byte_counts(sp, text, transformed)
             val_buf_tokens.extend(token_ids)
-            val_buf_bytes.extend(int(b) for b in byte_counts[:len(token_ids)])
+            val_buf_bytes.append(0)  # BOS contributes 0 original bytes
+            val_buf_bytes.extend(int(b) for b in byte_counts)
             if len(val_buf_tokens) >= SHARD_TOKENS:
                 _write_shard(train_out / f"fineweb_val_{val_written:06d}.bin",
                              np.array(val_buf_tokens[:SHARD_TOKENS], dtype=np.uint16))