# Notebook 03: Build Vocab + Pairs (XuetangX)

**Purpose:** Build course vocabulary + create prefix→label pairs for next-course prediction.

**Cold-Start Focus:**
- Vocabulary is **fixed** (343 courses from XuetangX, not cold-start)
- Pairs are **user-specific** (support K-shot learning for cold-start users)
- Chronological ordering **critical** (no future leakage)

**Inputs:**
- `data/interim/xuetangx.duckdb` (view: `xuetangx_events_sessionized`)

**Outputs:**
- `data/processed/xuetangx/vocab/course2id.json` (course_id → int)
- `data/processed/xuetangx/vocab/id2course.json` (int → course_id)
- `data/processed/xuetangx/pairs/pairs.parquet` (prefix→label pairs)
- DuckDB view: `xuetangx_pairs`
- `reports/03_build_vocab_pairs_xuetangx/<run_tag>/report.json`

**Strategy:**
1. Build course vocabulary (alphabetical ordering for determinism)
2. Extract course sequences from sessions (chronological)
3. Create pairs: prefix=[c0, c1, ..., c_{t-1}], label=c_t
4. Validate: no future leakage, all courses in vocab

In [1]:
# [CELL 03-00] Bootstrap: repo root + paths + logger

import os
import sys
import json
import time
import uuid
import hashlib
from pathlib import Path
from datetime import datetime
from typing import Any, Dict, List

import numpy as np
import pandas as pd

t0 = datetime.now()
print(f"[CELL 03-00] start={t0.isoformat(timespec='seconds')}")
print("[CELL 03-00] CWD:", Path.cwd().resolve())

def find_repo_root(start: Path) -> Path:
    start = start.resolve()
    for p in [start, *start.parents]:
        if (p / "PROJECT_STATE.md").exists():
            return p
    raise RuntimeError("Could not find PROJECT_STATE.md. Open notebook from within the repo.")

REPO_ROOT = find_repo_root(Path.cwd())
print("[CELL 03-00] REPO_ROOT:", REPO_ROOT)

PATHS = {
    "META_REGISTRY": REPO_ROOT / "meta.json",
    "DATA_INTERIM": REPO_ROOT / "data" / "interim",
    "DATA_PROCESSED": REPO_ROOT / "data" / "processed",
    "REPORTS": REPO_ROOT / "reports",
}
for k, v in PATHS.items():
    print(f"[CELL 03-00] {k}={v}")

def cell_start(cell_id: str, title: str, **kwargs: Any) -> float:
    t = time.time()
    print(f"\n[{cell_id}] {title}")
    print(f"[{cell_id}] start={datetime.now().isoformat(timespec='seconds')}")
    for k, v in kwargs.items():
        print(f"[{cell_id}] {k}={v}")
    return t

def cell_end(cell_id: str, t0: float, **kwargs: Any) -> None:
    for k, v in kwargs.items():
        print(f"[{cell_id}] {k}={v}")
    print(f"[{cell_id}] elapsed={time.time()-t0:.2f}s")
    print(f"[{cell_id}] done")

print("[CELL 03-00] done")

[CELL 03-00] start=2026-01-07T10:25:33
[CELL 03-00] CWD: C:\anonymous-users-mooc-session-meta\notebooks
[CELL 03-00] REPO_ROOT: C:\anonymous-users-mooc-session-meta
[CELL 03-00] META_REGISTRY=C:\anonymous-users-mooc-session-meta\meta.json
[CELL 03-00] DATA_INTERIM=C:\anonymous-users-mooc-session-meta\data\interim
[CELL 03-00] DATA_PROCESSED=C:\anonymous-users-mooc-session-meta\data\processed
[CELL 03-00] REPORTS=C:\anonymous-users-mooc-session-meta\reports
[CELL 03-00] done


In [2]:
# [CELL 03-01] Reproducibility: seed everything

t0 = cell_start("CELL 03-01", "Seed everything")

GLOBAL_SEED = 20260107

def seed_everything(seed: int) -> None:
    import random
    random.seed(seed)
    np.random.seed(seed)

seed_everything(GLOBAL_SEED)

cell_end("CELL 03-01", t0, seed=GLOBAL_SEED)


[CELL 03-01] Seed everything
[CELL 03-01] start=2026-01-07T10:25:33
[CELL 03-01] seed=20260107
[CELL 03-01] elapsed=0.00s
[CELL 03-01] done


In [3]:
# [CELL 03-02] JSON IO + hashing helpers

t0 = cell_start("CELL 03-02", "JSON IO + hashing")

def write_json_atomic(path: Path, obj: Any, indent: int = 2) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    tmp = path.with_suffix(path.suffix + f".tmp_{uuid.uuid4().hex}")
    with tmp.open("w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=indent)
    tmp.replace(path)

def read_json(path: Path) -> Any:
    if not path.exists():
        raise RuntimeError(f"Missing JSON file: {path}")
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)

def sha256_file(path: Path, chunk_size: int = 1024 * 1024) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        while True:
            b = f.read(chunk_size)
            if not b:
                break
            h.update(b)
    return h.hexdigest()

cell_end("CELL 03-02", t0)


[CELL 03-02] JSON IO + hashing
[CELL 03-02] start=2026-01-07T10:25:33
[CELL 03-02] elapsed=0.00s
[CELL 03-02] done


In [4]:
# [CELL 03-03] Run tagging + report/config/manifest + meta.json

t0 = cell_start("CELL 03-03", "Start run + init files + meta.json")

NOTEBOOK_NAME = "03_build_vocab_pairs_xuetangx"
RUN_TAG = datetime.now().strftime("%Y%m%d_%H%M%S")
RUN_ID = uuid.uuid4().hex

OUT_DIR = PATHS["REPORTS"] / NOTEBOOK_NAME / RUN_TAG
OUT_DIR.mkdir(parents=True, exist_ok=True)

REPORT_PATH = OUT_DIR / "report.json"
CONFIG_PATH = OUT_DIR / "config.json"
MANIFEST_PATH = OUT_DIR / "manifest.json"

DUCKDB_PATH = PATHS["DATA_INTERIM"] / "xuetangx.duckdb"
EVENTS_VIEW = "xuetangx_events_sessionized"

VOCAB_DIR = PATHS["DATA_PROCESSED"] / "xuetangx" / "vocab"
PAIRS_DIR = PATHS["DATA_PROCESSED"] / "xuetangx" / "pairs"
VOCAB_DIR.mkdir(parents=True, exist_ok=True)
PAIRS_DIR.mkdir(parents=True, exist_ok=True)

OUT_COURSE2ID = VOCAB_DIR / "course2id.json"
OUT_ID2COURSE = VOCAB_DIR / "id2course.json"
OUT_PAIRS_PARQUET = PAIRS_DIR / "pairs.parquet"

CFG = {
    "notebook": NOTEBOOK_NAME,
    "run_id": RUN_ID,
    "run_tag": RUN_TAG,
    "seed": GLOBAL_SEED,
    "inputs": {
        "duckdb_path": str(DUCKDB_PATH),
        "events_view": EVENTS_VIEW,
    },
    "outputs": {
        "course2id": str(OUT_COURSE2ID),
        "id2course": str(OUT_ID2COURSE),
        "pairs": str(OUT_PAIRS_PARQUET),
        "out_dir": str(OUT_DIR),
    },
    "vocab": {
        "ordering": "alphabetical",  # deterministic
        "start_index": 0,  # course_id 0, 1, 2, ...
    },
    "pairs": {
        "min_prefix_len": 1,  # at least 1 course before predicting next
        "max_prefix_len": None,  # no limit (variable length)
        "deduplicate_consecutive": True,  # remove consecutive repeats (e.g., [A,A,B] → [A,B])
    }
}

write_json_atomic(CONFIG_PATH, CFG)

report = {
    "run_id": RUN_ID,
    "notebook": NOTEBOOK_NAME,
    "run_tag": RUN_TAG,
    "created_at": datetime.now().isoformat(timespec="seconds"),
    "repo_root": str(REPO_ROOT),
    "metrics": {},
    "key_findings": [],
    "sanity_samples": {},
    "data_fingerprints": {},
    "notes": [],
}
write_json_atomic(REPORT_PATH, report)

manifest = {"run_id": RUN_ID, "notebook": NOTEBOOK_NAME, "run_tag": RUN_TAG, "artifacts": []}
write_json_atomic(MANIFEST_PATH, manifest)

# meta.json append-only
META_PATH = PATHS["META_REGISTRY"]
if not META_PATH.exists():
    write_json_atomic(META_PATH, {"schema_version": 1, "runs": []})
meta = read_json(META_PATH)
meta["runs"].append({
    "run_id": RUN_ID,
    "notebook": NOTEBOOK_NAME,
    "run_tag": RUN_TAG,
    "out_dir": str(OUT_DIR),
    "created_at": datetime.now().isoformat(timespec="seconds"),
})
write_json_atomic(META_PATH, meta)

cell_end("CELL 03-03", t0, out_dir=str(OUT_DIR))


[CELL 03-03] Start run + init files + meta.json
[CELL 03-03] start=2026-01-07T10:25:33
[CELL 03-03] out_dir=C:\anonymous-users-mooc-session-meta\reports\03_build_vocab_pairs_xuetangx\20260107_102533
[CELL 03-03] elapsed=0.02s
[CELL 03-03] done


In [5]:
# [CELL 03-04] Load sessionized events from DuckDB

t0 = cell_start("CELL 03-04", "Load sessionized events", duckdb=str(DUCKDB_PATH))

import duckdb

if not DUCKDB_PATH.exists():
    raise RuntimeError(f"Missing DuckDB: {DUCKDB_PATH}. Run Notebook 02 first.")

con = duckdb.connect(str(DUCKDB_PATH), read_only=True)

# Load events with session + timestamp info
events = con.execute(f"""
    SELECT 
        user_id,
        course_id,
        session_id,
        ts_epoch,
        pos_in_sess
    FROM {EVENTS_VIEW}
    ORDER BY user_id, session_id, ts_epoch, pos_in_sess
""").fetchdf()

con.close()

print(f"[CELL 03-04] Loaded events shape: {events.shape}")
print(f"[CELL 03-04] Columns: {list(events.columns)}")
print(f"\n[CELL 03-04] Head(5):")
print(events.head(5).to_string(index=False))

cell_end("CELL 03-04", t0, n_events=int(events.shape[0]))


[CELL 03-04] Load sessionized events
[CELL 03-04] start=2026-01-07T10:25:33
[CELL 03-04] duckdb=C:\anonymous-users-mooc-session-meta\data\interim\xuetangx.duckdb


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

[CELL 03-04] Loaded events shape: (19239571, 5)
[CELL 03-04] Columns: ['user_id', 'course_id', 'session_id', 'ts_epoch', 'pos_in_sess']

[CELL 03-04] Head(5):
user_id                             course_id                             session_id   ts_epoch  pos_in_sess
      1                       UQx/Think101x/_     1_ed614e7cbc66a577d760911c5c1684d4 1441241706            1
      1                       UQx/Think101x/_     1_ed614e7cbc66a577d760911c5c1684d4 1441241713            2
  10000 course-v1:TsinghuaX+40040152X+2015_T2 10000_e20c166625ba46d905c81429dd713c97 1442264613            1
  10000 course-v1:TsinghuaX+40040152X+2015_T2 10000_e20c166625ba46d905c81429dd713c97 1442264934            2
  10000 course-v1:TsinghuaX+40040152X+2015_T2 10000_e20c166625ba46d905c81429dd713c97 1442264994            3
[CELL 03-04] n_events=19239571
[CELL 03-04] elapsed=9.22s
[CELL 03-04] done


In [6]:
# [CELL 03-05] Build course vocabulary (alphabetical, deterministic)

t0 = cell_start("CELL 03-05", "Build course vocabulary")

# Extract unique courses and sort alphabetically
unique_courses = sorted(events["course_id"].unique())

print(f"[CELL 03-05] Found {len(unique_courses)} unique courses")
print(f"[CELL 03-05] First 5 courses: {unique_courses[:5]}")
print(f"[CELL 03-05] Last 5 courses: {unique_courses[-5:]}")

# Create bidirectional mappings
course2id = {course: idx for idx, course in enumerate(unique_courses)}
id2course = {idx: course for course, idx in course2id.items()}

n_items = len(course2id)
print(f"\n[CELL 03-05] Vocabulary size (n_items): {n_items}")

# Save vocabularies
write_json_atomic(OUT_COURSE2ID, course2id)
write_json_atomic(OUT_ID2COURSE, id2course)

course2id_sha = sha256_file(OUT_COURSE2ID)
id2course_sha = sha256_file(OUT_ID2COURSE)

print(f"\n[CELL 03-05] Saved: {OUT_COURSE2ID.name} (SHA256: {course2id_sha[:16]}...)")
print(f"[CELL 03-05] Saved: {OUT_ID2COURSE.name} (SHA256: {id2course_sha[:16]}...)")

cell_end("CELL 03-05", t0, n_items=n_items)


[CELL 03-05] Build course vocabulary
[CELL 03-05] start=2026-01-07T10:25:42
[CELL 03-05] Found 343 unique courses
[CELL 03-05] First 5 courses: ['AdelaideX/humbio101x/_', 'BIT/ELC05198/2014_T2', 'BIT/PHY1701701/2015_T1', 'BIT/PHY1701702/2015_T1', 'BerkeleyX/CS169_1x/_']
[CELL 03-05] Last 5 courses: ['course-v1:test+test+test', 'course-v1:test+test0001+2015_test', 'course-v1:ustcX+LB05203a+2015_T2', 'edX/BlendedX/_', 'ustcX/LB05203a/2014_T2']

[CELL 03-05] Vocabulary size (n_items): 343

[CELL 03-05] Saved: course2id.json (SHA256: 8c65e26abef28d74...)
[CELL 03-05] Saved: id2course.json (SHA256: 04fad75365624bc7...)
[CELL 03-05] n_items=343
[CELL 03-05] elapsed=1.47s
[CELL 03-05] done


In [7]:
# [CELL 03-06] Map course_id to item_id (integer encoding)

t0 = cell_start("CELL 03-06", "Map courses to item IDs")

events["item_id"] = events["course_id"].map(course2id)

# Sanity check: no unmapped courses
n_missing = events["item_id"].isna().sum()
if n_missing > 0:
    raise RuntimeError(f"Found {n_missing} events with unmapped courses (should be impossible)")

events["item_id"] = events["item_id"].astype(int)

print(f"[CELL 03-06] Mapped all courses to item_id [0, {n_items-1}]")
print(f"\n[CELL 03-06] Head(5) with item_id:")
print(events[["user_id", "session_id", "course_id", "item_id", "ts_epoch"]].head(5).to_string(index=False))

cell_end("CELL 03-06", t0)


[CELL 03-06] Map courses to item IDs
[CELL 03-06] start=2026-01-07T10:25:44
[CELL 03-06] Mapped all courses to item_id [0, 342]

[CELL 03-06] Head(5) with item_id:
user_id                             session_id                             course_id  item_id   ts_epoch
      1     1_ed614e7cbc66a577d760911c5c1684d4                       UQx/Think101x/_      203 1441241706
      1     1_ed614e7cbc66a577d760911c5c1684d4                       UQx/Think101x/_      203 1441241713
  10000 10000_e20c166625ba46d905c81429dd713c97 course-v1:TsinghuaX+40040152X+2015_T2      307 1442264613
  10000 10000_e20c166625ba46d905c81429dd713c97 course-v1:TsinghuaX+40040152X+2015_T2      307 1442264934
  10000 10000_e20c166625ba46d905c81429dd713c97 course-v1:TsinghuaX+40040152X+2015_T2      307 1442264994
[CELL 03-06] elapsed=2.70s
[CELL 03-06] done


In [8]:
# [CELL 03-07] Extract course sequences per session (deduplicate consecutive)

t0 = cell_start("CELL 03-07", "Extract course sequences per session")

DEDUPE = CFG["pairs"]["deduplicate_consecutive"]

def dedupe_consecutive(items: list) -> list:
    """Remove consecutive duplicates: [A, A, B, C, C] → [A, B, C]"""
    if not items:
        return []
    result = [items[0]]
    for item in items[1:]:
        if item != result[-1]:
            result.append(item)
    return result

# Group by session and extract chronological course sequence
session_seqs = []
for (user_id, session_id), group in events.groupby(["user_id", "session_id"]):
    # Sort by timestamp within session (already sorted, but explicit)
    group = group.sort_values("ts_epoch")
    
    item_seq = group["item_id"].tolist()
    ts_seq = group["ts_epoch"].tolist()
    
    if DEDUPE:
        # Deduplicate consecutive courses (keeps first occurrence timestamp)
        deduped_items = []
        deduped_ts = []
        for i, item in enumerate(item_seq):
            if i == 0 or item != item_seq[i-1]:
                deduped_items.append(item)
                deduped_ts.append(ts_seq[i])
        item_seq = deduped_items
        ts_seq = deduped_ts
    
    session_seqs.append({
        "user_id": user_id,
        "session_id": session_id,
        "item_seq": item_seq,
        "ts_seq": ts_seq,
    })

print(f"[CELL 03-07] Extracted {len(session_seqs):,} session sequences")
print(f"[CELL 03-07] Deduplicate consecutive: {DEDUPE}")

# Sample check
print(f"\n[CELL 03-07] Sample session sequence:")
sample = session_seqs[0]
print(f"  user_id: {sample['user_id']}")
print(f"  session_id: {sample['session_id']}")
print(f"  item_seq (first 10): {sample['item_seq'][:10]}")
print(f"  length: {len(sample['item_seq'])}")

cell_end("CELL 03-07", t0, n_sessions=len(session_seqs))


[CELL 03-07] Extract course sequences per session
[CELL 03-07] start=2026-01-07T10:25:47
[CELL 03-07] Extracted 291,565 session sequences
[CELL 03-07] Deduplicate consecutive: True

[CELL 03-07] Sample session sequence:
  user_id: 1
  session_id: 1_ed614e7cbc66a577d760911c5c1684d4
  item_seq (first 10): [203]
  length: 1
[CELL 03-07] n_sessions=291565
[CELL 03-07] elapsed=78.42s
[CELL 03-07] done


In [9]:
# [CELL 03-08] Create prefix→label pairs (chronological, no future leakage)

t0 = cell_start("CELL 03-08", "Create prefix→label pairs")

MIN_PREFIX_LEN = int(CFG["pairs"]["min_prefix_len"])

pairs = []
pair_id = 0

for sess in session_seqs:
    user_id = sess["user_id"]
    session_id = sess["session_id"]
    item_seq = sess["item_seq"]
    ts_seq = sess["ts_seq"]
    
    # Need at least min_prefix_len + 1 items to create a pair
    if len(item_seq) < MIN_PREFIX_LEN + 1:
        continue
    
    # Create pairs: for each position t, prefix=[0:t], label=t
    for t in range(MIN_PREFIX_LEN, len(item_seq)):
        prefix = item_seq[:t]
        label = item_seq[t]
        
        # Timestamps: prefix_max_ts < label_ts (no future leakage)
        prefix_ts = ts_seq[:t]
        label_ts = ts_seq[t]
        
        # Skip pairs where prefix_max_ts >= label_ts (can happen after deduplication)
        # This ensures strict chronological ordering: all prefix events happen BEFORE label
        if max(prefix_ts) >= label_ts:
            continue  # Skip this pair (same-timestamp events after deduplication)
        
        pairs.append({
            "pair_id": pair_id,
            "user_id": user_id,
            "session_id": session_id,
            "prefix": prefix,  # list[int]
            "label": int(label),  # int
            "label_ts_epoch": int(label_ts),  # timestamp of label event
            "prefix_len": int(len(prefix)),
        })
        pair_id += 1

pairs_df = pd.DataFrame(pairs)

print(f"[CELL 03-08] Created {len(pairs_df):,} prefix→label pairs")
print(f"[CELL 03-08] Min prefix length: {MIN_PREFIX_LEN}")
print(f"\n[CELL 03-08] Prefix length distribution:")
print(f"  Min: {pairs_df['prefix_len'].min()}")
print(f"  p50: {pairs_df['prefix_len'].quantile(0.50):.0f}")
print(f"  p90: {pairs_df['prefix_len'].quantile(0.90):.0f}")
print(f"  p99: {pairs_df['prefix_len'].quantile(0.99):.0f}")
print(f"  Max: {pairs_df['prefix_len'].max()}")

print(f"\n[CELL 03-08] Head(3):")
print(pairs_df[["pair_id", "user_id", "session_id", "prefix", "label", "label_ts_epoch", "prefix_len"]].head(3).to_string(index=False))

cell_end("CELL 03-08", t0, n_pairs=int(len(pairs_df)))


[CELL 03-08] Create prefix→label pairs
[CELL 03-08] start=2026-01-07T10:27:05
[CELL 03-08] Created 264,229 prefix→label pairs
[CELL 03-08] Min prefix length: 1

[CELL 03-08] Prefix length distribution:
  Min: 1
  p50: 3
  p90: 22
  p99: 140
  Max: 901

[CELL 03-08] Head(3):
 pair_id user_id                               session_id          prefix  label  label_ts_epoch  prefix_len
       0 1000009 1000009_95163f59939941d9fd47d6c9b17fdaf6           [107]    133      1443241561           1
       1 1000009 1000009_95163f59939941d9fd47d6c9b17fdaf6      [107, 133]    334      1443242059           2
       2 1000009 1000009_95163f59939941d9fd47d6c9b17fdaf6 [107, 133, 334]    297      1443242155           3
[CELL 03-08] n_pairs=264229
[CELL 03-08] elapsed=1.64s
[CELL 03-08] done


In [10]:
# [CELL 03-09] Save pairs.parquet

t0 = cell_start("CELL 03-09", "Save pairs.parquet", out=str(OUT_PAIRS_PARQUET))

pairs_df.to_parquet(OUT_PAIRS_PARQUET, index=False, compression="zstd")

pairs_bytes = int(OUT_PAIRS_PARQUET.stat().st_size)
pairs_sha = sha256_file(OUT_PAIRS_PARQUET)

print(f"[CELL 03-09] Saved: {OUT_PAIRS_PARQUET}")
print(f"[CELL 03-09] Size: {pairs_bytes / 1024 / 1024:.1f} MB")
print(f"[CELL 03-09] SHA256: {pairs_sha}")

cell_end("CELL 03-09", t0)


[CELL 03-09] Save pairs.parquet
[CELL 03-09] start=2026-01-07T10:27:07
[CELL 03-09] out=C:\anonymous-users-mooc-session-meta\data\processed\xuetangx\pairs\pairs.parquet
[CELL 03-09] Saved: C:\anonymous-users-mooc-session-meta\data\processed\xuetangx\pairs\pairs.parquet
[CELL 03-09] Size: 4.8 MB
[CELL 03-09] SHA256: 51e62b1f05ea6f2df989f9eb77b2ddc3a684959e8dbc5e19c926950eed95ee4f
[CELL 03-09] elapsed=0.56s
[CELL 03-09] done


In [11]:
# [CELL 03-10] Register DuckDB view for pairs

t0 = cell_start("CELL 03-10", "Register DuckDB view", duckdb=str(DUCKDB_PATH))

con = duckdb.connect(str(DUCKDB_PATH), read_only=False)

con.execute("DROP VIEW IF EXISTS xuetangx_pairs;")

def esc_path(p: Path) -> str:
    return str(p).replace("'", "''")

con.execute(f"""
CREATE VIEW xuetangx_pairs AS
SELECT * FROM read_parquet('{esc_path(OUT_PAIRS_PARQUET)}')
""")

n_pairs = int(con.execute("SELECT COUNT(*) FROM xuetangx_pairs").fetchone()[0])
print(f"[CELL 03-10] View xuetangx_pairs: {n_pairs:,} rows")

con.close()
print(f"[CELL 03-10] Closed DuckDB connection")

cell_end("CELL 03-10", t0)


[CELL 03-10] Register DuckDB view
[CELL 03-10] start=2026-01-07T10:27:07
[CELL 03-10] duckdb=C:\anonymous-users-mooc-session-meta\data\interim\xuetangx.duckdb
[CELL 03-10] View xuetangx_pairs: 264,229 rows
[CELL 03-10] Closed DuckDB connection
[CELL 03-10] elapsed=0.07s
[CELL 03-10] done


In [12]:
# [CELL 03-11] Validation: sanity checks

t0 = cell_start("CELL 03-11", "Validation checks")

# Check 1: All labels in vocab
invalid_labels = pairs_df[pairs_df["label"] >= n_items]
if len(invalid_labels) > 0:
    raise RuntimeError(f"Found {len(invalid_labels)} pairs with label >= n_items={n_items}")
print(f"[CELL 03-11] ✅ All labels in vocab [0, {n_items-1}]")

# Check 2: All prefix items in vocab
all_prefix_items = [item for prefix in pairs_df["prefix"] for item in prefix]
max_prefix_item = max(all_prefix_items) if all_prefix_items else -1
if max_prefix_item >= n_items:
    raise RuntimeError(f"Found prefix item {max_prefix_item} >= n_items={n_items}")
print(f"[CELL 03-11] ✅ All prefix items in vocab [0, {n_items-1}]")

# Check 3: No empty prefixes (enforced by MIN_PREFIX_LEN)
empty_prefix = pairs_df[pairs_df["prefix_len"] < MIN_PREFIX_LEN]
if len(empty_prefix) > 0:
    raise RuntimeError(f"Found {len(empty_prefix)} pairs with prefix_len < {MIN_PREFIX_LEN}")
print(f"[CELL 03-11] ✅ All prefixes have length >= {MIN_PREFIX_LEN}")

# Check 4: User-level stats
user_pair_counts = pairs_df.groupby("user_id").size()
print(f"\n[CELL 03-11] User-level pair counts:")
print(f"  Total users with pairs: {len(user_pair_counts):,}")
print(f"  Min pairs/user: {user_pair_counts.min()}")
print(f"  p50 pairs/user: {user_pair_counts.quantile(0.50):.0f}")
print(f"  p90 pairs/user: {user_pair_counts.quantile(0.90):.0f}")
print(f"  Max pairs/user: {user_pair_counts.max()}")

cell_end("CELL 03-11", t0)


[CELL 03-11] Validation checks
[CELL 03-11] start=2026-01-07T10:27:07
[CELL 03-11] ✅ All labels in vocab [0, 342]
[CELL 03-11] ✅ All prefix items in vocab [0, 342]
[CELL 03-11] ✅ All prefixes have length >= 1

[CELL 03-11] User-level pair counts:
  Total users with pairs: 42,171
  Min pairs/user: 1
  p50 pairs/user: 2
  p90 pairs/user: 13
  Max pairs/user: 1318
[CELL 03-11] elapsed=0.26s
[CELL 03-11] done


In [13]:
# [CELL 03-12] Update report + manifest

t0 = cell_start("CELL 03-12", "Write report + manifest")

report = read_json(REPORT_PATH)
manifest = read_json(MANIFEST_PATH)

# Metrics
report["metrics"] = {
    "n_items": n_items,
    "n_pairs": int(len(pairs_df)),
    "n_users_with_pairs": int(len(user_pair_counts)),
    "min_prefix_len": MIN_PREFIX_LEN,
    "prefix_len_p50": float(pairs_df["prefix_len"].quantile(0.50)),
    "prefix_len_p90": float(pairs_df["prefix_len"].quantile(0.90)),
    "prefix_len_max": int(pairs_df["prefix_len"].max()),
    "deduplicate_consecutive": DEDUPE,
}

# Key findings
report["key_findings"].append(
    f"Built course vocabulary with {n_items} items (alphabetical ordering). "
    f"Created {len(pairs_df):,} prefix→label pairs for next-course prediction. "
    f"All pairs validated: chronological ordering, no future leakage, all items in vocab."
)

if DEDUPE:
    report["key_findings"].append(
        "Consecutive duplicate courses removed from sequences (e.g., [A,A,B] → [A,B]). "
        "This focuses prediction on course transitions, not re-engagement patterns."
    )

# Sanity samples
report["sanity_samples"]["pairs_head3"] = pairs_df.head(3).to_dict(orient="records")
report["sanity_samples"]["vocab_sample"] = {
    "first_5_courses": unique_courses[:5],
    "last_5_courses": unique_courses[-5:],
}

# Fingerprints
report["data_fingerprints"]["course2id"] = {"path": str(OUT_COURSE2ID), "sha256": course2id_sha}
report["data_fingerprints"]["id2course"] = {"path": str(OUT_ID2COURSE), "sha256": id2course_sha}
report["data_fingerprints"]["pairs"] = {
    "path": str(OUT_PAIRS_PARQUET),
    "bytes": pairs_bytes,
    "sha256": pairs_sha,
}

write_json_atomic(REPORT_PATH, report)

# Manifest
def add_artifact(path: Path) -> None:
    rec = {"path": str(path), "bytes": int(path.stat().st_size), "sha256": None, "sha256_error": None}
    try:
        rec["sha256"] = sha256_file(path)
    except PermissionError as e:
        rec["sha256_error"] = f"PermissionError: {e}"
    manifest["artifacts"].append(rec)

add_artifact(OUT_COURSE2ID)
add_artifact(OUT_ID2COURSE)
add_artifact(OUT_PAIRS_PARQUET)

write_json_atomic(MANIFEST_PATH, manifest)

print(f"[CELL 03-12] Updated: {REPORT_PATH}")
print(f"[CELL 03-12] Updated: {MANIFEST_PATH}")

cell_end("CELL 03-12", t0)


[CELL 03-12] Write report + manifest
[CELL 03-12] start=2026-01-07T10:27:08
[CELL 03-12] Updated: C:\anonymous-users-mooc-session-meta\reports\03_build_vocab_pairs_xuetangx\20260107_102533\report.json
[CELL 03-12] Updated: C:\anonymous-users-mooc-session-meta\reports\03_build_vocab_pairs_xuetangx\20260107_102533\manifest.json
[CELL 03-12] elapsed=0.04s
[CELL 03-12] done


## ✅ Notebook 03 Complete

**Outputs:**
- ✅ `data/processed/xuetangx/vocab/course2id.json` (343 courses)
- ✅ `data/processed/xuetangx/vocab/id2course.json`
- ✅ `data/processed/xuetangx/pairs/pairs.parquet` (all prefix→label pairs)
- ✅ DuckDB view: `xuetangx_pairs`
- ✅ `reports/03_build_vocab_pairs_xuetangx/<run_tag>/report.json`

**Validation Passed:**
- ✅ All labels in vocab [0, n_items-1]
- ✅ All prefix items in vocab
- ✅ Chronological ordering (prefix_max_ts < label_ts)
- ✅ No future leakage

**Next:** Notebook 04 (User Split)
- Deterministic user-level split (80/10/10)
- Disjoint train/val/test users (cold-start guarantee)
- Split pairs by user assignment