# Notebook 05: Episode Index (XuetangX)

**Purpose:** Create episodic meta-learning indices for K-shot learning.

**Cold-Start Focus:**
- **User-as-task**: Each episode represents one user's learning task
- **Support set**: K pairs from user's history (for adaptation)
- **Query set**: Q pairs from user's history (for evaluation)
- **Chronological ordering**: Support timestamps < Query timestamps (no future leakage)

**Inputs:**
- `data/processed/xuetangx/pairs/pairs_train.parquet` (139,349 pairs, 28,633 users)
- `data/processed/xuetangx/pairs/pairs_val.parquet` (17,848 pairs, 3,579 users)
- `data/processed/xuetangx/pairs/pairs_test.parquet` (18,324 pairs, 3,580 users)

**Outputs:**
- `data/processed/xuetangx/episodes/episodes_train_K{K}_Q{Q}.parquet`
- `data/processed/xuetangx/episodes/episodes_val_K{K}_Q{Q}.parquet`
- `data/processed/xuetangx/episodes/episodes_test_K{K}_Q{Q}.parquet`
- DuckDB views: `xuetangx_episodes_train_K{K}_Q{Q}`, etc.
- `reports/05_episode_index_xuetangx/<run_tag>/report.json`

**Strategy:**
1. Filter users by minimum pairs: ≥K+Q pairs required
2. For each eligible user, create episodes:
   - **Train**: Multiple episodes per user (sliding window approach)
   - **Val/Test**: Single episode per user (last K+Q pairs)
3. Each episode:
   - Support: K pairs (chronologically first)
   - Query: Q pairs (chronologically after support)
4. Validate: support_max_ts < query_min_ts (no future leakage)

**K-Shot Configurations:**
- K=5, Q=10 (primary, 15 pairs minimum)
- K=10, Q=20 (secondary, 30 pairs minimum)

In [1]:
# [CELL 05-00] Bootstrap: repo root + paths + logger

import os
import sys
import json
import time
import uuid
import hashlib
from pathlib import Path
from datetime import datetime
from typing import Any, Dict, List

import numpy as np
import pandas as pd

t0 = datetime.now()
print(f"[CELL 05-00] start={t0.isoformat(timespec='seconds')}")
print("[CELL 05-00] CWD:", Path.cwd().resolve())

def find_repo_root(start: Path) -> Path:
    start = start.resolve()
    for p in [start, *start.parents]:
        if (p / "PROJECT_STATE.md").exists():
            return p
    raise RuntimeError("Could not find PROJECT_STATE.md. Open notebook from within the repo.")

REPO_ROOT = find_repo_root(Path.cwd())
print("[CELL 05-00] REPO_ROOT:", REPO_ROOT)

PATHS = {
    "META_REGISTRY": REPO_ROOT / "meta.json",
    "DATA_INTERIM": REPO_ROOT / "data" / "interim",
    "DATA_PROCESSED": REPO_ROOT / "data" / "processed",
    "REPORTS": REPO_ROOT / "reports",
}
for k, v in PATHS.items():
    print(f"[CELL 05-00] {k}={v}")

def cell_start(cell_id: str, title: str, **kwargs: Any) -> float:
    t = time.time()
    print(f"\n[{cell_id}] {title}")
    print(f"[{cell_id}] start={datetime.now().isoformat(timespec='seconds')}")
    for k, v in kwargs.items():
        print(f"[{cell_id}] {k}={v}")
    return t

def cell_end(cell_id: str, t0: float, **kwargs: Any) -> None:
    for k, v in kwargs.items():
        print(f"[{cell_id}] {k}={v}")
    print(f"[{cell_id}] elapsed={time.time()-t0:.2f}s")
    print(f"[{cell_id}] done")

print("[CELL 05-00] done")

[CELL 05-00] start=2026-01-30T07:52:21
[CELL 05-00] CWD: /workspace/anonymous-users-mooc-session-meta/notebooks
[CELL 05-00] REPO_ROOT: /workspace/anonymous-users-mooc-session-meta
[CELL 05-00] META_REGISTRY=/workspace/anonymous-users-mooc-session-meta/meta.json
[CELL 05-00] DATA_INTERIM=/workspace/anonymous-users-mooc-session-meta/data/interim
[CELL 05-00] DATA_PROCESSED=/workspace/anonymous-users-mooc-session-meta/data/processed
[CELL 05-00] REPORTS=/workspace/anonymous-users-mooc-session-meta/reports
[CELL 05-00] done


In [2]:
# [CELL 05-01] Reproducibility: seed everything

t0 = cell_start("CELL 05-01", "Seed everything")

GLOBAL_SEED = 20260107

def seed_everything(seed: int) -> None:
    import random
    random.seed(seed)
    np.random.seed(seed)

seed_everything(GLOBAL_SEED)

cell_end("CELL 05-01", t0, seed=GLOBAL_SEED)


[CELL 05-01] Seed everything
[CELL 05-01] start=2026-01-30T07:52:21
[CELL 05-01] seed=20260107
[CELL 05-01] elapsed=0.00s
[CELL 05-01] done


In [3]:
# [CELL 05-02] JSON IO + hashing helpers

t0 = cell_start("CELL 05-02", "JSON IO + hashing")

def write_json_atomic(path: Path, obj: Any, indent: int = 2) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    tmp = path.with_suffix(path.suffix + f".tmp_{uuid.uuid4().hex}")
    with tmp.open("w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=indent)
    tmp.replace(path)

def read_json(path: Path) -> Any:
    if not path.exists():
        raise RuntimeError(f"Missing JSON file: {path}")
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)

def sha256_file(path: Path, chunk_size: int = 1024 * 1024) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        while True:
            b = f.read(chunk_size)
            if not b:
                break
            h.update(b)
    return h.hexdigest()

cell_end("CELL 05-02", t0)


[CELL 05-02] JSON IO + hashing
[CELL 05-02] start=2026-01-30T07:52:21
[CELL 05-02] elapsed=0.00s
[CELL 05-02] done


In [4]:
# [CELL 05-03] Run tagging + K-shot config + report/config/manifest + meta.json

t0 = cell_start("CELL 05-03", "Start run + init files + meta.json")

NOTEBOOK_NAME = "05_episode_index_xuetangx"
RUN_TAG = datetime.now().strftime("%Y%m%d_%H%M%S")
RUN_ID = uuid.uuid4().hex

OUT_DIR = PATHS["REPORTS"] / NOTEBOOK_NAME / RUN_TAG
OUT_DIR.mkdir(parents=True, exist_ok=True)

REPORT_PATH = OUT_DIR / "report.json"
CONFIG_PATH = OUT_DIR / "config.json"
MANIFEST_PATH = OUT_DIR / "manifest.json"

DUCKDB_PATH = PATHS["DATA_INTERIM"] / "xuetangx.duckdb"
PAIRS_DIR = PATHS["DATA_PROCESSED"] / "xuetangx" / "pairs"
EPISODES_DIR = PATHS["DATA_PROCESSED"] / "xuetangx" / "episodes"
EPISODES_DIR.mkdir(parents=True, exist_ok=True)

PAIRS_TRAIN = PAIRS_DIR / "pairs_train.parquet"
PAIRS_VAL = PAIRS_DIR / "pairs_val.parquet"
PAIRS_TEST = PAIRS_DIR / "pairs_test.parquet"

# K-shot configurations
K_SHOT_CONFIGS = [
    {"K": 5, "Q": 10},   # Primary: 15 pairs minimum
    {"K": 10, "Q": 20},  # Secondary: 30 pairs minimum
]

CFG = {
    "notebook": NOTEBOOK_NAME,
    "run_id": RUN_ID,
    "run_tag": RUN_TAG,
    "seed": GLOBAL_SEED,
    "inputs": {
        "pairs_train": str(PAIRS_TRAIN),
        "pairs_val": str(PAIRS_VAL),
        "pairs_test": str(PAIRS_TEST),
    },
    "k_shot_configs": K_SHOT_CONFIGS,
    "episode_strategy": {
        "train": "multiple_episodes_per_user",  # Sliding window: creates multiple episodes
        "val": "single_episode_per_user",       # Last K+Q pairs only
        "test": "single_episode_per_user",      # Last K+Q pairs only
        "train_stride": 1,  # Slide by 1 pair at a time (maximum training data)
    },
    "outputs": {
        "episodes_dir": str(EPISODES_DIR),
        "out_dir": str(OUT_DIR),
    }
}

write_json_atomic(CONFIG_PATH, CFG)

report = {
    "run_id": RUN_ID,
    "notebook": NOTEBOOK_NAME,
    "run_tag": RUN_TAG,
    "created_at": datetime.now().isoformat(timespec="seconds"),
    "repo_root": str(REPO_ROOT),
    "metrics": {},
    "key_findings": [],
    "sanity_samples": {},
    "data_fingerprints": {},
    "notes": [],
}
write_json_atomic(REPORT_PATH, report)

manifest = {"run_id": RUN_ID, "notebook": NOTEBOOK_NAME, "run_tag": RUN_TAG, "artifacts": []}
write_json_atomic(MANIFEST_PATH, manifest)

# meta.json append-only
META_PATH = PATHS["META_REGISTRY"]
if not META_PATH.exists():
    write_json_atomic(META_PATH, {"schema_version": 1, "runs": []})
meta = read_json(META_PATH)
meta["runs"].append({
    "run_id": RUN_ID,
    "notebook": NOTEBOOK_NAME,
    "run_tag": RUN_TAG,
    "out_dir": str(OUT_DIR),
    "created_at": datetime.now().isoformat(timespec="seconds"),
})
write_json_atomic(META_PATH, meta)

print(f"[CELL 05-03] K-shot configs: {K_SHOT_CONFIGS}")

cell_end("CELL 05-03", t0, out_dir=str(OUT_DIR))


[CELL 05-03] Start run + init files + meta.json
[CELL 05-03] start=2026-01-30T07:52:21
[CELL 05-03] K-shot configs: [{'K': 5, 'Q': 10}, {'K': 10, 'Q': 20}]
[CELL 05-03] out_dir=/workspace/anonymous-users-mooc-session-meta/reports/05_episode_index_xuetangx/20260130_075221
[CELL 05-03] elapsed=0.00s
[CELL 05-03] done


In [5]:
# [CELL 05-04] Load split pairs from Notebook 04

t0 = cell_start("CELL 05-04", "Load split pairs")

if not PAIRS_TRAIN.exists():
    raise RuntimeError(f"Missing pairs_train.parquet. Run Notebook 04 first.")
if not PAIRS_VAL.exists():
    raise RuntimeError(f"Missing pairs_val.parquet. Run Notebook 04 first.")
if not PAIRS_TEST.exists():
    raise RuntimeError(f"Missing pairs_test.parquet. Run Notebook 04 first.")

pairs_train = pd.read_parquet(PAIRS_TRAIN)
pairs_val = pd.read_parquet(PAIRS_VAL)
pairs_test = pd.read_parquet(PAIRS_TEST)

print(f"[CELL 05-04] Loaded pairs_train: {pairs_train.shape[0]:,} pairs, {pairs_train['user_id'].nunique():,} users")
print(f"[CELL 05-04] Loaded pairs_val:   {pairs_val.shape[0]:,} pairs, {pairs_val['user_id'].nunique():,} users")
print(f"[CELL 05-04] Loaded pairs_test:  {pairs_test.shape[0]:,} pairs, {pairs_test['user_id'].nunique():,} users")

cell_end("CELL 05-04", t0)


[CELL 05-04] Load split pairs
[CELL 05-04] start=2026-01-30T07:52:21
[CELL 05-04] Loaded pairs_train: 139,349 pairs, 28,633 users
[CELL 05-04] Loaded pairs_val:   17,848 pairs, 3,579 users
[CELL 05-04] Loaded pairs_test:  18,324 pairs, 3,580 users
[CELL 05-04] elapsed=0.10s
[CELL 05-04] done


In [6]:
# [CELL 05-05] Episode creation function (chronological support→query)

t0 = cell_start("CELL 05-05", "Define episode creation function")

def create_episodes(
    pairs_df: pd.DataFrame,
    K: int,
    Q: int,
    mode: str,  # 'train', 'val', or 'test'
    stride: int = 1,
) -> List[Dict[str, Any]]:
    """
    Create episodic meta-learning indices.
    
    Args:
        pairs_df: DataFrame with columns [pair_id, user_id, prefix, label, label_ts_epoch, ...]
        K: Number of support pairs
        Q: Number of query pairs
        mode: 'train' (multiple episodes), 'val'/'test' (single episode)
        stride: For train mode, stride for sliding window (default 1)
    
    Returns:
        List of episode dictionaries
    """
    min_pairs = K + Q
    episodes = []
    episode_id = 0
    
    # Group by user and sort by timestamp
    for user_id, group in pairs_df.groupby("user_id"):
        # Sort pairs by timestamp (chronological)
        group = group.sort_values("label_ts_epoch").reset_index(drop=True)
        
        n_pairs = len(group)
        
        # Filter: user must have at least K+Q pairs
        if n_pairs < min_pairs:
            continue
        
        if mode == "train":
            # Multiple episodes per user (sliding window)
            # Start index can be any position where we have K+Q pairs remaining
            for start_idx in range(0, n_pairs - min_pairs + 1, stride):
                support_pairs = group.iloc[start_idx:start_idx+K]
                query_pairs = group.iloc[start_idx+K:start_idx+K+Q]
                
                # Chronological validation
                support_max_ts = support_pairs["label_ts_epoch"].max()
                query_min_ts = query_pairs["label_ts_epoch"].min()
                
                if support_max_ts >= query_min_ts:
                    # Should not happen if data is sorted, but check anyway
                    continue
                
                episodes.append({
                    "episode_id": episode_id,
                    "user_id": user_id,
                    "K": K,
                    "Q": Q,
                    "support_pair_ids": support_pairs["pair_id"].tolist(),
                    "query_pair_ids": query_pairs["pair_id"].tolist(),
                    "support_max_ts": int(support_max_ts),
                    "query_min_ts": int(query_min_ts),
                })
                episode_id += 1
        
        else:  # val or test
            # Single episode per user: last K+Q pairs
            support_pairs = group.iloc[-min_pairs:-Q]
            query_pairs = group.iloc[-Q:]
            
            # Chronological validation
            support_max_ts = support_pairs["label_ts_epoch"].max()
            query_min_ts = query_pairs["label_ts_epoch"].min()
            
            if support_max_ts >= query_min_ts:
                # Should not happen, but skip if so
                continue
            
            episodes.append({
                "episode_id": episode_id,
                "user_id": user_id,
                "K": K,
                "Q": Q,
                "support_pair_ids": support_pairs["pair_id"].tolist(),
                "query_pair_ids": query_pairs["pair_id"].tolist(),
                "support_max_ts": int(support_max_ts),
                "query_min_ts": int(query_min_ts),
            })
            episode_id += 1
    
    return episodes

print("[CELL 05-05] Episode creation function defined")
print("[CELL 05-05] Strategy:")
print("  - Train: Multiple episodes per user (sliding window, stride=1)")
print("  - Val/Test: Single episode per user (last K+Q pairs)")
print("  - Chronological: support_max_ts < query_min_ts (no future leakage)")

cell_end("CELL 05-05", t0)


[CELL 05-05] Define episode creation function
[CELL 05-05] start=2026-01-30T07:52:21
[CELL 05-05] Episode creation function defined
[CELL 05-05] Strategy:
  - Train: Multiple episodes per user (sliding window, stride=1)
  - Val/Test: Single episode per user (last K+Q pairs)
  - Chronological: support_max_ts < query_min_ts (no future leakage)
[CELL 05-05] elapsed=0.00s
[CELL 05-05] done


In [7]:
# [CELL 05-06] Create episodes for all K-shot configs and splits

t0 = cell_start("CELL 05-06", "Create episodes for all configs")

all_episode_files = []
all_episode_stats = []

for cfg in K_SHOT_CONFIGS:
    K = cfg["K"]
    Q = cfg["Q"]
    
    print(f"\n[CELL 05-06] ========== K={K}, Q={Q} ==========")
    
    # Train episodes
    print(f"[CELL 05-06] Creating train episodes (K={K}, Q={Q})...")
    episodes_train = create_episodes(
        pairs_train, 
        K=K, 
        Q=Q, 
        mode="train", 
        stride=CFG["episode_strategy"]["train_stride"]
    )
    episodes_train_df = pd.DataFrame(episodes_train)
    
    out_train = EPISODES_DIR / f"episodes_train_K{K}_Q{Q}.parquet"
    episodes_train_df.to_parquet(out_train, index=False, compression="zstd")
    print(f"[CELL 05-06] Train: {len(episodes_train_df):,} episodes, {episodes_train_df['user_id'].nunique():,} users")
    print(f"[CELL 05-06] Saved: {out_train.name} ({out_train.stat().st_size / 1024 / 1024:.1f} MB)")
    
    all_episode_files.append(("train", K, Q, out_train))
    all_episode_stats.append({
        "split": "train",
        "K": K,
        "Q": Q,
        "n_episodes": len(episodes_train_df),
        "n_users": int(episodes_train_df["user_id"].nunique()),
        "episodes_per_user_mean": float(len(episodes_train_df) / episodes_train_df["user_id"].nunique()),
    })
    
    # Val episodes
    print(f"[CELL 05-06] Creating val episodes (K={K}, Q={Q})...")
    episodes_val = create_episodes(pairs_val, K=K, Q=Q, mode="val")
    episodes_val_df = pd.DataFrame(episodes_val)
    
    out_val = EPISODES_DIR / f"episodes_val_K{K}_Q{Q}.parquet"
    episodes_val_df.to_parquet(out_val, index=False, compression="zstd")
    print(f"[CELL 05-06] Val:   {len(episodes_val_df):,} episodes, {episodes_val_df['user_id'].nunique():,} users")
    print(f"[CELL 05-06] Saved: {out_val.name} ({out_val.stat().st_size / 1024 / 1024:.1f} MB)")
    
    all_episode_files.append(("val", K, Q, out_val))
    all_episode_stats.append({
        "split": "val",
        "K": K,
        "Q": Q,
        "n_episodes": len(episodes_val_df),
        "n_users": int(episodes_val_df["user_id"].nunique()),
    })
    
    # Test episodes
    print(f"[CELL 05-06] Creating test episodes (K={K}, Q={Q})...")
    episodes_test = create_episodes(pairs_test, K=K, Q=Q, mode="test")
    episodes_test_df = pd.DataFrame(episodes_test)
    
    out_test = EPISODES_DIR / f"episodes_test_K{K}_Q{Q}.parquet"
    episodes_test_df.to_parquet(out_test, index=False, compression="zstd")
    print(f"[CELL 05-06] Test:  {len(episodes_test_df):,} episodes, {episodes_test_df['user_id'].nunique():,} users")
    print(f"[CELL 05-06] Saved: {out_test.name} ({out_test.stat().st_size / 1024 / 1024:.1f} MB)")
    
    all_episode_files.append(("test", K, Q, out_test))
    all_episode_stats.append({
        "split": "test",
        "K": K,
        "Q": Q,
        "n_episodes": len(episodes_test_df),
        "n_users": int(episodes_test_df["user_id"].nunique()),
    })

print(f"\n[CELL 05-06] Created episodes for {len(K_SHOT_CONFIGS)} K-shot configs x 3 splits = {len(all_episode_files)} files")

cell_end("CELL 05-06", t0)


[CELL 05-06] Create episodes for all configs
[CELL 05-06] start=2026-01-30T07:52:21

[CELL 05-06] Creating train episodes (K=5, Q=10)...
[CELL 05-06] Train: 30,895 episodes, 1,859 users
[CELL 05-06] Saved: episodes_train_K5_Q10.parquet (0.8 MB)
[CELL 05-06] Creating val episodes (K=5, Q=10)...
[CELL 05-06] Val:   258 episodes, 258 users
[CELL 05-06] Saved: episodes_val_K5_Q10.parquet (0.0 MB)
[CELL 05-06] Creating test episodes (K=5, Q=10)...
[CELL 05-06] Test:  248 episodes, 248 users
[CELL 05-06] Saved: episodes_test_K5_Q10.parquet (0.0 MB)

[CELL 05-06] Creating train episodes (K=10, Q=20)...
[CELL 05-06] Train: 14,383 episodes, 597 users
[CELL 05-06] Saved: episodes_train_K10_Q20.parquet (0.7 MB)
[CELL 05-06] Creating val episodes (K=10, Q=20)...
[CELL 05-06] Val:   72 episodes, 72 users
[CELL 05-06] Saved: episodes_val_K10_Q20.parquet (0.0 MB)
[CELL 05-06] Creating test episodes (K=10, Q=20)...
[CELL 05-06] Test:  85 episodes, 85 users
[CELL 05-06] Saved: episodes_test_K10_Q20.pa

In [8]:
# [CELL 05-07] Register DuckDB views for all episode files

t0 = cell_start("CELL 05-07", "Register DuckDB views", duckdb=str(DUCKDB_PATH))

import duckdb

con = duckdb.connect(str(DUCKDB_PATH), read_only=False)

def esc_path(p: Path) -> str:
    return str(p).replace("'", "''")

for split, K, Q, path in all_episode_files:
    view_name = f"xuetangx_episodes_{split}_K{K}_Q{Q}"
    
    con.execute(f"DROP VIEW IF EXISTS {view_name};")
    con.execute(f"""
    CREATE VIEW {view_name} AS
    SELECT * FROM read_parquet('{esc_path(path)}')
    """)
    
    n_rows = int(con.execute(f"SELECT COUNT(*) FROM {view_name}").fetchone()[0])
    print(f"[CELL 05-07] View {view_name}: {n_rows:,} rows")

con.close()
print(f"\n[CELL 05-07] Closed DuckDB connection")

cell_end("CELL 05-07", t0)


[CELL 05-07] Register DuckDB views
[CELL 05-07] start=2026-01-30T07:52:35
[CELL 05-07] duckdb=/workspace/anonymous-users-mooc-session-meta/data/interim/xuetangx.duckdb
[CELL 05-07] View xuetangx_episodes_train_K5_Q10: 30,895 rows
[CELL 05-07] View xuetangx_episodes_val_K5_Q10: 258 rows
[CELL 05-07] View xuetangx_episodes_test_K5_Q10: 248 rows
[CELL 05-07] View xuetangx_episodes_train_K10_Q20: 14,383 rows
[CELL 05-07] View xuetangx_episodes_val_K10_Q20: 72 rows
[CELL 05-07] View xuetangx_episodes_test_K10_Q20: 85 rows

[CELL 05-07] Closed DuckDB connection
[CELL 05-07] elapsed=0.08s
[CELL 05-07] done


In [9]:
# [CELL 05-08] Validation: chronological ordering check

t0 = cell_start("CELL 05-08", "Validate chronological ordering")

print(f"[CELL 05-08] Checking all episodes for chronological ordering (support_max_ts < query_min_ts)...\n")

for split, K, Q, path in all_episode_files:
    episodes_df = pd.read_parquet(path)
    
    # Check: support_max_ts < query_min_ts for all episodes
    violations = episodes_df[episodes_df["support_max_ts"] >= episodes_df["query_min_ts"]]
    
    if len(violations) > 0:
        raise RuntimeError(f"Found {len(violations)} chronology violations in {split} K={K} Q={Q}")
    
    print(f"[CELL 05-08] [OK] {split:5s} K={K:2d} Q={Q:2d}: {len(episodes_df):,} episodes, all chronologically valid")

print(f"\n[CELL 05-08] [OK] All episodes validated: support timestamps < query timestamps")

cell_end("CELL 05-08", t0)


[CELL 05-08] Validate chronological ordering
[CELL 05-08] start=2026-01-30T07:52:35
[CELL 05-08] Checking all episodes for chronological ordering (support_max_ts < query_min_ts)...

[CELL 05-08] [OK] train K= 5 Q=10: 30,895 episodes, all chronologically valid
[CELL 05-08] [OK] val   K= 5 Q=10: 258 episodes, all chronologically valid
[CELL 05-08] [OK] test  K= 5 Q=10: 248 episodes, all chronologically valid
[CELL 05-08] [OK] train K=10 Q=20: 14,383 episodes, all chronologically valid
[CELL 05-08] [OK] val   K=10 Q=20: 72 episodes, all chronologically valid
[CELL 05-08] [OK] test  K=10 Q=20: 85 episodes, all chronologically valid

[CELL 05-08] [OK] All episodes validated: support timestamps < query timestamps
[CELL 05-08] elapsed=0.07s
[CELL 05-08] done


In [10]:
# [CELL 05-09] Episode statistics summary

t0 = cell_start("CELL 05-09", "Episode statistics")

print(f"[CELL 05-09] Episode counts by K-shot config:\n")

for cfg in K_SHOT_CONFIGS:
    K = cfg["K"]
    Q = cfg["Q"]
    
    print(f"[CELL 05-09] K={K}, Q={Q} (min {K+Q} pairs):")
    
    for split in ["train", "val", "test"]:
        stats = [s for s in all_episode_stats if s["split"] == split and s["K"] == K and s["Q"] == Q][0]
        n_episodes = stats["n_episodes"]
        n_users = stats["n_users"]
        
        if split == "train":
            eps_per_user = stats["episodes_per_user_mean"]
            print(f"  {split:5s}: {n_episodes:6,} episodes, {n_users:5,} users ({eps_per_user:.1f} episodes/user)")
        else:
            print(f"  {split:5s}: {n_episodes:6,} episodes, {n_users:5,} users (1.0 episode/user)")
    
    print()

cell_end("CELL 05-09", t0)


[CELL 05-09] Episode statistics
[CELL 05-09] start=2026-01-30T07:52:35
[CELL 05-09] Episode counts by K-shot config:

[CELL 05-09] K=5, Q=10 (min 15 pairs):
  train: 30,895 episodes, 1,859 users (16.6 episodes/user)
  val  :    258 episodes,   258 users (1.0 episode/user)
  test :    248 episodes,   248 users (1.0 episode/user)

[CELL 05-09] K=10, Q=20 (min 30 pairs):
  train: 14,383 episodes,   597 users (24.1 episodes/user)
  val  :     72 episodes,    72 users (1.0 episode/user)
  test :     85 episodes,    85 users (1.0 episode/user)

[CELL 05-09] elapsed=0.00s
[CELL 05-09] done


In [11]:
# [CELL 05-10] Update report + manifest

t0 = cell_start("CELL 05-10", "Write report + manifest")

report = read_json(REPORT_PATH)
manifest = read_json(MANIFEST_PATH)

# Metrics
report["metrics"]["k_shot_configs"] = K_SHOT_CONFIGS
report["metrics"]["episode_stats"] = all_episode_stats

# Key findings
for cfg in K_SHOT_CONFIGS:
    K = cfg["K"]
    Q = cfg["Q"]
    
    train_stats = [s for s in all_episode_stats if s["split"] == "train" and s["K"] == K and s["Q"] == Q][0]
    val_stats = [s for s in all_episode_stats if s["split"] == "val" and s["K"] == K and s["Q"] == Q][0]
    test_stats = [s for s in all_episode_stats if s["split"] == "test" and s["K"] == K and s["Q"] == Q][0]
    
    report["key_findings"].append(
        f"K={K}, Q={Q}: Created {train_stats['n_episodes']:,} train episodes ({train_stats['n_users']:,} users, "
        f"{train_stats['episodes_per_user_mean']:.1f} eps/user), "
        f"{val_stats['n_episodes']:,} val episodes ({val_stats['n_users']:,} users), "
        f"{test_stats['n_episodes']:,} test episodes ({test_stats['n_users']:,} users). "
        f"All episodes chronologically valid (support_max_ts < query_min_ts)."
    )

# Sanity samples (convert to native Python types for JSON serialization)
sample_episodes_train = pd.read_parquet(EPISODES_DIR / f"episodes_train_K5_Q10.parquet")
sample_records = []
for _, row in sample_episodes_train.head(3).iterrows():
    record = {
        "episode_id": int(row["episode_id"]),
        "user_id": str(row["user_id"]),
        "K": int(row["K"]),
        "Q": int(row["Q"]),
        "support_pair_ids": [int(x) for x in row["support_pair_ids"]],
        "query_pair_ids": [int(x) for x in row["query_pair_ids"]],
        "support_max_ts": int(row["support_max_ts"]),
        "query_min_ts": int(row["query_min_ts"]),
    }
    sample_records.append(record)
report["sanity_samples"]["train_episodes_head3_K5_Q10"] = sample_records

# Fingerprints
for split, K, Q, path in all_episode_files:
    key = f"episodes_{split}_K{K}_Q{Q}"
    report["data_fingerprints"][key] = {
        "path": str(path),
        "bytes": int(path.stat().st_size),
        "sha256": sha256_file(path),
    }

write_json_atomic(REPORT_PATH, report)

# Manifest
def add_artifact(path: Path) -> None:
    rec = {"path": str(path), "bytes": int(path.stat().st_size), "sha256": None, "sha256_error": None}
    try:
        rec["sha256"] = sha256_file(path)
    except PermissionError as e:
        rec["sha256_error"] = f"PermissionError: {e}"
    manifest["artifacts"].append(rec)

for split, K, Q, path in all_episode_files:
    add_artifact(path)

write_json_atomic(MANIFEST_PATH, manifest)

print(f"[CELL 05-10] Updated: {REPORT_PATH}")
print(f"[CELL 05-10] Updated: {MANIFEST_PATH}")

cell_end("CELL 05-10", t0)


[CELL 05-10] Write report + manifest
[CELL 05-10] start=2026-01-30T07:52:35
[CELL 05-10] Updated: /workspace/anonymous-users-mooc-session-meta/reports/05_episode_index_xuetangx/20260130_075221/report.json
[CELL 05-10] Updated: /workspace/anonymous-users-mooc-session-meta/reports/05_episode_index_xuetangx/20260130_075221/manifest.json
[CELL 05-10] elapsed=0.03s
[CELL 05-10] done


In [12]:
# [CELL 05-11] Visualizations: plots and tables for reporting

t0 = cell_start("CELL 05-11", "Generate plots and tables")

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['font.size'] = 10

VIZ_DIR = OUT_DIR / "visualizations"
VIZ_DIR.mkdir(exist_ok=True)

print(f"[CELL 05-11] Creating visualizations in {VIZ_DIR}")

# Load episodes for analysis
eps_train_k5 = pd.read_parquet(EPISODES_DIR / "episodes_train_K5_Q10.parquet")
eps_val_k5 = pd.read_parquet(EPISODES_DIR / "episodes_val_K5_Q10.parquet")
eps_test_k5 = pd.read_parquet(EPISODES_DIR / "episodes_test_K5_Q10.parquet")
eps_train_k10 = pd.read_parquet(EPISODES_DIR / "episodes_train_K10_Q20.parquet")
eps_val_k10 = pd.read_parquet(EPISODES_DIR / "episodes_val_K10_Q20.parquet")
eps_test_k10 = pd.read_parquet(EPISODES_DIR / "episodes_test_K10_Q20.parquet")

# ===== PLOT 1: Episode counts by split and K-shot config =====
fig, ax = plt.subplots(figsize=(10, 6))

configs = ['K=5, Q=10', 'K=10, Q=20']
train_counts = [len(eps_train_k5), len(eps_train_k10)]
val_counts = [len(eps_val_k5), len(eps_val_k10)]
test_counts = [len(eps_test_k5), len(eps_test_k10)]

x = range(len(configs))
width = 0.25

ax.bar([i - width for i in x], train_counts, width, label='Train', color='#2ecc71', alpha=0.7, edgecolor='black')
ax.bar(x, val_counts, width, label='Val', color='#3498db', alpha=0.7, edgecolor='black')
ax.bar([i + width for i in x], test_counts, width, label='Test', color='#e74c3c', alpha=0.7, edgecolor='black')

ax.set_ylabel('Number of Episodes', fontsize=11, fontweight='bold')
ax.set_title('Episode Counts by Split and K-Shot Config', fontsize=12, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(configs, fontsize=10, fontweight='bold')
ax.legend(loc='upper right', fontsize=10)
ax.grid(axis='y', alpha=0.3)

# Add value labels
for i, (train, val, test) in enumerate(zip(train_counts, val_counts, test_counts)):
    ax.text(i - width, train + 1000, f'{train:,}', ha='center', va='bottom', fontsize=8, fontweight='bold')
    ax.text(i, val + 10, f'{val:,}', ha='center', va='bottom', fontsize=8, fontweight='bold')
    ax.text(i + width, test + 10, f'{test:,}', ha='center', va='bottom', fontsize=8, fontweight='bold')

plt.tight_layout()
plt.savefig(VIZ_DIR / "fig1_episode_counts.png", bbox_inches='tight')
plt.close()
print(f"[CELL 05-11] Saved: fig1_episode_counts.png")

# ===== PLOT 2: Episodes per user distribution (K=5, Q=10 train only) =====
fig, ax = plt.subplots(figsize=(10, 6))

user_episode_counts = eps_train_k5.groupby('user_id').size()

ax.hist(user_episode_counts, bins=50, color='#2ecc71', alpha=0.7, edgecolor='black')
ax.axvline(user_episode_counts.median(), color='red', linestyle='--', linewidth=2, 
           label=f'Median={user_episode_counts.median():.0f}')
ax.axvline(user_episode_counts.quantile(0.90), color='orange', linestyle='--', linewidth=2, 
           label=f'p90={user_episode_counts.quantile(0.90):.0f}')
ax.set_xlabel('Episodes per User', fontsize=11, fontweight='bold')
ax.set_ylabel('Number of Users', fontsize=11, fontweight='bold')
ax.set_title('Train Episodes per User Distribution (K=5, Q=10)', fontsize=12, fontweight='bold')
ax.legend(loc='upper right', fontsize=10)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig(VIZ_DIR / "fig2_episodes_per_user.png", bbox_inches='tight')
plt.close()
print(f"[CELL 05-11] Saved: fig2_episodes_per_user.png")

# ===== PLOT 3: User eligibility comparison (original vs filtered) =====
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# K=5, Q=10
original_users_k5 = [pairs_train['user_id'].nunique(), pairs_val['user_id'].nunique(), pairs_test['user_id'].nunique()]
eligible_users_k5 = [eps_train_k5['user_id'].nunique(), eps_val_k5['user_id'].nunique(), eps_test_k5['user_id'].nunique()]
splits = ['Train', 'Val', 'Test']
x = range(len(splits))
width = 0.35

ax1.bar([i - width/2 for i in x], original_users_k5, width, label='Original', color='#95a5a6', alpha=0.7, edgecolor='black')
ax1.bar([i + width/2 for i in x], eligible_users_k5, width, label='Eligible (>=15 pairs)', color='#2ecc71', alpha=0.7, edgecolor='black')
ax1.set_ylabel('Number of Users', fontsize=11, fontweight='bold')
ax1.set_title('User Eligibility: K=5, Q=10', fontsize=12, fontweight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(splits)
ax1.legend(loc='upper right', fontsize=9)
ax1.grid(axis='y', alpha=0.3)

for i, (orig, elig) in enumerate(zip(original_users_k5, eligible_users_k5)):
    pct = elig / orig * 100 if orig > 0 else 0
    ax1.text(i, max(orig, elig) + 500, f'{elig:,}\n({pct:.1f}%)', ha='center', va='bottom', fontsize=8, fontweight='bold')

# K=10, Q=20
original_users_k10 = original_users_k5  # same original users
eligible_users_k10 = [eps_train_k10['user_id'].nunique(), eps_val_k10['user_id'].nunique(), eps_test_k10['user_id'].nunique()]

ax2.bar([i - width/2 for i in x], original_users_k10, width, label='Original', color='#95a5a6', alpha=0.7, edgecolor='black')
ax2.bar([i + width/2 for i in x], eligible_users_k10, width, label='Eligible (>=30 pairs)', color='#3498db', alpha=0.7, edgecolor='black')
ax2.set_ylabel('Number of Users', fontsize=11, fontweight='bold')
ax2.set_title('User Eligibility: K=10, Q=20', fontsize=12, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(splits)
ax2.legend(loc='upper right', fontsize=9)
ax2.grid(axis='y', alpha=0.3)

for i, (orig, elig) in enumerate(zip(original_users_k10, eligible_users_k10)):
    pct = elig / orig * 100 if orig > 0 else 0
    ax2.text(i, max(orig, elig) + 500, f'{elig:,}\n({pct:.1f}%)', ha='center', va='bottom', fontsize=8, fontweight='bold')

plt.tight_layout()
plt.savefig(VIZ_DIR / "fig3_user_eligibility.png", bbox_inches='tight')
plt.close()
print(f"[CELL 05-11] Saved: fig3_user_eligibility.png")

# ===== TABLE 1: Episode statistics summary =====
table1_records = []
for stats in all_episode_stats:
    rec = {
        'Config': f"K={stats['K']}, Q={stats['Q']}",
        'Split': stats['split'].capitalize(),
        'Episodes': f"{stats['n_episodes']:,}",
        'Users': f"{stats['n_users']:,}",
    }
    if 'episodes_per_user_mean' in stats:
        rec['Eps/User'] = f"{stats['episodes_per_user_mean']:.1f}"
    else:
        rec['Eps/User'] = "1.0"
    table1_records.append(rec)

table1 = pd.DataFrame(table1_records)
table1.to_csv(VIZ_DIR / "table1_episode_summary.csv", index=False)
print(f"[CELL 05-11] Saved: table1_episode_summary.csv")
print(f"\n[CELL 05-11] Table 1: Episode Summary")
print(table1.to_string(index=False))

# ===== TABLE 2: User eligibility comparison =====
table2 = pd.DataFrame([
    {
        'Config': 'K=5, Q=10',
        'Split': 'Train',
        'Original Users': f"{pairs_train['user_id'].nunique():,}",
        'Eligible Users': f"{eps_train_k5['user_id'].nunique():,}",
        'Eligibility %': f"{eps_train_k5['user_id'].nunique() / pairs_train['user_id'].nunique() * 100:.1f}%",
    },
    {
        'Config': 'K=5, Q=10',
        'Split': 'Val',
        'Original Users': f"{pairs_val['user_id'].nunique():,}",
        'Eligible Users': f"{eps_val_k5['user_id'].nunique():,}",
        'Eligibility %': f"{eps_val_k5['user_id'].nunique() / pairs_val['user_id'].nunique() * 100:.1f}%",
    },
    {
        'Config': 'K=5, Q=10',
        'Split': 'Test',
        'Original Users': f"{pairs_test['user_id'].nunique():,}",
        'Eligible Users': f"{eps_test_k5['user_id'].nunique():,}",
        'Eligibility %': f"{eps_test_k5['user_id'].nunique() / pairs_test['user_id'].nunique() * 100:.1f}%",
    },
    {
        'Config': 'K=10, Q=20',
        'Split': 'Train',
        'Original Users': f"{pairs_train['user_id'].nunique():,}",
        'Eligible Users': f"{eps_train_k10['user_id'].nunique():,}",
        'Eligibility %': f"{eps_train_k10['user_id'].nunique() / pairs_train['user_id'].nunique() * 100:.1f}%",
    },
    {
        'Config': 'K=10, Q=20',
        'Split': 'Val',
        'Original Users': f"{pairs_val['user_id'].nunique():,}",
        'Eligible Users': f"{eps_val_k10['user_id'].nunique():,}",
        'Eligibility %': f"{eps_val_k10['user_id'].nunique() / pairs_val['user_id'].nunique() * 100:.1f}%",
    },
    {
        'Config': 'K=10, Q=20',
        'Split': 'Test',
        'Original Users': f"{pairs_test['user_id'].nunique():,}",
        'Eligible Users': f"{eps_test_k10['user_id'].nunique():,}",
        'Eligibility %': f"{eps_test_k10['user_id'].nunique() / pairs_test['user_id'].nunique() * 100:.1f}%",
    },
])

table2.to_csv(VIZ_DIR / "table2_user_eligibility.csv", index=False)
print(f"[CELL 05-11] Saved: table2_user_eligibility.csv")
print(f"\n[CELL 05-11] Table 2: User Eligibility Comparison")
print(table2.to_string(index=False))

print(f"\n[CELL 05-11] All visualizations saved to {VIZ_DIR}")

cell_end("CELL 05-11", t0)


[CELL 05-11] Generate plots and tables
[CELL 05-11] start=2026-01-30T07:52:35
[CELL 05-11] Creating visualizations in /workspace/anonymous-users-mooc-session-meta/reports/05_episode_index_xuetangx/20260130_075221/visualizations
[CELL 05-11] Saved: fig1_episode_counts.png
[CELL 05-11] Saved: fig2_episodes_per_user.png
[CELL 05-11] Saved: fig3_user_eligibility.png
[CELL 05-11] Saved: table1_episode_summary.csv

[CELL 05-11] Table 1: Episode Summary
    Config Split Episodes Users Eps/User
 K=5, Q=10 Train   30,895 1,859     16.6
 K=5, Q=10   Val      258   258      1.0
 K=5, Q=10  Test      248   248      1.0
K=10, Q=20 Train   14,383   597     24.1
K=10, Q=20   Val       72    72      1.0
K=10, Q=20  Test       85    85      1.0
[CELL 05-11] Saved: table2_user_eligibility.csv

[CELL 05-11] Table 2: User Eligibility Comparison
    Config Split Original Users Eligible Users Eligibility %
 K=5, Q=10 Train         28,633          1,859          6.5%
 K=5, Q=10   Val          3,579         

## ✅ Notebook 05 Complete

**Outputs:**
- ✅ Episode files for K=5,Q=10 and K=10,Q=20 (train/val/test)
- ✅ DuckDB views: `xuetangx_episodes_{train|val|test}_K{K}_Q{Q}`
- ✅ `reports/05_episode_index_xuetangx/<run_tag>/report.json`

**Validation Passed:**
- ✅ Chronological ordering: support_max_ts < query_min_ts for all episodes
- ✅ User filtering: only users with ≥K+Q pairs included
- ✅ Train: multiple episodes per user (sliding window)
- ✅ Val/Test: single episode per user (last K+Q pairs)

**Next:** Notebook 06 (Baselines)
- Implement baseline models: Popularity, GRU, SASRec
- Train on train episodes, evaluate on val/test episodes
- Metrics: Accuracy@1, Recall@{5,10}, MRR
- Compare global (non-personalized) vs few-shot adapted models