# Notebook 03b: Build Pairs with Event Types (XuetangX)

**Purpose:** Extend NB03 to include event_type information for reliability scoring.

**Difference from NB03:**
- Includes `event_type` in pairs for action diversity computation
- Adds session-level reliability metrics (n_events, duration, action_types)
- Outputs to separate directory to preserve original pipeline

**For Contribution 3:** Reliability-Weighted MAML

**Inputs:**
- `data/interim/xuetangx.duckdb` (view: `xuetangx_events_sessionized`)
- `data/processed/xuetangx/vocab/course2id.json` (from NB03)

**Outputs:**
- `data/processed/xuetangx/pairs_with_reliability/pairs.parquet`
- `data/processed/xuetangx/pairs_with_reliability/session_reliability.parquet`
- DuckDB view: `xuetangx_pairs_with_reliability`
- `reports/03b_build_pairs_with_events_xuetangx/<run_tag>/report.json`

In [1]:
# [CELL 03b-00] Bootstrap: repo root + paths + logger

import os
import sys
import json
import time
import uuid
import hashlib
from pathlib import Path
from datetime import datetime
from typing import Any, Dict, List

import numpy as np
import pandas as pd

t0 = datetime.now()
print(f"[CELL 03b-00] start={t0.isoformat(timespec='seconds')}")
print("[CELL 03b-00] CWD:", Path.cwd().resolve())

def find_repo_root(start: Path) -> Path:
    start = start.resolve()
    for p in [start, *start.parents]:
        if (p / "PROJECT_STATE.md").exists():
            return p
    raise RuntimeError("Could not find PROJECT_STATE.md. Open notebook from within the repo.")

REPO_ROOT = find_repo_root(Path.cwd())
print("[CELL 03b-00] REPO_ROOT:", REPO_ROOT)

PATHS = {
    "META_REGISTRY": REPO_ROOT / "meta.json",
    "DATA_INTERIM": REPO_ROOT / "data" / "interim",
    "DATA_PROCESSED": REPO_ROOT / "data" / "processed",
    "REPORTS": REPO_ROOT / "reports",
}
for k, v in PATHS.items():
    print(f"[CELL 03b-00] {k}={v}")

def cell_start(cell_id: str, title: str, **kwargs: Any) -> float:
    t = time.time()
    print(f"\n[{cell_id}] {title}")
    print(f"[{cell_id}] start={datetime.now().isoformat(timespec='seconds')}")
    for k, v in kwargs.items():
        print(f"[{cell_id}] {k}={v}")
    return t

def cell_end(cell_id: str, t0: float, **kwargs: Any) -> None:
    for k, v in kwargs.items():
        print(f"[{cell_id}] {k}={v}")
    print(f"[{cell_id}] elapsed={time.time()-t0:.2f}s")
    print(f"[{cell_id}] done")

print("[CELL 03b-00] done")

[CELL 03b-00] start=2026-02-04T02:45:09
[CELL 03b-00] CWD: /workspace/anonymous-users-mooc-session-meta/notebooks
[CELL 03b-00] REPO_ROOT: /workspace/anonymous-users-mooc-session-meta
[CELL 03b-00] META_REGISTRY=/workspace/anonymous-users-mooc-session-meta/meta.json
[CELL 03b-00] DATA_INTERIM=/workspace/anonymous-users-mooc-session-meta/data/interim
[CELL 03b-00] DATA_PROCESSED=/workspace/anonymous-users-mooc-session-meta/data/processed
[CELL 03b-00] REPORTS=/workspace/anonymous-users-mooc-session-meta/reports
[CELL 03b-00] done


In [2]:
# [CELL 03b-01] Reproducibility: seed everything

t0 = cell_start("CELL 03b-01", "Seed everything")

GLOBAL_SEED = 20260107

def seed_everything(seed: int) -> None:
    import random
    random.seed(seed)
    np.random.seed(seed)

seed_everything(GLOBAL_SEED)

cell_end("CELL 03b-01", t0, seed=GLOBAL_SEED)


[CELL 03b-01] Seed everything
[CELL 03b-01] start=2026-02-04T02:45:09
[CELL 03b-01] seed=20260107
[CELL 03b-01] elapsed=0.00s
[CELL 03b-01] done


In [3]:
# [CELL 03b-02] JSON IO + hashing helpers

t0 = cell_start("CELL 03b-02", "JSON IO + hashing")

def write_json_atomic(path: Path, obj: Any, indent: int = 2) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    tmp = path.with_suffix(path.suffix + f".tmp_{uuid.uuid4().hex}")
    with tmp.open("w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=indent)
    tmp.replace(path)

def read_json(path: Path) -> Any:
    if not path.exists():
        raise RuntimeError(f"Missing JSON file: {path}")
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)

def sha256_file(path: Path, chunk_size: int = 1024 * 1024) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        while True:
            b = f.read(chunk_size)
            if not b:
                break
            h.update(b)
    return h.hexdigest()

cell_end("CELL 03b-02", t0)


[CELL 03b-02] JSON IO + hashing
[CELL 03b-02] start=2026-02-04T02:45:09
[CELL 03b-02] elapsed=0.00s
[CELL 03b-02] done


In [4]:
# [CELL 03b-03] Run tagging + report/config/manifest + meta.json

t0 = cell_start("CELL 03b-03", "Start run + init files + meta.json")

NOTEBOOK_NAME = "03b_build_pairs_with_events_xuetangx"
RUN_TAG = datetime.now().strftime("%Y%m%d_%H%M%S")
RUN_ID = uuid.uuid4().hex

OUT_DIR = PATHS["REPORTS"] / NOTEBOOK_NAME / RUN_TAG
OUT_DIR.mkdir(parents=True, exist_ok=True)

REPORT_PATH = OUT_DIR / "report.json"
CONFIG_PATH = OUT_DIR / "config.json"
MANIFEST_PATH = OUT_DIR / "manifest.json"

DUCKDB_PATH = PATHS["DATA_INTERIM"] / "xuetangx.duckdb"
EVENTS_VIEW = "xuetangx_events_sessionized"

# Load vocab from NB03 (reuse existing)
VOCAB_DIR = PATHS["DATA_PROCESSED"] / "xuetangx" / "vocab"
COURSE2ID_PATH = VOCAB_DIR / "course2id.json"

# Output to separate directory
PAIRS_DIR = PATHS["DATA_PROCESSED"] / "xuetangx" / "pairs_with_reliability"
PAIRS_DIR.mkdir(parents=True, exist_ok=True)

OUT_PAIRS_PARQUET = PAIRS_DIR / "pairs.parquet"
OUT_SESSION_RELIABILITY = PAIRS_DIR / "session_reliability.parquet"

# Reliability score configuration
RELIABILITY_CFG = {
    "intensity_cap": 100,      # Cap event count at 100
    "duration_cap": 1800,      # Cap duration at 30 minutes (1800 seconds)
    "weight_intensity": 1/3,
    "weight_extent": 1/3,
    "weight_composition": 1/3,
}

CFG = {
    "notebook": NOTEBOOK_NAME,
    "run_id": RUN_ID,
    "run_tag": RUN_TAG,
    "seed": GLOBAL_SEED,
    "inputs": {
        "duckdb_path": str(DUCKDB_PATH),
        "events_view": EVENTS_VIEW,
        "course2id": str(COURSE2ID_PATH),
    },
    "outputs": {
        "pairs": str(OUT_PAIRS_PARQUET),
        "session_reliability": str(OUT_SESSION_RELIABILITY),
        "out_dir": str(OUT_DIR),
    },
    "pairs": {
        "min_prefix_len": 1,
        "max_prefix_len": None,
        "deduplicate_consecutive": True,
    },
    "reliability": RELIABILITY_CFG,
}

write_json_atomic(CONFIG_PATH, CFG)

report = {
    "run_id": RUN_ID,
    "notebook": NOTEBOOK_NAME,
    "run_tag": RUN_TAG,
    "created_at": datetime.now().isoformat(timespec="seconds"),
    "repo_root": str(REPO_ROOT),
    "metrics": {},
    "key_findings": [],
    "sanity_samples": {},
    "data_fingerprints": {},
    "notes": [],
}
write_json_atomic(REPORT_PATH, report)

manifest = {"run_id": RUN_ID, "notebook": NOTEBOOK_NAME, "run_tag": RUN_TAG, "artifacts": []}
write_json_atomic(MANIFEST_PATH, manifest)

# meta.json append-only
META_PATH = PATHS["META_REGISTRY"]
if not META_PATH.exists():
    write_json_atomic(META_PATH, {"schema_version": 1, "runs": []})
meta = read_json(META_PATH)
meta["runs"].append({
    "run_id": RUN_ID,
    "notebook": NOTEBOOK_NAME,
    "run_tag": RUN_TAG,
    "out_dir": str(OUT_DIR),
    "created_at": datetime.now().isoformat(timespec="seconds"),
})
write_json_atomic(META_PATH, meta)

cell_end("CELL 03b-03", t0, out_dir=str(OUT_DIR))


[CELL 03b-03] Start run + init files + meta.json
[CELL 03b-03] start=2026-02-04T02:45:09
[CELL 03b-03] out_dir=/workspace/anonymous-users-mooc-session-meta/reports/03b_build_pairs_with_events_xuetangx/20260204_024509
[CELL 03b-03] elapsed=0.00s
[CELL 03b-03] done


In [5]:
# [CELL 03b-04] Load sessionized events WITH event_type (KEY DIFFERENCE FROM NB03)

t0 = cell_start("CELL 03b-04", "Load sessionized events WITH event_type", duckdb=str(DUCKDB_PATH))

import duckdb

if not DUCKDB_PATH.exists():
    raise RuntimeError(f"Missing DuckDB: {DUCKDB_PATH}. Run Notebook 02 first.")

con = duckdb.connect(str(DUCKDB_PATH), read_only=True)

# KEY CHANGE: Include event_type in SELECT
events = con.execute(f"""
    SELECT 
        user_id,
        course_id,
        session_id,
        event_type,
        ts_epoch,
        pos_in_sess
    FROM {EVENTS_VIEW}
    ORDER BY user_id, session_id, ts_epoch, pos_in_sess
""").fetchdf()

con.close()

print(f"[CELL 03b-04] Loaded events shape: {events.shape}")
print(f"[CELL 03b-04] Columns: {list(events.columns)}")

# Show event_type distribution
print(f"\n[CELL 03b-04] Event types:")
print(events['event_type'].value_counts())

N_POSSIBLE_ACTIONS = events['event_type'].nunique()
print(f"\n[CELL 03b-04] N_POSSIBLE_ACTIONS: {N_POSSIBLE_ACTIONS}")

print(f"\n[CELL 03b-04] Head(5):")
print(events.head(5).to_string(index=False))

cell_end("CELL 03b-04", t0, n_events=int(events.shape[0]), n_action_types=N_POSSIBLE_ACTIONS)


[CELL 03b-04] Load sessionized events WITH event_type
[CELL 03b-04] start=2026-02-04T02:45:09
[CELL 03b-04] duckdb=/workspace/anonymous-users-mooc-session-meta/data/interim/xuetangx.duckdb


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

[CELL 03b-04] Loaded events shape: (28002537, 6)
[CELL 03b-04] Columns: ['user_id', 'course_id', 'session_id', 'event_type', 'ts_epoch', 'pos_in_sess']

[CELL 03b-04] Event types:
event_type
pause_video                10447067
seek_video                  4878540
problem_get                 4594062
load_video                  3832977
problem_check               1589086
problem_check_correct       1287667
problem_check_incorrect      717231
problem_save                 655907
Name: count, dtype: int64

[CELL 03b-04] N_POSSIBLE_ACTIONS: 8

[CELL 03b-04] Head(5):
user_id                             course_id                               session_id  event_type  ts_epoch  pos_in_sess
1000063 course-v1:TsinghuaX+20220332X+2016_T1 1000063_a64788d2c08e323a9f0cb7cc046f76ca  load_video   1488030            1
1000063 course-v1:TsinghuaX+20220332X+2016_T1 1000063_a64788d2c08e323a9f0cb7cc046f76ca pause_video   1488030            2
1000063 course-v1:TsinghuaX+20220332X+2016_T1 1000063_a64788d2c08e32

In [6]:
# [CELL 03b-05] Load course vocabulary from NB03

t0 = cell_start("CELL 03b-05", "Load course vocabulary", path=str(COURSE2ID_PATH))

if not COURSE2ID_PATH.exists():
    raise RuntimeError(f"Missing vocab file: {COURSE2ID_PATH}. Run Notebook 03 first.")

course2id = read_json(COURSE2ID_PATH)
n_items = len(course2id)

print(f"[CELL 03b-05] Loaded vocabulary with {n_items} courses")

# Map course_id to item_id
events["item_id"] = events["course_id"].map(course2id)

# Check for unmapped courses
n_missing = events["item_id"].isna().sum()
if n_missing > 0:
    print(f"[CELL 03b-05] WARNING: {n_missing} events with unmapped courses")
    events = events[events["item_id"].notna()].copy()

events["item_id"] = events["item_id"].astype(int)

print(f"[CELL 03b-05] Mapped all courses to item_id [0, {n_items-1}]")

cell_end("CELL 03b-05", t0, n_items=n_items)


[CELL 03b-05] Load course vocabulary
[CELL 03b-05] start=2026-02-04T02:45:30
[CELL 03b-05] path=/workspace/anonymous-users-mooc-session-meta/data/processed/xuetangx/vocab/course2id.json
[CELL 03b-05] Loaded vocabulary with 1518 courses
[CELL 03b-05] Mapped all courses to item_id [0, 1517]
[CELL 03b-05] n_items=1518
[CELL 03b-05] elapsed=3.38s
[CELL 03b-05] done


In [7]:
# [CELL 03b-06] Compute session-level reliability metrics

t0 = cell_start("CELL 03b-06", "Compute session-level reliability metrics")

# Aggregate session-level statistics
session_stats = events.groupby('session_id').agg({
    'user_id': 'first',
    'ts_epoch': ['min', 'max', 'count'],
    'event_type': 'nunique',
}).reset_index()

# Flatten column names
session_stats.columns = ['session_id', 'user_id', 'start_ts', 'end_ts', 'n_events', 'n_action_types']

# Compute duration (max(ts) - min(ts) within session - as per supervisor requirement)
session_stats['duration_sec'] = session_stats['end_ts'] - session_stats['start_ts']

print(f"[CELL 03b-06] Session stats shape: {session_stats.shape}")
print(f"\n[CELL 03b-06] Session statistics:")
print(f"  n_events: min={session_stats['n_events'].min()}, p50={session_stats['n_events'].median():.0f}, max={session_stats['n_events'].max()}")
print(f"  duration_sec: min={session_stats['duration_sec'].min()}, p50={session_stats['duration_sec'].median():.0f}, max={session_stats['duration_sec'].max()}")
print(f"  n_action_types: min={session_stats['n_action_types'].min()}, p50={session_stats['n_action_types'].median():.0f}, max={session_stats['n_action_types'].max()}")

cell_end("CELL 03b-06", t0, n_sessions=len(session_stats))


[CELL 03b-06] Compute session-level reliability metrics
[CELL 03b-06] start=2026-02-04T02:45:34
[CELL 03b-06] Session stats shape: (469286, 7)

[CELL 03b-06] Session statistics:
  n_events: min=1, p50=13, max=27264
  duration_sec: min=0, p50=1, max=10525
  n_action_types: min=1, p50=3, max=8
[CELL 03b-06] n_sessions=469286
[CELL 03b-06] elapsed=5.82s
[CELL 03b-06] done


In [8]:
# [CELL 03b-07] Compute reliability scores

t0 = cell_start("CELL 03b-07", "Compute reliability scores")

INTENSITY_CAP = RELIABILITY_CFG['intensity_cap']
DURATION_CAP = RELIABILITY_CFG['duration_cap']
W_INTENSITY = RELIABILITY_CFG['weight_intensity']
W_EXTENT = RELIABILITY_CFG['weight_extent']
W_COMPOSITION = RELIABILITY_CFG['weight_composition']

def compute_reliability(row):
    """
    Log-derived reliability score from engagement signals.
    
    Components:
    1. Activity intensity: event count (normalized)
    2. Temporal extent: session duration (max(ts) - min(ts))
    3. Action composition: diversity of action types
    
    All computed from clickstream logs only.
    """
    # 1. Activity intensity (event count, capped and normalized)
    intensity = min(row['n_events'] / INTENSITY_CAP, 1.0)
    
    # 2. Temporal extent (duration, capped and normalized)
    extent = min(row['duration_sec'] / DURATION_CAP, 1.0)
    
    # 3. Action composition (diversity)
    composition = row['n_action_types'] / N_POSSIBLE_ACTIONS
    
    # Combined reliability score
    reliability = (
        W_INTENSITY * intensity +
        W_EXTENT * extent +
        W_COMPOSITION * composition
    )
    
    return reliability

session_stats['reliability'] = session_stats.apply(compute_reliability, axis=1)

# Also compute individual components for analysis
session_stats['intensity'] = (session_stats['n_events'] / INTENSITY_CAP).clip(upper=1.0)
session_stats['extent'] = (session_stats['duration_sec'] / DURATION_CAP).clip(upper=1.0)
session_stats['composition'] = session_stats['n_action_types'] / N_POSSIBLE_ACTIONS

print(f"[CELL 03b-07] Reliability score distribution:")
print(f"  Min: {session_stats['reliability'].min():.4f}")
print(f"  p10: {session_stats['reliability'].quantile(0.10):.4f}")
print(f"  p50: {session_stats['reliability'].quantile(0.50):.4f}")
print(f"  p90: {session_stats['reliability'].quantile(0.90):.4f}")
print(f"  Max: {session_stats['reliability'].max():.4f}")
print(f"  Mean: {session_stats['reliability'].mean():.4f}")
print(f"  Std: {session_stats['reliability'].std():.4f}")

print(f"\n[CELL 03b-07] Component contributions:")
print(f"  intensity: mean={session_stats['intensity'].mean():.4f}")
print(f"  extent: mean={session_stats['extent'].mean():.4f}")
print(f"  composition: mean={session_stats['composition'].mean():.4f}")

cell_end("CELL 03b-07", t0)


[CELL 03b-07] Compute reliability scores
[CELL 03b-07] start=2026-02-04T02:45:40
[CELL 03b-07] Reliability score distribution:
  Min: 0.0450
  p10: 0.0752
  p50: 0.1759
  p90: 0.5985
  Max: 1.0000
  Mean: 0.2590
  Std: 0.2069

[CELL 03b-07] Component contributions:
  intensity: mean=0.3100
  extent: mean=0.0534
  composition: mean=0.4135
[CELL 03b-07] elapsed=2.32s
[CELL 03b-07] done


In [9]:
# [CELL 03b-08] Save session reliability scores

t0 = cell_start("CELL 03b-08", "Save session reliability scores")

session_reliability_df = session_stats[[
    'session_id', 'user_id', 'n_events', 'duration_sec', 'n_action_types',
    'intensity', 'extent', 'composition', 'reliability'
]].copy()

session_reliability_df.to_parquet(OUT_SESSION_RELIABILITY, index=False, compression='zstd')

reliability_bytes = int(OUT_SESSION_RELIABILITY.stat().st_size)
reliability_sha = sha256_file(OUT_SESSION_RELIABILITY)

print(f"[CELL 03b-08] Saved: {OUT_SESSION_RELIABILITY}")
print(f"[CELL 03b-08] Size: {reliability_bytes / 1024 / 1024:.2f} MB")
print(f"[CELL 03b-08] SHA256: {reliability_sha}")

cell_end("CELL 03b-08", t0)


[CELL 03b-08] Save session reliability scores
[CELL 03b-08] start=2026-02-04T02:45:42
[CELL 03b-08] Saved: /workspace/anonymous-users-mooc-session-meta/data/processed/xuetangx/pairs_with_reliability/session_reliability.parquet
[CELL 03b-08] Size: 12.29 MB
[CELL 03b-08] SHA256: 91cc5392c87775ffee05007b8189c9143c6b1a57fc53f35849543a734bd0bee2
[CELL 03b-08] elapsed=0.17s
[CELL 03b-08] done


In [10]:
# [CELL 03b-09] Extract course sequences per session (with event types)

t0 = cell_start("CELL 03b-09", "Extract course sequences per session")

DEDUPE = CFG["pairs"]["deduplicate_consecutive"]

# Group by session and extract chronological course sequence
session_seqs = []
for (user_id, session_id), group in events.groupby(["user_id", "session_id"]):
    # Sort by timestamp within session
    group = group.sort_values("ts_epoch")
    
    item_seq = group["item_id"].tolist()
    ts_seq = group["ts_epoch"].tolist()
    event_types = group["event_type"].tolist()
    
    if DEDUPE:
        # Deduplicate consecutive courses
        deduped_items = []
        deduped_ts = []
        deduped_events = []
        for i, item in enumerate(item_seq):
            if i == 0 or item != item_seq[i-1]:
                deduped_items.append(item)
                deduped_ts.append(ts_seq[i])
                deduped_events.append(event_types[i])
        item_seq = deduped_items
        ts_seq = deduped_ts
        event_types = deduped_events
    
    session_seqs.append({
        "user_id": user_id,
        "session_id": session_id,
        "item_seq": item_seq,
        "ts_seq": ts_seq,
        "event_types": event_types,
    })

print(f"[CELL 03b-09] Extracted {len(session_seqs):,} session sequences")
print(f"[CELL 03b-09] Deduplicate consecutive: {DEDUPE}")

# Sample check
print(f"\n[CELL 03b-09] Sample session sequence:")
sample = session_seqs[0]
print(f"  user_id: {sample['user_id']}")
print(f"  session_id: {sample['session_id']}")
print(f"  item_seq (first 5): {sample['item_seq'][:5]}")
print(f"  event_types (first 5): {sample['event_types'][:5]}")
print(f"  length: {len(sample['item_seq'])}")

cell_end("CELL 03b-09", t0, n_sessions=len(session_seqs))


[CELL 03b-09] Extract course sequences per session
[CELL 03b-09] start=2026-02-04T02:45:42
[CELL 03b-09] Extracted 469,286 session sequences
[CELL 03b-09] Deduplicate consecutive: True

[CELL 03b-09] Sample session sequence:
  user_id: 1000063
  session_id: 1000063_a64788d2c08e323a9f0cb7cc046f76ca
  item_seq (first 5): [980]
  event_types (first 5): ['load_video']
  length: 1
[CELL 03b-09] n_sessions=469286
[CELL 03b-09] elapsed=73.67s
[CELL 03b-09] done


In [11]:
# [CELL 03b-10] Create prefix→label pairs with session reliability

t0 = cell_start("CELL 03b-10", "Create prefix→label pairs with reliability")

MIN_PREFIX_LEN = int(CFG["pairs"]["min_prefix_len"])

# Create session_id → reliability mapping
session_reliability_map = session_stats.set_index('session_id')['reliability'].to_dict()

pairs = []
pair_id = 0

for sess in session_seqs:
    user_id = sess["user_id"]
    session_id = sess["session_id"]
    item_seq = sess["item_seq"]
    ts_seq = sess["ts_seq"]
    
    # Get session reliability score
    session_reliability = session_reliability_map.get(session_id, 0.0)
    
    # Need at least min_prefix_len + 1 items to create a pair
    if len(item_seq) < MIN_PREFIX_LEN + 1:
        continue
    
    # Create pairs: for each position t, prefix=[0:t], label=t
    for t in range(MIN_PREFIX_LEN, len(item_seq)):
        prefix = item_seq[:t]
        label = item_seq[t]
        
        # Timestamps
        prefix_ts = ts_seq[:t]
        label_ts = ts_seq[t]
        
        # Skip pairs where prefix_max_ts >= label_ts
        if max(prefix_ts) >= label_ts:
            continue
        
        pairs.append({
            "pair_id": pair_id,
            "user_id": user_id,
            "session_id": session_id,
            "prefix": prefix,
            "label": int(label),
            "label_ts_epoch": int(label_ts),
            "prefix_len": int(len(prefix)),
            "session_reliability": float(session_reliability),  # NEW: reliability score
        })
        pair_id += 1

pairs_df = pd.DataFrame(pairs)

print(f"[CELL 03b-10] Created {len(pairs_df):,} prefix→label pairs")
print(f"[CELL 03b-10] Min prefix length: {MIN_PREFIX_LEN}")

print(f"\n[CELL 03b-10] Reliability score in pairs:")
print(f"  Min: {pairs_df['session_reliability'].min():.4f}")
print(f"  p50: {pairs_df['session_reliability'].median():.4f}")
print(f"  Max: {pairs_df['session_reliability'].max():.4f}")

print(f"\n[CELL 03b-10] Head(3):")
print(pairs_df[["pair_id", "user_id", "prefix", "label", "prefix_len", "session_reliability"]].head(3).to_string(index=False))

cell_end("CELL 03b-10", t0, n_pairs=int(len(pairs_df)))


[CELL 03b-10] Create prefix→label pairs with reliability
[CELL 03b-10] start=2026-02-04T02:46:56
[CELL 03b-10] Created 281,979 prefix→label pairs
[CELL 03b-10] Min prefix length: 1

[CELL 03b-10] Reliability score in pairs:
  Min: 0.0485
  p50: 0.5720
  Max: 1.0000

[CELL 03b-10] Head(3):
 pair_id user_id                prefix  label  prefix_len  session_reliability
       0 1000066                 [793]    122           1             0.712407
       1 1000066       [793, 122, 793]   1074           3             0.712407
       2 1000066 [793, 122, 793, 1074]    793           4             0.712407
[CELL 03b-10] n_pairs=281979
[CELL 03b-10] elapsed=2.38s
[CELL 03b-10] done


In [12]:
# [CELL 03b-11] Save pairs.parquet

t0 = cell_start("CELL 03b-11", "Save pairs.parquet", out=str(OUT_PAIRS_PARQUET))

pairs_df.to_parquet(OUT_PAIRS_PARQUET, index=False, compression="zstd")

pairs_bytes = int(OUT_PAIRS_PARQUET.stat().st_size)
pairs_sha = sha256_file(OUT_PAIRS_PARQUET)

print(f"[CELL 03b-11] Saved: {OUT_PAIRS_PARQUET}")
print(f"[CELL 03b-11] Size: {pairs_bytes / 1024 / 1024:.1f} MB")
print(f"[CELL 03b-11] SHA256: {pairs_sha}")

cell_end("CELL 03b-11", t0)


[CELL 03b-11] Save pairs.parquet
[CELL 03b-11] start=2026-02-04T02:46:58
[CELL 03b-11] out=/workspace/anonymous-users-mooc-session-meta/data/processed/xuetangx/pairs_with_reliability/pairs.parquet
[CELL 03b-11] Saved: /workspace/anonymous-users-mooc-session-meta/data/processed/xuetangx/pairs_with_reliability/pairs.parquet
[CELL 03b-11] Size: 6.3 MB
[CELL 03b-11] SHA256: 3f0dd60136bdb771e34ee9b4a4a435a2e7d9fe3ca0cb725b8b81a5b776992a22
[CELL 03b-11] elapsed=0.35s
[CELL 03b-11] done


In [13]:
# [CELL 03b-12] Register DuckDB view

t0 = cell_start("CELL 03b-12", "Register DuckDB view", duckdb=str(DUCKDB_PATH))

con = duckdb.connect(str(DUCKDB_PATH), read_only=False)

con.execute("DROP VIEW IF EXISTS xuetangx_pairs_with_reliability;")

def esc_path(p: Path) -> str:
    return str(p).replace("'", "''")

con.execute(f"""
CREATE VIEW xuetangx_pairs_with_reliability AS
SELECT * FROM read_parquet('{esc_path(OUT_PAIRS_PARQUET)}')
""")

n_pairs = int(con.execute("SELECT COUNT(*) FROM xuetangx_pairs_with_reliability").fetchone()[0])
print(f"[CELL 03b-12] View xuetangx_pairs_with_reliability: {n_pairs:,} rows")

con.close()
print(f"[CELL 03b-12] Closed DuckDB connection")

cell_end("CELL 03b-12", t0)


[CELL 03b-12] Register DuckDB view
[CELL 03b-12] start=2026-02-04T02:46:58
[CELL 03b-12] duckdb=/workspace/anonymous-users-mooc-session-meta/data/interim/xuetangx.duckdb
[CELL 03b-12] View xuetangx_pairs_with_reliability: 281,979 rows
[CELL 03b-12] Closed DuckDB connection
[CELL 03b-12] elapsed=0.02s
[CELL 03b-12] done


In [14]:
# [CELL 03b-13] Validation: sanity checks

t0 = cell_start("CELL 03b-13", "Validation checks")

# Check 1: All labels in vocab
invalid_labels = pairs_df[pairs_df["label"] >= n_items]
if len(invalid_labels) > 0:
    raise RuntimeError(f"Found {len(invalid_labels)} pairs with label >= n_items={n_items}")
print(f"[CELL 03b-13] All labels in vocab [0, {n_items-1}]")

# Check 2: All prefix items in vocab
all_prefix_items = [item for prefix in pairs_df["prefix"] for item in prefix]
max_prefix_item = max(all_prefix_items) if all_prefix_items else -1
if max_prefix_item >= n_items:
    raise RuntimeError(f"Found prefix item {max_prefix_item} >= n_items={n_items}")
print(f"[CELL 03b-13] All prefix items in vocab [0, {n_items-1}]")

# Check 3: Reliability scores are valid
invalid_reliability = pairs_df[(pairs_df['session_reliability'] < 0) | (pairs_df['session_reliability'] > 1)]
if len(invalid_reliability) > 0:
    raise RuntimeError(f"Found {len(invalid_reliability)} pairs with invalid reliability scores")
print(f"[CELL 03b-13] All reliability scores in [0, 1]")

# Check 4: Compare with original NB03 pairs count
print(f"\n[CELL 03b-13] Comparison with NB03:")
print(f"  NB03b pairs: {len(pairs_df):,}")
print(f"  (Should match NB03: 281,979)")

# Check 5: User-level stats
user_pair_counts = pairs_df.groupby("user_id").size()
print(f"\n[CELL 03b-13] User-level pair counts:")
print(f"  Total users with pairs: {len(user_pair_counts):,}")
print(f"  Min pairs/user: {user_pair_counts.min()}")
print(f"  p50 pairs/user: {user_pair_counts.quantile(0.50):.0f}")
print(f"  Max pairs/user: {user_pair_counts.max()}")

cell_end("CELL 03b-13", t0)


[CELL 03b-13] Validation checks
[CELL 03b-13] start=2026-02-04T02:46:58
[CELL 03b-13] All labels in vocab [0, 1517]
[CELL 03b-13] All prefix items in vocab [0, 1517]
[CELL 03b-13] All reliability scores in [0, 1]

[CELL 03b-13] Comparison with NB03:
  NB03b pairs: 281,979
  (Should match NB03: 281,979)

[CELL 03b-13] User-level pair counts:
  Total users with pairs: 62,054
  Min pairs/user: 1
  p50 pairs/user: 2
  Max pairs/user: 680
[CELL 03b-13] elapsed=0.12s
[CELL 03b-13] done


In [15]:
# [CELL 03b-14] Update report + manifest

t0 = cell_start("CELL 03b-14", "Write report + manifest")

report = read_json(REPORT_PATH)
manifest = read_json(MANIFEST_PATH)

# Metrics
report["metrics"] = {
    "n_items": n_items,
    "n_pairs": int(len(pairs_df)),
    "n_sessions": int(len(session_stats)),
    "n_users_with_pairs": int(len(user_pair_counts)),
    "n_action_types": N_POSSIBLE_ACTIONS,
    "reliability_mean": float(session_stats['reliability'].mean()),
    "reliability_std": float(session_stats['reliability'].std()),
    "reliability_p50": float(session_stats['reliability'].median()),
    "reliability_p90": float(session_stats['reliability'].quantile(0.90)),
    "intensity_cap": INTENSITY_CAP,
    "duration_cap": DURATION_CAP,
}

# Key findings
report["key_findings"].append(
    f"Extended NB03 with event_type for reliability scoring. "
    f"Computed reliability scores for {len(session_stats):,} sessions using 3 components: "
    f"intensity (events), extent (duration), composition (action diversity)."
)

report["key_findings"].append(
    f"Reliability score distribution: mean={session_stats['reliability'].mean():.4f}, "
    f"std={session_stats['reliability'].std():.4f}, "
    f"p50={session_stats['reliability'].median():.4f}, "
    f"p90={session_stats['reliability'].quantile(0.90):.4f}"
)

# Sanity samples
report["sanity_samples"]["pairs_head3"] = pairs_df.head(3).to_dict(orient="records")
report["sanity_samples"]["session_reliability_head3"] = session_reliability_df.head(3).to_dict(orient="records")

# Fingerprints
report["data_fingerprints"]["pairs"] = {
    "path": str(OUT_PAIRS_PARQUET),
    "bytes": pairs_bytes,
    "sha256": pairs_sha,
}
report["data_fingerprints"]["session_reliability"] = {
    "path": str(OUT_SESSION_RELIABILITY),
    "bytes": reliability_bytes,
    "sha256": reliability_sha,
}

write_json_atomic(REPORT_PATH, report)

# Manifest
def add_artifact(path: Path) -> None:
    rec = {"path": str(path), "bytes": int(path.stat().st_size), "sha256": None}
    try:
        rec["sha256"] = sha256_file(path)
    except PermissionError as e:
        rec["sha256_error"] = f"PermissionError: {e}"
    manifest["artifacts"].append(rec)

add_artifact(OUT_PAIRS_PARQUET)
add_artifact(OUT_SESSION_RELIABILITY)

write_json_atomic(MANIFEST_PATH, manifest)

print(f"[CELL 03b-14] Updated: {REPORT_PATH}")
print(f"[CELL 03b-14] Updated: {MANIFEST_PATH}")

cell_end("CELL 03b-14", t0)


[CELL 03b-14] Write report + manifest
[CELL 03b-14] start=2026-02-04T02:46:59
[CELL 03b-14] Updated: /workspace/anonymous-users-mooc-session-meta/reports/03b_build_pairs_with_events_xuetangx/20260204_024509/report.json
[CELL 03b-14] Updated: /workspace/anonymous-users-mooc-session-meta/reports/03b_build_pairs_with_events_xuetangx/20260204_024509/manifest.json
[CELL 03b-14] elapsed=0.06s
[CELL 03b-14] done


## Notebook 03b Complete

**Run:** 2026-02-04T02:45:09 | Elapsed: ~2 minutes

**Data Summary:**
- Events loaded: 28,002,537
- Sessions: 469,286
- Pairs created: 281,979 (matches NB03)
- Users with pairs: 62,054
- Courses (vocabulary): 1,518

**Event Types (8 total):**
| Event Type | Count |
|------------|-------|
| pause_video | 10,447,067 |
| seek_video | 4,878,540 |
| problem_get | 4,594,062 |
| load_video | 3,832,977 |
| problem_check | 1,589,086 |
| problem_check_correct | 1,287,667 |
| problem_check_incorrect | 717,231 |
| problem_save | 655,907 |

**Reliability Score Distribution:**
| Metric | Value |
|--------|-------|
| Min | 0.0450 |
| p10 | 0.0752 |
| p50 (median) | 0.1759 |
| p90 | 0.5985 |
| Max | 1.0000 |
| Mean | 0.2590 |
| Std | 0.2069 |

**Component Contributions (mean):**
- Intensity (events): 0.3100
- Extent (duration): 0.0534
- Composition (action diversity): 0.4135

**Outputs:**
- `data/processed/xuetangx/pairs_with_reliability/pairs.parquet` (6.3 MB)
- `data/processed/xuetangx/pairs_with_reliability/session_reliability.parquet` (12.3 MB)
- DuckDB view: `xuetangx_pairs_with_reliability`
- `reports/03b_build_pairs_with_events_xuetangx/20260204_024509/report.json`

**Next:** Notebook 10 (Reliability Validation)