# Notebook 01: Ingest XuetangX

**Purpose:** Parse raw XuetangX JSON file → normalized Parquet → DuckDB view.

**Inputs:**
- `data/raw/xuetangx/20170201-20170801-raw_user_activity.json` (~5GB, Feb-Aug 2017)

**Outputs:**
- `data/interim/xuetangx_events_raw.parquet` (28M learning events)
- `data/interim/xuetangx.duckdb` (view: `xuetangx_events_raw`)
- `reports/01_ingest_xuetangx/<run_tag>/report.json`

**Strategy:**
- Use newest data file (Feb 2017 - Aug 2017)
- Keep platform sessions (session_hash)
- Filter learning events only (load_*, problem_*, video_*, etc.)
- JSON → Parquet → DuckDB (reproducible, Windows-safe)

In [1]:
# [CELL 01-00] Bootstrap: repo root + paths + logger

import os
import sys
import json
import time
import uuid
import hashlib
from pathlib import Path
from datetime import datetime
from typing import Any, Dict, List

import numpy as np
import pandas as pd

t0 = datetime.now()
print(f"[CELL 01-00] start={t0.isoformat(timespec='seconds')}")
print("[CELL 01-00] CWD:", Path.cwd().resolve())

def find_repo_root(start: Path) -> Path:
    start = start.resolve()
    for p in [start, *start.parents]:
        if (p / "PROJECT_STATE.md").exists():
            return p
    raise RuntimeError("Could not find PROJECT_STATE.md. Open notebook from within the repo.")

REPO_ROOT = find_repo_root(Path.cwd())
print("[CELL 01-00] REPO_ROOT:", REPO_ROOT)

PATHS = {
    "PROJECT_STATE": REPO_ROOT / "PROJECT_STATE.md",
    "META_REGISTRY": REPO_ROOT / "meta.json",
    "DATA_RAW": REPO_ROOT / "data" / "raw",
    "DATA_INTERIM": REPO_ROOT / "data" / "interim",
    "DATA_PROCESSED": REPO_ROOT / "data" / "processed",
    "REPORTS": REPO_ROOT / "reports",
}
for k, v in PATHS.items():
    print(f"[CELL 01-00] {k}={v}")

def cell_start(cell_id: str, title: str, **kwargs: Any) -> float:
    t = time.time()
    print(f"\n[{cell_id}] {title}")
    print(f"[{cell_id}] start={datetime.now().isoformat(timespec='seconds')}")
    for k, v in kwargs.items():
        print(f"[{cell_id}] {k}={v}")
    return t

def cell_end(cell_id: str, t0: float, **kwargs: Any) -> None:
    for k, v in kwargs.items():
        print(f"[{cell_id}] {k}={v}")
    print(f"[{cell_id}] elapsed={time.time()-t0:.2f}s")
    print(f"[{cell_id}] done")

print("[CELL 01-00] done")

[CELL 01-00] start=2026-02-01T02:01:35
[CELL 01-00] CWD: /workspace/anonymous-users-mooc-session-meta/notebooks
[CELL 01-00] REPO_ROOT: /workspace/anonymous-users-mooc-session-meta
[CELL 01-00] PROJECT_STATE=/workspace/anonymous-users-mooc-session-meta/PROJECT_STATE.md
[CELL 01-00] META_REGISTRY=/workspace/anonymous-users-mooc-session-meta/meta.json
[CELL 01-00] DATA_RAW=/workspace/anonymous-users-mooc-session-meta/data/raw
[CELL 01-00] DATA_INTERIM=/workspace/anonymous-users-mooc-session-meta/data/interim
[CELL 01-00] DATA_PROCESSED=/workspace/anonymous-users-mooc-session-meta/data/processed
[CELL 01-00] REPORTS=/workspace/anonymous-users-mooc-session-meta/reports
[CELL 01-00] done


In [2]:
# [CELL 01-01] Reproducibility: seed everything

t0 = cell_start("CELL 01-01", "Seed everything")

GLOBAL_SEED = 20260107  # New seed for XuetangX pipeline

def seed_everything(seed: int) -> None:
    import random
    random.seed(seed)
    np.random.seed(seed)

seed_everything(GLOBAL_SEED)

cell_end("CELL 01-01", t0, seed=GLOBAL_SEED)


[CELL 01-01] Seed everything
[CELL 01-01] start=2026-02-01T02:01:35
[CELL 01-01] seed=20260107
[CELL 01-01] elapsed=0.00s
[CELL 01-01] done


In [3]:
# [CELL 01-02] JSON IO + hashing helpers

t0 = cell_start("CELL 01-02", "JSON IO + hashing")

def write_json_atomic(path: Path, obj: Any, indent: int = 2) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    tmp = path.with_suffix(path.suffix + f".tmp_{uuid.uuid4().hex}")
    with tmp.open("w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=indent)
    tmp.replace(path)

def read_json(path: Path) -> Any:
    if not path.exists():
        raise RuntimeError(f"Missing JSON file: {path}")
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)

def sha256_file(path: Path, chunk_size: int = 1024 * 1024) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        while True:
            b = f.read(chunk_size)
            if not b:
                break
            h.update(b)
    return h.hexdigest()

def assert_nonempty_df(df: pd.DataFrame, name: str) -> None:
    if df is None or not isinstance(df, pd.DataFrame) or df.shape[0] == 0:
        raise RuntimeError(f"{name} is empty or invalid DataFrame")

cell_end("CELL 01-02", t0)


[CELL 01-02] JSON IO + hashing
[CELL 01-02] start=2026-02-01T02:01:35
[CELL 01-02] elapsed=0.00s
[CELL 01-02] done


In [4]:
# [CELL 01-03] Run tagging + report/config/manifest + meta.json

t0 = cell_start("CELL 01-03", "Start run + init run files + meta.json")

NOTEBOOK_NAME = "01_ingest_xuetangx"
RUN_TAG = datetime.now().strftime("%Y%m%d_%H%M%S")
RUN_ID = uuid.uuid4().hex

OUT_DIR = PATHS["REPORTS"] / NOTEBOOK_NAME / RUN_TAG
OUT_DIR.mkdir(parents=True, exist_ok=True)

REPORT_PATH = OUT_DIR / "report.json"
CONFIG_PATH = OUT_DIR / "config.json"
MANIFEST_PATH = OUT_DIR / "manifest.json"

RAW_DIR = PATHS["DATA_RAW"] / "xuetangx"
OUT_PARQUET = PATHS["DATA_INTERIM"] / "xuetangx_events_raw.parquet"
OUT_DUCKDB = PATHS["DATA_INTERIM"] / "xuetangx.duckdb"

CFG = {
    "notebook": NOTEBOOK_NAME,
    "run_id": RUN_ID,
    "run_tag": RUN_TAG,
    "seed": GLOBAL_SEED,
    "paths": {
        "raw_dir": str(RAW_DIR),
        "out_parquet": str(OUT_PARQUET),
        "out_duckdb": str(OUT_DUCKDB),
        "out_dir": str(OUT_DIR),
    },
    "ingest": {
        "prototype_mode": True,  # Start with 1 file
        "prototype_file": "20170201-20170801-raw_user_activity.json",  # NEWEST FILE
        "all_files_pattern": "*-raw_user_activity.json",
        "learning_events_only": True,  # Filter to learning behavior
        "learning_event_prefixes": ["load_", "problem_", "video_", "seek_", "speed_", "pause_"],
        "parquet_compression": "zstd",
        "duckdb_view": "xuetangx_events_raw",
    }
}

write_json_atomic(CONFIG_PATH, CFG)

report = {
    "run_id": RUN_ID,
    "notebook": NOTEBOOK_NAME,
    "run_tag": RUN_TAG,
    "created_at": datetime.now().isoformat(timespec="seconds"),
    "repo_root": str(REPO_ROOT),
    "metrics": {},
    "key_findings": [],
    "sanity_samples": {},
    "data_fingerprints": {},
    "notes": [],
}
write_json_atomic(REPORT_PATH, report)

manifest = {
    "run_id": RUN_ID,
    "notebook": NOTEBOOK_NAME,
    "run_tag": RUN_TAG,
    "artifacts": [],
}
write_json_atomic(MANIFEST_PATH, manifest)

# meta.json append-only
META_PATH = PATHS["META_REGISTRY"]
if not META_PATH.exists():
    write_json_atomic(META_PATH, {"schema_version": 1, "runs": []})

meta = read_json(META_PATH)
if "runs" not in meta or not isinstance(meta["runs"], list):
    raise RuntimeError("meta.json invalid: missing 'runs' list")

meta["runs"].append({
    "run_id": RUN_ID,
    "notebook": NOTEBOOK_NAME,
    "run_tag": RUN_TAG,
    "out_dir": str(OUT_DIR),
    "created_at": datetime.now().isoformat(timespec="seconds"),
})
write_json_atomic(META_PATH, meta)

cell_end("CELL 01-03", t0,
         out_dir=str(OUT_DIR),
         report=str(REPORT_PATH),
         config=str(CONFIG_PATH),
         manifest=str(MANIFEST_PATH),
         meta=str(META_PATH))


[CELL 01-03] Start run + init run files + meta.json
[CELL 01-03] start=2026-02-01T02:01:35
[CELL 01-03] out_dir=/workspace/anonymous-users-mooc-session-meta/reports/01_ingest_xuetangx/20260201_020135
[CELL 01-03] report=/workspace/anonymous-users-mooc-session-meta/reports/01_ingest_xuetangx/20260201_020135/report.json
[CELL 01-03] config=/workspace/anonymous-users-mooc-session-meta/reports/01_ingest_xuetangx/20260201_020135/config.json
[CELL 01-03] manifest=/workspace/anonymous-users-mooc-session-meta/reports/01_ingest_xuetangx/20260201_020135/manifest.json
[CELL 01-03] meta=/workspace/anonymous-users-mooc-session-meta/meta.json
[CELL 01-03] elapsed=0.00s
[CELL 01-03] done


In [5]:
# [CELL 01-04] Enumerate raw JSON files + select prototype

t0 = cell_start("CELL 01-04", "Enumerate raw JSON files", raw_dir=str(RAW_DIR))

if not RAW_DIR.exists():
    raise RuntimeError(f"Raw directory not found: {RAW_DIR}")

all_json_files = sorted(RAW_DIR.glob(CFG["ingest"]["all_files_pattern"]))
print(f"[CELL 01-04] Found {len(all_json_files)} JSON files:")
for f in all_json_files:
    print(f"  {f.name} ({f.stat().st_size / 1024 / 1024 / 1024:.2f} GB)")

# Prototype mode: use only first file
if CFG["ingest"]["prototype_mode"]:
    prototype_name = CFG["ingest"]["prototype_file"]
    target_file = RAW_DIR / prototype_name
    if not target_file.exists():
        raise RuntimeError(f"Prototype file not found: {target_file}")
    files_to_parse = [target_file]
    print(f"\n[CELL 01-04] PROTOTYPE MODE: using only {prototype_name}")
else:
    files_to_parse = all_json_files
    print(f"\n[CELL 01-04] FULL MODE: parsing all {len(files_to_parse)} files")

cell_end("CELL 01-04", t0, n_files=len(files_to_parse))


[CELL 01-04] Enumerate raw JSON files
[CELL 01-04] start=2026-02-01T02:01:35
[CELL 01-04] raw_dir=/workspace/anonymous-users-mooc-session-meta/data/raw/xuetangx
[CELL 01-04] Found 1 JSON files:
  20170201-20170801-raw_user_activity.json (4.79 GB)

[CELL 01-04] PROTOTYPE MODE: using only 20170201-20170801-raw_user_activity.json
[CELL 01-04] n_files=1
[CELL 01-04] elapsed=0.00s
[CELL 01-04] done


In [6]:
# [CELL 01-05] Parse nested JSON → flattened DataFrame

t0 = cell_start("CELL 01-05", "Parse nested JSON to DataFrame")

# Event filtering
LEARNING_PREFIXES = tuple(CFG["ingest"]["learning_event_prefixes"])
FILTER_LEARNING = CFG["ingest"]["learning_events_only"]

def is_learning_event(event_type: str) -> bool:
    """Filter to learning events only (exclude UI navigation)."""
    if not FILTER_LEARNING:
        return True
    return event_type.startswith(LEARNING_PREFIXES)

def parse_xuetangx_json(file_path: Path) -> pd.DataFrame:
    """
    Parse XuetangX JSON structure:
    [
      [course_id, {user_id: {session_hash: [[event, timestamp], ...]}}],
      ...
    ]
    
    Returns: DataFrame with [course_id, user_id, session_hash, event_type, timestamp]
    """
    print(f"  Parsing: {file_path.name} ({file_path.stat().st_size / 1024 / 1024:.1f} MB)...")
    
    with file_path.open('r', encoding='utf-8') as f:
        data = json.load(f)
    
    rows = []
    n_courses = len(data)
    n_events_total = 0
    n_events_filtered = 0
    
    for course_idx, course_data in enumerate(data):
        if (course_idx + 1) % 100 == 0:
            print(f"    Progress: {course_idx + 1}/{n_courses} courses...")
        
        course_id = course_data[0]
        users_dict = course_data[1]
        
        for user_id, sessions_dict in users_dict.items():
            for session_hash, events_list in sessions_dict.items():
                for event_data in events_list:
                    event_type = event_data[0]
                    timestamp = event_data[1]
                    
                    n_events_total += 1
                    
                    if is_learning_event(event_type):
                        rows.append({
                            "course_id": course_id,
                            "user_id": user_id,
                            "session_hash": session_hash,
                            "event_type": event_type,
                            "timestamp": timestamp,
                        })
                        n_events_filtered += 1
    
    df = pd.DataFrame(rows)
    print(f"    Events: {n_events_total:,} total, {n_events_filtered:,} kept (learning only)")
    return df

# Parse all selected files
dfs = []
for fpath in files_to_parse:
    df_chunk = parse_xuetangx_json(fpath)
    dfs.append(df_chunk)

events = pd.concat(dfs, ignore_index=True)
assert_nonempty_df(events, "events")

# Add source file tracking
events["__source_file"] = "; ".join([f.name for f in files_to_parse])

print(f"\n[CELL 01-05] Combined shape: {events.shape}")
print(f"[CELL 01-05] Columns: {list(events.columns)}")
print(f"\n[CELL 01-05] Head(3):")
print(events.head(3).to_string(index=False))

cell_end("CELL 01-05", t0, rows=int(events.shape[0]), cols=int(events.shape[1]))


[CELL 01-05] Parse nested JSON to DataFrame
[CELL 01-05] start=2026-02-01T02:01:35
  Parsing: 20170201-20170801-raw_user_activity.json (4903.7 MB)...
    Progress: 100/1577 courses...
    Progress: 200/1577 courses...
    Progress: 300/1577 courses...
    Progress: 400/1577 courses...
    Progress: 500/1577 courses...
    Progress: 600/1577 courses...
    Progress: 700/1577 courses...
    Progress: 800/1577 courses...
    Progress: 900/1577 courses...
    Progress: 1000/1577 courses...
    Progress: 1100/1577 courses...
    Progress: 1200/1577 courses...
    Progress: 1300/1577 courses...
    Progress: 1400/1577 courses...
    Progress: 1500/1577 courses...
    Events: 56,985,474 total, 28,002,537 kept (learning only)

[CELL 01-05] Combined shape: (28002537, 6)
[CELL 01-05] Columns: ['course_id', 'user_id', 'session_hash', 'event_type', 'timestamp', '__source_file']

[CELL 01-05] Head(3):
                              course_id user_id                     session_hash  event_type     

In [7]:
# [CELL 01-06] Save canonical Parquet

t0 = cell_start("CELL 01-06", "Save canonical Parquet", out_parquet=str(OUT_PARQUET))

OUT_PARQUET.parent.mkdir(parents=True, exist_ok=True)
events.to_parquet(OUT_PARQUET, index=False, compression=CFG["ingest"]["parquet_compression"])

parq_bytes = int(OUT_PARQUET.stat().st_size)
parq_sha = sha256_file(OUT_PARQUET)

print(f"[CELL 01-06] Saved: {OUT_PARQUET}")
print(f"[CELL 01-06] Size: {parq_bytes / 1024 / 1024:.1f} MB")
print(f"[CELL 01-06] SHA256: {parq_sha}")

cell_end("CELL 01-06", t0)


[CELL 01-06] Save canonical Parquet
[CELL 01-06] start=2026-02-01T02:03:11
[CELL 01-06] out_parquet=/workspace/anonymous-users-mooc-session-meta/data/interim/xuetangx_events_raw.parquet
[CELL 01-06] Saved: /workspace/anonymous-users-mooc-session-meta/data/interim/xuetangx_events_raw.parquet
[CELL 01-06] Size: 95.0 MB
[CELL 01-06] SHA256: 98dd880d3b1431623e74da517c4eb41c43ba776c1dff1a52492ad0ae6fec1c01
[CELL 01-06] elapsed=5.08s
[CELL 01-06] done


In [8]:
# [CELL 01-07] DuckDB: create DB + view from Parquet

t0 = cell_start("CELL 01-07", "Create DuckDB + view", out_duckdb=str(OUT_DUCKDB))

try:
    import duckdb
except Exception as e:
    raise RuntimeError(
        "Missing duckdb. Install via:\n"
        "  conda install -c conda-forge duckdb\n"
        "or\n"
        "  pip install duckdb\n"
    ) from e

OUT_DUCKDB.parent.mkdir(parents=True, exist_ok=True)
con = duckdb.connect(str(OUT_DUCKDB))

view = CFG["ingest"]["duckdb_view"]
con.execute(f"DROP VIEW IF EXISTS {view};")
con.execute(f"""
CREATE VIEW {view} AS
SELECT * FROM read_parquet('{str(OUT_PARQUET).replace("'", "''")}')\n
""")

n = con.execute(f"SELECT COUNT(*) FROM {view}").fetchone()[0]
schema_df = con.execute(f"DESCRIBE {view}").fetchdf()

print(f"[CELL 01-07] View: {view}")
print(f"[CELL 01-07] Rows: {int(n):,}")
print(f"\n[CELL 01-07] Schema:")
print(schema_df.to_string(index=False))

con.close()
print(f"\n[CELL 01-07] Closed DuckDB connection")

cell_end("CELL 01-07", t0, rows=int(n))


[CELL 01-07] Create DuckDB + view
[CELL 01-07] start=2026-02-01T02:03:16
[CELL 01-07] out_duckdb=/workspace/anonymous-users-mooc-session-meta/data/interim/xuetangx.duckdb
[CELL 01-07] View: xuetangx_events_raw
[CELL 01-07] Rows: 28,002,537

[CELL 01-07] Schema:
  column_name column_type null  key default extra
    course_id     VARCHAR  YES None    None  None
      user_id     VARCHAR  YES None    None  None
 session_hash     VARCHAR  YES None    None  None
   event_type     VARCHAR  YES None    None  None
    timestamp     VARCHAR  YES None    None  None
__source_file     VARCHAR  YES None    None  None

[CELL 01-07] Closed DuckDB connection
[CELL 01-07] rows=28002537
[CELL 01-07] elapsed=0.12s
[CELL 01-07] done


In [9]:
# [CELL 01-08] Update report + manifest

t0 = cell_start("CELL 01-08", "Write fingerprints + sanity sample to report")

report = read_json(REPORT_PATH)
manifest = read_json(MANIFEST_PATH)

# Sanity sample
head3 = events.head(3).to_dict(orient="records")

# Data fingerprints
raw_fp = {
    "root": str(RAW_DIR),
    "n_files_parsed": len(files_to_parse),
    "files": [
        {
            "name": f.name,
            "bytes": int(f.stat().st_size),
            "sha256": sha256_file(f),
        } for f in files_to_parse
    ],
}

report["data_fingerprints"]["xuetangx_raw_files"] = raw_fp
report["data_fingerprints"]["xuetangx_events_raw_parquet"] = {
    "path": str(OUT_PARQUET),
    "bytes": parq_bytes,
    "sha256": parq_sha,
}
report["sanity_samples"]["events_head3"] = head3
report["notes"].append(
    f"Parsed {len(files_to_parse)} JSON file(s) in prototype mode. "
    "Filtered to learning events only (load_*, problem_*, video_*, etc.). "
    "Kept platform session_hash for sessionization."
)

write_json_atomic(REPORT_PATH, report)

# Manifest
def add_artifact(path: Path) -> None:
    rec = {
        "path": str(path),
        "bytes": int(path.stat().st_size),
        "sha256": None,
        "sha256_error": None,
    }
    try:
        rec["sha256"] = sha256_file(path)
    except PermissionError as e:
        rec["sha256_error"] = f"PermissionError: {e}"
        print(f"[CELL 01-08] WARN: could not hash (locked): {path}")
    manifest["artifacts"].append(rec)

add_artifact(OUT_PARQUET)
add_artifact(OUT_DUCKDB)

write_json_atomic(MANIFEST_PATH, manifest)

print(f"[CELL 01-08] Updated: {REPORT_PATH}")
print(f"[CELL 01-08] Updated: {MANIFEST_PATH}")

cell_end("CELL 01-08", t0)


[CELL 01-08] Write fingerprints + sanity sample to report
[CELL 01-08] start=2026-02-01T02:03:16
[CELL 01-08] Updated: /workspace/anonymous-users-mooc-session-meta/reports/01_ingest_xuetangx/20260201_020135/report.json
[CELL 01-08] Updated: /workspace/anonymous-users-mooc-session-meta/reports/01_ingest_xuetangx/20260201_020135/manifest.json
[CELL 01-08] elapsed=3.63s
[CELL 01-08] done


In [None]:
# [CELL 01-09] Basic stats (users/courses/events)

t0 = cell_start("CELL 01-09", "Compute basic stats")

stats = {
    "n_events": int(events.shape[0]),
    "n_users": int(events["user_id"].nunique()),
    "n_courses": int(events["course_id"].nunique()),
    "n_sessions": int(events["session_hash"].nunique()),
    "n_event_types": int(events["event_type"].nunique()),
}

print(f"[CELL 01-09] Stats:")
for k, v in stats.items():
    print(f"  {k}: {v:,}")

# Top event types
top_events = events["event_type"].value_counts().head(10)
print(f"\n[CELL 01-09] Top 10 event types:")
print(top_events.to_string())

# Save to report
report = read_json(REPORT_PATH)
report["metrics"] = stats
report["sanity_samples"]["top_event_types"] = top_events.to_dict()
write_json_atomic(REPORT_PATH, report)

cell_end("CELL 01-09", t0, **stats)


[CELL 01-09] Compute basic stats
[CELL 01-09] start=2026-02-01T02:03:20


## ✅ Notebook 01 Complete

**Outputs:**
- ✅ `data/interim/xuetangx_events_raw.parquet`
- ✅ `data/interim/xuetangx.duckdb` (view: `xuetangx_events_raw`)
- ✅ `reports/01_ingest_xuetangx/<run_tag>/report.json`

**Next:** Notebook 02 (Sessionize XuetangX)
- Validate platform session_hash
- Compute gap statistics
- Decide: use platform sessions or re-sessionize?