##### Sessionize (target + source) and build prefix→label samples (target)
**Repo:** `mooc-coldstart-session-meta`  
**Strict order:** this notebook must run after 04.  
**Decisions (locked):** `target_gap = 30m (1800s)`, `source_gap = 10m (600s)`.

This notebook:
1) Validates `session_gap_thresholds.json` matches the locked decisions.  
2) Sessionizes **target** (MARS explicit-only variant) and writes sessionized events + session sequences.  
3) Builds **target** supervised samples (prefix → next-item label).  
4) Sessionizes **source** (XuetangX) using DuckDB (scales to large data) and writes sessionized events.

> Guardrail: **no toy/synthetic data**; everything is computed from the real parquet inputs.

In [1]:
# [CELL 05-00] Imports + versions

import os
import sys
import json
import time
import math
import hashlib
from pathlib import Path
from datetime import datetime

import numpy as np
import pandas as pd

import duckdb
import pyarrow as pa
import pyarrow.parquet as pq

print("python:", sys.version)
print("pandas:", pd.__version__)
print("duckdb:", duckdb.__version__)
print("pyarrow:", pa.__version__)

python: 3.11.14 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 18:30:03) [MSC v.1929 64 bit (AMD64)]
pandas: 2.3.3
duckdb: 1.4.3
pyarrow: 22.0.0


In [2]:
# [CELL 05-01] Bootstrap: locate repo root reliably (Windows-safe)

from pathlib import Path

CWD = Path.cwd().resolve()
print("Initial CWD:", CWD)

def find_repo_root(start: Path) -> Path:
    """Search upward for repo root. Priority: PROJECT_STATE.md."""
    for p in [start, *start.parents]:
        if (p / "PROJECT_STATE.md").exists():
            print(f"  Found PROJECT_STATE.md in: {p}")
            return p
    # fallback: git
    for p in [start, *start.parents]:
        if (p / ".git").exists():
            print(f"  Found .git in: {p}")
            return p
    raise FileNotFoundError("Could not locate repo root (PROJECT_STATE.md or .git).")

REPO_ROOT = find_repo_root(CWD)
print("REPO_ROOT:", REPO_ROOT)

DATA_DIR = REPO_ROOT / "data"
PROC_DIR = DATA_DIR / "processed"
NORM_DIR = PROC_DIR / "normalized_events"

IN_TARGET = NORM_DIR / "events_target_norm.parquet"
IN_SOURCE = NORM_DIR / "events_source_norm.parquet"
IN_THRESH = NORM_DIR / "session_gap_thresholds.json"

OUT_SESS_DIR = PROC_DIR / "sessionized"
OUT_SUP_DIR  = PROC_DIR / "supervised"
OUT_SESS_DIR.mkdir(parents=True, exist_ok=True)
OUT_SUP_DIR.mkdir(parents=True, exist_ok=True)

RUN_TAG = datetime.now().strftime("%Y%m%d_%H%M%S")
print("RUN_TAG:", RUN_TAG)

print("IN_TARGET:", IN_TARGET)
print("IN_SOURCE:", IN_SOURCE)
print("IN_THRESH:", IN_THRESH)

Initial CWD: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\notebooks
  Found PROJECT_STATE.md in: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta
REPO_ROOT: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta
RUN_TAG: 20251229_232834
IN_TARGET: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\normalized_events\events_target_norm.parquet
IN_SOURCE: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\normalized_events\events_source_norm.parquet
IN_THRESH: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\normalized_events\session_gap_thresholds.json


In [3]:
# [CELL 05-02] Validate inputs exist + validate locked thresholds

for p in [IN_TARGET, IN_SOURCE, IN_THRESH]:
    if not p.exists():
        raise FileNotFoundError(f"Missing required input: {p}")

thresholds = json.loads(IN_THRESH.read_text(encoding="utf-8"))
print("Loaded thresholds file:", IN_THRESH)
print(json.dumps(thresholds, indent=2)[:1000])

# Locked decisions for this project step:
LOCK_TARGET_S = 1800  # 30m
LOCK_SOURCE_S = 600   # 10m

t = thresholds.get("target", {})
s = thresholds.get("source", {})

t_s = int(t.get("primary_threshold_seconds", -1))
s_s = int(s.get("primary_threshold_seconds", -1))

if t_s != LOCK_TARGET_S or s_s != LOCK_SOURCE_S:
    raise ValueError(
        "session_gap_thresholds.json does NOT match locked decisions for Notebook 05.\n"
        f"Expected target={LOCK_TARGET_S}s, source={LOCK_SOURCE_S}s; got target={t_s}s, source={s_s}s.\n"
        "Fix: update data/processed/normalized_events/session_gap_thresholds.json then re-run."
    )

print("✅ Thresholds validated:", {"target_s": t_s, "source_s": s_s})

Loaded thresholds file: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\normalized_events\session_gap_thresholds.json
{
  "generated_from_run_tag": "20251229_154018",
  "generated_at": "2025-12-29T08:20:10",
  "target": {
    "primary_threshold_seconds": 1800,
    "primary_threshold_label": "30m"
  },
  "source": {
    "primary_threshold_seconds": 600,
    "primary_threshold_label": "10m",
    "sampling": {
      "SAMPLE_MOD": 100,
      "MIN_EVENTS_PER_USER": 2,
      "sample_users": 6891,
      "sample_counts": {
        "n_events": 1698038,
        "n_users": 6891,
        "n_items": 1476,
        "min_ts": "2015-07-31T23:59:21",
        "max_ts": "2017-07-31T23:51:34"
      }
    }
  },
  "decision_notes": {
    "source_threshold_override": "Override source primary gap to 10m (600s) to reduce session explosion; coverage-based recommendation remains 5m."
  }
}
✅ Thresholds validated: {'target_s': 1800, 'source_s': 600}


In [4]:
# [CELL 05-03] Load target events (explicit-only variant) + basic checks (robust)

import time
import pandas as pd
import numpy as np

t0 = time.time()

# --- Load ---
df_t = pd.read_parquet(IN_TARGET)
print("Loaded target events:", df_t.shape)

# --- Normalize column names (safe) ---
orig_cols = df_t.columns.tolist()
df_t.columns = [str(c).strip() for c in df_t.columns]
print("Columns:", df_t.columns.tolist())

# --- Helper: rename from possible alternatives ---
def _rename_first_match(df, target, candidates):
    if target in df.columns:
        return
    for c in candidates:
        if c in df.columns:
            df.rename(columns={c: target}, inplace=True)
            return

# Try to map typical variants (in case the parquet isn't perfectly standardized)
_rename_first_match(df_t, "user_id", ["user", "userid", "learner_id", "student_id", "uid"])
_rename_first_match(df_t, "item_id", ["item", "itemid", "course_id", "resource_id", "object_id", "iid"])
_rename_first_match(df_t, "timestamp", ["ts", "time", "event_time", "datetime", "created_at"])

# Domain is often missing in target (single domain). If missing, create it.
if "domain" not in df_t.columns:
    df_t["domain"] = TARGET_DOMAIN if "TARGET_DOMAIN" in globals() else "target"

# --- Required columns check ---
required_cols = ["domain", "user_id", "item_id", "timestamp"]
missing = [c for c in required_cols if c not in df_t.columns]
if missing:
    raise ValueError(
        f"Target missing required columns: {missing}\n"
        f"Existing columns: {df_t.columns.tolist()}\n"
        f"Original columns: {orig_cols}"
    )

# --- Timestamp parsing (handles int seconds/ms + strings) ---
ts = df_t["timestamp"]

if pd.api.types.is_datetime64_any_dtype(ts):
    # already datetime
    df_t["timestamp"] = pd.to_datetime(ts, utc=True, errors="coerce")
elif pd.api.types.is_numeric_dtype(ts):
    # numeric epoch: decide seconds vs milliseconds by magnitude
    vals = ts.dropna().astype("int64")
    if len(vals) == 0:
        df_t["timestamp"] = pd.NaT
    else:
        mx = int(vals.max())
        # heuristic: > 1e12 => ms, else => seconds
        unit = "ms" if mx > 1_000_000_000_000 else "s"
        df_t["timestamp"] = pd.to_datetime(ts, unit=unit, utc=True, errors="coerce")
else:
    # strings / objects
    df_t["timestamp"] = pd.to_datetime(ts, utc=True, errors="coerce")

bad_ts = int(df_t["timestamp"].isna().sum())
if bad_ts:
    # show a few bad rows to debug quickly
    bad_rows = df_t.loc[df_t["timestamp"].isna(), ["user_id", "item_id", "timestamp"]].head(10)
    raise ValueError(f"Target has {bad_ts} rows with invalid timestamp. نمونه:\n{bad_rows}")

# --- Type cleanup for ids (avoid mixed types) ---
df_t["user_id"] = df_t["user_id"].astype(str)
df_t["item_id"] = df_t["item_id"].astype(str)
df_t["domain"]  = df_t["domain"].astype(str)

# --- Sort for sessionization correctness ---
df_t = df_t.sort_values(["user_id", "timestamp", "item_id"]).reset_index(drop=True)

print("Target domain counts:", df_t["domain"].value_counts(dropna=False).to_dict())
print("n_users:", df_t["user_id"].nunique(), "n_items:", df_t["item_id"].nunique())
print("time range:", df_t["timestamp"].min(), "→", df_t["timestamp"].max())
print("Load+checks seconds:", round(time.time() - t0, 2))


Loaded target events: (3655, 5)
Columns: ['user_id', 'item_id', 'timestamp', 'signal_type', 'value']
Target domain counts: {'target': 3655}
n_users: 822 n_items: 776
time range: 2018-09-28 14:38:15+00:00 → 2021-09-20 16:26:06+00:00
Load+checks seconds: 0.03


In [5]:
# [CELL 05-04] Sessionize target in pandas (fast: target is small)

TARGET_GAP_S = LOCK_TARGET_S  # already validated

# Compute per-user time gaps
df_t["prev_ts"] = df_t.groupby("user_id")["timestamp"].shift(1)
df_t["gap_s"] = (df_t["timestamp"] - df_t["prev_ts"]).dt.total_seconds()

# New session if first event or gap > threshold
df_t["new_session"] = df_t["prev_ts"].isna() | (df_t["gap_s"] > TARGET_GAP_S)

# Session index per user
df_t["session_idx"] = df_t.groupby("user_id")["new_session"].cumsum().astype("int64")

# Stable session_id (string)
df_t["session_id"] = (
    "t_" + df_t["user_id"].astype(str) + "_" + df_t["session_idx"].astype(str)
)

# Sanity checks
n_sessions = df_t["session_id"].nunique()
sess_len = df_t.groupby("session_id").size()
print("Target sessions:", n_sessions)
print("Session length quantiles:", sess_len.quantile([0.5, 0.9, 0.99]).to_dict())
print("Min/Max session length:", int(sess_len.min()), int(sess_len.max()))

# Write sessionized target events
OUT_T_EVENTS = OUT_SESS_DIR / f"target_events_sessionized_{RUN_TAG}.parquet"
df_t_out = df_t.drop(columns=["prev_ts"])
df_t_out.to_parquet(OUT_T_EVENTS, index=False)
print("Wrote:", OUT_T_EVENTS, "| rows:", len(df_t_out))

Target sessions: 1322
Session length quantiles: {0.5: 1.0, 0.9: 6.0, 0.99: 24.789999999999964}
Min/Max session length: 1 50
Wrote: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\sessionized\target_events_sessionized_20251229_232834.parquet | rows: 3655


In [6]:
# [CELL 05-05] Build target prefix samples (DuckDB) + singleton stats + duration  — FIXED

import duckdb
import pandas as pd
from pathlib import Path
import time

t0 = time.time()

con = duckdb.connect(database=":memory:")
con.execute("SET threads=1;")
con.execute("SET preserve_insertion_order=false;")
con.execute("PRAGMA memory_limit='6GB';")

# Read the target sessionized events
t_events_path = Path(OUT_T_EVENTS).as_posix().replace("'", "''")
con.execute(f"""
CREATE OR REPLACE VIEW sessions_target_view AS
SELECT * FROM read_parquet('{t_events_path}');
""")

# --- MAX_PREFIX_LEN: robust resolve ---
# Prefer an existing MAX_PREFIX_LEN, else fall back to 20 (project default)
MAXP = int(globals().get("MAX_PREFIX_LEN", 20))
print("Using MAX_PREFIX_LEN:", MAXP)

sql = f"""
CREATE OR REPLACE TABLE session_lengths AS
SELECT
  domain,
  user_id,
  session_id,
  COUNT(*) AS session_length,
  MIN(timestamp) AS session_start,
  MAX(timestamp) AS session_end,
  EXTRACT(EPOCH FROM (MAX(timestamp) - MIN(timestamp))) AS session_duration_sec
FROM sessions_target_view
GROUP BY domain, user_id, session_id;

CREATE OR REPLACE TABLE eligible_sessions AS
SELECT *
FROM session_lengths
WHERE session_length >= 2;

CREATE OR REPLACE TABLE prefix_target_splits AS
WITH sess AS (
  SELECT
    e.domain,
    e.user_id,
    e.session_id,
    e.session_length,
    e.session_start,
    e.session_end,
    e.session_duration_sec,
    LIST(v.item_id ORDER BY v.timestamp, v.item_id) AS items
  FROM eligible_sessions e
  JOIN sessions_target_view v
    ON v.session_id = e.session_id
  GROUP BY
    e.domain, e.user_id, e.session_id,
    e.session_length, e.session_start, e.session_end, e.session_duration_sec
),
splits AS (
  SELECT
    domain,
    user_id,
    session_id,
    session_length,
    session_start,
    session_end,
    session_duration_sec,
    i AS split_pos,
    LIST_SLICE(
      items,
      GREATEST(1, i - {MAXP} + 1),
      i
    ) AS prefix_items,
    items[i + 1] AS label_item
  FROM sess,
       GENERATE_SERIES(1, session_length - 1) t(i)
)
SELECT
  domain,
  user_id,
  session_id,
  split_pos,
  session_length,
  session_duration_sec,
  LIST_COUNT(prefix_items) AS prefix_len,
  prefix_items,
  label_item
FROM splits;
"""

con.execute(sql)

# --- Singleton transparency stats ---
singleton_stats = con.execute("""
SELECT
  COUNT(*) AS total_sessions,
  SUM(CASE WHEN session_length = 1 THEN 1 ELSE 0 END) AS singleton_sessions,
  SUM(CASE WHEN session_length >= 2 THEN 1 ELSE 0 END) AS eligible_sessions,
  AVG(CASE WHEN session_length = 1 THEN 1.0 ELSE 0.0 END) AS singleton_rate
FROM session_lengths;
""").df().iloc[0]

print("\n[05-05] Target session filtering stats:")
print(f"  Total sessions: {int(singleton_stats['total_sessions']):,}")
print(f"  Singleton sessions (filtered): {int(singleton_stats['singleton_sessions']):,}")
print(f"  Eligible sessions (≥2 events): {int(singleton_stats['eligible_sessions']):,}")
print(f"  Singleton rate: {float(singleton_stats['singleton_rate']):.1%}")

prefix_target_splits = con.execute("SELECT * FROM prefix_target_splits;").df()
print("\n[05-05] prefix_target_splits:", prefix_target_splits.shape)
print(prefix_target_splits.head(3))

print("CELL 05-05 seconds:", round(time.time() - t0, 2))


Using MAX_PREFIX_LEN: 20

[05-05] Target session filtering stats:
  Total sessions: 1,322
  Singleton sessions (filtered): 761
  Eligible sessions (≥2 events): 561
  Singleton rate: 57.6%

[05-05] prefix_target_splits: (2333, 9)
   domain user_id  session_id  split_pos  session_length  \
0  target  104074  t_104074_2          9              10   
1  target  104074  t_104074_3         17              18   
2  target  104074  t_104074_4          3               4   

   session_duration_sec  prefix_len  \
0                1494.0           9   
1                2207.0          17   
2                 830.0           3   

                                        prefix_items label_item  
0  [52609, 52616, 52615, 52610, 52614, 52618, 526...      52619  
1  [45209, 45206, 45207, 45211, 45214, 45212, 452...      45234  
2                           [376915, 376916, 376919]     376917  
CELL 05-05 seconds: 0.06


Validate target session quality vs Notebook 04 expectations

In [7]:
# [CELL 05-05B] Validate Target Session Quality (actual vs expected)

print("\n" + "="*70)
print("TARGET SESSION QUALITY VALIDATION")
print("="*70)

actual = con.execute("""
WITH all_sessions AS (
  SELECT session_id, COUNT(*) AS session_length
  FROM sessions_target_view
  GROUP BY session_id
),
eligible_sessions AS (
  SELECT session_id, session_length
  FROM all_sessions
  WHERE session_length >= 2
),
eligible_metrics AS (
  SELECT
    COUNT(*) AS eligible_sessions,
    approx_quantile(session_length, 0.5) AS median_length,
    AVG(session_length) AS avg_length
  FROM eligible_sessions
),
singleton_metric AS (
  SELECT
    (SELECT COUNT(*) FROM all_sessions WHERE session_length = 1)::DOUBLE
    /
    (SELECT COUNT(*) FROM all_sessions)::DOUBLE AS singleton_rate
)
SELECT
  eligible_sessions,
  median_length,
  avg_length,
  singleton_rate
FROM eligible_metrics
CROSS JOIN singleton_metric;
""").df().iloc[0]

print("Actual Metrics:")
print(f"  Eligible sessions: {int(actual['eligible_sessions']):,}")
print(f"  Median session length (eligible): {float(actual['median_length']):.2f} events")
print(f"  Avg session length (eligible): {float(actual['avg_length']):.2f} events")
print(f"  Singleton rate (all sessions): {float(actual['singleton_rate']):.1%}")

print("\nExpected (from Notebook 04):")
print("  Target expected median length ≈ 2 events")
print("  Target expected singleton rate ≈ ~18%")



TARGET SESSION QUALITY VALIDATION
Actual Metrics:
  Eligible sessions: 561
  Median session length (eligible): 3.00 events
  Avg session length (eligible): 5.16 events
  Singleton rate (all sessions): 57.6%

Expected (from Notebook 04):
  Target expected median length ≈ 2 events
  Target expected singleton rate ≈ ~18%


In [8]:
# [CELL 05-06] Build seq_df (target sessions -> item lists) then generate prefix rows (pandas fallback)
# This replaces the cell that failed with: NameError: seq_df is not defined

import pandas as pd
from pathlib import Path

MAX_PREFIX_LEN = int(globals().get("MAX_PREFIX_LEN", 20))
print("Using MAX_PREFIX_LEN:", MAX_PREFIX_LEN)

# Load target sessionized events
df_ev = pd.read_parquet(OUT_T_EVENTS)
required = {"session_id", "user_id", "item_id", "timestamp"}
missing = required - set(df_ev.columns)
if missing:
    raise ValueError(f"OUT_T_EVENTS missing columns: {missing}. Columns: {df_ev.columns.tolist()}")

# If domain missing, inject it
if "domain" not in df_ev.columns:
    df_ev["domain"] = TARGET_DOMAIN if "TARGET_DOMAIN" in globals() else "target"

# Ensure datetime and sort
df_ev["timestamp"] = pd.to_datetime(df_ev["timestamp"], utc=True, errors="coerce")
if df_ev["timestamp"].isna().any():
    raise ValueError("OUT_T_EVENTS has invalid timestamps after parsing.")
df_ev = df_ev.sort_values(["user_id", "session_id", "timestamp", "item_id"]).reset_index(drop=True)

# Build seq_df: one row per session with ordered items
seq_df = (
    df_ev.groupby(["session_id", "domain", "user_id"], as_index=False)
         .agg(
             items=("item_id", list),
             start_ts=("timestamp", "min"),
             end_ts=("timestamp", "max"),
             session_len=("item_id", "size"),
         )
)

# Filter singleton sessions
before = len(seq_df)
seq_df = seq_df[seq_df["session_len"] >= 2].reset_index(drop=True)
after = len(seq_df)
print(f"Sessions total: {before:,} | eligible (>=2): {after:,} | filtered: {before-after:,}")

# Generate prefix samples
rows = []
for _, r in seq_df[["session_id","domain","user_id","items","start_ts","end_ts","session_len"]].iterrows():
    items = r["items"]
    L = len(items)
    # split_pos = 1..L-1
    for split_pos in range(1, L):
        start = max(0, split_pos - MAX_PREFIX_LEN)
        prefix_items = items[start:split_pos]
        label_item = items[split_pos]
        rows.append({
            "domain": r["domain"],
            "user_id": r["user_id"],
            "session_id": r["session_id"],
            "split_pos": split_pos,
            "session_length": L,
            "session_duration_sec": (r["end_ts"] - r["start_ts"]).total_seconds(),
            "prefix_len": len(prefix_items),
            "prefix_items": prefix_items,
            "label_item": label_item,
        })

prefix_target_samples_df = pd.DataFrame(rows)
print("prefix_target_samples_df:", prefix_target_samples_df.shape)
print(prefix_target_samples_df.head(3))


Using MAX_PREFIX_LEN: 20
Sessions total: 1,322 | eligible (>=2): 561 | filtered: 761
prefix_target_samples_df: (2333, 9)
   domain user_id  session_id  split_pos  session_length  \
0  target  104074  t_104074_2          1              10   
1  target  104074  t_104074_2          2              10   
2  target  104074  t_104074_2          3              10   

   session_duration_sec  prefix_len           prefix_items label_item  
0                1494.0           1                [52609]      52616  
1                1494.0           2         [52609, 52616]      52615  
2                1494.0           3  [52609, 52616, 52615]      52610  


In [2]:
import duckdb
from pathlib import Path

# TODO: set this to the actual source parquet you are using in notebook 05
SOURCE_PARQUET = r"D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\normalized_events\events_source_norm.parquet"

p = Path(SOURCE_PARQUET).as_posix().replace("'", "''")
con = duckdb.connect()

res = con.execute(f"""
SELECT
  COUNT(*) AS n_events,
  COUNT(DISTINCT user_id) AS n_users,
  COUNT(DISTINCT item_id) AS n_items,
  MIN(timestamp) AS min_ts,
  MAX(timestamp) AS max_ts
FROM read_parquet('{p}');
""").df()

print(res)
print("SOURCE_PARQUET:", SOURCE_PARQUET)


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    n_events  n_users  n_items              min_ts              max_ts
0  154817413   770283     1628 2015-07-31 23:59:15 2017-07-31 23:59:09
SOURCE_PARQUET: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\normalized_events\events_source_norm.parquet


In [3]:
import psutil

mem = psutil.virtual_memory()
print(f"Total RAM: {mem.total / 1e9:.1f} GB")
print(f"Available RAM: {mem.available / 1e9:.1f} GB")
print(f"Used RAM: {mem.used / 1e9:.1f} GB")

if mem.available < 12e9:  # Less than 12GB available
    print("⚠️  WARNING: Low memory! Consider using a sample.")
else:
    print("✅ Sufficient memory for full dataset sessionization")

Total RAM: 34.3 GB
Available RAM: 17.4 GB
Used RAM: 16.9 GB
✅ Sufficient memory for full dataset sessionization


In [9]:
# [CELL 05-07] Sessionize Source Dataset (FULL-safe, repo-root aware, bucketed, spill-to-disk)

import json
import duckdb
from pathlib import Path

print("\n" + "="*70)
print("SESSIONIZING SOURCE DATASET")
print("="*70)

# ---------------------------------------------------------------------
# 0) Locate REPO_ROOT (so relative paths always work)
# ---------------------------------------------------------------------
def find_repo_root(start: Path) -> Path:
    for p in [start, *start.parents]:
        if (p / "PROJECT_STATE.md").exists():
            return p
    # fallback: try git root marker
    for p in [start, *start.parents]:
        if (p / ".git").exists():
            return p
    return start

REPO_ROOT = find_repo_root(Path.cwd().resolve())
print("CWD:", Path.cwd().resolve())
print("REPO_ROOT:", REPO_ROOT)

# ---------------------------------------------------------------------
# 1) Resolve SOURCE_PARQUET path (prefer IN_SOURCE if present)
# ---------------------------------------------------------------------
if "IN_SOURCE" in globals():
    SOURCE_PARQUET = str(IN_SOURCE)
elif "SOURCE_PARQUET" in globals():
    SOURCE_PARQUET = str(SOURCE_PARQUET)
else:
    SOURCE_PARQUET = str(REPO_ROOT / "data" / "processed" / "normalized_events" / "events_source_norm.parquet")

source_file = Path(SOURCE_PARQUET)
if not source_file.is_absolute():
    source_file = (REPO_ROOT / source_file).resolve()

if not source_file.exists():
    raise FileNotFoundError(f"Source parquet not found: {source_file}")

source_path_sql = source_file.as_posix().replace("'", "''")
print("SOURCE_PARQUET:", source_file)

# ---------------------------------------------------------------------
# 2) Load source gap seconds from session_gap_thresholds.json (repo-root aware)
# ---------------------------------------------------------------------
gaps_path = REPO_ROOT / "data" / "processed" / "normalized_events" / "session_gap_thresholds.json"
if not gaps_path.exists():
    raise FileNotFoundError(f"Missing session_gap_thresholds.json at: {gaps_path}")

with open(gaps_path, "r", encoding="utf-8") as f:
    gaps = json.load(f)

# Your schema uses: gaps["source"]["primary_threshold_seconds"]
gap_s = int(gaps["source"]["primary_threshold_seconds"])
print("Session gaps file:", gaps_path)
print("Using source gap seconds:", gap_s)

# ---------------------------------------------------------------------
# 3) Resolve OUT_S_EVENTS (repo-root aware)
# ---------------------------------------------------------------------
if "OUT_S_EVENTS" in globals():
    out_file = Path(OUT_S_EVENTS)
else:
    run_tag = globals().get("RUN_TAG", "run")
    out_file = REPO_ROOT / "data" / "processed" / "sessionized" / f"source_events_sessionized_{run_tag}.parquet"

if not out_file.is_absolute():
    out_file = (REPO_ROOT / out_file).resolve()

out_file.parent.mkdir(parents=True, exist_ok=True)
OUT_S_EVENTS = out_file  # write back to globals for downstream cells
print("OUT_S_EVENTS:", OUT_S_EVENTS)

# ---------------------------------------------------------------------
# 4) Decide FULL vs SAMPLE
# ---------------------------------------------------------------------
USE_SAMPLE = False   # set True ONLY if you want a user-hash sample
SAMPLE_MOD = 10      # 10 => ~10% users; 100 => ~1% users

mode = "SAMPLE" if USE_SAMPLE else "FULL"
print("\n[05-07] Mode:", mode)
if USE_SAMPLE:
    print(f"  Sampling rule: abs(hash(user_id)) % {SAMPLE_MOD} == 0  (~{100//SAMPLE_MOD}% users)")

# ---------------------------------------------------------------------
# 5) DuckDB setup (memory safe + spill-to-disk)
# ---------------------------------------------------------------------
con = duckdb.connect(database=":memory:")
con.execute("SET threads=1;")
con.execute("SET preserve_insertion_order=false;")
con.execute("PRAGMA enable_object_cache=false;")
con.execute("PRAGMA memory_limit='6GB';")

duckdb_tmp = OUT_S_EVENTS.parent / "_duckdb_tmp"
duckdb_tmp.mkdir(parents=True, exist_ok=True)
con.execute(f"SET temp_directory='{duckdb_tmp.as_posix()}';")
print("DuckDB temp_directory:", duckdb_tmp)

# ---------------------------------------------------------------------
# 6) Print dataset size from parquet (proves full vs sample later)
# ---------------------------------------------------------------------
info = con.execute(f"""
SELECT
  COUNT(*) AS n_events,
  COUNT(DISTINCT user_id) AS n_users,
  COUNT(DISTINCT item_id) AS n_items,
  MIN(timestamp) AS min_ts,
  MAX(timestamp) AS max_ts
FROM read_parquet('{source_path_sql}');
""").df().iloc[0]

n_events = int(info["n_events"])
n_users  = int(info["n_users"])
n_items  = int(info["n_items"])
print("\n[05-07] Source dataset size (from parquet):")
print(f"  Events: {n_events:,}")
print(f"  Users:  {n_users:,}")
print(f"  Items:  {n_items:,}")
print(f"  Range:  {info['min_ts']} → {info['max_ts']}")

# ---------------------------------------------------------------------
# 7) Normalize schema in DuckDB: cast ids to VARCHAR + robust timestamp -> ts
# ---------------------------------------------------------------------
con.execute(f"""
CREATE OR REPLACE VIEW src_raw AS
WITH base AS (
  SELECT
    CAST(user_id AS VARCHAR) AS user_id,
    CAST(item_id AS VARCHAR) AS item_id,
    timestamp AS ts_in
  FROM read_parquet('{source_path_sql}')
)
SELECT
  user_id,
  item_id,
  CASE
    WHEN typeof(ts_in) IN ('TIMESTAMP', 'TIMESTAMP_TZ') THEN CAST(ts_in AS TIMESTAMP)
    WHEN typeof(ts_in) IN ('BIGINT','UBIGINT','INTEGER','UINTEGER','SMALLINT','USMALLINT','TINYINT','UTINYINT','DOUBLE','FLOAT','DECIMAL') THEN
      CASE
        WHEN CAST(ts_in AS BIGINT) > 1000000000000 THEN to_timestamp(CAST(ts_in AS DOUBLE) / 1000.0)
        ELSE to_timestamp(CAST(ts_in AS DOUBLE))
      END
    ELSE try_cast(ts_in AS TIMESTAMP)
  END AS ts
FROM base;
""")

bad_ts = con.execute("SELECT COUNT(*) FROM src_raw WHERE ts IS NULL;").fetchone()[0]
if bad_ts:
    raise ValueError(f"Source has {bad_ts} rows with unparseable timestamps after normalization.")

# ---------------------------------------------------------------------
# 8) Sessionize using bucketing by user hash (FULL-safe)
# ---------------------------------------------------------------------
tmp_dir = OUT_S_EVENTS.parent / "_tmp_src_sessionize"
tmp_dir.mkdir(parents=True, exist_ok=True)

# Clean previous bucket files to avoid mixing runs
for f in tmp_dir.glob("src_events_sessionized_b*.parquet"):
    f.unlink()

# Buckets: more buckets for very large data
N_BUCKETS = 256 if n_events > 50_000_000 else 64
print(f"\n[05-07] Sessionizing with N_BUCKETS={N_BUCKETS}")
print("Temp bucket dir:", tmp_dir)

SRC_DOMAIN = "source"
sample_clause = f"AND (abs(hash(user_id)) % {SAMPLE_MOD}) = 0" if USE_SAMPLE else ""

for b in range(N_BUCKETS):
    tmp_out = tmp_dir / f"src_events_sessionized_b{b:03d}.parquet"

    sql_bucket = f"""
    COPY (
      WITH src AS (
        SELECT
          '{SRC_DOMAIN}' AS domain,
          user_id,
          item_id,
          ts AS timestamp
        FROM src_raw
        WHERE (abs(hash(user_id)) % {N_BUCKETS}) = {b}
        {sample_clause}
      ),
      x AS (
        SELECT
          domain,
          user_id,
          item_id,
          timestamp,
          LAG(timestamp) OVER (PARTITION BY user_id ORDER BY timestamp, item_id) AS prev_ts
        FROM src
      ),
      y AS (
        SELECT
          *,
          CASE
            WHEN prev_ts IS NULL THEN 1
            WHEN (epoch(timestamp) - epoch(prev_ts)) > {gap_s} THEN 1
            ELSE 0
          END AS new_sess
        FROM x
      ),
      z AS (
        SELECT
          *,
          SUM(new_sess) OVER (
            PARTITION BY user_id
            ORDER BY timestamp, item_id
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
          ) AS session_num
        FROM y
      )
      SELECT
        domain,
        user_id,
        item_id,
        timestamp,
        CONCAT(user_id, '::', CAST(session_num AS VARCHAR)) AS session_id
      FROM z
    ) TO '{tmp_out.as_posix()}' (FORMAT PARQUET);
    """
    con.execute(sql_bucket)

    if (b + 1) % 32 == 0:
        print(f"  done {b+1}/{N_BUCKETS}")

# ---------------------------------------------------------------------
# 9) Stitch buckets into final parquet
# ---------------------------------------------------------------------
out_final_sql = OUT_S_EVENTS.as_posix().replace("'", "''")
tmp_glob_sql = (tmp_dir / "src_events_sessionized_b*.parquet").as_posix().replace("'", "''")

if OUT_S_EVENTS.exists():
    OUT_S_EVENTS.unlink()

con.execute(f"""
COPY (
  SELECT * FROM read_parquet('{tmp_glob_sql}')
) TO '{out_final_sql}' (FORMAT PARQUET);
""")

print("\nWrote:", OUT_S_EVENTS)

# ---------------------------------------------------------------------
# 10) Verify output counts (proof full vs sample)
# ---------------------------------------------------------------------
chk = con.execute(f"""
SELECT
  COUNT(*) AS out_events,
  COUNT(DISTINCT user_id) AS out_users,
  COUNT(DISTINCT item_id) AS out_items,
  MIN(timestamp) AS out_min_ts,
  MAX(timestamp) AS out_max_ts
FROM read_parquet('{out_final_sql}');
""").df().iloc[0]

print("\n[05-07] Output sessionized counts:")
print(f"  out_events: {int(chk['out_events']):,}")
print(f"  out_users:  {int(chk['out_users']):,}")
print(f"  out_items:  {int(chk['out_items']):,}")
print(f"  out_range:  {chk['out_min_ts']} → {chk['out_max_ts']}")

if not USE_SAMPLE:
    print("\n[05-07] FULL mode expectation: out_events should be ~equal to source events.")
else:
    print("\n[05-07] SAMPLE mode expectation: out_users should be much smaller than full.")



SESSIONIZING SOURCE DATASET
CWD: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\notebooks
REPO_ROOT: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta
SOURCE_PARQUET: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\normalized_events\events_source_norm.parquet
Session gaps file: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\normalized_events\session_gap_thresholds.json
Using source gap seconds: 600
OUT_S_EVENTS: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\sessionized\source_events_sessionized_run.parquet

[05-07] Mode: FULL
DuckDB temp_directory: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\sessionized\_duckdb_tmp


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


[05-07] Source dataset size (from parquet):
  Events: 154,817,413
  Users:  770,283
  Items:  1,628
  Range:  2015-07-31 23:59:15 → 2017-07-31 23:59:09


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


[05-07] Sessionizing with N_BUCKETS=256
Temp bucket dir: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\sessionized\_tmp_src_sessionize


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  done 32/256


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  done 64/256


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  done 96/256


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  done 128/256


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  done 160/256


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  done 192/256


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  done 224/256


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  done 256/256


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


Wrote: D:\00_DS-ML-Workspace\mooc-coldstart-session-meta\data\processed\sessionized\source_events_sessionized_run.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


[05-07] Output sessionized counts:
  out_events: 154,817,413
  out_users:  770,283
  out_items:  1,628
  out_range:  2015-07-31 23:59:15+08:00 → 2017-07-31 23:59:09+08:00

[05-07] FULL mode expectation: out_events should be ~equal to source events.


## Outputs produced by this notebook (local)
- `data/processed/sessionized/target_events_sessionized_<RUN_TAG>.parquet`
- `data/processed/sessionized/target_sessions_<RUN_TAG>.parquet`
- `data/processed/supervised/target_prefix_samples_<RUN_TAG>.parquet`
- `data/processed/sessionized/source_events_sessionized_<RUN_TAG>.parquet`

Next notebook (strict order): **05A_split_prefix_target.ipynb**