# UMDD Notebook Playground
Hands-on tour for UMDD: synth data, heuristic preview, train a tiny multi-head model (codepage + tags + boundaries), and run inference. The goal is to give notebook users a single place to explore without touching the CLI.

## Prereqs
* Run this notebook from the repo root.
* Ensure deps are installed (one-time): `pip install -e '.[dev]'`.
* Torch CPU is sufficient for these tiny demos; GPU is optional.

In [None]:
from pathlib import Path
import sys

ROOT = Path.cwd()
if (ROOT / 'pyproject.toml').exists():
    sys.path.append(str(ROOT))
print('Project root:', ROOT)
print('Pyproject exists:', (ROOT / 'pyproject.toml').exists())
print('Python path contains root:', str(ROOT) in sys.path)


In [None]:
# Generate a small synthetic RDW dataset + metadata for experimentation.
from umdd.data.generator import generate_synthetic_dataset
import orjson

data, meta = generate_synthetic_dataset(count=4, seed=42)
Path('data').mkdir(exist_ok=True)
Path('data/notebook_synth.bin').write_bytes(data)
Path('data/notebook_synth.json').write_bytes(orjson.dumps(meta, option=orjson.OPT_INDENT_2))
print('Bytes:', len(data))
print('Records:', len(meta))
meta[:2]  # preview first few metadata entries


In [None]:
# Heuristic decode preview (useful before models are trained).
from umdd.decoder import heuristic_decode

heuristic_summary = heuristic_decode(data, preview_bytes=128)
heuristic_summary


In [None]:
# Train a tiny multi-head model on synthetic data (fast CPU run).
from umdd.training.multitask import MultiTaskConfig, train_multitask

cfg = MultiTaskConfig(
    output_dir=Path('artifacts/notebook-multihead'),
    samples_per_codepage=8,
    max_len=96,
    batch_size=4,
    epochs=1,
    embed_dim=32,
    num_heads=2,
    num_layers=1,
    ff_dim=64,
    device='cpu',
)
metrics = train_multitask(cfg)
metrics


In [None]:
# Run inference on the synthetic bytes using the trained checkpoint.
from umdd.inference import infer_bytes

results = infer_bytes(data, checkpoint=Path(metrics['checkpoint']), max_records=2)
for r in results:
    print(f'Record {r.record_index} (len={r.byte_length})')
    print('  Codepage:', r.codepage, 'conf', round(r.codepage_confidence, 3))
    print('  Tag spans:', r.tag_spans)
    print('  Boundary positions:', r.boundary_positions)
    print('---')


## Inference outputs (JSONL/Arrow)
Demonstrate writing inference results to JSONL and Arrow IPC for downstream tools.

In [None]:
from umdd.inference import results_to_jsonl, results_to_arrow
import pandas as pd
import pyarrow.ipc as pa_ipc

logs_dir = Path('logs')
logs_dir.mkdir(exist_ok=True)
jsonl_path = logs_dir / 'notebook_infer.jsonl'
arrow_path = logs_dir / 'notebook_infer.arrow'

results_to_jsonl(results, jsonl_path)
results_to_arrow(results, arrow_path)
print('Wrote', jsonl_path, 'and', arrow_path)
pd.read_json(jsonl_path, lines=True).head()


In [None]:
with pa_ipc.open_file(arrow_path) as reader:
    table = reader.read_all()
table.to_pandas().head()


## Notes
* These runs stay tiny for speed; bump `samples_per_codepage`, `epochs`, or `embed_dim` if you have more time/resources.
* Swap `data/notebook_synth.bin` with a real RDW dataset to see how the model behaves on authentic bytes.
* CLI equivalents exist (`umdd dataset synthetic`, `umdd train multitask`, `umdd infer`) if you prefer terminal workflows.