# 01 ‚Äî Data Ingestion & Staging (Real Histopathology)

This notebook:
- mounts Drive + resolves `PROJECT_ROOT`
- downloads **real** CRC-VAL-HE-7K zip (default) from Zenodo (if enabled)
- extracts a sample in SAFE_MODE (or full extraction if SAFE_MODE=false)
- builds the canonical `items.parquet` table and writes a manifest for resume mode


**Tip:** You can set `data.dataset_keys` (list) in `pipeline_config.yaml` to stage *multiple* datasets in one run (CRC + small demo datasets like MedMNIST/PCam).


## PR updates & bias/leakage notes

This notebook **stages real histopathology patch images** and builds the canonical `items.parquet` table.

### üîí Preventing label leakage (keep fusion/clustering unsupervised)
We removed the previous default that embedded the ground-truth label into the `text` modality.

- **Pros of label-in-text:** clusters and similarity edges often look ‚Äúcleaner‚Äù, and demos can appear more impressive.
- **Cons (why it's not defensible):** it contaminates any downstream ‚Äúunsupervised‚Äù embedding fusion, clustering, and nearest-neighbor graphs.

Controls (from `pipeline_config.yaml`):

- `embeddings.text.use_text_modality` (default: **false**)
- `embeddings.text.text_template_version` (default: **v2_no_label**)

If you later enable text, we recommend verifying that `n_unique_texts` is meaningfully larger than the number of labels; otherwise the text modality is low-entropy and should be dropped for now.

### ‚úÖ What we keep
For this workshop, a defensible multimodal setup is typically:
- **image embeddings** (ResNet or UNI)
- **morphology/QC features** (numerical)
- optional **text** only when it contains real non-label metadata.


<a id="B0.0"></a>
### Cell B0.0 ‚Äî Bootstrap (Drive, PROJECT_ROOT, runtime)

- **Purpose:** Mount Drive, resolve PROJECT_ROOT, load config, init runtime.
- **Inputs:** HISTO_PROJECT_ROOT (optional), pipeline_config.yaml
- **Outputs:** PROJECT_ROOT, cfg, SAFE_MODE
- **Depends on:** None
- **Writes checkpoints:** checkpoints/_STATE.json


In [1]:
import os, sys
from pathlib import Path
import yaml

IN_COLAB = "google.colab" in sys.modules

def _mount_drive(mountpoint: str = "/content/drive", max_tries: int = 3, timeout_ms: int = 300000) -> bool:
    """Robust Google Drive mount with retries (Colab).

    Returns True if /content/drive/MyDrive becomes available.
    """
    if not IN_COLAB:
        return True
    try:
        from google.colab import drive  # type: ignore
    except Exception as e:
        print("‚ö†Ô∏è google.colab not available:", repr(e))
        return False

    import time

    mp = Path(mountpoint)
    if (mp / "MyDrive").exists():
        return True

    last = None
    for t in range(max_tries):
        try:
            kwargs = {}
            if t > 0:
                kwargs["force_remount"] = True
            # Some Colab versions accept timeout_ms; ignore if not.
            kwargs["timeout_ms"] = timeout_ms
            try:
                drive.mount(mountpoint, **kwargs)
            except TypeError:
                kwargs.pop("timeout_ms", None)
                if kwargs:
                    drive.mount(mountpoint, **kwargs)
                else:
                    drive.mount(mountpoint)

            if (mp / "MyDrive").exists():
                return True
        except Exception as e:
            last = e
            time.sleep(2)

    print("‚ùå Google Drive mount failed.")
    print("Fixes to try:")
    print("  1) Runtime ‚ñ∏ Restart runtime, then re-run this cell")
    print("  2) Run: from google.colab import drive; drive.flush_and_unmount(); drive.mount('/content/drive', force_remount=True)")
    print("  3) In your browser, allow third‚Äëparty cookies for colab.research.google.com")
    if last is not None:
        print("Last error:", repr(last))
    return False

if IN_COLAB and not _mount_drive():
    raise RuntimeError("Cannot continue without Google Drive mounted. Fix Drive mount and re-run this cell.")

# Optional hard-set:
os.environ["HISTO_PROJECT_ROOT"] = "/content/drive/MyDrive/mit/histopathology_202601012"

def resolve_project_root() -> Path:
    """Find the folder that contains pipeline_config.yaml + label_taxonomy.yaml."""
    ev = os.environ.get("HISTO_PROJECT_ROOT")
    if ev:
        p = Path(ev).expanduser()
        if (p / "pipeline_config.yaml").exists() and (p / "label_taxonomy.yaml").exists():
            return p
        raise FileNotFoundError(f"HISTO_PROJECT_ROOT is set but required files not found in: {p}")

    bases = [
        Path("/content/drive/MyDrive/mit"),
        Path("/content/drive/MyDrive"),
    ]
    required = ["pipeline_config.yaml", "label_taxonomy.yaml"]
    candidates = []
    for base in bases:
        if not base.exists():
            continue
        for p in base.glob("**/pipeline_config.yaml"):
            root = p.parent
            if all((root / rf).exists() for rf in required):
                candidates.append(root.resolve())
        if candidates:
            break

    candidates = sorted(set(candidates), key=lambda p: p.stat().st_mtime, reverse=True)
    if not candidates:
        raise FileNotFoundError(
            "Could not locate project root containing pipeline_config.yaml + label_taxonomy.yaml.\n"
            "Expected it somewhere under /content/drive/MyDrive/mit/.\n"
            "Fix: copy the project folder into Drive, OR set os.environ['HISTO_PROJECT_ROOT'] explicitly."
        )

    if len(candidates) > 1:
        print("‚ö†Ô∏è Multiple candidate project roots found; using newest. To force, set HISTO_PROJECT_ROOT.")
        for c in candidates[:5]:
            print("  -", c)

    return candidates[0]

PROJECT_ROOT = resolve_project_root().resolve()
os.environ["HISTO_PROJECT_ROOT"] = str(PROJECT_ROOT)
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

print("‚úÖ PROJECT_ROOT =", PROJECT_ROOT)
print("sys.path[0] =", sys.path[0])

from histo_cartography.runtime import init_runtime, set_seed, health_check
from histo_cartography.paths import ensure_dirs

CONFIG_PATH = PROJECT_ROOT / "pipeline_config.yaml"
assert CONFIG_PATH.exists(), f"Missing {CONFIG_PATH}"
cfg = yaml.safe_load(CONFIG_PATH.read_text())

SAFE_MODE = bool(cfg.get("project", {}).get("safe_mode", True))
DEBUG_LEVEL = int(cfg.get("project", {}).get("debug_level", 1))

# Create standard directories
ensure_dirs(PROJECT_ROOT, [
    cfg["paths"]["log_dir"],
    cfg["paths"]["checkpoints_dir"],
    cfg["paths"]["data_raw_dir"],
    cfg["paths"]["data_staging_dir"],
    cfg["paths"]["exports_dir"],
    "exports/eda",
    "exports/cartography",
    "exports/kg",
])

init_runtime(
    PROJECT_ROOT,
    safe_mode=SAFE_MODE,
    debug_level=DEBUG_LEVEL,
    log_dir_rel=cfg["paths"]["log_dir"],
    checkpoint_dir_rel=cfg["paths"]["checkpoints_dir"],
)
set_seed(int(cfg.get("project", {}).get("seed", 1337)))

# Post-init sanity check
health_check(
    "BOOTSTRAP",
    namespace=globals(),
    require_files=[PROJECT_ROOT / "pipeline_config.yaml", PROJECT_ROOT / "label_taxonomy.yaml"],
    require_dirs=[PROJECT_ROOT / cfg["paths"]["checkpoints_dir"], PROJECT_ROOT / cfg["paths"]["exports_dir"]],
)

print("SAFE_MODE =", SAFE_MODE, "| DEBUG_LEVEL =", DEBUG_LEVEL)


Mounted at /content/drive
‚úÖ PROJECT_ROOT = /content/drive/MyDrive/mit/histopathology_202601012
sys.path[0] = /content/drive/MyDrive/mit/histopathology_202601012


INFO:histo_cartography:Logging to: /content/drive/MyDrive/mit/histopathology_202601012/logs/run.jsonl
INFO:histo_cartography:Loaded existing STATE for resume mode
INFO:histo_cartography:Seeds set to 1337
DEBUG:histo_cartography:health_check ok


SAFE_MODE = True | DEBUG_LEVEL = 1


<a id="B1.0"></a>
### Cell B1.0 ‚Äî Download dataset zip (real data)

- **Purpose:** Download the configured dataset zip to Drive (data_raw/).
- **Inputs:** pipeline_config.yaml:data.dataset_key, data.download.*
- **Outputs:** data_raw/<dataset>.zip
- **Depends on:** B0.0
- **Writes checkpoints:** data_raw/*.zip


In [2]:
from pathlib import Path
import yaml

from histo_cartography.runtime import cell_context
from histo_cartography.datasets import (
    list_available_datasets,
    download_crc_zip,
    CRC_ZENODO_FILES,
    MEDMNIST_DATASETS,
    HF_DATASETS,
)

cfg = yaml.safe_load((PROJECT_ROOT / "pipeline_config.yaml").read_text())

# Support either a single key (data.dataset_key) or a list (data.dataset_keys)
dataset_keys = cfg["data"].get("dataset_keys") or [cfg["data"]["dataset_key"]]
if isinstance(dataset_keys, str):
    dataset_keys = [dataset_keys]
dataset_keys = [str(k) for k in dataset_keys]

print("üìå Selected dataset_keys:", dataset_keys)

# Show what's supported by this repo
try:
    from IPython.display import display
    display(list_available_datasets())
except Exception:
    print(list_available_datasets().to_string(index=False))

raw_dir = PROJECT_ROOT / cfg["paths"]["data_raw_dir"]
raw_dir.mkdir(parents=True, exist_ok=True)

download_cfg = cfg["data"]["download"]
enable_download = bool(download_cfg.get("enable", True))
verify_md5 = bool(download_cfg.get("verify_md5", True))
allow_large = bool(download_cfg.get("allow_large", False))

# Only CRC datasets use large Zenodo zips. Other providers download during staging.
zip_paths = {}

ckpt_targets = [str(raw_dir / f"{dk}.zip") for dk in dataset_keys if dk in CRC_ZENODO_FILES]

with cell_context("B1.0", purpose="Download dataset assets (CRC zips only)", stage="B", checkpoint_paths=ckpt_targets):
    if not enable_download:
        print("‚ö†Ô∏è Download disabled in config; assuming required files already exist in data_raw/.")
    for dk in dataset_keys:
        if dk in CRC_ZENODO_FILES:
            if not enable_download and not (raw_dir / f"{dk}.zip").exists():
                raise FileNotFoundError(
                    f"Download disabled but zip missing: {(raw_dir / f'{dk}.zip')}. "
                    "Either enable downloads or upload the zip to data_raw/."
                )
            zip_paths[dk] = download_crc_zip(dk, raw_dir, verify_md5=verify_md5, allow_large=allow_large)
            print(f"‚úÖ {dk}: zip at {zip_paths[dk]}")
        elif dk in MEDMNIST_DATASETS:
            print(f"‚ÑπÔ∏è {dk}: provider=medmnist (no zip download step).")
        elif dk in HF_DATASETS:
            print(f"‚ÑπÔ∏è {dk}: provider=huggingface (download happens during export).")
        else:
            raise ValueError(f"Unknown dataset_key={dk}. See list_available_datasets().")

print("üì¶ CRC zip_paths:", {k: str(v) for k, v in zip_paths.items()})

# Optional dependency hints (only needed for some datasets)
if any(dk in MEDMNIST_DATASETS for dk in dataset_keys):
    try:
        import medmnist  # noqa: F401
    except Exception:
        print("‚ö†Ô∏è Missing optional dependency: medmnist. In Colab run: !pip install medmnist")

if any(dk in HF_DATASETS for dk in dataset_keys):
    try:
        import datasets  # noqa: F401
    except Exception:
        print("‚ö†Ô∏è Missing optional dependency: datasets. In Colab run: !pip install datasets")

üìå Selected dataset_keys: ['CRC_VAL_HE_7K']


Unnamed: 0,dataset_key,provider,type,description
0,CRC_VAL_HE_7K,Zenodo,crc_zip,Kather CRC patches (zip)
1,NCT_CRC_HE_100K,Zenodo,crc_zip,Kather CRC patches (zip)
2,NCT_CRC_HE_100K_NONORM,Zenodo,crc_zip,Kather CRC patches (zip)
3,HF_PCAM,huggingface,hf,PatchCamelyon (PCam) patches (binary)
4,MEDMNIST_PATHMNIST,medmnist,medmnist,MedMNIST PathMNIST (colorectal histopathology ...


INFO:histo_cartography:‚ñ∂Ô∏è  B1.0: Download dataset assets (CRC zips only)
INFO:histo_cartography:Downloading
INFO:histo_cartography:MD5 verified
INFO:histo_cartography:‚úÖ B1.0 finished in 909.65s


‚úÖ CRC_VAL_HE_7K: zip at /content/drive/MyDrive/mit/histopathology_202601012/data_raw/CRC_VAL_HE_7K.zip
üì¶ CRC zip_paths: {'CRC_VAL_HE_7K': '/content/drive/MyDrive/mit/histopathology_202601012/data_raw/CRC_VAL_HE_7K.zip'}


<a id="B2.0"></a>
### Cell B2.0 ‚Äî Extract images to staging

- **Purpose:** Extract a SAFE_MODE stratified sample (or full) to data_staging/<dataset>/images/.
- **Inputs:** data_raw/<dataset>.zip
- **Outputs:** data_staging/<dataset>/images/...
- **Depends on:** B1.0
- **Writes checkpoints:** data_staging/**/images


In [3]:
from pathlib import Path
import yaml

from histo_cartography.runtime import cell_context
from histo_cartography.datasets import (
    extract_crc_zip_sample,
    export_medmnist_to_staging,
    export_hf_to_staging,
    CRC_ZENODO_FILES,
    MEDMNIST_DATASETS,
    HF_DATASETS,
)
from histo_cartography.debug_tools import show_images

cfg = yaml.safe_load((PROJECT_ROOT / "pipeline_config.yaml").read_text())

raw_dir = PROJECT_ROOT / cfg["paths"]["data_raw_dir"]
staging_base = PROJECT_ROOT / cfg["paths"]["data_staging_dir"]

# Split handling:
# - If config provides data.split, use it for all datasets.
# - Otherwise, infer a reasonable split from the dataset key.
_cfg_split = cfg["data"].get("split")

def infer_split(dataset_key: str) -> str:
    if _cfg_split:
        return str(_cfg_split)
    dk = str(dataset_key).upper()
    if "VAL" in dk or dk.endswith("_7K"):
        return "val"
    return "train"

dataset_splits = {dk: infer_split(dk) for dk in dataset_keys}
print("üß© dataset_splits:", dataset_splits)

max_items = int(cfg["data"]["max_items_safe"]) if SAFE_MODE else cfg["data"].get("max_items_full")
max_items = int(max_items) if max_items is not None else None

FORCE_REEXTRACT = bool(cfg["data"].get("force_reextract", False))
seed = int(cfg["project"]["seed"])

images_dirs = {}

with cell_context("B2.0", purpose="Stage images to data_staging/<dataset>/images", stage="B"):
    for dk in dataset_keys:
        split = dataset_splits[dk]
        staging_root = staging_base / dk
        staging_root.mkdir(parents=True, exist_ok=True)

        if dk in CRC_ZENODO_FILES:
            zip_path = zip_paths.get(dk) or (raw_dir / f"{dk}.zip")
            if not zip_path.exists():
                raise FileNotFoundError(f"Missing zip for {dk}: {zip_path}. Re-run B1.0 or enable downloads.")
            images_dir = extract_crc_zip_sample(
                zip_path,
                staging_root,
                safe_mode=SAFE_MODE,
                max_items=int(max_items or 10**9),
                seed=seed,
                overwrite=FORCE_REEXTRACT,
            )
        elif dk in MEDMNIST_DATASETS:
            # MedMNIST exports train/val/test splits
            images_dir = export_medmnist_to_staging(
                dk,
                staging_root,
                split=split,
                max_items=int(max_items or 10**9),
                seed=seed,
                overwrite=FORCE_REEXTRACT,
            )
        elif dk in HF_DATASETS:
            # HuggingFace uses split strings like "train", "validation", "test" depending on dataset.
            # For HF_PCAM, "train" exists by default.
            images_dir = export_hf_to_staging(
                dk,
                staging_root,
                split=split,
                max_items=int(max_items or 10**9),
                seed=seed,
                overwrite=FORCE_REEXTRACT,
            )
        else:
            raise ValueError(f"Unknown dataset_key={dk}. See list_available_datasets().")

        images_dirs[dk] = Path(images_dir)

        # -----------------------
        # Quick QC / sanity check
        # -----------------------
        exts = {".tif", ".tiff", ".png", ".jpg", ".jpeg"}
        counts = {}
        total = 0
        for cls_dir in sorted([p for p in Path(images_dir).iterdir() if p.is_dir()]):
            n = sum(1 for p in cls_dir.iterdir() if p.is_file() and p.suffix.lower() in exts)
            if n:
                counts[cls_dir.name] = n
                total += n

        print(f"‚úÖ {dk} ({split}): images_dir={images_dir} | n_images={total} | n_classes={len(counts)}")
        if counts:
            top = dict(sorted(counts.items(), key=lambda kv: kv[1], reverse=True)[:10])
            print("   top label counts:", top)
        else:
            print("   ‚ö†Ô∏è No class subfolders detected. Inspect:", images_dir)

        # Show a few sample images (1 per top label)
        sample_paths = []
        for lbl, _ in list(sorted(counts.items(), key=lambda kv: kv[1], reverse=True))[:4]:
            cls_dir = Path(images_dir) / lbl
            for p in cls_dir.iterdir():
                if p.is_file() and p.suffix.lower() in exts:
                    sample_paths.append(p)
                    break
        if sample_paths:
            show_images(sample_paths, max_images=4, width=260)

print("üìÅ images_dirs keys:", list(images_dirs))

INFO:histo_cartography:‚ñ∂Ô∏è  B2.0: Stage images to data_staging/<dataset>/images
INFO:histo_cartography:CRC zip class histogram
INFO:histo_cartography:Extracting images


üß© dataset_splits: {'CRC_VAL_HE_7K': 'val'}


INFO:histo_cartography:‚úÖ B2.0 finished in 10.16s


‚úÖ CRC_VAL_HE_7K (val): images_dir=/content/drive/MyDrive/mit/histopathology_202601012/data_staging/CRC_VAL_HE_7K/images | n_images=512 | n_classes=9
   top label counts: {'ADI': 57, 'DEB': 57, 'LYM': 57, 'MUC': 57, 'MUS': 57, 'NORM': 57, 'STR': 57, 'TUM': 57, 'BACK': 56}
Failed to show /content/drive/MyDrive/mit/histopathology_202601012/data_staging/CRC_VAL_HE_7K/images/ADI/ADI-TCGA-LYIAPIQD.tif : Cannot embed the 'tif' image format
Failed to show /content/drive/MyDrive/mit/histopathology_202601012/data_staging/CRC_VAL_HE_7K/images/DEB/DEB-TCGA-SKPIRDCM.tif : Cannot embed the 'tif' image format
Failed to show /content/drive/MyDrive/mit/histopathology_202601012/data_staging/CRC_VAL_HE_7K/images/LYM/LYM-TCGA-VPFEGNEN.tif : Cannot embed the 'tif' image format
Failed to show /content/drive/MyDrive/mit/histopathology_202601012/data_staging/CRC_VAL_HE_7K/images/MUC/MUC-TCGA-PFCYWYRT.tif : Cannot embed the 'tif' image format
üìÅ images_dirs keys: ['CRC_VAL_HE_7K']


<a id="B3.0"></a>
### Cell B3.0 ‚Äî Build items table

- **Purpose:** Create the canonical items dataframe for embedding and KG steps.
- **Inputs:** data_staging/<dataset>/images
- **Outputs:** items_df (DataFrame)
- **Depends on:** B2.0
- **Writes checkpoints:** checkpoints/B/items.parquet


In [4]:
from pathlib import Path
import yaml
import pandas as pd

from histo_cartography.runtime import cell_context
from histo_cartography.datasets import build_items_table_from_images_dir
from histo_cartography.schema import ITEMS_SCHEMA_V1, validate_df_schema
from histo_cartography.debug_tools import display_df

cfg = yaml.safe_load((PROJECT_ROOT / "pipeline_config.yaml").read_text())

items_dfs = []

with cell_context("B3.0", purpose="Build items table from staged images", stage="B"):
    for dk, img_dir in images_dirs.items():
        split = dataset_splits.get(dk, cfg["data"].get("split", "train"))
        df_i = build_items_table_from_images_dir(
            Path(img_dir),
            source=dk,
            split=split,
            mpp=0.5,
            use_text_modality=bool(cfg.get("embeddings", {}).get("text", {}).get("use_text_modality", False)),
            text_template_version=str(cfg.get("embeddings", {}).get("text", {}).get("text_template_version", "v2_no_label")),
        )
        print(f"  - {dk} ({split}): {len(df_i)} items")
        items_dfs.append(df_i)

    items_df = pd.concat(items_dfs, ignore_index=True) if items_dfs else pd.DataFrame()

if items_df.empty:
    raise RuntimeError(
        "items_df is empty. This means staging produced 0 images.\n"
        "Common fixes:\n"
        "  - Set data.force_reextract=true in pipeline_config.yaml and re-run B2.0\n"
        "  - If using CRC zips: confirm the zip downloaded fully (size ~800MB for CRC_VAL_HE_7K)\n"
        "  - Inspect data_staging/<dataset>/images for class subfolders\n"
    )

ok, errors = validate_df_schema(items_df, ITEMS_SCHEMA_V1, strict=False)
print("Schema ok?", ok)
if not ok:
    print("Schema errors (first 20):", errors[:20])

display_df(items_df, title="üìÑ items_df preview", n=10)

if "label" in items_df.columns:
    try:
        from IPython.display import display
        display(items_df["label"].value_counts().head(20).to_frame("count"))
    except Exception:
        print(items_df["label"].value_counts().head(20))
print("n_items =", len(items_df))

INFO:histo_cartography:‚ñ∂Ô∏è  B3.0: Build items table from staged images
INFO:histo_cartography:‚úÖ B3.0 finished in 2.89s


  - CRC_VAL_HE_7K (val): 512 items
Schema ok? True
üìÑ items_df preview
shape=(512, 9)


Unnamed: 0,item_id,source,split,label,text,image_path,width,height,mpp
0,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AEDALKHL,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
1,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AGWWSHFM,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
2,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AIQQNFEC,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
3,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AKVLMQER,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
4,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-ASWYCRFC,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
5,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-CIWDYDNF,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
6,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-CWHWCNCI,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
7,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-DFGTTFYR,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
8,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-DYEVWMFC,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
9,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-DYYTMTTH,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
ADI,57
DEB,57
LYM,57
MUC,57
MUS,57
STR,57
NORM,57
TUM,57
BACK,56


n_items = 512


<a id="B4.0"></a>
### Cell B4.0 ‚Äî Persist items.parquet + manifest

- **Purpose:** Write items parquet to checkpoints with manifest for resume mode.
- **Inputs:** items_df
- **Outputs:** checkpoints/B/items.parquet (+ manifest)
- **Depends on:** B3.0
- **Writes checkpoints:** checkpoints/B/items.parquet, checkpoints/B/items.parquet.manifest.json


In [5]:
from pathlib import Path
import yaml

from histo_cartography.runtime import cell_context
from histo_cartography.exports import save_parquet
from histo_cartography.checkpoint import write_manifest
from histo_cartography.debug_tools import show_parquet

cfg = yaml.safe_load((PROJECT_ROOT / "pipeline_config.yaml").read_text())
ckpt_dir = PROJECT_ROOT / cfg["paths"]["checkpoints_dir"] / "B"
ckpt_dir.mkdir(parents=True, exist_ok=True)

items_path = ckpt_dir / "items.parquet"

with cell_context("B4.0", purpose="Save items checkpoint", stage="B", checkpoint_paths=[str(items_path)]):
    save_parquet(items_df, items_path)
    write_manifest(items_path, schema_version=cfg["project"]["schema_version"], df=items_df, key_cols=["item_id"])

print("‚úÖ items saved to:", items_path)

# Read back immediately for QC (as a table)
_ = show_parquet(items_path, title="checkpoints/B/items.parquet (saved)", n=10)

INFO:histo_cartography:‚ñ∂Ô∏è  B4.0: Save items checkpoint
INFO:histo_cartography:Saved parquet
INFO:histo_cartography:Wrote manifest
INFO:histo_cartography:‚úÖ B4.0 finished in 4.68s


‚úÖ items saved to: /content/drive/MyDrive/mit/histopathology_202601012/checkpoints/B/items.parquet
üìÑ checkpoints/B/items.parquet (saved)
shape=(512, 9)


Unnamed: 0,item_id,source,split,label,text,image_path,width,height,mpp
0,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AEDALKHL,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
1,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AGWWSHFM,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
2,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AIQQNFEC,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
3,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AKVLMQER,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
4,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-ASWYCRFC,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
5,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-CIWDYDNF,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
6,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-CWHWCNCI,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
7,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-DFGTTFYR,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
8,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-DYEVWMFC,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
9,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-DYYTMTTH,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5


In [6]:
from histo_cartography.debug_tools import show_parquet
from pathlib import Path

items_path = Path(items_path)

# Convenience re-display (in case user scrolls)
_ = show_parquet(items_path, title="checkpoints/B/items.parquet (re-open)", n=15)

üìÑ checkpoints/B/items.parquet (re-open)
shape=(512, 9)


Unnamed: 0,item_id,source,split,label,text,image_path,width,height,mpp
0,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AEDALKHL,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
1,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AGWWSHFM,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
2,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AIQQNFEC,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
3,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-AKVLMQER,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
4,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-ASWYCRFC,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
5,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-CIWDYDNF,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
6,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-CWHWCNCI,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
7,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-DFGTTFYR,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
8,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-DYEVWMFC,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
9,CRC_VAL_HE_7K::val::ADI::ADI-TCGA-DYYTMTTH,CRC_VAL_HE_7K,val,ADI,,/content/drive/MyDrive/mit/histopathology_2026...,224,224,0.5
