# 04 — Revamped Mantis Multimodal Knowledge Graph Export (Colab)

This notebook **replaces the failed `04_HistoCartography_Mantis_CSV_Export.ipynb`** with a **Mantis‑compliant multimodal CSV export** that produces a *healthy* visualization:

- **>5 levels** of supervised semantic hierarchy (`cluster_l1 ... cluster_l7`)
- **No unlabeled clusters** (every node has every level populated)
- **Clinically meaningful separation** via **hybrid embeddings** (Visual + Semantic)
- A single **Unified Atlas** DataFrame merging:
  - `CRC_VAL_HE_7K` (Zenodo)
  - `NCT_CRC_HE_100K` (Zenodo; large)
  - `MEDMNIST_PATHMNIST` (MedMNIST)
  - `HF_PCAM` (HuggingFace `pcam`)
  - **+4 additional open datasets**:
    - `HF_LC25000` (lung + colon)
    - `HF_BACH` (breast)
    - `HF_BREAKHIS_RCL_7500` (breast)
    - `ORCA_ORAL_ANNOTATED_100` (oral / OSCC; small)

> **Output:** `mantis_unified_atlas_multimodal.csv` (ready for Mantis import)  
> **Optional:** API upload cell (you provide the token at runtime).

---

## Notes on compute & disk

`NCT_CRC_HE_100K` is ~11.7GB. This notebook defaults to **SAFE_MODE sampling** so you can still run end‑to‑end on Colab.  
If you have limited disk, you can disable that dataset with a flag in the config cell.


In [1]:
#@title 1) Colab setup (install deps, mount drive optional)

# ---- (A) Optional: mount Google Drive (recommended for large datasets) ----
try:
    from google.colab import drive  # type: ignore
    DRIVE_MOUNT = True
except Exception:
    DRIVE_MOUNT = False

if DRIVE_MOUNT:
    drive.mount("/content/drive")

# ---- (B) Install deps ----
# (torch/torchvision usually preinstalled on Colab; keep in list for reproducibility)
!pip -q install --upgrade pip
!pip -q install pandas numpy pillow tqdm scikit-learn umap-learn sentence-transformers datasets medmnist requests

# ---- (C) Project path ----
import os, sys
from pathlib import Path

PROJECT_ROOT = Path.cwd()
if (PROJECT_ROOT / "histo_cartography").exists():
    sys.path.insert(0, str(PROJECT_ROOT))
else:
    # If you opened the notebook from Drive, point PROJECT_ROOT at the repo folder.
    # Example:
    # PROJECT_ROOT = Path("/content/drive/MyDrive/histopathology_kg_project")
    # sys.path.insert(0, str(PROJECT_ROOT))
    pass

print("PROJECT_ROOT =", PROJECT_ROOT.resolve())


Mounted at /content/drive
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m84.3 MB/s[0m eta [36m0:00:00[0m
[?25hPROJECT_ROOT = /content


In [2]:
#@title 2) Configuration

from pathlib import Path
import json
import numpy as np
import pandas as pd
from tqdm import tqdm

# --------------------------
# Core switches
# --------------------------
SAFE_MODE = True              # sample aggressively for quick iteration
SEED = 1337
np.random.seed(SEED)

# Data locations
DATA_ROOT = Path("/content/histo_data")  # change if you want to store in Drive
RAW_DIR = DATA_ROOT / "data_raw"
STAGING_DIR = DATA_ROOT / "staging"
EXPORT_DIR = DATA_ROOT / "exports"
RAW_DIR.mkdir(parents=True, exist_ok=True)
STAGING_DIR.mkdir(parents=True, exist_ok=True)
EXPORT_DIR.mkdir(parents=True, exist_ok=True)

# Dataset toggles (you can turn off large sources)
ENABLE = {
    "CRC_VAL_HE_7K": True,
    "NCT_CRC_HE_100K": False,        # LARGE (~11.7GB)
    "MEDMNIST_PATHMNIST": True,
    "HF_PCAM": True,

    # Additional open-source diversity datasets
    "HF_LC25000": True,             # lung + colon
    "HF_BACH": True,                # breast (4-way)
    "HF_BREAKHIS_RCL_7500": True,   # breast (2-way)
    "ORCA_ORAL_ANNOTATED_100": True # oral / OSCC (small)
}

# SAFE_MODE sampling controls (per dataset)
MAX_ITEMS_PER_DATASET = 800 if SAFE_MODE else None     # cap per dataset after ingestion
SPLIT = "train"                                        # staging split folder name

# Embedding controls
VISION_BACKBONE = "resnet50"      # "resnet50" (default) or "phikon" (optional cell later)
TEXT_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

FUSION_MODE = "concat"            # "concat" recommended (visual || semantic)
FUSION_ALPHA = 0.65               # used only for weighted_sum modes (not default)

REDUCE_DIM = 512                  # reduce fused vector to this dim (PCA)
COMPUTE_2D_LAYOUT = True
UMAP_N_NEIGHBORS = 30
UMAP_MIN_DIST = 0.08

# Mantis export filename
EXPORT_CSV = EXPORT_DIR / "mantis_unified_atlas_multimodal.csv"

print("DATA_ROOT:", DATA_ROOT)
print("EXPORT_CSV:", EXPORT_CSV)


DATA_ROOT: /content/histo_data
EXPORT_CSV: /content/histo_data/exports/mantis_unified_atlas_multimodal.csv


## 3) Dataset registry (sources + licenses)

We keep dataset metadata **separate** from the ontology labels.  
This metadata is injected into the **`metadata` JSON column** and also contributes to the **semantic text** used for embedding.


In [3]:
#@title Dataset registry

DATASET_REGISTRY = {
    # Zenodo CRC datasets (Kather et al.)
    "CRC_VAL_HE_7K": {
        "provider": "zenodo",
        "license": "unknown (see Zenodo record)",
        "url": "https://zenodo.org/records/1214456",
        "citation": "Kather et al. (2016) / Zenodo CRC-VAL-HE-7K",
        "notes": "Colorectal histology tiles; 9 tissue classes."
    },
    "NCT_CRC_HE_100K": {
        "provider": "zenodo",
        "license": "unknown (see Zenodo record)",
        "url": "https://zenodo.org/records/1214456",
        "citation": "Kather et al. (2016) / Zenodo NCT-CRC-HE-100K",
        "notes": "Large colorectal histology tile dataset; 9 tissue classes."
    },

    # MedMNIST
    "MEDMNIST_PATHMNIST": {
        "provider": "medmnist",
        "license": "see MedMNIST site (dataset-specific)",
        "url": "https://medmnist.com/",
        "citation": "MedMNIST v2 PathMNIST",
        "notes": "Derived from colorectal histopathology patches; 9 classes."
    },

    # HF datasets
    "HF_PCAM": {
        "provider": "huggingface",
        "license": "see HuggingFace dataset card",
        "url": "https://huggingface.co/datasets/pcam",
        "citation": "PatchCamelyon (PCam)",
        "notes": "Lymph node metastasis (binary).",
        "hf_id": "pcam",
        "image_col": "image",
        "label_col": "label",
    },

    # Additional diversity datasets
    "HF_LC25000": {
        "provider": "huggingface",
        "license": "unlicense (per dataset card)",
        "url": "https://huggingface.co/datasets/1aurent/LC25000",
        "citation": "LC25000 via HuggingFace (1aurent/LC25000)",
        "notes": "Lung + colon histopathology (classification).",
        "hf_id": "1aurent/LC25000",
        "image_col": "image",
        "label_col": "label",
        "extra_cols": ["organ"],   # IMPORTANT: needed for ontology
    },
    "HF_BACH": {
        "provider": "huggingface",
        "license": "cc-by-nc-nd-4.0 (per dataset card)",
        "url": "https://huggingface.co/datasets/1aurent/BACH",
        "citation": "BACH (breast histology) via HuggingFace (1aurent/BACH)",
        "notes": "Breast histology microscopy images (4-way).",
        "hf_id": "1aurent/BACH",
        "image_col": "image",
        "label_col": "label",
    },
    "HF_BREAKHIS_RCL_7500": {
        "provider": "huggingface",
        "license": "cc-by-4.0 (per dataset card)",
        "url": "https://huggingface.co/datasets/LuminaAI/RCL-Breast-Cancer-Biopsy-7500",
        "citation": "BreakHis-derived Breast Cancer Biopsy 7500 (LuminaAI)",
        "notes": "Breast histopathology (benign vs malignant).",
        "hf_id": "LuminaAI/RCL-Breast-Cancer-Biopsy-7500",
        "image_col": "image",
        "label_col": "label",
    },

    # ORCA (oral)
    "ORCA_ORAL_ANNOTATED_100": {
        "provider": "gdrive",
        "license": "research/educational; see ORCA site",
        "url": "https://sites.google.com/unibas.it/orca",
        "citation": "Martino et al. Applied Sciences 2020 (ORCA dataset)",
        "notes": "Oral Cancer Annotated dataset: 100 OSCC samples (validation set).",
        # Google Drive link from ORCA website (validation set) — may change
        "gdrive_url": "https://drive.google.com/drive/folders/1XfplgYK5JWzzYWXQhrPUQujXNKUDK-WR?usp=sharing",
    },
}

print("Registered datasets:", list(DATASET_REGISTRY.keys()))


Registered datasets: ['CRC_VAL_HE_7K', 'NCT_CRC_HE_100K', 'MEDMNIST_PATHMNIST', 'HF_PCAM', 'HF_LC25000', 'HF_BACH', 'HF_BREAKHIS_RCL_7500', 'ORCA_ORAL_ANNOTATED_100']


## 4) Data ingestion to a single Unified Atlas table

Each dataset is materialized to a common schema:

- `item_id` (unique)
- `image_path`
- `dataset_key`
- `raw_label`
- `split`
- `width`, `height`, `mpp`
- `metadata` (dict; later JSON)

This step intentionally **does not** assign ontology yet.


In [None]:
#@title 4B) Ingest all enabled datasets into a Unified Atlas table

import os
import shutil
from pathlib import Path

# --- FIX: Ensure histo_cartography package exists from local files ---
# If datasets.py is in the root (uploaded directly), move it to histo_cartography/datasets.py
if os.path.isfile("datasets.py") and not os.path.exists("histo_cartography/datasets.py"):
    print("Moving datasets.py to histo_cartography/datasets.py to create package...")
    os.makedirs("histo_cartography", exist_ok=True)
    shutil.move("datasets.py", "histo_cartography/datasets.py")
    # Create empty __init__.py to make it a package
    Path("histo_cartography/__init__.py") .touch()
# ---------------------------------------------------------------------

from histo_cartography.datasets import prepare_dataset_to_staging  # re-use existing robust logic

unified_parts = []

def ingest_crc_medmnist_pcam(dataset_key: str) -> pd.DataFrame:
    items_df, images_dir = prepare_dataset_to_staging(
        dataset_key,
        raw_dir=RAW_DIR,
        staging_dir=STAGING_DIR / dataset_key,
        split=SPLIT,
        safe_mode=SAFE_MODE,
        max_items=MAX_ITEMS_PER_DATASET,
        seed=SEED,
        overwrite=False,
        allow_large=True,          # allow big CRC zip
        verify_md5=False,
        use_text_modality=False,   # we will add rich semantic text ourselves
        text_template_version="v2_no_label",
    )
    # normalize schema
    return (items_df.rename(columns={"label": "raw_label", "source": "dataset_key"})
                   [["item_id","dataset_key","split","raw_label","image_path","width","height","mpp"]])

# --- CRC + MedMNIST + PCAM ---
for k in ["CRC_VAL_HE_7K","NCT_CRC_HE_100K","MEDMNIST_PATHMNIST","HF_PCAM"]:
    if ENABLE.get(k, False):
        print(f"\n=== Ingest: {k} ===")
        unified_parts.append(ingest_crc_medmnist_pcam(k))

# --- HF_LC25000 (needs organ + label => folder = organ__label) ---
if ENABLE.get("HF_LC25000", False):
    print("\n=== Ingest: HF_LC25000 ===")
    images_dir = export_hf_dataset_to_staging(
        "HF_LC25000",
        STAGING_DIR,
        split="train",
        max_items=MAX_ITEMS_PER_DATASET,
        seed=SEED,
        label_folder_fn=lambda row: f"{row['organ']}__{row['label']}",
    )
    unified_parts.append(build_items_from_images_dir(images_dir, "HF_LC25000", split="train", mpp=0.5))

# --- HF_BACH (folder = breast__label) ---
if ENABLE.get("HF_BACH", False):
    print("\n=== Ingest: HF_BACH ===")
    images_dir = export_hf_dataset_to_staging(
        "HF_BACH",
        STAGING_DIR,
        split="train",
        max_items=MAX_ITEMS_PER_DATASET,
        seed=SEED,
        label_folder_fn=lambda row: f"breast__{row['label']}",
    )
    unified_parts.append(build_items_from_images_dir(images_dir, "HF_BACH", split="train", mpp=0.5))

# --- HF_BREAKHIS_RCL_7500 (folder = breast__benign/malignant) ---
if ENABLE.get("HF_BREAKHIS_RCL_7500", False):
    print("\n=== Ingest: HF_BREAKHIS_RCL_7500 ===")
    images_dir = export_hf_dataset_to_staging(
        "HF_BREAKHIS_RCL_7500",
        STAGING_DIR,
        split="train",
        max_items=MAX_ITEMS_PER_DATASET,
        seed=SEED,
        label_folder_fn=lambda row: f"breast__{'benign' if int(row['label'])==0 else 'malignant'}",
    )
    unified_parts.append(build_items_from_images_dir(images_dir, "HF_BREAKHIS_RCL_7500", split="train", mpp=0.5))

# --- ORCA oral dataset ---
if ENABLE.get("ORCA_ORAL_ANNOTATED_100", False):
    print("\n=== Ingest: ORCA_ORAL_ANNOTATED_100 ===")
    images_dir = download_orca_to_staging("ORCA_ORAL_ANNOTATED_100", STAGING_DIR, overwrite=False)
    unified_parts.append(build_items_from_images_dir(images_dir, "ORCA_ORAL_ANNOTATED_100", split="val", mpp=0.5))

# Merge to one dataframe
atlas = pd.concat(unified_parts, ignore_index=True)

# Attach dataset metadata dict (later becomes JSON string)
def build_metadata_row(row) -> dict:
    meta = DATASET_REGISTRY.get(row["dataset_key"], {})
    return {
        "dataset_key": row["dataset_key"],
        "provider": meta.get("provider"),
        "license": meta.get("license"),
        "dataset_url": meta.get("url"),
        "citation": meta.get("citation"),
        "notes": meta.get("notes"),
        "raw_label": row["raw_label"],
        "split": row["split"],
        "mpp": row.get("mpp"),
        "width": row.get("width"),
        "height": row.get("height"),
    }

atlas["metadata"] = atlas.apply(build_metadata_row, axis=1)

# Cap per dataset if requested
if MAX_ITEMS_PER_DATASET is not None:
    atlas = (atlas.groupby("dataset_key", group_keys=False)
                  .apply(lambda df: df.sample(n=min(len(df), MAX_ITEMS_PER_DATASET), random_state=SEED))
                  .reset_index(drop=True))

print("Unified atlas shape:", atlas.shape)
atlas.head()

Moving datasets.py to histo_cartography/datasets.py to create package...

=== Ingest: CRC_VAL_HE_7K ===


In [5]:
#@title 4B) Ingest all enabled datasets into a Unified Atlas table

from histo_cartography.datasets import prepare_dataset_to_staging  # re-use existing robust logic

unified_parts = []

def ingest_crc_medmnist_pcam(dataset_key: str) -> pd.DataFrame:
    items_df, images_dir = prepare_dataset_to_staging(
        dataset_key,
        raw_dir=RAW_DIR,
        staging_dir=STAGING_DIR / dataset_key,
        split=SPLIT,
        safe_mode=SAFE_MODE,
        max_items=MAX_ITEMS_PER_DATASET,
        seed=SEED,
        overwrite=False,
        allow_large=True,          # allow big CRC zip
        verify_md5=False,
        use_text_modality=False,   # we will add rich semantic text ourselves
        text_template_version="v2_no_label",
    )
    # normalize schema
    return (items_df.rename(columns={"label": "raw_label", "source": "dataset_key"})
                   [["item_id","dataset_key","split","raw_label","image_path","width","height","mpp"]])

# --- CRC + MedMNIST + PCAM ---
for k in ["CRC_VAL_HE_7K","NCT_CRC_HE_100K","MEDMNIST_PATHMNIST","HF_PCAM"]:
    if ENABLE.get(k, False):
        print(f"\n=== Ingest: {k} ===")
        unified_parts.append(ingest_crc_medmnist_pcam(k))

# --- HF_LC25000 (needs organ + label => folder = organ__label) ---
if ENABLE.get("HF_LC25000", False):
    print("\n=== Ingest: HF_LC25000 ===")
    images_dir = export_hf_dataset_to_staging(
        "HF_LC25000",
        STAGING_DIR,
        split="train",
        max_items=MAX_ITEMS_PER_DATASET,
        seed=SEED,
        label_folder_fn=lambda row: f"{row['organ']}__{row['label']}",
    )
    unified_parts.append(build_items_from_images_dir(images_dir, "HF_LC25000", split="train", mpp=0.5))

# --- HF_BACH (folder = breast__label) ---
if ENABLE.get("HF_BACH", False):
    print("\n=== Ingest: HF_BACH ===")
    images_dir = export_hf_dataset_to_staging(
        "HF_BACH",
        STAGING_DIR,
        split="train",
        max_items=MAX_ITEMS_PER_DATASET,
        seed=SEED,
        label_folder_fn=lambda row: f"breast__{row['label']}",
    )
    unified_parts.append(build_items_from_images_dir(images_dir, "HF_BACH", split="train", mpp=0.5))

# --- HF_BREAKHIS_RCL_7500 (folder = breast__benign/malignant) ---
if ENABLE.get("HF_BREAKHIS_RCL_7500", False):
    print("\n=== Ingest: HF_BREAKHIS_RCL_7500 ===")
    images_dir = export_hf_dataset_to_staging(
        "HF_BREAKHIS_RCL_7500",
        STAGING_DIR,
        split="train",
        max_items=MAX_ITEMS_PER_DATASET,
        seed=SEED,
        label_folder_fn=lambda row: f"breast__{'benign' if int(row['label'])==0 else 'malignant'}",
    )
    unified_parts.append(build_items_from_images_dir(images_dir, "HF_BREAKHIS_RCL_7500", split="train", mpp=0.5))

# --- ORCA oral dataset ---
if ENABLE.get("ORCA_ORAL_ANNOTATED_100", False):
    print("\n=== Ingest: ORCA_ORAL_ANNOTATED_100 ===")
    images_dir = download_orca_to_staging("ORCA_ORAL_ANNOTATED_100", STAGING_DIR, overwrite=False)
    unified_parts.append(build_items_from_images_dir(images_dir, "ORCA_ORAL_ANNOTATED_100", split="val", mpp=0.5))

# Merge to one dataframe
atlas = pd.concat(unified_parts, ignore_index=True)

# Attach dataset metadata dict (later becomes JSON string)
def build_metadata_row(row) -> dict:
    meta = DATASET_REGISTRY.get(row["dataset_key"], {})
    return {
        "dataset_key": row["dataset_key"],
        "provider": meta.get("provider"),
        "license": meta.get("license"),
        "dataset_url": meta.get("url"),
        "citation": meta.get("citation"),
        "notes": meta.get("notes"),
        "raw_label": row["raw_label"],
        "split": row["split"],
        "mpp": row.get("mpp"),
        "width": row.get("width"),
        "height": row.get("height"),
    }

atlas["metadata"] = atlas.apply(build_metadata_row, axis=1)

# Cap per dataset if requested
if MAX_ITEMS_PER_DATASET is not None:
    atlas = (atlas.groupby("dataset_key", group_keys=False)
                  .apply(lambda df: df.sample(n=min(len(df), MAX_ITEMS_PER_DATASET), random_state=SEED))
                  .reset_index(drop=True))

print("Unified atlas shape:", atlas.shape)
atlas.head()


Installing umap-learn...
✅ Logged in to Hugging Face successfully.
✅ Found project root at: /content/drive/MyDrive/mit/histopathology_mantis_20260115
Cleared 4 stale modules.
Cleaning RAW_DIR to force fresh downloads (fixes BadZipFile)...

=== Ingest: CRC_VAL_HE_7K ===

=== Ingest: MEDMNIST_PATHMNIST ===

=== Ingest: HF_PCAM ===
❌ Failed to ingest HF_PCAM: Dataset 'pcam' doesn't exist on the Hub or cannot be accessed.

=== Ingest: HF_LC25000 ===
❌ Failed to ingest HF_LC25000: name 'export_hf_dataset_to_staging' is not defined

=== Ingest: HF_BACH ===
❌ Failed to ingest HF_BACH: name 'export_hf_dataset_to_staging' is not defined

=== Ingest: HF_BREAKHIS_RCL_7500 ===
❌ Failed to ingest HF_BREAKHIS_RCL_7500: name 'export_hf_dataset_to_staging' is not defined

=== Ingest: ORCA_ORAL_ANNOTATED_100 ===
❌ Failed to ingest ORCA_ORAL_ANNOTATED_100: name 'download_orca_to_staging' is not defined
Unified atlas shape: (1600, 9)


  .apply(lambda df: df.sample(n=min(len(df), MAX_ITEMS_PER_DATASET), random_state=SEED))


Unnamed: 0,item_id,dataset_key,split,raw_label,image_path,width,height,mpp,metadata
0,CRC_VAL_HE_7K::train::BACK::BACK-TCGA-QENFEQCV,CRC_VAL_HE_7K,train,BACK,/content/histo_data/staging/CRC_VAL_HE_7K/imag...,224,224,0.5,"{'dataset_key': 'CRC_VAL_HE_7K', 'provider': '..."
1,CRC_VAL_HE_7K::train::DEB::DEB-TCGA-ARIHITHS,CRC_VAL_HE_7K,train,DEB,/content/histo_data/staging/CRC_VAL_HE_7K/imag...,224,224,0.5,"{'dataset_key': 'CRC_VAL_HE_7K', 'provider': '..."
2,CRC_VAL_HE_7K::train::DEB::DEB-TCGA-MEEEDYVH,CRC_VAL_HE_7K,train,DEB,/content/histo_data/staging/CRC_VAL_HE_7K/imag...,224,224,0.5,"{'dataset_key': 'CRC_VAL_HE_7K', 'provider': '..."
3,CRC_VAL_HE_7K::train::STR::STR-TCGA-THRGREQS,CRC_VAL_HE_7K,train,STR,/content/histo_data/staging/CRC_VAL_HE_7K/imag...,224,224,0.5,"{'dataset_key': 'CRC_VAL_HE_7K', 'provider': '..."
4,CRC_VAL_HE_7K::train::MUS::MUS-TCGA-SIWYHYDQ,CRC_VAL_HE_7K,train,MUS,/content/histo_data/staging/CRC_VAL_HE_7K/imag...,224,224,0.5,"{'dataset_key': 'CRC_VAL_HE_7K', 'provider': '..."


## 5) Supervised Ontology Mapper (>5 levels)

We **do not** use unsupervised cluster IDs as semantic labels.

Instead, we map every row to a **7‑level semantic path**:

1. **Domain** (Tissue vs Artifact)  
2. **Organ System** (Colon, Lung, Breast, Lymph Node, Oral Cavity, …)  
3. **General Pathology** (Neoplastic, Non‑Neoplastic, Inflammatory, Necrosis, Artifact)  
4. **Tissue Structure** (Epithelium, Stroma, Immune, Muscle, Extracellular, Background, …)  
5. **Specific Class** (Adenocarcinoma, Lymphocytes, Adipose Tissue, …)  
6. **Subtype / Context** (Invasive vs In‑situ; Mets+/−; dataset‑specific)  
7. **Diagnostic Modifier** (**mocked deterministically** if not provided)

All 7 columns are exported as `cluster_l1 ... cluster_l7`.


In [None]:
#@title 5) Ontology mapper (7 levels)

import hashlib

CRC_MAP = {
    "ADI":  ("Tissue", "Colon", "Non-Neoplastic", "Stroma", "Adipose Tissue", "Adipose", "Unspecified"),
    "BACK": ("Artifact", "Colon", "Artifact", "Background", "Glass/Empty", "Background", "N/A"),
    "DEB":  ("Tissue", "Colon", "Necrosis", "Debris", "Necrotic Detritus", "Debris", "Unspecified"),
    "LYM":  ("Tissue", "Colon", "Inflammatory", "Immune", "Lymphocytes", "Lymphoid", "Unspecified"),
    "MUC":  ("Tissue", "Colon", "Non-Neoplastic", "Extracellular", "Mucus", "Mucin", "Unspecified"),
    "MUS":  ("Tissue", "Colon", "Non-Neoplastic", "Muscle", "Smooth Muscle", "Muscularis", "Unspecified"),
    "NORM": ("Tissue", "Colon", "Non-Neoplastic", "Epithelium", "Normal Mucosa", "Crypts", "Unspecified"),
    "STR":  ("Tissue", "Colon", "Non-Neoplastic", "Stroma", "Connective Tissue", "Fibrous stroma", "Unspecified"),
    "TUM":  ("Tissue", "Colon", "Neoplastic", "Epithelium", "Adenocarcinoma", "Tumor epithelium", None),  # modifier later
}

PATHMNIST_ALIAS = {
    "adipose": "ADI",
    "background": "BACK",
    "debris": "DEB",
    "lymphocytes": "LYM",
    "mucus": "MUC",
    "smooth muscle": "MUS",
    "normal colon mucosa": "NORM",
    "stroma": "STR",
    "tumor epithelium": "TUM",
}

BACH_MAP = {
    "breast__normal":   ("Tissue", "Breast", "Non-Neoplastic", "Epithelium", "Normal Breast Tissue", "Normal", "Unspecified"),
    "breast__benign":   ("Tissue", "Breast", "Non-Neoplastic", "Epithelium", "Benign Lesion", "Benign", "Unspecified"),
    "breast__in situ carcinoma": ("Tissue", "Breast", "Neoplastic", "Epithelium", "Carcinoma", "In situ", None),
    "breast__in-situ carcinoma": ("Tissue", "Breast", "Neoplastic", "Epithelium", "Carcinoma", "In situ", None),
    "breast__in situ": ("Tissue", "Breast", "Neoplastic", "Epithelium", "Carcinoma", "In situ", None),
    "breast__invasive carcinoma": ("Tissue", "Breast", "Neoplastic", "Epithelium", "Carcinoma", "Invasive", None),
    "breast__invasive": ("Tissue", "Breast", "Neoplastic", "Epithelium", "Carcinoma", "Invasive", None),
}

def stable_mock_grade(item_id: str) -> str:
    h = int(hashlib.md5(item_id.encode("utf-8")).hexdigest(), 16)
    grades = ["Low Grade (mocked)", "Intermediate Grade (mocked)", "High Grade (mocked)"]
    return grades[h % len(grades)]

def map_to_ontology(row: pd.Series) -> tuple[str, str, str, str, str, str, str]:
    dk = row["dataset_key"]
    raw = str(row["raw_label"]).strip()

    # CRC datasets
    if dk in {"CRC_VAL_HE_7K", "NCT_CRC_HE_100K"}:
        base = CRC_MAP.get(raw)
        if base is None:
            base = ("Tissue", "Colon", "Unknown", "Unknown", raw, "Unspecified", "Unspecified")
        l1,l2,l3,l4,l5,l6,l7 = base
        if l7 is None:
            l7 = stable_mock_grade(row["item_id"])
        return (l1,l2,l3,l4,l5,l6,l7)

    # MedMNIST PathMNIST
    if dk == "MEDMNIST_PATHMNIST":
        key = raw.lower()
        crc_code = PATHMNIST_ALIAS.get(key)
        if crc_code and crc_code in CRC_MAP:
            l1,l2,l3,l4,l5,l6,l7 = CRC_MAP[crc_code]
            if l7 is None:
                l7 = stable_mock_grade(row["item_id"])
            return (l1,l2,l3,l4,l5,l6,l7)
        return ("Tissue","Colon","Unknown","Unknown",raw,"Unspecified","Unspecified")

    # PCAM
    if dk == "HF_PCAM":
        if raw in {"1","pos","positive","tumor","metastasis"}:
            return ("Tissue","Lymph Node","Neoplastic","Metastasis","Tumor Tissue","Mets Positive", stable_mock_grade(row["item_id"]))
        return ("Tissue","Lymph Node","Non-Neoplastic","Lymphoid Tissue","Normal Tissue","Mets Negative","Unspecified")

    # LC25000: raw_label = organ__class
    if dk == "HF_LC25000":
        parts = raw.split("__", 1)
        organ = parts[0].capitalize() if parts else "Unknown"
        cls = parts[1] if len(parts) > 1 else raw
        cls_low = cls.lower()
        if "adenocarc" in cls_low:
            return ("Tissue", organ, "Neoplastic", "Epithelium", "Adenocarcinoma", "Primary", stable_mock_grade(row["item_id"]))
        if "squamous" in cls_low:
            return ("Tissue", organ, "Neoplastic", "Epithelium", "Squamous Cell Carcinoma", "Primary", stable_mock_grade(row["item_id"]))
        if "benign" in cls_low or "normal" in cls_low:
            return ("Tissue", organ, "Non-Neoplastic", "Parenchyma", "Benign/Normal Tissue", "Benign", "Unspecified")
        return ("Tissue", organ, "Unknown", "Unknown", cls, "Unspecified", "Unspecified")

    # BACH
    if dk == "HF_BACH":
        key = raw.lower().strip()
        base = BACH_MAP.get(key)
        if base is not None:
            l1,l2,l3,l4,l5,l6,l7 = base
            if l7 is None:
                l7 = stable_mock_grade(row["item_id"])
            return (l1,l2,l3,l4,l5,l6,l7)
        if "invasive" in key:
            return ("Tissue","Breast","Neoplastic","Epithelium","Carcinoma","Invasive", stable_mock_grade(row["item_id"]))
        if "in situ" in key or "insitu" in key:
            return ("Tissue","Breast","Neoplastic","Epithelium","Carcinoma","In situ", stable_mock_grade(row["item_id"]))
        if "benign" in key:
            return ("Tissue","Breast","Non-Neoplastic","Epithelium","Benign lesion","Benign","Unspecified")
        if "normal" in key:
            return ("Tissue","Breast","Non-Neoplastic","Epithelium","Normal breast tissue","Normal","Unspecified")
        return ("Tissue","Breast","Unknown","Unknown",raw,"Unspecified","Unspecified")

    # BreakHis RCL 7500
    if dk == "HF_BREAKHIS_RCL_7500":
        key = raw.lower().strip()
        if "malignant" in key:
            return ("Tissue","Breast","Neoplastic","Epithelium","Carcinoma","Malignant", stable_mock_grade(row["item_id"]))
        if "benign" in key:
            return ("Tissue","Breast","Non-Neoplastic","Epithelium","Benign lesion","Benign","Unspecified")
        return ("Tissue","Breast","Unknown","Unknown",raw,"Unspecified","Unspecified")

    # ORCA oral
    if dk == "ORCA_ORAL_ANNOTATED_100":
        return ("Tissue","Oral Cavity","Neoplastic","Epithelium","Oral Squamous Cell Carcinoma","OSCC", stable_mock_grade(row["item_id"]))

    return ("Tissue","Unknown","Unknown","Unknown",raw,"Unspecified","Unspecified")

# Apply ontology
cols = ["cluster_l1","cluster_l2","cluster_l3","cluster_l4","cluster_l5","cluster_l6","cluster_l7"]
atlas[cols] = atlas.apply(lambda r: pd.Series(map_to_ontology(r), index=cols), axis=1)

# Add rich semantic text
def build_semantic_text(row: pd.Series) -> str:
    m = row["metadata"]
    return (
        f"Domain: {row['cluster_l1']}. "
        f"Organ system: {row['cluster_l2']}. "
        f"General pathology: {row['cluster_l3']}. "
        f"Tissue structure: {row['cluster_l4']}. "
        f"Specific class: {row['cluster_l5']}. "
        f"Subtype/context: {row['cluster_l6']}. "
        f"Diagnostic modifier: {row['cluster_l7']}. "
        f"Dataset: {m.get('dataset_key')} (provider={m.get('provider')}). "
        f"Notes: {m.get('notes')}"
    ).strip()

atlas["text"] = atlas.apply(build_semantic_text, axis=1)
atlas["semantic_text"] = atlas["text"]

# Validation: no blanks
for c in cols:
    assert atlas[c].isna().sum() == 0
    assert (atlas[c].astype(str).str.strip() == "").sum() == 0

atlas[["dataset_key","raw_label"] + cols].head(10)


## 6) Hybrid embeddings (Visual + Semantic)

We compute:

- **Visual embeddings:** ResNet50 (torchvision) by default  
- **Semantic embeddings:** Sentence‑Transformers on the ontology‑rich `text` field  

Then we fuse them (default **concatenation**) and reduce with PCA to a manageable dimension for CSV export.


In [None]:
#@title 6A) Visual embeddings (ResNet50) — batch mode

from histo_cartography.embeddings import embed_images_resnet50

img_emb = embed_images_resnet50(
    atlas[["item_id","image_path"]].copy(),
    image_col="image_path",
    batch_size=64,
    max_items=None,
)
img_emb = img_emb[["item_id","vector"]].rename(columns={"vector":"vec_img"})
print("img_emb:", img_emb.shape)
img_emb.head()


In [None]:
#@title 6B) Semantic embeddings (Sentence-Transformers)

from sentence_transformers import SentenceTransformer

text_model = SentenceTransformer(TEXT_MODEL)

texts = atlas["text"].tolist()
item_ids = atlas["item_id"].tolist()

vec_txt = text_model.encode(
    texts,
    batch_size=64,
    show_progress_bar=True,
    normalize_embeddings=True,
)

txt_emb = pd.DataFrame({"item_id": item_ids, "vec_txt": [v.astype(np.float32) for v in vec_txt]})
print("txt_emb:", txt_emb.shape)
txt_emb.head()


In [None]:
#@title 6C) Fuse embeddings + reduce dimension

from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize

df = atlas.merge(img_emb, on="item_id", how="left").merge(txt_emb, on="item_id", how="left")

df["vec_img"] = df["vec_img"].apply(lambda v: np.array(v, dtype=np.float32) if v is not None else np.zeros(2048, dtype=np.float32))
df["vec_txt"] = df["vec_txt"].apply(lambda v: np.array(v, dtype=np.float32) if v is not None else np.zeros(384, dtype=np.float32))

V_img = np.stack(df["vec_img"].to_list())
V_txt = np.stack(df["vec_txt"].to_list())

V_img = normalize(V_img)

if FUSION_MODE == "concat":
    V = np.concatenate([V_img, V_txt], axis=1)
else:
    pca_tmp = PCA(n_components=V_txt.shape[1], random_state=SEED)
    V_img_proj = pca_tmp.fit_transform(V_img)
    V = (FUSION_ALPHA * normalize(V_img_proj) + (1 - FUSION_ALPHA) * normalize(V_txt))

print("Fused dim:", V.shape)

if REDUCE_DIM is not None and V.shape[1] > REDUCE_DIM:
    pca = PCA(n_components=REDUCE_DIM, random_state=SEED)
    V_red = pca.fit_transform(V)
else:
    V_red = V

V_red = normalize(V_red)
df["vector"] = [v.astype(np.float32) for v in V_red]

print("Reduced dim:", V_red.shape)
df[["item_id","dataset_key","raw_label"]].head()


## 7) (Optional) 2D layout

If your Mantis setup prefers precomputed coordinates, we compute `x`,`y` via UMAP on the fused vectors.

We also pass a **supervised target** (`cluster_l5`) to encourage separation by histologic class.


In [None]:
#@title 7) Compute 2D layout (UMAP)

if COMPUTE_2D_LAYOUT:
    import umap
    lbl = pd.Categorical(df["cluster_l5"]).codes

    reducer = umap.UMAP(
        n_neighbors=UMAP_N_NEIGHBORS,
        min_dist=UMAP_MIN_DIST,
        metric="cosine",
        random_state=SEED,
    )
    xy = reducer.fit_transform(np.stack(df["vector"].to_list()), y=lbl)
    df["x"] = xy[:,0].astype(float)
    df["y"] = xy[:,1].astype(float)
else:
    df["x"] = 0.0
    df["y"] = 0.0

df[["x","y"]].describe()


## 8) Build Mantis CSV (strict + backward compatible)

We export both:

- **New strict fields** (commonly expected in Mantis‑style spaces):  
  `id`, `text`, `vector_str`, `cluster_l1..cluster_l7`, `metadata`

- **Backward‑compatible fields** used by the legacy notebook / UI inspector:  
  `title`, `semantic_text`, `x`, `y`, `raw_label`, `dataset_key`, `split`, `mpp`, `width`, `height`, `image_path`

This makes the CSV resilient to schema differences across Mantis builds.


In [None]:
#@title 8) Export CSV

import json

def vec_to_str(v: np.ndarray) -> str:
    # Dense float list, CSV-friendly (comma-separated)
    return ",".join([f"{x:.6f}" for x in v.tolist()])

out = df.copy()

# -------------------------
# Mantis core identifiers
# -------------------------
out["id"] = out["item_id"]                  # required-ish
out["source"] = out["dataset_key"]          # legacy alias
out["label"] = out["raw_label"]             # legacy alias

# Human-readable title (inspector / flashcards)
out["title"] = out.apply(lambda r: f"{r['cluster_l2']} | {r['cluster_l5']} ({r['cluster_l6']})", axis=1)

# -------------------------
# Vectors
# -------------------------
out["vector_str"] = out["vector"].apply(vec_to_str)

# Some Mantis builds may expect `vector` instead of `vector_str`
out["vector"] = out["vector_str"]

# -------------------------
# Ontology / hierarchy
# -------------------------
# Provide BOTH snake_case and CamelCase variants to be robust across Mantis builds.
for i in range(1, 8):
    out[f"cluster_L{i}"] = out[f"cluster_l{i}"]

# A single "cluster" field is often used for coloring
out["cluster"] = out.apply(lambda r: f"{r['cluster_l2']} > {r['cluster_l5']}", axis=1)

# Semantic tags / keywords (useful for flashcards + search)
out["ontology_path"] = out.apply(lambda r: " / ".join([r[f"cluster_l{i}"] for i in range(1,8)]), axis=1)
out["keywords"] = out["ontology_path"].apply(lambda s: ", ".join([x.strip() for x in s.split("/")]))
out["tags"] = out["keywords"]

# -------------------------
# Metadata
# -------------------------
# Keep rich, provenance-aware JSON. This is "semantic metadata" beyond ontology.
out["metadata"] = out["metadata"].apply(lambda m: json.dumps(m, ensure_ascii=False))

# -------------------------
# Column order
# -------------------------
base_cols = [
    # core
    "id","item_id","title","text","semantic_text",
    # vector
    "vector_str","vector",
    # coords
    "x","y",
    # hierarchy (snake_case)
    "cluster_l1","cluster_l2","cluster_l3","cluster_l4","cluster_l5","cluster_l6","cluster_l7",
    # hierarchy (CamelCase)
    "cluster_L1","cluster_L2","cluster_L3","cluster_L4","cluster_L5","cluster_L6","cluster_L7",
    # convenience
    "cluster","ontology_path","keywords","tags",
    # provenance
    "dataset_key","source","split","raw_label","label","width","height","mpp",
    # modality refs
    "image_path",
    # semantic metadata
    "metadata",
]

for c in base_cols:
    if c not in out.columns:
        out[c] = ""

export_df = out[base_cols].copy()

# Final validation: no missing hierarchy values
hier = [c for c in export_df.columns if c.startswith("cluster_l")]
assert export_df[hier].isna().sum().sum() == 0
assert (export_df[hier].astype(str).apply(lambda s: s.str.strip() == "")).sum().sum() == 0

EXPORT_CSV.parent.mkdir(parents=True, exist_ok=True)
export_df.to_csv(EXPORT_CSV, index=False)

print("Wrote:", EXPORT_CSV)
print("Rows:", len(export_df), "Cols:", len(export_df.columns))
export_df.head()


## 9) Quick health checks (hierarchy depth, no unlabeled clusters)

These checks catch the failure modes you saw:

- unlabeled clusters (`""` / `NaN`)
- insufficient semantic depth (only 1–2 levels)
- dataset dominating by raw label only


In [None]:
#@title 9) Validate hierarchy depth + diversity

hier_cols = [c for c in export_df.columns if c.startswith("cluster_l")]
print("Hierarchy columns:", hier_cols)

assert len(hier_cols) >= 6, "Need >5 ontology levels."

for c in ["cluster_l2","cluster_l3","cluster_l4","cluster_l5"]:
    nunique = export_df[c].nunique()
    print(c, "unique =", nunique)
    assert nunique >= 2, f"Expected multiple unique values for {c}"

blank_counts = {c: int((export_df[c].astype(str).str.strip()=="").sum()) for c in hier_cols}
print("Blank counts:", blank_counts)

display(export_df.groupby(["dataset_key","cluster_l2"]).size().sort_values(ascending=False).head(20))


## 10) Optional: Upload to Mantis via API

Because the dev docs require login, the exact endpoint may differ.  
The cell below is a **template**: set the correct URL and paste your token **at runtime**.

✅ Recommendation: store token in a Colab secret / environment variable rather than hard‑coding.


In [None]:
#@title 10) Upload template (fill in endpoint + token)

import os, requests

MANTIS_UPLOAD_URL = "https://mantisdev.csail.mit.edu/api/v1/upload"  # <-- update if needed

# Set in Colab: %env MANTIS_TOKEN=...
MANTIS_TOKEN = os.environ.get("MANTIS_TOKEN", "")

if not MANTIS_TOKEN:
    print("No token found in env var MANTIS_TOKEN. Set it before uploading.")
else:
    with open(EXPORT_CSV, "rb") as f:
        files = {"file": f}
        headers = {"Authorization": f"Bearer {MANTIS_TOKEN}"}
        resp = requests.post(MANTIS_UPLOAD_URL, headers=headers, files=files, timeout=120)
    print("Status:", resp.status_code)
    try:
        print(resp.json())
    except Exception:
        print(resp.text[:1000])
