# 🏥 Visual AI in Healthcare with FiftyOne - ARCADE Dataset Loading/Exploration
**Empowering medical imaging workflows with open-source tools and modern AI**

This notebook is part of the **“Visual AI in Healthcare with FiftyOne”** workshop. Through hands-on examples, we explore how to load, visualize, analyze, and enhance medical imaging datasets using state-of-the-art AI tools.

🔬 **What you’ll learn in this notebook:**

- How to **load and organize a multi-task medical imaging dataset** (ARCADE) using FiftyOne  
- How to **import COCO-style segmentation annotations** for both segmentation and stenosis detection tasks  
- How to **tag and enrich samples with metadata** for easier querying and filtering  
- How to **merge multiple subsets** into a single persistent FiftyOne dataset  
- How to **compute image embeddings** using FiftyOne Brain for exploratory analysis and visualization  
- How to **launch the FiftyOne App** for interactive dataset exploration
- How to **export your dataser** and **load it to Hugging Face**

📚 **Part of the notebook series:**
1. `01_load_arcade_dataset.ipynb` – Load and visualize the ARCADE dataset.  
2. `02_load_deeplesion_balanced.ipynb` – Curate and balance the DeepLesion dataset.  
3. `03_vlms_analysis_arcade.ipynb` – Use VFMs like NVLabs_CRADIOV3 in dataset undersatnding for ARCADE. 
4. `04_finetune_yolo8_stenosis.ipynb` – Train and integrate YOLOv8 for stenosis detection.  
5. `05_medsam2_ct_scan.ipynb` – Run MedSAM2 on CT scans for segmentation.  
6. `06_nvidia_vista_segmentation.ipynb` – Explore NVIDIA-VISTA-3D.  
7. `07_medgemma_vqa.ipynb` – Perform visual question answering and classification with MedGemma.

All notebooks are standalone but are best experienced sequentially.

### ✅ Requirements

Please install all the requeriments for running this notebook

In [None]:
#!pip install datasets fiftyone pandas

### 🗂️ Loading and Merging ARCADE Dataset Subsets into FiftyOne

The ARCADE challenge includes multiple datasets across two key tasks — **segmentation** and **stenosis detection** — spread across training, validation, and test phases. This section consolidates all subsets into a single FiftyOne dataset to simplify exploration and analysis. Download the dataset [here](https://zenodo.org/records/10390295)

#### 🔍 What's happening here:

- Defines the paths and structure for the **segmentation** and **stenosis** datasets across training, validation, and final test phases.
- Initializes a new dataset called `"arcade_combined"` and deletes any prior version with the same name.
- **Adds custom fields** to the dataset schema:
  - `phase` – e.g., `phase_1`, `final_phase`
  - `task` – either `segmentation` or `stenosis`
  - `subset_name` – specific subset like `seg_train` or `sten_val`
- For each subset:
  - Loads annotations in COCO format using `COCODetectionDatasetImporter`
  - Creates a temporary dataset and populates it with images and labels
  - Assigns field values and adds sample-level tags
  - Merges the processed samples into the main dataset
  - Deletes the temporary dataset after merging

This structure allows for **highly flexible querying** (e.g., "show all stenosis test cases") and enables rich filtering and visualization workflows in FiftyOne.g

This creates a unified dataset in FiftyOne where you can filter by phase, task type, or subset name, setting the foundation for structured medical image analysis and model evaluation.


In [None]:
import os
import fiftyone as fo
from fiftyone.utils.coco import COCODetectionDatasetImporter

# Base path
base_path = "/path/to//arcade_challenge_datasets"
combined_dataset_name = "arcade_combined"

# Dataset config
datasets = [
    ("seg_train", "dataset_phase_1/segmentation_dataset/seg_train", "phase_1", "segmentation"),
    ("seg_val", "dataset_phase_1/segmentation_dataset/seg_val", "phase_1", "segmentation"),
    ("sten_train", "dataset_phase_1/stenosis_dataset/sten_train", "phase_1", "stenosis"),
    ("sten_val", "dataset_phase_1/stenosis_dataset/sten_val", "phase_1", "stenosis"),
    ("test_case_seg", "dataset_final_phase/test_case_segmentation", "final_phase", "segmentation"),
    ("test_case_sten", "dataset_final_phase/test_cases_stenosis", "final_phase", "stenosis"),
]

# Delete existing dataset
if fo.dataset_exists(combined_dataset_name):
    fo.delete_dataset(combined_dataset_name)
combined_dataset = fo.Dataset(combined_dataset_name)

# Add metadata fields
combined_dataset.add_sample_field("phase", fo.StringField)
combined_dataset.add_sample_field("task", fo.StringField)
combined_dataset.add_sample_field("subset_name", fo.StringField)

# Load each dataset separately and assign fields
for subset_name, relative_path, phase, task in datasets:
    print(f"\n📦 Loading {subset_name}...")

    image_dir = os.path.join(base_path, relative_path, "images")
    annotation_dir = os.path.join(base_path, relative_path, "annotations")
    json_files = [f for f in os.listdir(annotation_dir) if f.endswith(".json")]
    assert len(json_files) == 1, f"Expected 1 JSON in {annotation_dir}, found {len(json_files)}"
    labels_path = os.path.join(annotation_dir, json_files[0])

    # Create temporary dataset
    temp_dataset_name = f"{combined_dataset_name}_{subset_name}"
    if fo.dataset_exists(temp_dataset_name):
        fo.delete_dataset(temp_dataset_name)
    temp_dataset = fo.Dataset(temp_dataset_name)

    importer = COCODetectionDatasetImporter(
        data_path=image_dir,
        labels_path=labels_path,
        label_types="segmentations",
        
        include_id=True,
        extra_attrs=True,
    )

    # Add to temp dataset
    temp_dataset.add_importer(importer)

    # Tag + assign fields
    for sample in temp_dataset:
        sample["phase"] = phase
        sample["task"] = task
        sample["subset_name"] = subset_name
        sample.tags.extend([subset_name, task, phase])
        sample.save()
        combined_dataset.add_sample(sample)

    # Delete temp dataset
    temp_dataset.delete()

print("\n✅ All subsets imported successfully!")

# Optional: launch FiftyOne app
# fo.launch_app(combined_dataset)


### 🧠 Computing and Visualizing Embeddings with FiftyOne Brain

Once the full ARCADE dataset has been assembled, this section applies **embedding generation and visualization** techniques using FiftyOne Brain. Embeddings are vector representations of images or patches that capture visual similarity and semantic content, enabling advanced tasks like clustering, outlier detection, and semantic search.

#### 🗂 Step 1: Persist and Launch the App
- We make the dataset **persistent** so it can be reloaded later.
- The FiftyOne App is launched for interactive exploration of the full dataset.

In [None]:
combined_dataset.persistent=True
session = fo.launch_app(combined_dataset, port=5153, auto=False)

#### 🧠 Step 2: Compute Embeddings for Entire Images
- Uses FiftyOne Brain’s `compute_visualization()` to generate **default embeddings** for each image.
- These are stored under the `"default_embedding"` key and indexed using the `"arcade_emb"` brain key.
- This allows for visualizing the full dataset in **2D latent space** (e.g., t-SNE or UMAP projection).

In [None]:
import torch
import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

# Load dataset
#dataset = fo.load_dataset("deeplesion")

# Compute embeddings for all samples
fob.compute_visualization(
    combined_dataset,
    embeddings="default_embedding",
    brain_key="arcade_emb"
)


In [None]:
print(combined_dataset)

In [None]:
combined_dataset.persistent=True

#### 🧩 Step 3: Compute Patch-Level Embeddings (Optional)
- Instead of entire images, we compute embeddings for **regions of interest**, such as segmentations.
- Specifies the patch field (`"segmentations"`) to extract ROIs.
- Loads a pretrained model (`mobilenet-v2-imagenet-torch`) from the FiftyOne Model Zoo.
- Computes embeddings for each patch and stores them under `"patches_embedding"`, with a new brain key `"patches_emb"`.

These visual embeddings enable deep visual exploration and analysis of ARCADE’s segmentation and stenosis data, surfacing patterns and anomalies that go beyond raw pixel inspection.

In [None]:
# Specify the field containing the patches (e.g., detections)
patches_field = "segmentations"

# Option 1: Use a pre-trained model from the FiftyOne Model Zoo
model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")

# Compute embeddings for the patches using the specified model
results = fob.compute_visualization(
    combined_dataset, 
    patches_field=patches_field, 
    model=model, 
    embeddings="patches_embedding", 
    brain_key="patches_emb"
    )



In [None]:
combined_dataset.reload()
print(combined_dataset)

### 🗄️ Exporting the Combined ARCADE Dataset and Publishing to Hugging Face 🤗

This final step packages the merged ARCADE dataset into a shareable format and uploads it to the Hugging Face Hub.

#### 🚚 Step 1: Clone the Dataset
We create a **clone** of the `arcade_combined` dataset to keep the export operation clean and isolated. This cloned dataset is named `arcade_combined_export`.



In [None]:
# Clone the dataset
cloned_dataset = combined_dataset.clone(name="arcade_combined_export")

#### 💾 Step 2: Export to Local Directory
Using `FiftyOneDataset` format, the cloned dataset is exported to a directory (`./arcade_combined_fiftyone`). This includes:
- Image files
- COCO-style annotations (in this case, from the `"segmentations"` field)
- Dataset metadata

The `overwrite=True` flag ensures any existing content in the folder is safely replaced.


In [None]:
# Set export directory
export_dir = "./arcade_combined_fiftyone"

# Export in FiftyOne format
cloned_dataset.export(
    export_dir=export_dir,
    dataset_type=fo.types.FiftyOneDataset,
    label_field="segmentations",  # or other label field you want to export
    overwrite=True,
)

#### 🤗 Step 3: Push to Hugging Face Hub
With just one line of code, we use `push_to_hub()` to upload the dataset directly to the Hugging Face Hub. The dataset will appear under the user or organization space defined by the credentials used.

Publishing to Hugging Face allows others to:
- **Easily reuse your dataset** in notebooks, apps, or models
- **View it through the FiftyOne Web App integration**
- **Promote reproducibility and collaboration** in open-source healthcare AI research

In [None]:
from fiftyone.utils.huggingface import push_to_hub

push_to_hub(cloned_dataset, "arcade_fiftyone")