# üè• Visual AI in Healthcare with FiftyOne - ARCADE Dataset Loading/Exploration
**Empowering medical imaging workflows with open-source tools and modern AI**

This notebook is part of the **‚ÄúVisual AI in Healthcare with FiftyOne‚Äù** workshop. Through hands-on examples, we explore how to load, visualize, analyze, and enhance medical imaging datasets using state-of-the-art AI tools.

üî¨ **What you‚Äôll learn in this notebook:**

- How to **load and organize a multi-task medical imaging dataset** (ARCADE) using FiftyOne  
- How to **import COCO-style segmentation annotations** for both segmentation and stenosis detection tasks  
- How to **tag and enrich samples with metadata** for easier querying and filtering  
- How to **merge multiple subsets** into a single persistent FiftyOne dataset  
- How to **compute image embeddings** using FiftyOne Brain for exploratory analysis and visualization  
- How to **launch the FiftyOne App** for interactive dataset exploration
- How to **export your dataser** and **load it to Hugging Face**

üìö **Part of the notebook series:**
1. `01_load_arcade_dataset.ipynb` ‚Äì Load and visualize the ARCADE dataset.  
2. `02_load_deeplesion_balanced.ipynb` ‚Äì Curate and balance the DeepLesion dataset.  
3. `03_vlms_analysis_arcade.ipynb` ‚Äì Use VFMs like NVLabs_CRADIOV3 in dataset undersatnding for ARCADE. 
4. `04_finetune_yolo8_stenosis.ipynb` ‚Äì Train and integrate YOLOv8 for stenosis detection.  
5. `05_medsam2_ct_scan.ipynb` ‚Äì Run MedSAM2 on CT scans for segmentation.  
6. `06_nvidia_vista_segmentation.ipynb` ‚Äì Explore NVIDIA-VISTA-3D.  
7. `07_medgemma_vqa.ipynb` ‚Äì Perform visual question answering and classification with MedGemma.

All notebooks are standalone but are best experienced sequentially.

### ‚úÖ Requirements

Please install all the requeriments for running this notebook

In [None]:
#!pip install datasets fiftyone pandas

### üóÇÔ∏è Loading and Merging ARCADE Dataset Subsets into FiftyOne

The ARCADE challenge includes multiple datasets across two key tasks ‚Äî **segmentation** and **stenosis detection** ‚Äî spread across training, validation, and test phases. This section consolidates all subsets into a single FiftyOne dataset to simplify exploration and analysis. Download the dataset [here](https://zenodo.org/api/records/8386059/files-archive) and uncompress that where you have this repo installed with a folder called `arcade_challenge_datasets`.

#### üîç What's happening here:

- Defines the paths and structure for the **segmentation** and **stenosis** datasets across training, validation, and final test phases.
- Initializes a new dataset called `"arcade_combined"` and deletes any prior version with the same name.
- **Adds custom fields** to the dataset schema:
  - `phase` ‚Äì e.g., `phase_1`, `final_phase`
  - `task` ‚Äì either `segmentation` or `stenosis`
  - `subset_name` ‚Äì specific subset like `seg_train` or `sten_val`
- For each subset:
  - Loads annotations in COCO format using `COCODetectionDatasetImporter`
  - Creates a temporary dataset and populates it with images and labels
  - Assigns field values and adds sample-level tags
  - Merges the processed samples into the main dataset
  - Deletes the temporary dataset after merging

This structure allows for **highly flexible querying** (e.g., "show all stenosis test cases") and enables rich filtering and visualization workflows in FiftyOne.g

This creates a unified dataset in FiftyOne where you can filter by phase, task type, or subset name, setting the foundation for structured medical image analysis and model evaluation.


In [1]:
import os
import fiftyone as fo
from fiftyone.utils.coco import COCODetectionDatasetImporter

# Base path
base_path = "arcade_challenge_datasets"
combined_dataset_name = "arcade_combined"

# Dataset config
datasets = [
    ("seg_train", "dataset_phase_1/segmentation_dataset/seg_train", "phase_1", "segmentation"),
    ("seg_val", "dataset_phase_1/segmentation_dataset/seg_val", "phase_1", "segmentation"),
    ("sten_train", "dataset_phase_1/stenosis_dataset/sten_train", "phase_1", "stenosis"),
    ("sten_val", "dataset_phase_1/stenosis_dataset/sten_val", "phase_1", "stenosis"),
    ("test_case_seg", "dataset_final_phase/test_case_segmentation", "final_phase", "segmentation"),
    ("test_case_sten", "dataset_final_phase/test_cases_stenosis", "final_phase", "stenosis"),
]

# Delete existing dataset
if fo.dataset_exists(combined_dataset_name):
    fo.delete_dataset(combined_dataset_name)
combined_dataset = fo.Dataset(combined_dataset_name)

# Add metadata fields
combined_dataset.add_sample_field("phase", fo.StringField)
combined_dataset.add_sample_field("task", fo.StringField)
combined_dataset.add_sample_field("subset_name", fo.StringField)

# Load each dataset separately and assign fields
for subset_name, relative_path, phase, task in datasets:
    print(f"\nüì¶ Loading {subset_name}...")

    image_dir = os.path.join(base_path, relative_path, "images")
    annotation_dir = os.path.join(base_path, relative_path, "annotations")
    json_files = [f for f in os.listdir(annotation_dir) if f.endswith(".json")]
    assert len(json_files) == 1, f"Expected 1 JSON in {annotation_dir}, found {len(json_files)}"
    labels_path = os.path.join(annotation_dir, json_files[0])

    # Create temporary dataset
    temp_dataset_name = f"{combined_dataset_name}_{subset_name}"
    if fo.dataset_exists(temp_dataset_name):
        fo.delete_dataset(temp_dataset_name)
    temp_dataset = fo.Dataset(temp_dataset_name)

    importer = COCODetectionDatasetImporter(
        data_path=image_dir,
        labels_path=labels_path,
        label_types="segmentations",
        
        include_id=True,
        extra_attrs=True,
    )

    # Add to temp dataset
    temp_dataset.add_importer(importer)

    # Tag + assign fields
    for sample in temp_dataset:
        sample["phase"] = phase
        sample["task"] = task
        sample["subset_name"] = subset_name
        sample.tags.extend([subset_name, task, phase])
        sample.save()
        combined_dataset.add_sample(sample)

    # Delete temp dataset
    temp_dataset.delete()

print("\n‚úÖ All subsets imported successfully!")

# Optional: launch FiftyOne app
# fo.launch_app(combined_dataset)


  from .autonotebook import tqdm as notebook_tqdm



üì¶ Loading seg_train...
 100% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [3.2s elapsed, 0s remaining, 316.6 samples/s]      

üì¶ Loading seg_val...
 100% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [688.3ms elapsed, 0s remaining, 290.6 samples/s]      

üì¶ Loading sten_train...
 100% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [1.3s elapsed, 0s remaining, 757.7 samples/s]         

üì¶ Loading sten_val...
 100% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [332.4ms elapsed, 0s remaining, 601.7 samples/s]      

üì¶ Loading test_case_seg...
 100% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 [1.1s elapsed, 0s remaining, 272.4 samples/s]         

üì¶ Loading test_case_sten...
 100% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 [374.6ms elapsed, 0s remaining, 802.5 samples/s]     

‚úÖ All subsets imported successfully!


### üß† Computing and Visualizing Embeddings with FiftyOne Brain

Once the full ARCADE dataset has been assembled, this section applies **embedding generation and visualization** techniques using FiftyOne Brain. Embeddings are vector representations of images or patches that capture visual similarity and semantic content, enabling advanced tasks like clustering, outlier detection, and semantic search.

#### üóÇ Step 1: Persist and Launch the App
- We make the dataset **persistent** so it can be reloaded later.
- The FiftyOne App is launched for interactive exploration of the full dataset.

In [2]:
combined_dataset.persistent=True
session = fo.launch_app(combined_dataset, port=5151, auto=False)

Session launched. Run `session.show()` to open the App in a cell output.


#### üß† Step 2: Compute Embeddings for Entire Images
- Uses FiftyOne Brain‚Äôs `compute_visualization()` to generate **default embeddings** for each image.
- These are stored under the `"default_embedding"` key and indexed using the `"arcade_emb"` brain key.
- This allows for visualizing the full dataset in **2D latent space** (e.g., t-SNE or UMAP projection).

In [3]:
import torch
import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

# Load dataset
#dataset = fo.load_dataset("deeplesion")

# Compute embeddings for all samples
fob.compute_visualization(
    combined_dataset,
    embeddings="default_embedding",
    brain_key="arcade_emb"
)


Computing embeddings...
 100% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [4.8m elapsed, 0s remaining, 10.8 samples/s]      
Generating visualization...
UMAP( verbose=True)
Tue Jul 15 16:31:26 2025 Construct fuzzy simplicial set
Tue Jul 15 16:31:29 2025 Finding Nearest Neighbors
Tue Jul 15 16:31:30 2025 Finished Nearest Neighbor Search
Tue Jul 15 16:31:31 2025 Construct embedding


Epochs completed:  53%| ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé     265/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs


Epochs completed: 100%| ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà 500/500 [00:01]


	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Tue Jul 15 16:31:32 2025 Finished embedding


<fiftyone.brain.visualization.VisualizationResults at 0x3342a3250>

In [4]:
print(combined_dataset)

Name:        arcade_combined
Media type:  image
Num samples: 3000
Persistent:  True
Tags:        []
Sample fields:
    id:                fiftyone.core.fields.ObjectIdField
    filepath:          fiftyone.core.fields.StringField
    tags:              fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:        fiftyone.core.fields.DateTimeField
    last_modified_at:  fiftyone.core.fields.DateTimeField
    phase:             fiftyone.core.fields.StringField
    task:              fiftyone.core.fields.StringField
    subset_name:       fiftyone.core.fields.StringField
    segmentations:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    coco_id:           fiftyone.core.fields.IntField
    default_embedding: fiftyone.core.fields.VectorField


In [5]:
combined_dataset.persistent=True

#### üß© Step 3: Compute Patch-Level Embeddings (Optional)
- Instead of entire images, we compute embeddings for **regions of interest**, such as segmentations.
- Specifies the patch field (`"segmentations"`) to extract ROIs.
- Loads a pretrained model (`mobilenet-v2-imagenet-torch`) from the FiftyOne Model Zoo.
- Computes embeddings for each patch and stores them under `"patches_embedding"`, with a new brain key `"patches_emb"`.

These visual embeddings enable deep visual exploration and analysis of ARCADE‚Äôs segmentation and stenosis data, surfacing patterns and anomalies that go beyond raw pixel inspection.

In [6]:
# Specify the field containing the patches (e.g., detections)
patches_field = "segmentations"

# Option 1: Use a pre-trained model from the FiftyOne Model Zoo
model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")

# Compute embeddings for the patches using the specified model
combined_dataset.compute_patch_embeddings(
    patches_field=patches_field, 
    model=model, 
    embeddings_field="patches_embedding",
    num_workers=0
    )



 100% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [12.8m elapsed, 0s remaining, 12.2 samples/s]      


In [7]:
fob.compute_visualization(
    combined_dataset,
    patches_field="segmentations",             # üëà your field
    embeddings="patches_embedding",            # üëà patch embedding field
    brain_key="segmentations_patch_embeddings",# üëà name for this embedding run
    num_workers=0
)

Generating visualization...
UMAP( verbose=True)
Tue Jul 15 17:03:45 2025 Construct fuzzy simplicial set
Tue Jul 15 17:03:45 2025 Finding Nearest Neighbors
Tue Jul 15 17:03:45 2025 Building RP forest with 10 trees
Tue Jul 15 17:03:47 2025 NN descent for 13 iterations
	 1  /  13
	 2  /  13
	 3  /  13
	 4  /  13
	Stopping threshold met -- exiting after 4 iterations
Tue Jul 15 17:03:51 2025 Finished Nearest Neighbor Search
Tue Jul 15 17:03:51 2025 Construct embedding


Epochs completed:  22%| ‚ñà‚ñà‚ñé        45/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs


Epochs completed:  89%| ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ  178/200 [00:00]

	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs


Epochs completed: 100%| ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà 200/200 [00:00]


Tue Jul 15 17:03:52 2025 Finished embedding


<fiftyone.brain.visualization.VisualizationResults at 0x3344ddab0>

In [20]:
combined_dataset.reload()
print(combined_dataset)

sample = combined_dataset.first()
for detection in sample["segmentations"].detections:
    print(detection.label, detection.patches_embedding)

Name:        arcade_combined
Media type:  image
Num samples: 3000
Persistent:  True
Tags:        []
Sample fields:
    id:                fiftyone.core.fields.ObjectIdField
    filepath:          fiftyone.core.fields.StringField
    tags:              fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:        fiftyone.core.fields.DateTimeField
    last_modified_at:  fiftyone.core.fields.DateTimeField
    phase:             fiftyone.core.fields.StringField
    task:              fiftyone.core.fields.StringField
    subset_name:       fiftyone.core.fields.StringField
    segmentations:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    coco_id:           fiftyone.core.fields.IntField
    default_embedding: fiftyone.core.fields.VectorField
16 [0.         0.         0.06283244 ... 0.         0.         0.        ]
3 [0.       

### üóÑÔ∏è Exporting the Combined ARCADE Dataset and Publishing to Hugging Face ü§ó

This final step packages the merged ARCADE dataset into a shareable format and uploads it to the Hugging Face Hub.

#### üöö Step 1: Clone the Dataset
We create a **clone** of the `arcade_combined` dataset to keep the export operation clean and isolated. This cloned dataset is named `arcade_combined_export`.



In [9]:
# Clone the dataset
cloned_dataset = combined_dataset.clone(name="arcade_combined_fiftyone")

#### üíæ Step 2: Export to Local Directory
Using `FiftyOneDataset` format, the cloned dataset is exported to a directory (`./arcade_combined_fiftyone`). This includes:
- Image files
- COCO-style annotations (in this case, from the `"segmentations"` field)
- Dataset metadata

The `overwrite=True` flag ensures any existing content in the folder is safely replaced.


In [10]:
# Set export directory
export_dir = "./arcade_combined_fiftyone"

# Export in FiftyOne format
cloned_dataset.export(
    export_dir=export_dir,
    dataset_type=fo.types.FiftyOneDataset,
    label_field="segmentations",  # or other label field you want to export
    overwrite=True,
)

Exporting samples...
 100% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [1.3s elapsed, 0s remaining, 2.3K docs/s]         


#### ü§ó Step 3: Push to Hugging Face Hub
With just one line of code, we use `push_to_hub()` to upload the dataset directly to the Hugging Face Hub. The dataset will appear under the user or organization space defined by the credentials used.

Publishing to Hugging Face allows others to:
- **Easily reuse your dataset** in notebooks, apps, or models
- **View it through the FiftyOne Web App integration**
- **Promote reproducibility and collaboration** in open-source healthcare AI research

In [11]:
from fiftyone.utils.huggingface import push_to_hub

push_to_hub(cloned_dataset, "arcade_fiftyone")

Directory '/var/folders/6y/g2mslh_s7fz7qtj9vrxntqtm0000gn/T/tmpwjqsg_uf' already exists; export will be merged with existing files
Exporting samples...
 100% |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [1.1s elapsed, 0s remaining, 2.9K docs/s]         


Uploading media files in 2 batches of size 3000:   0%|          | 0/2 [00:00<?, ?it/s]It seems you are trying to upload a large folder at once. This might take some time and then fail if the folder is too large. For such cases, it is recommended to upload in smaller batches or to use `HfApi().upload_large_folder(...)`/`huggingface-cli upload-large-folder` instead. For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#upload-a-large-folder.
Uploading media files in 2 batches of size 3000:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 1/2 [02:37<02:37, 157.61s/it]No files have been modified since last commit. Skipping to prevent empty commit.
Uploading media files in 2 batches of size 3000: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [02:37<00:00, 78.88s/it] 
