# 🏥 Visual AI in Healthcare with FiftyOne - DeepLesion Exploration/Quering/Filtering
**Empowering medical imaging workflows with open-source tools and modern AI**

This notebook is part of the **“Visual AI in Healthcare with FiftyOne”** workshop. Through hands-on examples, we explore how to load, visualize, analyze, and enhance medical imaging datasets using state-of-the-art AI tools.

🔬 **What you’ll learn in this notebook:**

- How to **load the DeepLesion dataset** using the Hugging Face `datasets` library  
- How to **explore and assess the quality** of medical imaging data using FiftyOne Brain  
- How to **create balanced subsets** for training or evaluation  
- How to **push curated datasets** to the Hugging Face Hub for sharing and reuse  
- How to **launch the FiftyOne App** for interactive analysis and visualization

📚 **Part of the notebook series:**
1. `01_load_arcade_dataset.ipynb` – Load and visualize the ARCADE dataset.  
2. `02_load_deeplesion_balanced.ipynb` – Curate and balance the DeepLesion dataset.  
3. `03_vlms_analysis_arcade.ipynb` – Use VFMs like NVLabs_CRADIOV3 in dataset undersatnding for ARCADE. 
4. `04_finetune_yolo8_stenosis.ipynb` – Train and integrate YOLOv8 for stenosis detection.  
5. `05_medsam2_ct_scan.ipynb` – Run MedSAM2 on CT scans for segmentation.  
6. `06_nvidia_vista_segmentation.ipynb` – Explore NVIDIA-VISTA-3D.  
7. `07_medgemma_vqa.ipynb` – Perform visual question answering and classification with MedGemma.

All notebooks are standalone but are best experienced sequentially.

### ✅ Requirements

Please install all the requeriments for running this notebook

In [None]:
!pip install datasets fiftyone pandas

### 📥 Import `load_dataset` from Hugging Face to access medical imaging datasets

In [1]:
from datasets import load_dataset

ds = load_dataset("farrell236/DeepLesion")


  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 32735/32735 [00:00<00:00, 243998.32 examples/s]


### 📊 Import pandas for data manipulation (if needed)

In [2]:
import pandas as pd

df = pd.DataFrame(ds["train"])

### 🔧 Install the Hugging Face Hub / Using Repo scripts for dataset downloading

In [None]:
pip install huggingface_hub
git lfs install
git clone https://huggingface.co/datasets/farrell236/DeepLesion
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/farrell236/DeepLesion
cd DeepLesion/scripts
python batch_download_zips.py


### 📂 Import zipfile to handle compressed data formats

In [5]:
import zipfile
import os
from glob import glob
from tqdm import tqdm

# Set your zip folder and destination
zip_folder = "./DeepLesion/Images_zip"
extract_dir = "./DeepLesion/deeplesion_images"

os.makedirs(extract_dir, exist_ok=True)

# Loop through all zip files
zip_files = glob(os.path.join(zip_folder, "*.zip"))

for zip_path in tqdm(zip_files, desc="Extracting and deleting zip files"):
    try:
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(extract_dir)
        os.remove(zip_path)  # ✅ Delete zip file after successful extraction
    except Exception as e:
        print(f"Failed to process {zip_path}: {e}")


Extracting and deleting zip files: 100%|██████████| 56/56 [09:50<00:00, 10.54s/it]


### 🏗 Preprocessing DeepLesion CT Scans with DICOM Windowing

This section performs intensity normalization on CT scan slices from the DeepLesion dataset. Each image is originally stored as a 16-bit PNG representing Hounsfield Units (HU). To prepare the data for downstream tasks (e.g., visualization, model training), we apply the following preprocessing steps:

- **Read DICOM metadata** from the DeepLesion CSV file, including windowing parameters for each scan.
- **Clip and normalize intensities** to an 8-bit range \[0–255\] using DICOM windowing (typically -150 to 250 HU).
- **Preserve folder structure**, writing the processed images into a new directory (`Images_png_wn`) with one subfolder per series.
- **Convert all slices in each scan**, transforming them from raw HU values to normalized images suitable for machine learning.

This ensures all CT slices are visually and numerically consistent, facilitating better model performance and human interpretability.


In [23]:
import os
import cv2
import numpy as np
import pandas as pd
from glob import glob
from tqdm import tqdm

# Paths
dir_in = "./DeepLesion/deeplesion_images/Images_png"
dir_out = "./DeepLesion/deeplesion_images/Images_png_wn"
info_fn = "./DeepLesion/DL_info.csv"

# Ensure output folder exists
os.makedirs(dir_out, exist_ok=True)

# Load metadata
dl_info = pd.read_csv(info_fn)

def clip_and_normalize(np_image: np.ndarray, clip_min: int = -150, clip_max: int = 250) -> np.ndarray:
    """Apply intensity windowing and scale to 8-bit"""
    np_image = np.clip(np_image, clip_min, clip_max)
    np_image = (np_image - clip_min) / (clip_max - clip_min)
    return np_image

# Track folders already processed to avoid repeat
created_dirs = set()

for idx, row in tqdm(dl_info.iterrows(), total=len(dl_info)):
    folder = row['File_name'].rsplit('_', 1)[0]  # e.g. 000001_01_01
    subdir_in = os.path.join(dir_in, folder)
    subdir_out = os.path.join(dir_out, folder)

    if not os.path.exists(subdir_out):
        os.makedirs(subdir_out, exist_ok=True)

    # Get DICOM window values
    try:
        DICOM_windows = [float(v.strip()) for v in row['DICOM_windows'].split(',')]
        clip_min, clip_max = DICOM_windows[0], DICOM_windows[1]
    except:
        clip_min, clip_max = -150, 250  # default if broken
        print(f"Invalid DICOM window for {row['File_name']}, using default")

    # Process all slices in folder
    images = sorted(glob(os.path.join(subdir_in, "*.png")))
    for im in images:
        try:
            image = cv2.imread(im, cv2.IMREAD_UNCHANGED)
            image = image.astype("int32") - 32768  # Convert to Hounsfield Units
            image = clip_and_normalize(image, clip_min, clip_max)
            image = (image * 255).astype("uint8")

            out_path = os.path.join(subdir_out, os.path.basename(im))
            cv2.imwrite(out_path, image)

        except Exception as e:
            print(f"Failed to convert {im}: {e}")
            continue

100%|██████████| 32735/32735 [4:54:00<00:00,  1.86it/s]   


### 🧬 Constructing a FiftyOne Dataset from DeepLesion Metadata

In this section, we build a structured `fiftyone.Dataset` using the DeepLesion image directory and accompanying metadata from `DL_info.csv`.

Each row in the metadata corresponds to a CT slice with rich clinical context, including:
- **Patient, study, and series identifiers**
- **Bounding boxes** for lesions (if available)
- **DICOM metadata** such as windowing, slice range, and spacing
- **Patient-level info** like age and gender

The workflow:
- Locates each image in a subfolder by parsing its filename
- Applies any available bounding box annotations, normalized to the \[0, 1\] range (FiftyOne format)
- Embeds image-level metadata and tags into each `fo.Sample`
- Adds the samples to a new dataset called `"deeplesion_wn"` (window-normalized version)

This enables structured, queryable exploration of DeepLesion with FiftyOne, ready for downstream tasks like model training, evaluation, and visualization.


In [24]:
import os
import pandas as pd
import fiftyone as fo
from fiftyone.core.metadata import ImageMetadata
#from datasets import load_dataset
import pandas as pd

csv_path = "./DeepLesion/DL_info.csv"
df = pd.read_csv(csv_path)

# Load metadata
#ds = load_dataset("farrell236/DeepLesion")
#df = pd.DataFrame(ds["train"])

# Root path to nested folders
image_root = "./DeepLesion/deeplesion_images/Images_png_wn"

# Init FiftyOne dataset
dataset = fo.Dataset("deeplesion_wn")

for _, row in df.iterrows():
    file_name = row["File_name"]  # e.g. 000001_01_01_109.png
    parts = file_name.split("_")

    # Subfolder = 000001_01_01
    subfolder = "_".join(parts[:3])

    # Slice file = 109
    slice_part = parts[3].replace(".png", "")

    # Full image path
    img_path = os.path.join(image_root, subfolder, f"{slice_part}.png")

    if not os.path.exists(img_path):
        print(f"Missing: {img_path}")
    else:
        print(f"Found: {img_path}")
    if not os.path.exists(img_path):
        print(f"Missing: {img_path}")
        continue

    # Parse bounding box if available
    try:
        bbox = list(map(float, row["Bounding_boxes"].split(", ")))
        x, y, w, h = bbox[0], bbox[1], bbox[2] - bbox[0], bbox[3] - bbox[1]
        detection = fo.Detection(label="lesion", bounding_box=[x / 512, y / 512, w / 512, h / 512])
        detections = [detection]
    except:
        detections = []

    # Create sample with metadata
    sample = fo.Sample(
        filepath=img_path,
        metadata=ImageMetadata(width=512, height=512),
        ground_truth=fo.Detections(detections=detections),
        patient_index=row["Patient_index"],
        study_index=row["Study_index"],
        series_id=row["Series_ID"],
        key_slice_index=row["Key_slice_index"],
        lesion_type=row["Coarse_lesion_type"],
        pixel_spacing=row["Spacing_mm_px_"],
        dicom_window=row["DICOM_windows"],
        age=row["Patient_age"],
        gender=row["Patient_gender"],
        slice_range=row["Slice_range"],
        lesion_diameters=row["Lesion_diameters_Pixel_"],
        possibly_noisy=row["Possibly_noisy"],
        tags=["deeplesion"]
)


    dataset.add_sample(sample)

Found: ./DeepLesion/deeplesion_images/Images_png_wn/000001_01_01/109.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000001_02_01/014.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000001_02_01/017.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000001_03_01/088.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000001_04_01/017.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000002_01_01/162.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000002_01_01/176.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000002_02_01/077.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000002_02_01/050.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000002_02_01/065.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000002_02_01/052.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000002_03_01/041.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn/000003_01_01/016.png
Found: ./DeepLesion/deeplesion_images/Images_png_wn

### 💾 Make the FiftyOne dataset persistent

In [25]:
dataset.persistent=True
session = fo.launch_app(dataset, port=5151, auto=False)

Session launched. Run `session.show()` to open the App in a cell output.


### 🧩 Indexing Custom Fields for Efficient Querying

FiftyOne allows you to index specific fields within your dataset to enable **faster filtering, searching, and slicing**—especially useful when working with large medical datasets like DeepLesion.

In this step, we define a list of important metadata fields (e.g., patient ID, study ID, lesion type, age, etc.) and attempt to create an index on each one. These fields cover both:
- **Annotation-level data** (like `ground_truth.detections.label`)
- **Clinical metadata** (like `patient_index`, `pixel_spacing`, `gender`, etc.)

If a field cannot be indexed (e.g., due to missing values or data type issues), it logs an error but continues indexing the rest.

Indexing is critical for enabling high-performance queries and visual exploration in FiftyOne.


In [26]:
# List of custom fields to index
fields_to_index = [
    "ground_truth.detections.label",  # Label name
    "patient_index",
    "study_index",
    "series_id",
    "key_slice_index",
    "lesion_type",
    "pixel_spacing",
    "dicom_window",
    "age",
    "gender",
    "slice_range",
    "lesion_diameters",
    "possibly_noisy",
]

# Create indexes
for field in fields_to_index:
    try:
        dataset.create_index(field)
        print(f"Indexed: {field}")
    except Exception as e:
        print(f"Failed to index {field}: {e}")

Indexed: ground_truth.detections.label
Indexed: patient_index
Indexed: study_index
Indexed: series_id
Indexed: key_slice_index
Indexed: lesion_type
Indexed: pixel_spacing
Indexed: dicom_window
Indexed: age
Indexed: gender
Indexed: slice_range
Indexed: lesion_diameters
Indexed: possibly_noisy


### 🧠 Generating Visual Embeddings with a Pretrained ResNet50 Model

In this step, we use FiftyOne Brain to compute **image embeddings** for the DeepLesion dataset using a pretrained model from the FiftyOne Model Zoo.

Specifically:
- We load **ResNet50 trained on ImageNet** (`resnet50-imagenet-torch`) to extract feature vectors for each CT slice.
- We compute embeddings for all samples using `fob.compute_visualization()`.
- The resulting embeddings are stored under the field `"resnet50_embedding1"` and associated with the brain key `"deeplesion_emb1"`.

These embeddings serve as a compact numerical representation of each image and are essential for:
- Visualizing clusters of similar lesions
- Detecting anomalies or outliers
- Performing semantic search across patient studies

This step sets the foundation for **embedding-based exploration and analysis** in the FiftyOne App.


In [27]:
import torch
import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

# Load dataset
#dataset = fo.load_dataset("deeplesion")

# Use a pretrained model (e.g., ResNet50)
model = foz.load_zoo_model("resnet50-imagenet-torch", include_logits=False)

# Compute embeddings for all samples
fob.compute_visualization(
    dataset,
    model=model,
    embeddings="resnet50_embedding1",
    brain_key="deeplesion_emb1"
)


Computing embeddings...
 100% |█████████████| 32735/32735 [48.6m elapsed, 0s remaining, 11.3 samples/s]      
Generating visualization...
UMAP( verbose=True)
Mon Jun 23 20:57:00 2025 Construct fuzzy simplicial set




Mon Jun 23 20:57:00 2025 Finding Nearest Neighbors
Mon Jun 23 20:57:00 2025 Building RP forest with 14 trees
Mon Jun 23 20:57:00 2025 NN descent for 15 iterations
	 1  /  15
	 2  /  15
	 3  /  15
	 4  /  15
	 5  /  15
	Stopping threshold met -- exiting after 5 iterations
Mon Jun 23 20:57:01 2025 Finished Nearest Neighbor Search
Mon Jun 23 20:57:01 2025 Construct embedding


Epochs completed:   9%| ▉          18/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs


Epochs completed:  34%| ███▍       68/200 [00:00]

	completed  40  /  200 epochs
	completed  60  /  200 epochs


Epochs completed:  50%| █████      100/200 [00:00]

	completed  80  /  200 epochs
	completed  100  /  200 epochs


Epochs completed:  72%| ███████▎   145/200 [00:00]

	completed  120  /  200 epochs
	completed  140  /  200 epochs


Epochs completed:  95%| █████████▌ 190/200 [00:01]

	completed  160  /  200 epochs
	completed  180  /  200 epochs


Epochs completed: 100%| ██████████ 200/200 [00:01]


Mon Jun 23 20:57:04 2025 Finished embedding


<fiftyone.brain.visualization.VisualizationResults at 0x36252fb80>

### 🔍 Identifying the Most Unique CT Slices with FiftyOne Brain

In this step, we use **FiftyOne Brain's uniqueness algorithm** to analyze the dataset and assign a `uniqueness` score to each sample. This score reflects how different a sample is compared to others in the embedding space.

- `fob.compute_uniqueness(dataset)` computes and stores uniqueness scores for all samples.
- We then **sort the dataset by uniqueness** (in descending order) and select the **top 2,000 most unique slices** using `.limit()`.

This unique view helps surface rare or unusual cases, which is useful for:
- Curating diverse training sets
- Spotting labeling or imaging anomalies
- Selecting high-value examples for manual review or annotation


In [28]:
fob.compute_uniqueness(dataset)
# Create a view with the 100 most unique samples
unique_view = dataset.sort_by("uniqueness", reverse=True).limit(2000)

Computing embeddings...
 100% |█████████████| 32735/32735 [4.4m elapsed, 0s remaining, 134.1 samples/s]      
Computing uniqueness...
Uniqueness computation complete


In [29]:
session = fo.launch_app(unique_view, port=5152, auto=False)

Session launched. Run `session.show()` to open the App in a cell output.


### ⚖️ Creating a Balanced Subset of DeepLesion by Lesion Size and Type

This block builds a curated, balanced subset of the DeepLesion dataset focused on **larger lesions** (short axis > 10mm) and **uniform class distribution** across lesion types.

#### 📋 Step-by-step process:

1. **Compute short-axis diameters**  
   Extracts the smaller of the two lesion diameters from the `lesion_diameters` field using a helper function.

2. **Filter lesions with short diameter > 10mm**  
   Larger lesions are prioritized for clinical relevance and improved visibility.

3. **List all unique lesion types**  
   These will serve as the "classes" for balancing.

4. **Sample uniformly per class**  
   Randomly selects up to `250` samples per lesion type (modifiable) to form a class-balanced dataset.

5. **Tag and save the subset**  
   Uses `dataset.select()` to create a new view and applies the `"balanced_subset"` tag for tracking.

6. **Launch the FiftyOne App (optional)**  
   Opens an interactive session to explore the balanced dataset visually.

This balanced subset is ideal for downstream tasks like training detection models or running fine-tuning experiments.


In [33]:
import fiftyone as fo
from fiftyone import ViewField as F
import numpy as np

# Load your dataset
# dataset = fo.load_dataset("deeplesion")

# Step 1: Compute short diameter from lesion_diameters
def get_short_diameter(diam_str):
    try:
        values = list(map(float, diam_str.split(", ")))
        return min(values)  # short diameter
    except:
        return None

short_diameters = {
    sample.id: get_short_diameter(sample["lesion_diameters"])
    for sample in dataset.select_fields(["lesion_diameters"])
}

dataset.set_values("short_diameter", short_diameters, key_field="id")

# Step 2: Filter for lesions > 10mm short axis
filtered_view = dataset.match(F("short_diameter") > 10)

# Step 3: Determine available lesion types
lesion_types = filtered_view.distinct("lesion_type")
print("Lesion types found:", lesion_types)

# Step 4: Balanced sampling
samples_per_class = 250  # Adjust this based on your total budget (e.g., 8 classes * 250 = 2000)
balanced_ids = []

for lesion_type in lesion_types:
    view = filtered_view.match(F("lesion_type") == lesion_type)
    sampled_view = view.shuffle(seed=42).take(samples_per_class)
    ids = sampled_view.values("id")
    balanced_ids.extend(ids)

# Step 5: Select and tag balanced view
balanced_view = dataset.select(balanced_ids)
balanced_view.tag_samples("balanced_subset")
print(f"✅ Total balanced samples: {len(balanced_view)}")

# Step 6: (Optional) Launch FiftyOne app
session = fo.launch_app(balanced_view, port=5151, auto=False)


Lesion types found: [-1, 1, 2, 3, 4, 5, 6, 7, 8]
✅ Total balanced samples: 2161
Session launched. Run `session.show()` to open the App in a cell output.


### 💾 Cloning and Exporting the Balanced DeepLesion Subset

Once the balanced view has been created and validated, we clone it into a new standalone FiftyOne dataset called `"deeplesion_balanced"`.

This is useful for:
- Reusing the dataset independently of the full DeepLesion dataset
- Exporting for training or sharing
- Ensuring reproducibility

We then export the dataset using FiftyOne’s built-in exporter:
- The dataset is saved in **FiftyOneDataset format**
- It includes all media and labels under the `ground_truth` field
- Files are moved (or optionally copied) into a clean export directory
- Existing content in the export folder is overwritten for freshness

The result is a portable, tagged, and size-filtered dataset ready for downstream modeling or publishing.


In [34]:
# Clone the view to a new dataset
balanced_dataset = balanced_view.clone(name="deeplesion_balanced")
print(f"✅ Cloned to dataset: {balanced_dataset.name}")

export_dir = "./deeplesion_balanced_export"

balanced_dataset.export(
    export_dir=export_dir,
    dataset_type=fo.types.FiftyOneDataset,
    label_field="ground_truth",         # Include labels if they exist
    media_export_policy="move",         # Or "copy" if you want to keep originals
    overwrite=True
)

print(f"📁 Exported dataset to: {export_dir}")

✅ Cloned to dataset: deeplesion_balanced
Ignoring unsupported parameter 'media_export_policy'
Exporting samples...
 100% |██████████████████| 2161/2161 [2.9s elapsed, 0s remaining, 784.7 docs/s]      
📁 Exported dataset to: ./deeplesion_balanced_export


In [35]:
print(dataset)

Name:        deeplesion_wn
Media type:  image
Num samples: 32735
Persistent:  True
Tags:        []
Sample fields:
    id:                  fiftyone.core.fields.ObjectIdField
    filepath:            fiftyone.core.fields.StringField
    tags:                fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:            fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:          fiftyone.core.fields.DateTimeField
    last_modified_at:    fiftyone.core.fields.DateTimeField
    ground_truth:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    patient_index:       fiftyone.core.fields.IntField
    study_index:         fiftyone.core.fields.IntField
    series_id:           fiftyone.core.fields.IntField
    key_slice_index:     fiftyone.core.fields.IntField
    lesion_type:         fiftyone.core.fields.IntField
    pixel_spacing:       fiftyone.core.fields.StringField
    dicom_window: 

In [36]:
print(balanced_dataset)

Name:        deeplesion_balanced
Media type:  image
Num samples: 2161
Persistent:  False
Tags:        []
Sample fields:
    id:                  fiftyone.core.fields.ObjectIdField
    filepath:            fiftyone.core.fields.StringField
    tags:                fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:            fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:          fiftyone.core.fields.DateTimeField
    last_modified_at:    fiftyone.core.fields.DateTimeField
    ground_truth:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    patient_index:       fiftyone.core.fields.IntField
    study_index:         fiftyone.core.fields.IntField
    series_id:           fiftyone.core.fields.IntField
    key_slice_index:     fiftyone.core.fields.IntField
    lesion_type:         fiftyone.core.fields.IntField
    pixel_spacing:       fiftyone.core.fields.StringField
    dicom_wi

In [39]:
from fiftyone.utils.huggingface import push_to_hub

push_to_hub(balanced_dataset, "deeplesion_balanced_fiftyone")

Directory '/var/folders/6y/g2mslh_s7fz7qtj9vrxntqtm0000gn/T/tmpuoe13quz' already exists; export will be merged with existing files
Exporting samples...
 100% |██████████████████| 2161/2161 [1.3s elapsed, 0s remaining, 1.8K docs/s]         


Uploading media files in 2 batches of size 2137:   0%|          | 0/2 [00:00<?, ?it/s]It seems you are trying to upload a large folder at once. This might take some time and then fail if the folder is too large. For such cases, it is recommended to upload in smaller batches or to use `HfApi().upload_large_folder(...)`/`huggingface-cli upload-large-folder` instead. For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#upload-a-large-folder.
Uploading media files in 2 batches of size 2137:  50%|█████     | 1/2 [01:46<01:46, 106.08s/it]No files have been modified since last commit. Skipping to prevent empty commit.
Uploading media files in 2 batches of size 2137: 100%|██████████| 2/2 [01:46<00:00, 53.12s/it] 
