Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

459 add camelyon16 slide level task #476

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
75305bf
added panda dataset class
nkaenzig May 7, 2024
35598b0
clean up
nkaenzig May 8, 2024
ac529c7
remove samples with noisy labels
nkaenzig May 8, 2024
8fbd8ef
clean up table in dataset readme
nkaenzig May 8, 2024
f7f9d02
Merge remote-tracking branch 'origin/360-aggregated-feature-support-w…
nkaenzig May 8, 2024
ab836f6
added function for stratified splits
nkaenzig May 13, 2024
160a641
added unit tests
nkaenzig May 13, 2024
fb9f024
cleanup
nkaenzig May 13, 2024
c576e25
addressed comments
nkaenzig May 14, 2024
5b5b17a
Merge remote-tracking branch 'origin/360-aggregated-feature-support-w…
nkaenzig May 14, 2024
739b134
Merge remote-tracking branch 'origin/360-aggregated-feature-support-w…
nkaenzig May 21, 2024
f336bcf
fixed issue with resource download
nkaenzig May 21, 2024
3930ed1
validation fix
nkaenzig May 21, 2024
8a97913
updated readme
nkaenzig May 21, 2024
873f454
added to mkdocs
nkaenzig May 21, 2024
f1c203e
added image_dir to exception print
nkaenzig May 21, 2024
7172c80
updated root path in yaml config
nkaenzig May 21, 2024
7fbe4f1
added panda to datasets overview table in docs
nkaenzig May 21, 2024
101e8bd
added md5 hash for downloaded resources
nkaenzig May 21, 2024
a1fdde7
update init
roman807 May 21, 2024
9fd8996
Merge branch '425-create-panda-dataset-class' into 459-add-camelyon16…
roman807 May 21, 2024
1eaeae3
added camelyon16
roman807 May 21, 2024
b196b69
added camelyon16
roman807 May 21, 2024
0e800cb
Merge branch '360-aggregated-feature-support-wsi-level-tasks' into 45…
roman807 May 22, 2024
58584ca
updated camelyon16 class
roman807 May 23, 2024
695fe17
Merge branch '360-aggregated-feature-support-wsi-level-tasks' into 45…
roman807 May 23, 2024
71c5916
added tests and config
roman807 May 28, 2024
df28847
formatting
roman807 May 28, 2024
983d25a
formatting
roman807 May 28, 2024
6f3f88d
Merge branch '360-aggregated-feature-support-wsi-level-tasks' into 45…
roman807 May 28, 2024
6fda007
formatting
roman807 May 28, 2024
6d03769
Merge remote-tracking branch 'origin/459-add-camelyon16-slide-level-t…
roman807 May 28, 2024
5f99bcd
formatting
roman807 May 28, 2024
872bc6f
added test files
roman807 May 28, 2024
4d70442
formatting
roman807 May 28, 2024
6811b76
lint
roman807 May 28, 2024
24c21c1
added target transforms
roman807 May 28, 2024
ccbdcdd
formatting
roman807 May 28, 2024
bd87440
fixed dataset
roman807 May 29, 2024
06f6c91
addressed comments
roman807 Jun 3, 2024
d3be7c9
addressed comments
roman807 Jun 3, 2024
b108d87
fix test
roman807 Jun 3, 2024
7665752
fix test
roman807 Jun 3, 2024
f8ee2c5
Merge remote-tracking branch 'origin/360-aggregated-feature-support-w…
roman807 Jun 4, 2024
b5a8eda
fixed test
roman807 Jun 4, 2024
fde2d42
addressed comments
roman807 Jun 4, 2024
bd81d4a
updated loss
roman807 Jun 6, 2024
edc7f6f
fix annotations
roman807 Jun 6, 2024
6f88ba8
lint
roman807 Jun 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions configs/vision/dino_vit/offline/camelyon16.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
trainer:
class_path: eva.Trainer
init_args:
n_runs: &N_RUNS ${oc.env:N_RUNS, 3}
default_root_dir: &OUTPUT_ROOT ${oc.env:OUTPUT_ROOT, logs/${oc.env:DINO_BACKBONE, owkin/phikon}/offline/camelyon16}
max_steps: &MAX_STEPS ${oc.env:MAX_STEPS, 12500}
callbacks:
- class_path: lightning.pytorch.callbacks.LearningRateMonitor
init_args:
logging_interval: epoch
- class_path: lightning.pytorch.callbacks.ModelCheckpoint
init_args:
filename: best
save_last: true
save_top_k: 1
monitor: &MONITOR_METRIC ${oc.env:MONITOR_METRIC, val/BinaryAccuracy}
mode: &MONITOR_METRIC_MODE ${oc.env:MONITOR_METRIC_MODE, max}
- class_path: lightning.pytorch.callbacks.EarlyStopping
init_args:
min_delta: 0
patience: 74
monitor: *MONITOR_METRIC
mode: *MONITOR_METRIC_MODE
- class_path: eva.callbacks.EmbeddingsWriter
init_args:
output_dir: &DATASET_EMBEDDINGS_ROOT ${oc.env:EMBEDDINGS_ROOT, ./data/embeddings}/${oc.env:DINO_BACKBONE, dino_vits16}/camelyon16
dataloader_idx_map:
0: train
1: val
2: test
metadata_keys: ["wsi_id"]
backbone:
class_path: eva.models.ModelFromFunction
init_args:
path: torch.hub.load
arguments:
repo_or_dir: ${oc.env:REPO_OR_DIR, facebookresearch/dino:main}
model: ${oc.env:DINO_BACKBONE, dino_vits16}
pretrained: ${oc.env:PRETRAINED, true}
force_reload: ${oc.env:FORCE_RELOAD, false}
checkpoint_path: ${oc.env:CHECKPOINT_PATH, null}
logger:
- class_path: lightning.pytorch.loggers.TensorBoardLogger
init_args:
save_dir: *OUTPUT_ROOT
name: ""
model:
class_path: eva.HeadModule
init_args:
head:
class_path: eva.vision.models.networks.ABMIL
init_args:
input_size: ${oc.env:IN_FEATURES, 768}
output_size: &NUM_CLASSES 1
projected_input_size: 128
criterion: torch.nn.BCEWithLogitsLoss
optimizer:
class_path: torch.optim.AdamW
init_args:
lr: &LR_VALUE 0.000039
betas: [0.9, 0.999]
lr_scheduler:
class_path: torch.optim.lr_scheduler.CosineAnnealingLR
init_args:
T_max: *MAX_STEPS
eta_min: 0.0
metrics:
common:
- class_path: eva.metrics.AverageLoss
- class_path: eva.metrics.BinaryClassificationMetrics
data:
class_path: eva.DataModule
init_args:
datasets:
train:
class_path: eva.datasets.MultiEmbeddingsClassificationDataset
init_args: &DATASET_ARGS
root: *DATASET_EMBEDDINGS_ROOT
manifest_file: manifest.csv
split: train
embeddings_transforms:
class_path: eva.core.data.transforms.Pad2DTensor
init_args:
pad_size: 10_000
target_transforms:
class_path: eva.core.data.transforms.dtype.ArrayToFloatTensor
val:
class_path: eva.datasets.MultiEmbeddingsClassificationDataset
init_args:
<<: *DATASET_ARGS
split: val
test:
class_path: eva.datasets.MultiEmbeddingsClassificationDataset
init_args:
<<: *DATASET_ARGS
split: test
predict:
- class_path: eva.vision.datasets.Camelyon16
init_args: &PREDICT_DATASET_ARGS
root: ${oc.env:DATA_ROOT, ./data}/camelyon16
sampler:
class_path: eva.vision.data.wsi.patching.samplers.ForegroundGridSampler
init_args:
max_samples: 10_000
width: 224
height: 224
target_mpp: 0.25
split: train
image_transforms:
class_path: eva.vision.data.transforms.common.ResizeAndCrop
init_args:
size: ${oc.env:RESIZE_DIM, 224}
mean: ${oc.env:NORMALIZE_MEAN, [0.485, 0.456, 0.406]}
std: ${oc.env:NORMALIZE_STD, [0.229, 0.224, 0.225]}
- class_path: eva.vision.datasets.Camelyon16
init_args:
<<: *PREDICT_DATASET_ARGS
split: val
- class_path: eva.vision.datasets.Camelyon16
init_args:
<<: *PREDICT_DATASET_ARGS
split: test
dataloaders:
train:
batch_size: &BATCH_SIZE ${oc.env:BATCH_SIZE, 16}
shuffle: true
val:
batch_size: *BATCH_SIZE
test:
batch_size: *BATCH_SIZE
predict:
batch_size: &PREDICT_BATCH_SIZE ${oc.env:PREDICT_BATCH_SIZE, 64}
num_workers: 12 #multiprocessing.cpu_count
prefetch_factor: 2
68 changes: 68 additions & 0 deletions docs/datasets/camelyon16.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Camelyon16

The Camelyon16 dataset consists of 400 WSIs of lymph nodes for breast cancer metastasis classification. The dataset is a combination of two independent datasets, collected from two separate medical centers in the Netherlands (Radboud University Medical Center and University Medical Center Utrecht). The dataset contains the slides from which [PatchCamelyon](patch_camelyon.md)-patches were extracted.

The dataset is divided in a train set (270 slides) and test set (130 slides), both containing images from both centers.

The task was part of [Grand Challenge](https://grand-challenge.org/) in 2016 and has later been replaced by Camelyon17.

Source: https://camelyon16.grand-challenge.org

## Raw data

### Key stats

| | |
|---------------------------|----------------------------------------------------------|
| **Modality** | Vision (Slide-level) |
| **Task** | Binary classification |
| **Cancer type** | Breast |
| **Data size** | ~700 GB |
| **Image dimension** | ~100-250k x ~100-250k x 3 |
| **Magnification (μm/px)** | 40x (0.25) - Level 0 |
| **Files format** | `.tif` |
| **Number of images** | 400 (270 train, 130 test) |


### Organization

The data `CAMELYON16` (download links [here](https://camelyon17.grand-challenge.org/Data/)) is organized as follows:

```
CAMELYON16
├── training
│ ├── normal
| │ ├── normal_001.tif
| │ └── ...
│ ├── tumor
| │ ├── tumor_001.tif
| │ └── ...
│ └── lesion_annotations.zip
├── testing
│ ├── images
| │ ├── test_001.tif
| │ └── ...
│ ├── evaluation # masks not in use
│ ├── reference.csv # targets
│ └── lesion_annotations.zip
```

## Download and preprocessing

The `Camelyon16` dataset class doesn't download the data during runtime and must be downloaded manually from links provided [here](https://camelyon17.grand-challenge.org/Data/).

The dataset is split into train / test. Additionally, we split the train set into train/val using the same splits as [PatchCamelyon](patch_camelyon.md) (see metadata CSV files on [Zenodo](https://zenodo.org/records/2546921)).

| Splits | Train | Validation | Test |
|----------|-------------|-------------|------------|
| #Samples | 216 (54%) | 54 (13.5%) | 130 (32.5%) |


## Relevant links

* [Grand Challenge dataset description](https://camelyon16.grand-challenge.org/Data/)
* [Download links](https://camelyon17.grand-challenge.org/Data/)


## References
1 : [A General-Purpose Self-Supervised Model for Computational Pathology](https://arxiv.org/abs/2308.15474)
7 changes: 4 additions & 3 deletions docs/datasets/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@
### Whole Slide (WSI) and microscopy image datasets

#### Slide-level
| Dataset | #Slides | Slide Size | Magnification (μm/px) | Task | Cancer Type |
|------------------------------------|----------|------------|------------------------|----------------------------|------------------|
| [PANDA](panda.md) | 3,152 | ~20k x 20k x 3 | 20x (0.5) | Classification (6 classes) | Prostate |
| Dataset | #Slides | Slide Size | Magnification (μm/px) | Task | Cancer Type |
|------------------------------------|----------|---------------------------|------------------------|----------------------------|------------------|
| [Camelyon16](camelyon16.md) | 400 | ~100-250k x ~100-250k x 3 | 40x (0.25) | Classification (2 classes) | Breast |
| [PANDA](panda.md) | 10,616 | ~20k x 20k x 3 | 20x (0.5) | Classification (6 classes) | Prostate |


#### Patch-level
Expand Down
2 changes: 2 additions & 0 deletions src/eva/vision/data/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
CRC,
MHIST,
PANDA,
Camelyon16,
PatchCamelyon,
TotalSegmentatorClassification,
WsiClassificationDataset,
Expand All @@ -20,6 +21,7 @@
"ImageSegmentation",
"PatchCamelyon",
"PANDA",
"Camelyon16",
"TotalSegmentatorClassification",
"TotalSegmentator2D",
"VisionDataset",
Expand Down
13 changes: 13 additions & 0 deletions src/eva/vision/data/datasets/_validators.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,16 @@ def check_dataset_exists(dataset_dir: str, download_available: bool) -> None:
if download_available:
error_message += " You can set `download=True` to download the dataset automatically."
raise FileNotFoundError(error_message)


def check_number_of_files(file_paths: List[str], expected_length: int, split: str | None) -> None:
"""Verifies the number of files in the dataset.

Raise:
ValueError: If the number of files in the dataset does not match the expected one.
"""
if len(file_paths) != expected_length:
raise ValueError(
f"Expected {expected_length} files, for split '{split}' found {len(file_paths)}. "
f"{_SUFFIX_ERROR_MESSAGE}"
)
2 changes: 2 additions & 0 deletions src/eva/vision/data/datasets/classification/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Image classification datasets API."""

from eva.vision.data.datasets.classification.bach import BACH
from eva.vision.data.datasets.classification.camelyon16 import Camelyon16
from eva.vision.data.datasets.classification.crc import CRC
from eva.vision.data.datasets.classification.mhist import MHIST
from eva.vision.data.datasets.classification.panda import PANDA
Expand All @@ -16,4 +17,5 @@
"TotalSegmentatorClassification",
"WsiClassificationDataset",
"PANDA",
"Camelyon16",
]
Loading