Updated documentation with new datasets and leaderboard (#531)

* updated layout * updated layout * addressed comment
kaiko-ai · Jun 12, 2024 · ecfd966 · ecfd966
1 parent 094b9e4
commit ecfd966
Show file tree

Hide file tree

Showing 7 changed files with 94 additions and 73 deletions.
diff --git a/docs/datasets/camelyon16.md b/docs/datasets/camelyon16.md
@@ -2,7 +2,7 @@
 
 The Camelyon16 dataset consists of 400 WSIs of lymph nodes for breast cancer metastasis classification. The dataset is a combination of two independent datasets, collected from two separate medical centers in the Netherlands (Radboud University Medical Center and University Medical Center Utrecht). The dataset contains the slides from which [PatchCamelyon](patch_camelyon.md)-patches were extracted.
 
-The dataset is divided in a train set (270 slides) and test set (130 slides), both containing images from both centers.
+The dataset is divided in a train set (270 slides) and test set (130 slides), both containing images from both centers. Note that one test set slide was a duplicate has been removed (see [here](https://github.com/DIDSR/dldp?tab=readme-ov-file#04-data-description-important)).
 
 The task was part of [Grand Challenge](https://grand-challenge.org/) in 2016 and has later been replaced by Camelyon17.
 
@@ -14,14 +14,14 @@ Source: https://camelyon16.grand-challenge.org
 
 |                           |                                                          |
 |---------------------------|----------------------------------------------------------|
-| **Modality**              | Vision (Slide-level)                                     |
+| **Modality**              | Vision (WSI)                                     |
 | **Task**                  | Binary classification                                    |
 | **Cancer type**           | Breast                                                   |
 | **Data size**             | ~700 GB                                                  |
 | **Image dimension**       | ~100-250k x ~100-250k x 3                                |
 | **Magnification (μm/px)** | 40x (0.25) - Level 0                                     |
 | **Files format**          | `.tif`                                                   |
-| **Number of images**      | 400 (270 train, 130 test)                                |
+| **Number of images**      | 399 (270 train, 129 test)                                |
 
 
 ### Organization
@@ -55,13 +55,14 @@ The dataset is split into train / test. Additionally, we split the train set int
 
 | Splits   | Train       | Validation  | Test       |  
 |----------|-------------|-------------|------------|
-| #Samples | 216 (54%)  | 54 (13.5%)  | 130 (32.5%) |
+| #Samples | 216 (54.1%) | 54 (13.5%)  | 129 (32.3%)|
 
 
 ## Relevant links
 
 * [Grand Challenge dataset description](https://camelyon16.grand-challenge.org/Data/)
 * [Download links](https://camelyon17.grand-challenge.org/Data/)
+* [GitHub with dataset description by DIDSR](https://github.com/DIDSR/dldp)
 
 
 ## References

diff --git a/docs/datasets/index.md b/docs/datasets/index.md
@@ -6,13 +6,6 @@
 
 ### Whole Slide (WSI) and microscopy image datasets
 
-#### Slide-level
-| Dataset                            | #Slides  | Slide Size                | Magnification (μm/px)  | Task                       | Cancer Type      |
-|------------------------------------|----------|---------------------------|------------------------|----------------------------|------------------|
-| [Camelyon16](camelyon16.md)        | 400      | ~100-250k x ~100-250k x 3 |  40x (0.25)            | Classification (2 classes) | Breast           |
-| [PANDA](panda.md)                  | 10,616    | ~20k x 20k x 3            |  20x (0.5)             | Classification (6 classes) | Prostate         |
-
-
 #### Patch-level
 | Dataset                            | #Patches | Patch Size | Magnification (μm/px)  | Task                       | Cancer Type      |
 |------------------------------------|----------|------------|------------------------|----------------------------|------------------|
@@ -23,6 +16,13 @@
 
 \* Downsampled from 40x (0.25 μm/px) to increase the field of view.
 
+#### Slide-level
+| Dataset                            | #Slides  | Slide Size                | Magnification (μm/px)  | Task                       | Cancer Type      |
+|------------------------------------|----------|---------------------------|------------------------|----------------------------|------------------|
+| [Camelyon16](camelyon16.md)        | 400      | ~100-250k x ~100-250k x 3 |  40x (0.25)            | Classification (2 classes) | Breast           |
+| [PANDA](panda.md)                  | 10,616    | ~20k x 20k x 3            |  20x (0.5)             | Classification (6 classes) | Prostate         |
+
+
 ### Radiology datasets
 
 | Dataset | #Images | Image Size | Task  | Download provided

diff --git a/docs/datasets/panda.md b/docs/datasets/panda.md
@@ -1,6 +1,6 @@
 # PANDA (Prostate cANcer graDe Assessment)
 
-The PANDA datasets consists of 10616 whole-slide images of digitized H&E-stained prostate tissue biopsies originating from two medical centers. After the biopsy, the slides were classified into Gleason patterns (3, 4 or 5) based on the architectural growth patterns of the tumor, which are then converted into an ISUP grade on a 0-5 scale.
+The PANDA datasets consists of 10,616 whole-slide images of digitized H&E-stained prostate tissue biopsies originating from two medical centers. After the biopsy, the slides were classified into Gleason patterns (3, 4 or 5) based on the architectural growth patterns of the tumor, which are then converted into an ISUP grade on a 0-5 scale.
 
 The Gleason grading system is the most important prognostic marker for prostate cancer and the ISUP grade has a crucial role when deciding how a patient should be treated. However, the system suffers from significant inter-observer variability between pathologists, leading to imperfect and noisy labels.
 
@@ -20,7 +20,7 @@ Source: https://www.kaggle.com/competitions/prostate-cancer-grade-assessment
 | **Image dimension**       | ~20k x 20k x 3                                           |
 | **Magnification (μm/px)** | 20x (0.5) - Level 0                                      |
 | **Files format**          | `.tiff`                                                  |
-| **Number of images**      | 10616 (9555 after removing noisy labels)                 |
+| **Number of images**      | 10,616 (9,555 after removing noisy labels)               |
 
 
 ### Organization

diff --git a/docs/index.md b/docs/index.md
@@ -31,17 +31,17 @@ hide:
 
 _Oncology FM Evaluation Framework by [kaiko.ai](https://www.kaiko.ai/)_
 
-With the first release, *eva* supports performance evaluation for vision Foundation Models ("FMs") and supervised machine learning models on WSI-patch-level image classification task. Support for radiology (CT-scans) segmentation tasks will be added soon.
+*eva* currently supports performance evaluation for vision Foundation Models ("FMs") and supervised machine learning models on WSI (patch- and slide-level) as well as radiology image classification tasks.
 
 With *eva* we provide the open-source community with an easy-to-use framework that follows industry best practices to deliver a robust, reproducible and fair evaluation benchmark across FMs of different sizes and architectures.
 
-Support for additional modalities and tasks will be added in future releases.
+Support for additional modalities and tasks will be added soon.
 
 ## Use cases
 
 ### 1. Evaluate your own FMs on public benchmark datasets
 
-With a specified FM as input, you can run *eva* on several publicly available datasets & tasks. One evaluation run will download and preprocess the relevant data, compute embeddings, fit and evaluate a downstream head and report the mean and standard deviation of the relevant performance metrics.
+With a specified FM as input, you can run *eva* on several publicly available datasets & tasks. One evaluation run will download (if supported) and preprocess the relevant data, compute embeddings, fit and evaluate a downstream head and report the mean and standard deviation of the relevant performance metrics.
 
 Supported datasets & tasks include:
 
@@ -52,6 +52,11 @@ Supported datasets & tasks include:
 -	**[CRC](datasets/crc.md)**: multiclass colorectal cancer classification
 -	**[MHIST](datasets/mhist.md)**: binary colorectal polyp cancer classification
 
+*WSI slide-level pathology datasets*
+
+-	**[Camelyon16](datasets/camelyon16.md)**: binary breast cancer classification
+-	**[PANDA](datasets/panda.md)**: multiclass prostate cancer classification
+
 *Radiology datasets*
 
 -	**[TotalSegmentator](datasets/total_segmentator.md)**: radiology/CT-scan for segmentation of anatomical structures (*support coming soon*)
@@ -65,62 +70,7 @@ If you have your own labeled dataset, all that is needed is to implement a datas
 
 ## Evaluation results
 
-We evaluated the following FMs on the 4 supported WSI-patch-level image classification tasks. On the table below we report *Balanced Accuracy* for binary & multiclass tasks and show the average performance & standard deviation over 5 runs.
-
-
-<center>
-
-| FM-backbone                 | pretraining |  BACH             | CRC                | MHIST              |   PCam/val         | PCam/test       |       
-|-----------------------------|-------------|------------------ |-----------------   |-----------------   |-----------------   |--------------     |
-| DINO ViT-S16                | N/A         | 0.410 (±0.009)    | 0.617 (±0.008)     | 0.501 (±0.004)     | 0.753 (±0.002)	   | 0.728 (±0.003)    |
-| DINO ViT-S16                | ImageNet    | 0.695 (±0.004)    | 0.935 (±0.003)     | 0.831 (±0.002)     | 0.864 (±0.007)     | 0.849 (±0.007)    |
-| DINO ViT-B8        	        | ImageNet    | 0.710 (±0.007)    | 0.939 (±0.001)     | 0.814 (±0.003)     | 0.870 (±0.003)     | 0.856 (±0.004)    |
-| DINOv2 ViT-L14              | ImageNet    | 0.707 (±0.008)    | 0.916 (±0.002)     | 0.832 (±0.003)     | 0.873 (±0.001)     | 0.888 (±0.001)    |
-| Lunit - ViT-S16             | TCGA        | 0.801 (±0.005)    | 0.934 (±0.001)     | 0.768 (±0.004)     | 0.889 (±0.002)     | 0.895 (±0.006)    | 
-| Owkin - iBOT ViT-B16        | TCGA        | 0.725 (±0.004)    | 0.935 (±0.001)     | 0.777 (±0.005)     | 0.912 (±0.002)     | 0.915 (±0.003)    | 
-| UNI - DINOv2 ViT-L16        | Mass-100k   | 0.814 (±0.008)    | 0.950 (±0.001)     | **0.837 (±0.001)** | **0.936 (±0.001)** | **0.938 (±0.001)**| 
-| kaiko.ai - DINO ViT-S16	    | TCGA        | 0.797 (±0.003)    | 0.943 (±0.001)     | 0.828 (±0.003)     | 0.903 (±0.001)     | 0.893 (±0.005)    |
-| kaiko.ai - DINO ViT-S8	    | TCGA        | 0.834 (±0.012)    | 0.946 (±0.002)     | 0.832 (±0.006)     | 0.897 (±0.001)     | 0.887 (±0.002)    |
-| kaiko.ai - DINO ViT-B16     | TCGA        | 0.810 (±0.008)    | **0.960 (±0.001)** | 0.826 (±0.003)     | 0.900 (±0.002)     | 0.898 (±0.003)    | 
-| kaiko.ai - DINO ViT-B8      | TCGA        | 0.865 (±0.019)    | 0.956 (±0.001)     | 0.809 (±0.021)     | 0.913 (±0.001)     | 0.921 (±0.002)  | 
-| kaiko.ai - DINOv2 ViT-L14   | TCGA        | **0.870 (±0.005)**| 0.930 (±0.001)     | 0.809 (±0.001)     | 0.908 (±0.001)     | 0.898 (±0.002)    | 
-
-</center>
-
-The runs use the default setup described in the section below.
-
-*eva* trains the decoder on the "train" split and uses the "validation" split for monitoring, early stopping and checkpoint selection. Evaluation results are reported on the "validation" split and, if available, on the "test" split.
-
-For more details on the FM-backbones and instructions to replicate the results, check out [Replicate evaluations](user-guide/advanced/replicate_evaluations.md).
-
-## Evaluation setup
-
-*Note that the current version of eva implements the task- & model-independent and fixed default set up following the standard evaluation protocol proposed by [1] and described in the table below. We selected this approach to prioritize reliable, robust and fair FM-evaluation while being in line with common literature. Additionally, with future versions we are planning to allow the use of cross-validation and hyper-parameter tuning to find the optimal setup to achieve best possible performance on the implemented downstream tasks.*
-
-With a provided FM, *eva* computes embeddings for all input images (WSI patches) which are then used to train a downstream head consisting of a single linear layer in a supervised setup for each of the benchmark datasets. We use early stopping with a patience of 5% of the maximal number of epochs.
-
-|                         |                           |
-|-------------------------|---------------------------|
-| **Backbone**            | frozen                    |
-| **Hidden layers**       | none                      |
-| **Dropout**             | 0.0                       |
-| **Activation function** | none                      |
-| **Number of steps**     | 12,500                    |
-| **Base Batch size**     | 4,096                     |
-| **Batch size**          | dataset specific*         |
-| **Base learning rate**  | 0.01                      |
-| **Learning Rate**       | [Base learning rate] * [Batch size] / [Base batch size]   |
-| **Max epochs**          | [Number of samples] * [Number of steps] /  [Batch size]  |
-| **Early stopping**      | 5% * [Max epochs]  |
-| **Optimizer**           | SGD                       |
-| **Momentum**            | 0.9                       |
-| **Weight Decay**        | 0.0                       |
-| **Nesterov momentum**   | true                      |
-| **LR Schedule**         | Cosine without warmup     |
-
-\* For smaller datasets (e.g. BACH with 400 samples) we reduce the batch size to 256 and scale the learning rate accordingly.
-
-- [1]: [Virchow: A Million-Slide Digital Pathology Foundation Model, 2024](https://arxiv.org/pdf/2309.07778.pdf)
+Check out our [Leaderboards](leaderboards.md) to inspect evaluation results of publicly available FMs.
 
 ## License
 

diff --git a/docs/leaderboards.md b/docs/leaderboards.md
@@ -0,0 +1,69 @@
+---
+hide:
+  - navigation
+---
+
+# Leaderboards
+
+We evaluated the following FMs on the 6 supported WSI-classification tasks. We report *Balanced Accuracy* for binary & multiclass tasks. The score shows the average performance over 5 runs.
+
+<br/>
+
+<center>
+
+| Vision FM                  | pretraining |  BACH    | CRC       | MHIST     | PCam     |Camelyon16| PANDA    |
+|-----------------------------|-------------|--------- |-----------|-----------|----------|----------|----------|
+| [DINO ViT-S16](https://arxiv.org/abs/2104.14294) | N/A         | 0.410    | 0.617     | 0.501     | 0.728    | TBD      | TBD      |
+| [DINO ViT-S16](https://arxiv.org/abs/2104.14294) | ImageNet    | 0.695    | 0.935     | 0.831     | 0.849    | TBD      | TBD      |
+| [Lunit - ViT-S16](https://github.com/lunit-io/benchmark-ssl-pathology/releases/) | TCGA        | 0.801    | 0.934     | 0.768     | 0.895    | TBD      | TBD      |
+| [Owkin (Phikon) - iBOT ViT-B16](https://huggingface.co/owkin/phikon) | TCGA        | 0.725    | 0.935     | 0.777     | 0.915    | TBD      | TBD      |
+| [UNI - DINOv2 ViT-L16](https://huggingface.co/MahmoodLab/UNI) | Mass-100k   | 0.814    | 0.950     | **0.837** | **0.938**| TBD      | TBD      |
+| [kaiko.ai - DINO ViT-S16](https://github.com/kaiko-ai/towards_large_pathology_fms) | TCGA        | 0.797    | 0.943     | 0.828     | 0.893    | TBD      | TBD      |
+| [kaiko.ai - DINO ViT-S8](https://github.com/kaiko-ai/towards_large_pathology_fms)	| TCGA        | 0.834    | 0.946     | 0.832     | 0.887    | TBD      | TBD      |
+| [kaiko.ai - DINO ViT-B16](https://github.com/kaiko-ai/towards_large_pathology_fms) | TCGA        | 0.810    | **0.960** | 0.826     | 0.898    | TBD      | TBD      |
+| [kaiko.ai - DINO ViT-B8](https://github.com/kaiko-ai/towards_large_pathology_fms) | TCGA        | 0.865    | 0.956     | 0.809     | 0.921    | TBD      | TBD      |
+| [kaiko.ai - DINOv2 ViT-L14](https://github.com/kaiko-ai/towards_large_pathology_fms) | TCGA        | **0.870**| 0.930     | 0.809     | 0.898    | TBD      | TBD      |
+
+</center>
+
+The runs use the default setup described in the section below.
+
+*eva* trains the decoder on the "train" split and uses the "validation" split for monitoring, early stopping and checkpoint selection. Evaluation results are reported on the "test" split if available and otherwise on the "validation" split.
+
+For details on the FM-backbones and instructions to replicate the results, check out [Replicate evaluations](user-guide/advanced/replicate_evaluations.md).
+
+## Evaluation protocol
+
+*eva* uses a task- & model-independent and fixed default set up which closely follows the standard evaluation protocol proposed by [1] (with adjustments for slide-level tasks to ensure convergence and computational efficiency).
+
+We selected this approach to prioritize reliable, robust and fair FM-evaluation while being in line with common literature.
+
+|                                | WSI patch-level tasks     | WSI slide-level tasks     |
+|--------------------------------|---------------------------|---------------------------|
+| **Backbone**                   | frozen                    | frozen                    |
+| **Head**                       | single layer MLP          | ABMIL                     |
+| **Dropout**                    | 0.0                       | 0.0                       |
+| **Hidden activation function** | n/a                       | ReLU                      |
+| **Output activation function** | none                      | none                      |
+| **Number of steps**            | 12,500                    | 12,500 (2)                |
+| **Base batch size**            | 4,096 (1)                 | 32                        |
+| **Base learning rate**         | 0.01 (1)                  | 0.001                     |
+| **Early stopping**             | 5% * [Max epochs]         | 10% * [Max epochs] (3)    |
+| **Optimizer**                  | SGD                       | AdamW                     |
+| **Momentum**                   | 0.9                       | n/a                       |
+| **Weight Decay**               | 0.0                       | n/a                       |
+| **betas**                      | n/a                       | [0.9, 0.999]              |
+| **LR Schedule**                | Cosine without warmup     | Cosine without warmup     |
+| **number of patches per slide**| 1                         | dataset specific (4)      |
+
+
+(1) For smaller datasets (e.g. BACH with 400 samples) we reduce the batch size to 256 and scale the learning rate accordingly.
+
+(2) Upper cap at a maximum of 100 epochs.
+
+(3) Lower cap at a minimum of 8 epochs.
+
+(4) Number of patches per slide depends on task and slide size. For PANDA and Camelyon16 we use a max of 1,000 and 10,000 random patches per slide respectively.
+
+
+- [1]: [Virchow: A Million-Slide Digital Pathology Foundation Model, 2024](https://arxiv.org/pdf/2309.07778.pdf)
diff --git a/docs/user-guide/advanced/replicate_evaluations.md b/docs/user-guide/advanced/replicate_evaluations.md
@@ -4,7 +4,7 @@ To produce the evaluation results presented [here](../../index.md#evaluation-res
 
 Make sure to replace `<task>` in the commands below with `bach`, `crc`, `mhist` or `patch_camelyon`.
 
-Note that to run the commands below you will need to first download the data. [BACH](../../datasets/bach.md), [CRC](../../datasets/crc.md) and [PatchCamelyon](../../datasets/patch_camelyon.md) provide automatic download by setting the argument `download: true` (either modify the config-files or set the environment variable `DOWNLOAD=true`). In the case of MHIST you will need to download the data manually by following the instructions provided [here](../../datasets/mhist.md#download-and-preprocessing).*
+*Note that to run the commands below you will need to first download the data. [BACH](../../datasets/bach.md), [CRC](../../datasets/crc.md) and [PatchCamelyon](../../datasets/patch_camelyon.md) provide automatic download by setting the argument `download: true` (either modify the config-files or set the environment variable `DOWNLOAD=true`). In the case of MHIST you will need to download the data manually by following the instructions provided [here](../../datasets/mhist.md#download-and-preprocessing).*
 
 ## DINO ViT-S16 (random weights)
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -65,6 +65,7 @@ markdown_extensions:
   - pymdownx.superfences
 nav:
   - Introduction: index.md
+  - Leaderboards: leaderboards.md
   - User Guide:
       - user-guide/index.md
       - Getting started: