Skip to content

Commit

Permalink
Merge 5ef0db9 into ea59604
Browse files Browse the repository at this point in the history
  • Loading branch information
nicolebussola committed Nov 4, 2020
2 parents ea59604 + 5ef0db9 commit b5f9d3f
Show file tree
Hide file tree
Showing 3 changed files with 330 additions and 0 deletions.
55 changes: 55 additions & 0 deletions examples/TCGA/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Create a leakage-free dataset of tiles for TCGA

`extract_tile_pw_tcga.py` is a Python script proposed as reference to retrieve a reproducible dataset of tiles using a collection of WSIs from the [TCGA](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) public repository. In particular, it can be easily integrated in deep learning pipeline(s) for computational pathology.

## Prerequisites
To run this script you will need the following packages, other than `histolab`:

- `pandas`
- `scikit-learn`

Moreover, a CSV file of patient clinical data (`clinical_csv`) is required; this file can be retrieved from the TCGA [data portal](https://portal.gdc.cancer.gov/) and it is structured as follows:

| case_id | case_submitter_id | project_id | age_at_index | ... | primary_diagnosis | ... | treatment_type |
|---------------------|-----------------------------------|---------------|-----------|----------------|----------------|-------------------------|-----------------------------------------------------------------|
| 6cd9baf5-bbe0-4c1e-a87f-c53b3af22890 | TCGA-A7-A13G | TCGA-BRCA | 79 | ... | Infiltrating duct carcinoma, NOS | ... | Pharmaceutical Therapy, NOS |
| 928c48a0-68ee-4e28-ae83-9832e52850ca | TCGA-CH-5753 | TCGA-PRAD | 70 | ... | Adenocarcinoma, NOS | ... | Radiation Therapy, NOS |
| ... | ... | ... | ... | ... | ... | ... | ... |

## Workflow

The `extract_tile_pw_tcga.py` will perform the following steps:

1. a fixed number of tiles (100 by default) are randomly extracted from each WSI by the `extract_random_tiles` function. The directory where to store the tiles, along with several parameters that detail the extraction protocol (i.e. `n_tiles`, `seed`, `check_tissue`), can be defined as command-line arguments.
**Note** `histolab` automatically saves the generated tiles in the 'tiles' subdirectory.

2. the `split_tiles_patient_wise` function sorts the tiles into the training and the test set (80-20 partition by default) adopting a *Patient-Wise* splitting protocol, namely ensuring that tiles belonging to the same subject are either in the training or the test set.

## Usage

```
usage: extract_tile_pw_tcga.py [-h] [--clinical_csv CLINICAL_CSV]
[--wsi_dataset_dir WSI_DATASET_DIR]
[--tile_dataset_dir TILE_DATASET_DIR]
[--tile_size TILE_SIZE TILE_SIZE]
[--n_tiles N_TILES] [--level LEVEL]
[--seed SEED] [--check_tissue CHECK_TISSUE]
Retrieve a leakage-free dataset of tiles using a collection of WSI.
optional arguments:
-h, --help show this help message and exit
--clinical_csv CLINICAL_CSV
CSV with WSI clinical data. Default examples/TCGA/clinical_csv_example.csv.
--wsi_dataset_dir WSI_DATASET_DIR
Path where to save the WSIs. Default WSI_TCGA.
--tile_dataset_dir TILE_DATASET_DIR
Path where to save the WSIs. Default tiles_TCGA.
--tile_size TILE_SIZE TILE_SIZE
width and height of the cropped tiles. Default (512, 512).
--n_tiles N_TILES Maximum number of tiles to extract. Default 100.
--level LEVEL Magnification level from which extract the tiles. Default 2.
--seed SEED Seed for RandomState. Default 7.
--check_tissue CHECK_TISSUE
Whether to check if the tile has enough tissue to be saved. Default True.
```

0 comments on commit b5f9d3f

Please sign in to comment.