Skip to content

Commit

Permalink
Improve script CLI help and add README
Browse files Browse the repository at this point in the history
  • Loading branch information
alessiamarcolini committed Nov 3, 2020
1 parent 9bb29e8 commit 40f7591
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 13 deletions.
59 changes: 59 additions & 0 deletions examples/GTEx/README.md
@@ -0,0 +1,59 @@
# Create a leakage-free dataset of tiles for GTEx

`extract_tile_pw_gtex.py` is a Python script proposed as reference to retrieve a reproducible dataset of tiles using a collection of WSIs from the [GTEx](https://gtexportal.org/home/) public repository. In particular, it can be easily integrated in deep learning pipeline for computational pathology.

## Prerequisites
To run this script you will need the following packages, other than `histolab`:

- `pandas`
- `requests`
- `tqdm`
- `scikit-learn`

Moreover, a CSV file of patient metadata (`metadata_csv`) is required; this file can be retrieved from the GTEx [data portal](https://gtexportal.org/home/histologyPage) and it is structured as follows:

| Tissue Sample ID | Tissue | Subject ID | Sex | Age Bracket | Hardy Scale | Pathology Categories | Pathology Notes |
|---------------------|-----------------------------------|---------------|-----------|----------------|----------------|-------------------------|-----------------------------------------------------------------|
| GTEX-1117F-0126 | Skin - Sun Exposed (Lower leg) | GTEX-1117F | female | 60-69 | Slow death | | 6 pieces, minimal fat, squamous epithelium is ~50-70 microns |
| GTEX-1117F-0226 | Adipose - Subcutaneous | GTEX-1117F | female | 60-69 | Slow death | | 2 pieces, ~15% vessel stroma, rep delineated |
| ... | ... | ... | ... | ... | ... | ... | ... |

## Workflow

The `extract_tile_pw_gtex.py` will perform the following steps:

1. the WSIs listed in the metadata file (`Tissue Sample ID` column) are downloaded from GTEx via the `download_wsi_gtex` function; slides are saved in the `wsi_dataset_dir` directory, which is specified as command-line argument.
2. a fixed number of tiles (100 by default) are randomly extracted from each WSI by the `extract_random_tiles` function. The directory where to store the tiles, along with several parameters that detail the extraction protocol (i.e. `n_tiles`, `seed`, `check_tissue`), can be defined as command-line arguments.

**Note** `histolab` automatically saves the generated tiles in the 'tiles' subdirectory.

3. the `split_tiles_patient_wise` function sorts the tiles into the training and the test set (80-20 partition by default) adopting a *Patient-Wise* splitting protocol, namely ensuring that tiles belonging to the same subject are either in the training or the test set.

## Usage

```
usage: extract_tile_pw_gtex.py [-h] [--metadata_csv METADATA_CSV]
[--wsi_dataset_dir WSI_DATASET_DIR]
[--tile_dataset_dir TILE_DATASET_DIR]
[--tile_size TILE_SIZE TILE_SIZE]
[--n_tiles N_TILES] [--level LEVEL]
[--seed SEED] [--check_tissue CHECK_TISSUE]
Retrieve a leakage-free dataset of tiles using a collection of WSI.
optional arguments:
-h, --help show this help message and exit
--metadata_csv METADATA_CSV
CSV with WSI metadata. Default examples/GTEx/GTEx_AIDP2021.csv.
--wsi_dataset_dir WSI_DATASET_DIR
Path where to save the WSIs. Default WSI_GTEx.
--tile_dataset_dir TILE_DATASET_DIR
Path where to save the WSIs. Default tiles_GTEx.
--tile_size TILE_SIZE TILE_SIZE
width and height of the cropped tiles. Default (512, 512).
--n_tiles N_TILES Maximum number of tiles to extract. Default 100.
--level LEVEL Magnification level from which extract the tiles. Default 2.
--seed SEED Seed for RandomState. Default 7.
--check_tissue CHECK_TISSUE
Whether to check if the tile has enough tissue to be saved. Default True.
```
25 changes: 12 additions & 13 deletions examples/GTEx/extract_tile_pw_gtex.py
Expand Up @@ -186,54 +186,53 @@ def split_tiles_patient_wise(

def main():
parser = argparse.ArgumentParser(
description="Retrieve a leakage-free dataset of tiles using a collection of WSI"
"from the GTEx repository. The WSIs that will be used for tile extraction are "
"specified in the 'metadata_csv'. First, slides are downloaded from the GTEx "
"portal. Then, tiles are randomly cropped from each WSI and saved only if they "
"consist of, at least, 80% of tissue. Finally, tiles are sorted ...."
description="Retrieve a leakage-free dataset of tiles using a collection of WSI."
)
parser.add_argument(
"--metadata_csv",
type=str,
default="examples/GTEx/GTEx_AIDP2021.csv",
help="CSV with WSI metadata",
help="CSV with WSI metadata. Default examples/GTEx/GTEx_AIDP2021.csv.",
)
parser.add_argument(
"--wsi_dataset_dir",
type=str,
default="WSI_GTEx",
help="Path where to save the WSIs",
help="Path where to save the WSIs. Default WSI_GTEx.",
)
parser.add_argument(
"--tile_dataset_dir",
type=str,
default="tiles_GTEx",
help="Path where to save the WSIs",
help="Path where to save the WSIs. Default tiles_GTEx.",
)
parser.add_argument(
"--tile_size",
type=int,
nargs=2,
default=(512, 512),
help="width and height of the cropped tiles",
help="width and height of the cropped tiles. Default (512, 512).",
)
parser.add_argument(
"--n_tiles", type=int, default=100, help="Maximum number of tiles to extract"
"--n_tiles",
type=int,
default=100,
help="Maximum number of tiles to extract. Default 100.",
)
parser.add_argument(
"--level",
type=int,
default=2,
help="Magnification level from which extract the tiles",
help="Magnification level from which extract the tiles. Default 2.",
)
parser.add_argument(
"--seed", type=int, default=7, help="Seed for RandomState",
"--seed", type=int, default=7, help="Seed for RandomState. Default 7.",
)
parser.add_argument(
"--check_tissue",
type=bool,
default=True,
help="Whether to check if the tile has enough tissue to be saved",
help="Whether to check if the tile has enough tissue to be saved. Default True.",
)
args = parser.parse_args()

Expand Down

0 comments on commit 40f7591

Please sign in to comment.