# Dataset Preparation Tutorial

Welcome to the dataset preparation tutorial! In this notebook, we will download the toy data set for the tutorial and prepare the necessary tables used for later analysis. Here are the steps we will review:

1. Verify prerequisites
2. Create a new project workspace
3. Download data
4. Build the proxy table
5. Run regional annotation ETL

**NOTE**: All of the configuration files for this tutorial have been provided in the container, but you will have to download the input data and add it to the container's volume mount as shown in the steps below. 

**NOTE**: The current working directory is '~/notebooks'. All file and directory paths specified in the configuration files are relative to the current working directory. 

## 1. Verify Prerequisites

**Note**: This tutorial requires a machine with at least 16GB free memory and 40GB free disk space. This tutorial will not run on a laptop with 16GB total memory.

here are the software prerequisites for executing tasks with luna packages. These prerequiristes have already been baked into this docker container. Too view the setup, please see the corresponding dockerfile. 

In [3]:
!python3 --version
!java -version
%env JAVA_HOME=/usr
!echo PYSPARK_PYTHON: $PYSPARK_PYTHON
!echo PYSPARK_DRIVER_PYTHON: $PYSPARK_DRIVER_PYTHON
!echo SPARK_HOME: $SPARK_HOME
!echo JAVA_HOME: $JAVA_HOME
!echo LUNA_HOME: $LUNA_HOME
!which jupyter
!pip list | grep luna-
import luna
luna.__path__
import luna.pathology
luna.pathology.__path__

Python 3.6.9
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)
env: JAVA_HOME=/usr
PYSPARK_PYTHON: /usr/bin/python3
PYSPARK_DRIVER_PYTHON: /usr/bin/python3
SPARK_HOME:
JAVA_HOME: /usr
LUNA_HOME: /home/pashaa/luna
/home/pashaa/.local/bin/jupyter
pyluna-common           0.0.3
pyluna-core             0.0.3
pyluna-pathology        0.0.3


['/home/pashaa/.local/lib/python3.6/site-packages/luna/pathology']

## 2. Create a new project workspace



Next, we create a luna home space and place the configuration files there. Using a manifest file, we will create a project workspace for your configurations, data, models, and outputs to go for this tutorial.

In [4]:
%%bash
mkdir -p ~/luna
cp -R ~/vmount/conf ~/luna
cat ~/luna/conf/manifest.yaml
python3 -m luna.project.generate --manifest_file ~/luna/conf/manifest.yaml
tree ~/vmount/PRO_12-123

# project manifest template

# MIND project id
PROJECT: PRO_12-123

# IRB
IRB:

# project title
TITLE: pathology-tutorial

# project description
DESCRIPTION: End-to-end pathology analysis tutorial

DATA_MODALITIES: pathology

ROOT_PATH: ../
/home/pashaa/vmount/PRO_12-123
└── manifest.yaml

0 directories, 1 file


2021-12-06 11:33:26,484 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [<StreamHandler <stderr> (INFO)>, <RotatingFileHandler /home/pashaa/vmount/notebooks/data-processing.log (INFO)>]
2021-12-06 11:33:26,485 - INFO - luna.common.config - loading config file /home/pashaa/luna/conf/manifest.yaml
2021-12-06 11:33:26,499 - INFO - root - config files copied to ../PRO_12-123
2021-12-06 11:33:26,499 - INFO - root - Code block 'generate project folder' took: 0.014463242143392563s


You should now see a new directory called *PRO_12-123* with the manifest file in it. This will be your project workspace!

## 3. Download data

The data that you will be using for this tutorial is a set of 5 whole slide images of ovarian cancer H&E slides, available in the svs file format. Whole slide imaging refers to the scanning of conventional glass slides for research purposes; in this case, these are slides that oncologists have used to inspecting cancer samples! We will download these images from Synapse, a data warehouse used for digital research. 

We will now make a folder for your data and the toy data set in this new project workspace.

In [5]:
%%bash
mkdir -p ~/vmount/PRO_12-123/data/toy_data_set
tree ~/vmount/PRO_12-123

/home/pashaa/vmount/PRO_12-123
├── data
│   └── toy_data_set
└── manifest.yaml

2 directories, 1 file


You can find the pathology slides for your toy data set on Synapse. First, you must navigate to the Synapse website (https://www.synapse.org/) and create an account if you do not already have one. Once your account is created, open the site, search for the project ID (syn25946167) in the righthand corner, click the "Files" tab, and download the tar.gz file as a file (not as a package). This process may take a while, as you will be downloading a little under 5 GB of data onto your machine. Once downloaded, expand the tar file, and then relocate the five svs files into the host '~/vmount/PRO_12-123/data/toy_data_set/' directory that is volume mounted into the container. 

You should now be able to view the .svs files from your notebook as shown below. 

In [6]:
%%bash
cp ~/vmount/*.svs ~/vmount/PRO_12-123/data/toy_data_set
tree ~/vmount/PRO_12-123/data/toy_data_set

/home/pashaa/vmount/PRO_12-123/data/toy_data_set
├── 2551028.svs
├── 2551129.svs
├── 2551389.svs
├── 2551531.svs
└── 2551571.svs

0 directories, 5 files


## 4. Build the proxy table

Now, we will run the Whole Slide Image (WSI) ETL to build a meta-data catalog of the slides in a proxy table. 

For reference, ETL stands for extract-transform-load; it is the method that often involves cleaning data, transforming data types, and loading data into different systems. 

In [7]:
!cat ~/luna/conf/wsi_config.yaml 

REQUESTOR: viki mancoridis                                     # The name of the requestor. You are likely the requestor
REQUESTOR_DEPARTMENT: computational oncology                   # The department to which the requestor belongs
REQUESTOR_EMAIL: MancoriV@mskcc.org                            # The email address of the requestor
PROJECT: PRO_12-123                                            # The project name decided by data coordination
SOURCE: toy_set                                                # Source name of the input data file
MODALITY: radiology                                            # Data modality
DATA_TYPE: WSI                                                 # Data type within this modality
COMMENTS:                                                      # Description of template defined by requestor. You may leave blank
DATE: 2021-07-06                                               # The date on which the request was made, likely today
DATASET_NAME: toy_data_s

In [8]:
%%bash
python3 -m luna.pathology.proxy_table.generate \
        -d ~/luna/conf/wsi_config.yaml \
        -a ~/luna/conf/app_config.yaml \
        -p delta


:: loading settings :: url = jar:file:/home/pashaa/.local/lib/python3.6/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- wsi_record_uuid: string (nullable = true)
 |-- slide_id: string (nullable = true)
 |-- metadata: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+--------------------+--------------------+----------+--------------------+--------+--------------------+
|                path|    modificationTime|    length|     wsi_record_uuid|slide_id|            metadata|
+--------------------+--------------------+----------+--------------------+--------+--------------------+
|file:/home/pashaa...|2021-12-06 11:34:...|1413574341|WSI-03662b6be585f...| 2551571|{aperio_User -> d...|
|file:/home/pashaa...|2021-12-06 11:33:...|1322921471|WSI-1ba07f58166fc...| 2551028|{aperio

2021-12-06 11:34:27,529 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [<StreamHandler <stderr> (INFO)>, <RotatingFileHandler /home/pashaa/vmount/notebooks/data-processing.log (INFO)>]
2021-12-06 11:34:27,529 - INFO - root - data_ingestions_template: /home/pashaa/luna/conf/wsi_config.yaml
2021-12-06 11:34:27,529 - INFO - root - config_file: /home/pashaa/luna/conf/app_config.yaml
2021-12-06 11:34:27,529 - INFO - root - processes: ['delta']
2021-12-06 11:34:27,529 - INFO - luna.common.config - loading config file /home/pashaa/luna/conf/app_config.yaml
2021-12-06 11:34:27,531 - INFO - luna.common.config - loading config file /home/pashaa/luna/conf/wsi_config.yaml
2021-12-06 11:34:27,536 - INFO - luna.common.config - validating config /home/pashaa/luna/conf/wsi_config.yaml against schema /home/pashaa/.local/lib/python3.6/site-packages/luna/pathology/proxy_table/data_ingestion_template_schema.yml for DATA_CFG
2021-12-06 11:34:27,567 - INFO - root - c

This step may take a while. At the end, your proxy table should be generated!

Before we view the table, we must first update it to associate patient ID's with the slides. This is necessary for correctly training and validating the machine learning model in the coming notebooks. Once the slides are divided into "tiles" in the next notebook, the tiles are split between the training and validation sets for the ML model. If the tiles do not have patient ID's associated with them, then it is possible for tiles from one individual to appear in both the training and validation of the model; this would cause researchers to have an exaggerated interpretation of the model's accuracy, since we would essentially be validating the model on information that is too near to what it has already seen. 

Note that we will not be using patient IDs associated with MSK. Instead, we will be using spoof IDs that will suffice for this tutorial. When running this workflow with real data, make sure to include the IDs safely and securely. Run the following block of code to add a 'patient_id' column to the table and store it using Spark.

In [9]:
from pyspark.sql import SparkSession

# setup spark session
spark = SparkSession.builder \
        .appName("test") \
        .master('local[*]') \
        .config("spark.driver.host", "127.0.0.1") \
        .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
        .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.HDFSLogStore") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .config("spark.databricks.delta.retentionDurationCheck.enabled", "false") \
        .config("spark.hadoop.dfs.client.use.datanode.hostname", "true") \
        .config("spark.driver.memory", "6g") \
        .config("spark.executor.memory", "6g") \
        .getOrCreate()

print(spark)

# read WSI delta table
wsi_table = spark.read.format("delta") .load("../PRO_12-123/tables/WSI_toy_data_set").toPandas()

# insert spoof patient ids
patient_id=[1,2,3,4,5]
wsi_table['patient_id']=patient_id

wsi_table

# convert back to a spark table (update table)
x = spark.createDataFrame(wsi_table)
x.write.format("delta").mode("overwrite").option("mergeSchema", "true").save("../PRO_12-123/tables/WSI_toy_data_set")


<pyspark.sql.session.SparkSession object at 0x7fc91e735a90>


Reduce the delta table down to a single layer so all data can be read as a parquet table.

In [10]:
from delta.tables import *
wsi_table = DeltaTable.forPath(spark, "../PRO_12-123/tables/WSI_toy_data_set")  
wsi_table.vacuum(0)

DataFrame[]

Next, we may view the WSI table! This table should have the metadata associated with the WSI slides that you just collected, including the patient IDs. 

In [11]:
# read WSI delta table
wsi_table = spark.read.format("delta") \
            .load("../PRO_12-123/tables/WSI_toy_data_set").toPandas()

# view table
wsi_table


Unnamed: 0,path,modificationTime,length,wsi_record_uuid,slide_id,metadata,patient_id
0,file:/home/pashaa/vmount/PRO_12-123/data/toy_d...,2021-12-06 11:34:13.058,1413574341,WSI-03662b6be585f8bdb1a16a175a7cfda07c4057afe5...,2551571,"{'aperio_StripeWidth': '2032', 'aperio_User': ...",1
1,file:/home/pashaa/vmount/PRO_12-123/data/toy_d...,2021-12-06 11:33:53.967,584611357,WSI-93ccfd50a210d0b8c7589352be9036ef5abf6b4f81...,2551129,"{'aperio_StripeWidth': '2032', 'aperio_User': ...",2
2,file:/home/pashaa/vmount/PRO_12-123/data/toy_d...,2021-12-06 11:34:04.663,520642043,WSI-12677b7d98691d1eef8043727f27878eb9fda14b65...,2551531,"{'aperio_StripeWidth': '2032', 'aperio_User': ...",3
3,file:/home/pashaa/vmount/PRO_12-123/data/toy_d...,2021-12-06 11:33:51.883,1322921471,WSI-1ba07f58166fc2073c854dd9b00a11eaca2203ff20...,2551028,"{'aperio_Left': '12.423057', 'aperio_StripeWid...",4
4,file:/home/pashaa/vmount/PRO_12-123/data/toy_d...,2021-12-06 11:33:58.155,966069709,WSI-f3890775a7f36c982aae28ac58de43b1852652fc20...,2551389,"{'aperio_Left': '23.100784', 'aperio_StripeWid...",5


If the table is depicted above, congratulations, you  have successfully run the Whole Slide Image (WSI) ETL to database the slides!

## Run the regional annotation ETL

The whole slide images that you downloaded are images of ovarian cancer, but not every pixel on each slide is a tumor. In fact, the images show tumor cells, normal ovarian cells, necrosis (dead cells), fibrosis (scarred cells), and more. Pathologists at Memorial Sloan Kettering examined each slide and denoted these different features by hand, providing us with regional annotations. You may think of regional annotations as scientific highlighter marks over the different regions of the image.

What actually happens when the regional annotation ETL is run? First, annotation bitmaps are downloaded from SlideViewer, a repository which stores WSI images and their annotation data. These bitmaps are converted into numpy arrays, which are then converted into GeoJSON files and organized in the proxy table. The GeoJSON files store the annotation regions marked by pathologists as polygons, which makes the data simpler to store and analyze. Once the annotation files are loaded into QuPath- a software used for digital pathology- later in the pipeline, this data format becomes incredibly useful and easy to work with.

To run the regional annotation ETL, try:

In [14]:
%%bash

python3 -m luna.pathology.refined_table.regional_annotation.dask_generate \
        -d ~/luna/conf/regional_annotation_config.yaml \
        -a ~/luna/conf/app_config.yaml


No label 1 found
Building contours for label 2
num_pixels with label 344474790
num_contours 2
[-1, 0]
No label 3 found
Building contours for label 4
num_pixels with label 62336170
num_contours 3
[-1, 0, 0]
No label 5 found
No label 6 found
Building contours for label 7
num_pixels with label 2720232
num_contours 2
[-1, -1]
No label 8 found
No label 9 found
No label 10 found
No label 11 found
No label 12 found
No label 13 found
No label 14 found
No label 15 found
Building contours for label 1
num_pixels with label 3612930
num_contours 3
[-1, -1, -1]
No label 2 found
Building contours for label 3
num_pixels with label 20257170
num_contours 3
[-1, -1, -1]
No label 4 found
Building contours for label 5
num_pixels with label 38403188
num_contours 2
[-1, -1]
Building contours for label 6
num_pixels with label 28809658
num_contours 29
[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
Building contours for label 7
num_pixels wit

2021-12-06 12:32:46,963 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [<StreamHandler <stderr> (INFO)>, <RotatingFileHandler /home/pashaa/vmount/notebooks/data-processing.log (INFO)>]
2021-12-06 12:32:46,963 - INFO - luna.common.config - loading config file /home/pashaa/luna/conf/regional_annotation_config.yaml
2021-12-06 12:32:46,972 - INFO - luna.common.config - loading config file /home/pashaa/luna/conf/app_config.yaml
2021-12-06 12:32:46,974 - INFO - root - data template: /home/pashaa/luna/conf/regional_annotation_config.yaml
2021-12-06 12:32:46,974 - INFO - root - config_file: /home/pashaa/luna/conf/app_config.yaml
2021-12-06 12:32:47,001 - INFO - root - config files copied to ../PRO_12-123/configs/REGIONAL_METADATA_RESULTS
2021-12-06 12:32:50,215 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [<StreamHandler <stderr> (INFO)>, <RotatingFileHandler /home/pashaa/vmount/notebooks/data-processing.log (I

To check that the regional annotation ETL was correctly run, after the Jupyter cell finishes, you may load the regional annotations table! This table contains the metadata saved from running the ETL. It includes paths to the bitmap files, numpy files, and geoJSON files that were mentioned before. To load the table, run the following code cell: 

In [15]:
from pyarrow.parquet import read_table

regional_annotation_table = read_table("../PRO_12-123/tables/REGIONAL_METADATA_RESULTS",
                                      filters = [('user', '!=', f'CONCAT')]).to_pandas()
regional_annotation_table


Unnamed: 0,sv_project_id,slideviewer_path,slide_id,user,bmp_filepath,npy_filepath,geojson_path,date,labelset
0,134,2019;HobS19-409411851898;2551028.svs,2551028,ellensol,./regional_bmps/2019_HobS19-409411851898_25510...,./regional_npys/2019_HobS19-409411851898_25510...,./slides/2551028/ellensol/RegionalAnnotationJS...,2021-12-06 12:32:50.376895,DEFAULT_LABELS
1,134,2019;HobS19-409411851898;2551028.svs,2551028,ellensol,./regional_bmps/2019_HobS19-409411851898_25510...,./regional_npys/2019_HobS19-409411851898_25510...,./slides/2551028/ellensol/RegionalAnnotationJS...,2021-12-06 12:32:50.376895,PIXEL_CLASSIFIER_LABELS
2,134,2019;HobS19-409411851898;2551028.svs,2551028,ellensol,./regional_bmps/2019_HobS19-409411851898_25510...,./regional_npys/2019_HobS19-409411851898_25510...,./slides/2551028/ellensol/RegionalAnnotationJS...,2021-12-06 12:32:50.376895,OBJECT_CLASSIFIER_LABELS
3,134,2019;HobS19-409411851898;2551028.svs,2551028,ellensol,./regional_bmps/2019_HobS19-409411851898_25510...,./regional_npys/2019_HobS19-409411851898_25510...,./slides/2551028/ellensol/RegionalAnnotationJS...,2021-12-06 12:32:50.376895,SIMPLIFIED_PIXEL_CLASSIFIER_LABELS
4,134,2019;HobS19-159147602774;2551129.svs,2551129,ellensol,./regional_bmps/2019_HobS19-159147602774_25511...,./regional_npys/2019_HobS19-159147602774_25511...,./slides/2551129/ellensol/RegionalAnnotationJS...,2021-12-06 12:32:50.376746,DEFAULT_LABELS
5,134,2019;HobS19-159147602774;2551129.svs,2551129,ellensol,./regional_bmps/2019_HobS19-159147602774_25511...,./regional_npys/2019_HobS19-159147602774_25511...,./slides/2551129/ellensol/RegionalAnnotationJS...,2021-12-06 12:32:50.376746,PIXEL_CLASSIFIER_LABELS
6,134,2019;HobS19-159147602774;2551129.svs,2551129,ellensol,./regional_bmps/2019_HobS19-159147602774_25511...,./regional_npys/2019_HobS19-159147602774_25511...,./slides/2551129/ellensol/RegionalAnnotationJS...,2021-12-06 12:32:50.376746,OBJECT_CLASSIFIER_LABELS
7,134,2019;HobS19-159147602774;2551129.svs,2551129,ellensol,./regional_bmps/2019_HobS19-159147602774_25511...,./regional_npys/2019_HobS19-159147602774_25511...,./slides/2551129/ellensol/RegionalAnnotationJS...,2021-12-06 12:32:50.376746,SIMPLIFIED_PIXEL_CLASSIFIER_LABELS
8,134,2019;HobS19-475053909405;2551389.svs,2551389,soslowr,./regional_bmps/2019_HobS19-475053909405_25513...,./regional_npys/2019_HobS19-475053909405_25513...,./slides/2551389/soslowr/RegionalAnnotationJSO...,2021-12-06 12:32:50.376476,DEFAULT_LABELS
9,134,2019;HobS19-475053909405;2551389.svs,2551389,soslowr,./regional_bmps/2019_HobS19-475053909405_25513...,./regional_npys/2019_HobS19-475053909405_25513...,./slides/2551389/soslowr/RegionalAnnotationJSO...,2021-12-06 12:32:50.376476,PIXEL_CLASSIFIER_LABELS


At this point, you have successfully set up your workspace, dowloaded the data, and run both the pathology and regional annotation ETLs to prepare your data. You are ready to move on to the tiling notebook!