# Dataset Preparation Tutorial

Welcome to the dataset preparation tutorial! In this notebook, we will download the toy data set for the tutorial and prepare the necessary tables used for later analysis. Here are the steps we will review:

1. Verify prerequisites
2. Create a new project workspace
3. Review sample dataset
4. Build the proxy table
5. Run regional annotation ETL

**NOTE**: All of the configuration files for this tutorial have been provided in the container. The host and port values in the configuration files are dynamically set based on your system. 

**NOTE**: The current working directory is '~/vmount/notebooks'. All file and directory paths specified in the configuration files are relative to the current working directory. 

## 1. Verify prerequisites

Here are the software prerequisites for executing tasks with luna packages. These prerequisites have already been baked into this docker container. To view the setup, please see the corresponding dockerfile. 

In [1]:
!python3 --version
!echo LUNA_HOME: $LUNA_HOME
import luna.pathology
print(luna.pathology.__path__)

Python 3.9.16
LUNA_HOME: /home/pollardw/vmount
['/opt/conda/lib/python3.9/site-packages/luna/pathology']


Verify that the dask scheduler is running in the docker container

In [2]:
# !ps -ewo pid,ppid,args
!echo "LUNA_DASK_SCHEDULER: $LUNA_DASK_SCHEDULER"

LUNA_DASK_SCHEDULER: tcp://192.168.176.4:8786


## 2. Create a new project workspace



Next, we create a luna home space and place the configuration files there. Using a manifest file, we will create a project workspace where your configurations, data, models, and outputs will go for this tutorial.

In [3]:
%%bash
mkdir -p ~/luna
cp -R ~/vmount/conf ~/luna
cat ~/luna/conf/manifest.yaml
cp ~/luna/conf/manifest.yaml "${LUNA_HOME}/PRO-12-123/"

# project manifest template

# MIND project id
PROJECT: PRO-12-123

# IRB
IRB:

# project title
TITLE: pathology-tutorial

# project description
DESCRIPTION: End-to-end pathology analysis tutorial

DATA_MODALITIES: pathology

ROOT_PATH: ../


You should now see the `manifest.yaml` file in your `vmount/PRO-12-123` directory.  This will be your project workspace.

## 3. Review sample dataset

The data we'll be using for this tutorial is a set of 5 whole slide images (WSI) of ovarian cancer H&E slides, available in the svs file format. Whole slide imaging refers to the scanning of conventional glass slides for research purposes; in this case, these are slides that oncologists have used while inspecting cancer samples.

The slides have been downloaded by the script `vmount/provision_girder.py`, which ...
  
  - Creates an admin user and default assetstore
  
  - Downloads sample data from [public kitware site](https://data.kitware.com/#user/61b9f3dc4acac99f42ca7678/folder/61b9f4564acac99f42ca7692). to `~/vmount/PRO-12-123/data/toy_data_set/`
  
  - Creates a collection and adds the slides to your local DSA
    
If this was successsful, the downloaded svs files will be listed by the `tree` command, below ... 

In [4]:
!tree "${LUNA_HOME}/PRO-12-123/data/toy_data_set"

[01;34m/home/pollardw/vmount/PRO-12-123/data/toy_data_set[00m
├── 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs
├── 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs
├── 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.svs
├── 01OV008-308ad404-7079-4ff8-8232-12ee2e.svs
├── 01OV008-7579323e-2fae-43a9-b00f-a15c28.svs
└── [01;34mtable[00m
    ├── [01;34mANNOTATIONS[00m
    │   ├── 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.annotation.geojson
    │   ├── 01OV002-ed65cf94-8bc6-492b-9149-adc16f.annotation.geojson
    │   ├── 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.annotation.geojson
    │   ├── 01OV008-308ad404-7079-4ff8-8232-12ee2e.annotation.geojson
    │   ├── 01OV008-7579323e-2fae-43a9-b00f-a15c28.annotation.geojson
    │   ├── metadata.yml
    │   └── slide_annotation_dataset_TCGA collection_ov_regional.parquet
    └── [01;34mSLIDES[00m
        └── slide_ingest_PRO-12-123.parquet

3 directories, 13 files


These files may also be viewed through the Minio instance. The URL for this instance is available in the docker-compose terminal log. 

If you want to import your own data, you can do so from your local filesystem as well as an object store. For more details, refer to the [girder user documentation](https://girder.readthedocs.io/en/latest/user-guide.html#assetstores)

To import images from your local filesystem, 

- Login to DSA with admin/password1
- Navigate to **Collections** and use the create collection interface to create a new collection
- Under the branch icon on the right, create a folder within the collection
- In this folder, use the blue **info** button to access the Unique ID for the folder, and copy and store it somewhere for reference.
- Add images to your local computer at `vmount/assetstore` 
- Navigate to **Admin Console** -> **Assetstores**
- From the default assetstore, click on **Import data**
- Specify the path to the images you wish to import using absolute path, e.g. `/home/<user>/vmount/assetstore/yourimage`, and specify the destination type as 'Folder' and the destination ID as the ID copied earlier, and click import

As the `/assetstore` mount is available to DSA, this import should be much faster than uploading the image through the **Upload files** button in the UI.


## 4. Build the proxy table

Now, we will run the Whole Slide Image (WSI) ETL to build a meta-data catalog of the slides in a proxy table. 

For reference, ETL stands for extract-transform-load; it is the method that often involves cleaning data, transforming data types, and loading data into different systems. 

In [5]:
!slide_etl "${LUNA_HOME}/PRO-12-123/data/toy_data_set" \
--project_name PRO-12-123 --comment "Example Ingestion Job" \
--no-copy --output-urlpath "${LUNA_HOME}/PRO-12-123/data/toy_data_set/table/SLIDES"

Perhaps you already have a cluster running?
Hosting the HTTP server on port 41313 instead
[                                        ] | 0% Completed |  2.8s[32m2023-08-01 19:00:31.805[0m | [34m[1mDEBUG   [0m | [36mluna.common.utils[0m:[36mwrapper[0m:[36m146[0m - [34m[1mget_downscaled_thumbnail ran in 0.72s[0m
[32m2023-08-01 19:00:31.817[0m | [34m[1mDEBUG   [0m | [36mluna.common.utils[0m:[36mwrapper[0m:[36m146[0m - [34m[1mget_downscaled_thumbnail ran in 0.75s[0m
[32m2023-08-01 19:00:31.894[0m | [34m[1mDEBUG   [0m | [36mluna.common.utils[0m:[36mwrapper[0m:[36m146[0m - [34m[1mget_downscaled_thumbnail ran in 0.8s[0m
[                                        ] | 0% Completed |  3.4s[32m2023-08-01 19:00:32.506[0m | [34m[1mDEBUG   [0m | [36mluna.common.utils[0m:[36mwrapper[0m:[36m146[0m - [34m[1mget_downscaled_thumbnail ran in 1.42s[0m
[########################                ] | 60% Completed |  3.9s[32m2023-08-01 19:00:33.005[0m | [34

2023-08-01 19:00:47,424 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/worker.py", line 1215, in heartbeat
    response = await retry_operation(
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils_comm.py", line 400, in retry_operation
    return await retry(
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry
    return await coro()
  File "/opt/conda/lib/python3.9/site-packages/distributed/core.py", line 1221, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/c

2023-08-01 19:00:47,725 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/worker.py", line 1215, in heartbeat
    response = await retry_operation(
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils_comm.py", line 400, in retry_operation
    return await retry(
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry
    return await coro()
  File "/opt/conda/lib/python3.9/site-packages/distributed/core.py", line 1221, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/c

2023-08-01 19:00:47,889 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/worker.py", line 1215, in heartbeat
    response = await retry_operation(
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils_comm.py", line 400, in retry_operation
    return await retry(
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry
    return await coro()
  File "/opt/conda/lib/python3.9/site-packages/distributed/core.py", line 1221, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/c

2023-08-01 19:00:48,027 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/distributed/worker.py", line 1215, in heartbeat
    response = await retry_operation(
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils_comm.py", line 400, in retry_operation
    return await retry(
  File "/opt/conda/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry
    return await coro()
  File "/opt/conda/lib/python3.9/site-packages/distributed/core.py", line 1221, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/c



This step may take a while. At the end, your proxy table should be generated!

**Note** code for the following text has been handled elsewhere, but it is still good to read the following to familiarize yourself with some potential issues with overfitting data.

Before we view the table, we must first update it to associate patient IDs with the slides. This is necessary for correctly training and validating the machine learning model in the coming notebooks. Once the slides are divided into "tiles" in the next notebook, the tiles are split between the training and validation sets for the ML model. If the tiles do not have patient ID's associated with them, then it is possible for tiles from one individual to appear in both the training and validation of the model; this would cause researchers to have an exaggerated interpretation of the model's accuracy, since we would essentially be validating the model on information that is too near to what it has already seen. 

Note that we will not be using patient IDs associated with MSK. Instead, we will be using spoof IDs that will suffice for this tutorial. When running this workflow with real data, make sure to include the IDs safely and securely. Run the following block of code to add a 'patient_id' column to the table and store it using Spark (DEPRECATED).

Next, we may view the WSI table! This table should have the metadata associated with the WSI slides that you just collected, including the patient IDs. 

This table may also be viewed through the Dremio instance. The URL for this instance is available in the docker-compose terminal log. 

In [None]:
import os
LUNA_HOME = os.environ["LUNA_HOME"]
TABLE_DIR = f"{LUNA_HOME}/PRO-12-123/data/toy_data_set/table"

import pandas as pd
pd.read_parquet(f"{TABLE_DIR}/SLIDES/slide_ingest_PRO-12-123.parquet")

If the table loads, then you have successfully run the Whole Slide Image (WSI) ETL to database the slides.

## Run the regional annotation ETL

The whole slide images that you downloaded are images of ovarian cancer, but not every pixel on each slide is a tumor.  In fact, the images typically include tumor cells, normal ovarian cells, lymphoctyes, and more. *Note: A non-expert annotated this slide for demo purposes only.*

The regional annotation ETL performs the following steps

- Download DSA json annotations
- Convert DSA jsons to GeoJSON format, which is compatible with downstream applications
- Save configs in your `~/vmount/PRO-12-123/configs/REGIONAL_METADATA_RESULTS`
- Save parquet table in your `~/vmount/PRO-12-123/tables/REGIONAL_METADATA_RESULTS `


To run the regional annotation ETL, we use the `dsa_annotation` CLI. For more details on the `dsa_annotation` tool, and the annotations we support, please checkout the `7_dsa-annotation.ipynb` notebook.


In [None]:
!mkdir -p "{TABLE_DIR}/ANNOTATIONS"
!dsa_annotation http://girder:8080/api/v1 \
    --collection-name "TCGA collection" \
    --annotation-name "ov_regional" \
    --username admin --password password1 \
    --output-urlpath "{TABLE_DIR}/ANNOTATIONS"

To check that the regional annotation ETL was correctly run, you can examine the regional annotations table.  This table contains the metadata saved by the ETL.  It includes paths to the bitmap files, numpy files, and geoJSON files that were mentioned before.  To load the table, run the following cell,

In [None]:
from pyarrow.parquet import read_table
annotations_table = f"{TABLE_DIR}/ANNOTATIONS/slide_annotation_dataset_TCGA collection_ov_regional.parquet"
pd.read_parquet(annotations_table)

Lastly, lets get our geojsons and join on slide ID.

In [None]:
pd.read_parquet(annotations_table) \
    .query("type=='geojson'")[['slide_geojson']] \
    .join(
        pd.read_parquet(f"{TABLE_DIR}/SLIDES/slide_ingest_PRO-12-123.parquet")['id']
    )

At this point, you have successfully set up your workspace, downloaded the data, and run both the pathology and regional annotation ETLs to prepare your data. You are ready to move on to the tiling notebook!