# Dataset Preparation Tutorial

Welcome to the dataset preparation tutorial! In this notebook, we will download the toy data set for the tutorial and prepare the necessary tables used for later analysis. Here are the steps we will review:

1. Verify prerequisites
2. Create a new project workspace
3. Review sample dataset
4. Build the proxy table
5. Run regional annotation ETL

**NOTE**: All of the configuration files for this tutorial have been provided in the container. The host and port values in the configuration files are dynamically set based on your system. 

**NOTE**: The current working directory is '~/vmount/notebooks'. All file and directory paths specified in the configuration files are relative to the current working directory. 

## 1. Verify prerequisites

Here are the software prerequisites for executing tasks with luna packages. These prerequiristes have already been baked into this docker container. Too view the setup, please see the corresponding dockerfile. 

In [1]:
!python3 --version
!echo LUNA_HOME: $LUNA_HOME
!which jupyter
!pip list | grep luna-
import luna
luna.__path__
import luna.pathology
luna.pathology.__path__

Python 3.9.12
LUNA_HOME: /home/kohlia/vmount
/home/kohlia/.local/bin/jupyter
pyluna-common                 0.3.3       /home/kohlia/luna/pyluna-common
pyluna-pathology              0.3.3       /home/kohlia/luna/pyluna-pathology


['/home/kohlia/luna/pyluna-pathology/luna/pathology']

## 2. Create a new project workspace



Next, we create a luna home space and place the configuration files there. Using a manifest file, we will create a project workspace for your configurations, data, models, and outputs to go for this tutorial.

In [2]:
%%bash
mkdir -p ~/luna
cp -R ~/vmount/conf ~/luna
cat ~/luna/conf/manifest.yaml
python3 -m luna.project.generate --manifest_file ~/luna/conf/manifest.yaml
tree ~/vmount/PRO_12-123

# project manifest template

# MIND project id
PROJECT: PRO_12-123

# IRB
IRB:

# project title
TITLE: pathology-tutorial

# project description
DESCRIPTION: End-to-end pathology analysis tutorial

DATA_MODALITIES: pathology

ROOT_PATH: ../


/usr/local/bin/python3: Error while finding module specification for 'luna.project.generate' (ModuleNotFoundError: No module named 'luna.project')


/home/kohlia/vmount/PRO_12-123
├── data
│   └── toy_data_set
│       ├── 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs
│       ├── 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs
│       ├── 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.svs
│       ├── 01OV008-308ad404-7079-4ff8-8232-12ee2e.svs
│       ├── 01OV008-7579323e-2fae-43a9-b00f-a15c28.svs
│       └── table
│           ├── ANNOTATIONS
│           │   ├── 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.annotation.geojson
│           │   ├── 01OV002-ed65cf94-8bc6-492b-9149-adc16f.annotation.geojson
│           │   ├── 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.annotation.geojson
│           │   ├── 01OV008-308ad404-7079-4ff8-8232-12ee2e.annotation.geojson
│           │   ├── 01OV008-7579323e-2fae-43a9-b00f-a15c28.annotation.geojson
│           │   ├── metadata.yml
│           │   └── slide_annotation_dataset_TCGA collection_ov_regional.parquet
│           └── SLIDES
│               ├── metadata.yml
│               └── slide_ingest_PRO_12-123.parquet
├── da

You should now see a new directory called *PRO_12-123* with the manifest file in it. This will be your project workspace!

## 3. Review sample dataset

The data that you will be using for this tutorial is a set of 5 whole slide images of ovarian cancer H&E slides, available in the svs file format. Whole slide imaging refers to the scanning of conventional glass slides for research purposes; in this case, these are slides that oncologists have used to inspecting cancer samples!

While bringing up the DSA container, we already ran a script to get the data, and set up DSA. The `vmount/provision.py` script ran these steps:
  
  - Set up admin user and default assetstore
  
  - Download sample data from [public kitware site](https://data.kitware.com/#user/61b9f3dc4acac99f42ca7678/folder/61b9f4564acac99f42ca7692). to `~/vmount/PRO_12-123/data/toy_data_set/`
  
  - Create a collection and add slides/annotations to your local DSA


In [3]:
%%bash
tree ~/vmount/PRO_12-123/data/toy_data_set

/home/kohlia/vmount/PRO_12-123/data/toy_data_set
├── 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs
├── 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs
├── 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.svs
├── 01OV008-308ad404-7079-4ff8-8232-12ee2e.svs
├── 01OV008-7579323e-2fae-43a9-b00f-a15c28.svs
└── table
    ├── ANNOTATIONS
    │   ├── 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.annotation.geojson
    │   ├── 01OV002-ed65cf94-8bc6-492b-9149-adc16f.annotation.geojson
    │   ├── 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.annotation.geojson
    │   ├── 01OV008-308ad404-7079-4ff8-8232-12ee2e.annotation.geojson
    │   ├── 01OV008-7579323e-2fae-43a9-b00f-a15c28.annotation.geojson
    │   ├── metadata.yml
    │   └── slide_annotation_dataset_TCGA collection_ov_regional.parquet
    └── SLIDES
        ├── metadata.yml
        └── slide_ingest_PRO_12-123.parquet

3 directories, 14 files


If you want to import your own data, you can do so from your local filesystem as well as an object store. For more details, refer to the [girder user documentation](https://girder.readthedocs.io/en/latest/user-guide.html#assetstores)

To import images from your local filesystem, 

- Login to DSA with admin/password
- Add images to your local computer at `vmount/assetstore` 
- Navigate to **Admin Console** -> **Assetstores**
- From the default assetstore, click on **Import data**
- Specify the path to the images you wish to import. e.g. `/assetstore/yourimage` and click import

As the `/assetstore` mount is available to DSA, this import should be much faster than uploading the image through the **Upload files** in the UI.


## 4. Build the proxy table

Now, we will run the Whole Slide Image (WSI) ETL to build a meta-data catalog of the slides in a proxy table. 

For reference, ETL stands for extract-transform-load; it is the method that often involves cleaning data, transforming data types, and loading data into different systems. 

In [4]:
!slide_etl ~/vmount/PRO_12-123/data/toy_data_set \
--project_name PRO_12-123 --comment "Example Ingestion Job" \
--store_url "" --no-write \
-o ../PRO_12-123/data/toy_data_set/table/SLIDES 

2022-04-26 16:32:10,910 - INFO - root - Initalized logger, log file at: data-processing.log
2022-04-26 16:32:10,912 - INFO - luna.common.utils - Started CLI Runner wtih <function slide_etl at 0x7fa830346e50>
2022-04-26 16:32:10,913 - INFO - luna.common.utils - Validating params...
2022-04-26 16:32:10,914 - INFO - luna.common.utils -  -> Set input_slide_folder (<class 'str'>) = /home/kohlia/vmount/PRO_12-123/data/toy_data_set
2022-04-26 16:32:10,915 - INFO - luna.common.utils -  -> Set comment (<class 'str'>) = Example Ingestion Job
2022-04-26 16:32:10,916 - INFO - luna.common.utils -  -> Set no_write (<class 'bool'>) = True
2022-04-26 16:32:10,917 - INFO - luna.common.utils -  -> Set subset_csv (<class 'str'>) =
2022-04-26 16:32:10,918 - INFO - luna.common.utils -  -> Set debug_limit (<class 'int'>) = -1
2022-04-26 16:32:10,919 - INFO - luna.common.utils -  -> Set num_cores (<class 'int'>) = 4
2022-04-26 16:32:10,919 - INFO - luna.common.utils -  -> Set store_url (<class 'str'>) =
2022

This step may take a while. At the end, your proxy table should be generated!

Before we view the table, we must first update it to associate patient ID's with the slides. This is necessary for correctly training and validating the machine learning model in the coming notebooks. Once the slides are divided into "tiles" in the next notebook, the tiles are split between the training and validation sets for the ML model. If the tiles do not have patient ID's associated with them, then it is possible for tiles from one individual to appear in both the training and validation of the model; this would cause researchers to have an exaggerated interpretation of the model's accuracy, since we would essentially be validating the model on information that is too near to what it has already seen. 

Note that we will not be using patient IDs associated with MSK. Instead, we will be using spoof IDs that will suffice for this tutorial. When running this workflow with real data, make sure to include the IDs safely and securely. Run the following block of code to add a 'patient_id' column to the table and store it using Spark.

Next, we may view the WSI table! This table should have the metadata associated with the WSI slides that you just collected, including the patient IDs. 

In [5]:
import pandas as pd
pd.read_parquet("../PRO_12-123/data/toy_data_set/table/SLIDES/slide_ingest_PRO_12-123.parquet")

Unnamed: 0_level_0,valid_slide,aperio.AppMag,aperio.DSR ID,aperio.Date,aperio.DisplayColor,aperio.Exposure Scale,aperio.Exposure Time,aperio.Filename,aperio.Focus Offset,aperio.ICC Profile,...,data_url,size,slide_image,openslide.level[3].downsample,openslide.level[3].height,openslide.level[3].tile-height,openslide.level[3].tile-width,openslide.level[3].width,project_name,comment
slide_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01OV008-308ad404-7079-4ff8-8232-12ee2e,True,40.0,spaperio,08/14/14,0.0,1e-06,109.0,109330.0,0.0,ScanScope v1,...,file:///home/kohlia/vmount/PRO_12-123/data/toy...,207479411,file:///home/kohlia/vmount/PRO_12-123/data/toy...,,,,,,PRO_12-123,Example Ingestion Job
01OV002-ed65cf94-8bc6-492b-9149-adc16f,True,40.0,spaperio,08/14/14,0.0,1e-06,109.0,109327.0,0.0,ScanScope v1,...,file:///home/kohlia/vmount/PRO_12-123/data/toy...,240691747,file:///home/kohlia/vmount/PRO_12-123/data/toy...,32.008682,1555.0,256.0,256.0,1590.0,PRO_12-123,Example Ingestion Job
01OV002-bd8cdc70-3d46-40ae-99c4-90ef77,True,40.0,spaperio,08/14/14,0.0,1e-06,109.0,109316.0,0.0,ScanScope v1,...,file:///home/kohlia/vmount/PRO_12-123/data/toy...,237047223,file:///home/kohlia/vmount/PRO_12-123/data/toy...,32.007005,1713.0,256.0,256.0,1680.0,PRO_12-123,Example Ingestion Job
01OV007-9b90eb78-2f50-4aeb-b010-d642f9,True,40.0,spaperio,08/14/14,0.0,1e-06,109.0,109328.0,0.0,ScanScope v1,...,file:///home/kohlia/vmount/PRO_12-123/data/toy...,262796337,file:///home/kohlia/vmount/PRO_12-123/data/toy...,32.003727,1878.0,256.0,256.0,3420.0,PRO_12-123,Example Ingestion Job
01OV008-7579323e-2fae-43a9-b00f-a15c28,True,40.0,spaperio,08/14/14,0.0,1e-06,109.0,109331.0,0.0,ScanScope v1,...,file:///home/kohlia/vmount/PRO_12-123/data/toy...,215796305,file:///home/kohlia/vmount/PRO_12-123/data/toy...,32.002432,1850.0,256.0,256.0,1320.0,PRO_12-123,Example Ingestion Job


If the table is depicted above, congratulations, you  have successfully run the Whole Slide Image (WSI) ETL to database the slides!

## Run the regional annotation ETL

The whole slide images that you downloaded are images of ovarian cancer, but not every pixel on each slide is a tumor. In fact, the images show tumor cells, normal ovarian cells and more. A non-expert annotated this slide for demo purposes only.

The regional annotation ETL performs the following steps

- Downloads DSA json annotations
- Converts DSA jsons to GeoJSON format, which is compatible with downstream applications
- Saves configs in your `~/vmount/PRO_12-123/configs/REGIONAL_METADATA_RESULTS`
- Saves parquet table in your `~/vmount/PRO_12-123/tables/REGIONAL_METADATA_RESULTS `


To run the regional annotation ETL, we use the `dsa_annotation` CLI. For more details on the dsa_annotation, and the annotations we support, please checkout the `7_dsa-annotation.ipynb` notebook.

**Note**: details of your DSA instance is specified as `DSA_URI` in `../conf/dsa_regional_annotation.yaml` and should be updated to reflect your DSA setup.  If you are using the docker, replace the `localhost` with the IP you get from running:

```docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' USERNAME-girder-1```


In [11]:
!dsa_annotation http://172.20.0.4:8080/api/v1 \
    --collection_name "TCGA collection" \
    --annotation_name "ov_regional" \
    --username admin --password password \
    --output_dir ../PRO_12-123/data/toy_data_set/table/ANNOTATIONS 

2022-04-26 16:35:31,052 - INFO - root - Initalized logger, log file at: data-processing.log
2022-04-26 16:35:31,054 - INFO - luna.common.utils - Started CLI Runner wtih <function dsa_annotation_etl at 0x7fa35876dee0>
2022-04-26 16:35:31,055 - INFO - luna.common.utils - Validating params...
2022-04-26 16:35:31,056 - INFO - luna.common.utils -  -> Set input_dsa_endpoint (<class 'str'>) = http://172.20.0.4:8080/api/v1
2022-04-26 16:35:31,057 - INFO - luna.common.utils -  -> Set collection_name (<class 'str'>) = TCGA collection
2022-04-26 16:35:31,058 - INFO - luna.common.utils -  -> Set annotation_name (<class 'str'>) = ov_regional
2022-04-26 16:35:31,059 - INFO - luna.common.utils -  -> Set num_cores (<class 'int'>) = 4
2022-04-26 16:35:31,060 - INFO - luna.common.utils -  -> Set username (<class 'str'>) = *****
2022-04-26 16:35:31,061 - INFO - luna.common.utils -  -> Set password (<class 'str'>) = *****
2022-04-26 16:35:31,062 - INFO - luna.common.utils -  -> Set output_dir (<class 'str

2022-04-26 16:35:34,302 - INFO - dsa_annotation_etl - 	Created geometry POLYGON ((15745 10073, 15904 10059, 1606...
2022-04-26 16:35:34,305 - INFO - dsa_annotation_etl - 	Created geometry POLYGON ((29241 24267, 28934 24536, 2850...
2022-04-26 16:35:34,306 - INFO - dsa_annotation_etl - 	Created geometry POLYGON ((12928 29472, 12872 29504, 1280...
2022-04-26 16:35:34,309 - INFO - dsa_annotation_etl - 	Created geometry POLYGON ((15784 15684, 15840 15673, 1590...
2022-04-26 16:35:34,310 - INFO - dsa_annotation_etl - Checking geojson, errors with geojson FeatureCollection: []
2022-04-26 16:35:34,314 - INFO - dsa_annotation_etl - 	Created geometry POLYGON ((24707 16018, 24700 15990, 2463...
2022-04-26 16:35:34,316 - INFO - dsa_annotation_etl - Checking geojson, errors with geojson FeatureCollection: []
2022-04-26 16:35:34,318 - INFO - luna.pathology.dsa.dsa_api_handler - Found 1 total annotations: {'ov_regional'}
2022-04-26 16:35:34,323 - INFO - luna.pathology.dsa.dsa_api_handler - Found an 

To check that the regional annotation ETL was correctly run, after the Jupyter cell finishes, you may load the regional annotations table! This table contains the metadata saved from running the ETL. It includes paths to the bitmap files, numpy files, and geoJSON files that were mentioned before. To load the table, run the following code cell: 

In [12]:
from pyarrow.parquet import read_table

pd.read_parquet("../PRO_12-123/data/toy_data_set/table/ANNOTATIONS/slide_annotation_dataset_TCGA collection_ov_regional.parquet")

Unnamed: 0_level_0,_id,baseParentId,baseParentType,created,creatorId,description,folderId,largeImage,lowerName,name,...,xmin,xmax,ymin,ymax,bbox_area,x_coords,y_coords,slide_geojson,collection_name,annotation_name
slide_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01OV002-bd8cdc70-3d46-40ae-99c4-90ef77,62681d02787e8d61b58adf2d,62681d02787e8d61b58adf2b,collection,2022-04-26T16:25:38.663000+00:00,62681d02787e8d61b58adf27,,62681d02787e8d61b58adf2c,"{'fileId': '62681d2c787e8d61b58adf2f', 'source...",01ov002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,...,25250.0,28661.0,40529.0,44372.0,13108473.0,"[28211, 28328, 28587, 28630, 28655, 28661, 286...","[42225, 42546, 43126, 43261, 43379, 43607, 437...",,TCGA collection,ov_regional
01OV002-bd8cdc70-3d46-40ae-99c4-90ef77,62681d02787e8d61b58adf2d,62681d02787e8d61b58adf2b,collection,2022-04-26T16:25:38.663000+00:00,62681d02787e8d61b58adf27,,62681d02787e8d61b58adf2c,"{'fileId': '62681d2c787e8d61b58adf2f', 'source...",01ov002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,...,31532.0,33932.0,35713.0,39793.0,9792000.0,"[32252, 32252, 32220, 32164, 32140, 32108, 320...","[37097, 37001, 36897, 36745, 36713, 36689, 366...",,TCGA collection,ov_regional
01OV002-bd8cdc70-3d46-40ae-99c4-90ef77,62681d02787e8d61b58adf2d,62681d02787e8d61b58adf2b,collection,2022-04-26T16:25:38.663000+00:00,62681d02787e8d61b58adf27,,62681d02787e8d61b58adf2c,"{'fileId': '62681d2c787e8d61b58adf2f', 'source...",01ov002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,...,23557.0,25542.0,18922.0,21735.0,5583805.0,"[24180, 24136, 24063, 23980, 23874, 23813, 237...","[18972, 18978, 19017, 19078, 19161, 19217, 192...",,TCGA collection,ov_regional
01OV002-bd8cdc70-3d46-40ae-99c4-90ef77,62681d02787e8d61b58adf2d,62681d02787e8d61b58adf2b,collection,2022-04-26T16:25:38.663000+00:00,62681d02787e8d61b58adf27,,62681d02787e8d61b58adf2c,"{'fileId': '62681d2c787e8d61b58adf2f', 'source...",01ov002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,...,26951.0,30651.0,23202.0,26990.0,14015600.0,"[30133, 30073, 30005, 29951, 29904, 29857, 298...","[26411, 26465, 26506, 26532, 26553, 26566, 265...",,TCGA collection,ov_regional
01OV002-bd8cdc70-3d46-40ae-99c4-90ef77,62681d02787e8d61b58adf2d,62681d02787e8d61b58adf2b,collection,2022-04-26T16:25:38.663000+00:00,62681d02787e8d61b58adf27,,62681d02787e8d61b58adf2c,"{'fileId': '62681d2c787e8d61b58adf2f', 'source...",01ov002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,...,13533.0,17365.0,26776.0,29672.0,11097472.0,"[16525, 16597, 16701, 16781, 16845, 16901, 169...","[27088, 26992, 26896, 26840, 26816, 26808, 267...",,TCGA collection,ov_regional
01OV002-bd8cdc70-3d46-40ae-99c4-90ef77,62681d02787e8d61b58adf2d,62681d02787e8d61b58adf2b,collection,2022-04-26T16:25:38.663000+00:00,62681d02787e8d61b58adf27,,62681d02787e8d61b58adf2c,"{'fileId': '62681d2c787e8d61b58adf2f', 'source...",01ov002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,...,21459.0,23500.0,16457.0,19848.0,6921031.0,"[23389, 23435, 23463, 23481, 23500, 23500, 234...","[18929, 19021, 19103, 19186, 19250, 19425, 194...",,TCGA collection,ov_regional
01OV002-bd8cdc70-3d46-40ae-99c4-90ef77,62681d02787e8d61b58adf2d,62681d02787e8d61b58adf2b,collection,2022-04-26T16:25:38.663000+00:00,62681d02787e8d61b58adf27,,62681d02787e8d61b58adf2c,"{'fileId': '62681d2c787e8d61b58adf2f', 'source...",01ov002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs,...,,,,,,,,../PRO_12-123/data/toy_data_set/table/ANNOTATI...,TCGA collection,ov_regional
01OV002-ed65cf94-8bc6-492b-9149-adc16f,62681d2d787e8d61b58adf38,62681d02787e8d61b58adf2b,collection,2022-04-26T16:26:21.385000+00:00,62681d02787e8d61b58adf27,,62681d02787e8d61b58adf2c,"{'fileId': '62681d5b787e8d61b58adf3a', 'source...",01ov002-ed65cf94-8bc6-492b-9149-adc16f.svs,01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs,...,20217.0,25783.0,4477.0,8874.0,24473702.0,"[24143, 23829, 23663, 23611, 23532, 23436, 233...","[4477, 4477, 4503, 4521, 4556, 4591, 4608, 464...",,TCGA collection,ov_regional
01OV002-ed65cf94-8bc6-492b-9149-adc16f,62681d2d787e8d61b58adf38,62681d02787e8d61b58adf2b,collection,2022-04-26T16:26:21.385000+00:00,62681d02787e8d61b58adf27,,62681d02787e8d61b58adf2c,"{'fileId': '62681d5b787e8d61b58adf3a', 'source...",01ov002-ed65cf94-8bc6-492b-9149-adc16f.svs,01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs,...,14683.0,19778.0,4925.0,8728.0,19376285.0,"[19647, 19691, 19735, 19778, 19778, 19770, 197...","[5125, 5370, 5605, 5876, 6076, 6190, 6268, 634...",,TCGA collection,ov_regional
01OV002-ed65cf94-8bc6-492b-9149-adc16f,62681d2d787e8d61b58adf38,62681d02787e8d61b58adf2b,collection,2022-04-26T16:26:21.385000+00:00,62681d02787e8d61b58adf27,,62681d02787e8d61b58adf2c,"{'fileId': '62681d5b787e8d61b58adf3a', 'source...",01ov002-ed65cf94-8bc6-492b-9149-adc16f.svs,01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs,...,10240.0,12487.0,28794.0,34122.0,11972016.0,"[10388, 10374, 10334, 10294, 10253, 10240, 102...","[30395, 30328, 30207, 30059, 29938, 29830, 296...",,TCGA collection,ov_regional


Last, lets get our geojsons and join on slide id!

In [13]:
pd.read_parquet("../PRO_12-123/data/toy_data_set/table/ANNOTATIONS/slide_annotation_dataset_TCGA collection_ov_regional.parquet") \
    .query("type=='geojson'")[['slide_geojson']] \
    .join(
        pd.read_parquet("../PRO_12-123/data/toy_data_set/table/SLIDES/slide_ingest_PRO_12-123.parquet")['slide_image']
    )


Unnamed: 0_level_0,slide_geojson,slide_image
slide_id,Unnamed: 1_level_1,Unnamed: 2_level_1
01OV002-bd8cdc70-3d46-40ae-99c4-90ef77,../PRO_12-123/data/toy_data_set/table/ANNOTATI...,file:///home/kohlia/vmount/PRO_12-123/data/toy...
01OV002-ed65cf94-8bc6-492b-9149-adc16f,../PRO_12-123/data/toy_data_set/table/ANNOTATI...,file:///home/kohlia/vmount/PRO_12-123/data/toy...
01OV007-9b90eb78-2f50-4aeb-b010-d642f9,../PRO_12-123/data/toy_data_set/table/ANNOTATI...,file:///home/kohlia/vmount/PRO_12-123/data/toy...
01OV008-308ad404-7079-4ff8-8232-12ee2e,../PRO_12-123/data/toy_data_set/table/ANNOTATI...,file:///home/kohlia/vmount/PRO_12-123/data/toy...
01OV008-7579323e-2fae-43a9-b00f-a15c28,../PRO_12-123/data/toy_data_set/table/ANNOTATI...,file:///home/kohlia/vmount/PRO_12-123/data/toy...


At this point, you have successfully set up your workspace, dowloaded the data, and run both the pathology and regional annotation ETLs to prepare your data. You are ready to move on to the tiling notebook!