You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_MedImageTools.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_MedImageTools.ipynb)

# Background
This notebook is designed to showcase the core functionalities of Med-ImageTools using datasets from The Cancer Imaging Archive (TCIA). The notebook will guide you through the following steps:

1. Installing [Med-ImageTools](https://pypi.org/project/med-imagetools/) and [tcia_utils](https://pypi.org/project/tcia-utils/)
2. Downloading and processing a sample TCIA dataset using `AutoPipeline` for deep learning segmentation
   
   i. Understanding outputs from `AutoPipeline` for segmentation
   
   ii. Understanding full outputs from `AutoPipeline`
   
3. Processing a sample TCIA dataset using `AutoPipeline` with radiotherapy data

# 1 Setup:  Installing Med-ImageTolls and tcia_utils
Med-ImageTools and tcia_utils are available on PyPI and can be installed using pip.

In [1]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q med-imagetools
!{sys.executable} -m pip install --upgrade -q tcia_utils


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.5/77.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.3/8.3 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.7/14.7 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import requests
import pandas as pd
import os
import imgtools
from tcia_utils import nbia

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

Google Colab Logging set to INFO


# 2 Downloading and processing a sample TCIA dataset using `AutoPipeline` for deep learning segmentation

We're going to start off by downloading some sample data and then telling AutoPipeline where to find and save data.

For this example, let's assume you've already spent some time [browsing the data on TCIA](https://www.cancerimagingarchive.net/access-data) and decided that you're interested in working with the [Pancreatic-CT-CBCT-SEG](https://doi.org/10.7937/TCIA.ESHQ-4D90) collection.  

**Note:** If you're interested in learning more about the functionality of **tcia_utils** for querying and downloading data please check out https://github.com/kirbyju/TCIA_Notebooks for a more thorough explanation.  

In [3]:
# download series metadata for the Pancreatic-CT-CBCT-SEG collection
series = nbia.getSeries("Pancreatic-CT-CBCT-SEG", format = "df")

2024-09-24 04:34:18,133:INFO:Success - Token saved to api_call_headers variable and expires at 2024-09-24 06:34:18.133528
2024-09-24 04:34:18,137:INFO:Accessing public data anonymously. To access restricted data use nbia.getToken() with your credentials.
2024-09-24 04:34:18,141:INFO:Calling getSeries with parameters {'Collection': 'Pancreatic-CT-CBCT-SEG'}


In [4]:
# sort the 'series' dataframe by PatientID
series_sorted = series.sort_values('PatientID')

# get the first 3 unique PatientIDs
first_3_ids = series_sorted['PatientID'].unique()[:3]

# filter the dataframe to include only rows for the first 3 PatientIDs
sample_series = series_sorted[series_sorted['PatientID'].isin(first_3_ids)]

sample_series

Unnamed: 0,SeriesInstanceUID,StudyInstanceUID,Modality,SeriesDate,SeriesDescription,BodyPartExamined,SeriesNumber,Collection,PatientID,Manufacturer,...,ImageCount,TimeStamp,LicenseName,LicenseURI,CollectionURI,FileSize,DateReleased,StudyDesc,StudyDate,ThirdPartyAnalysis
285,1.3.6.1.4.1.14519.5.2.1.3094364563653665937958...,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,CT,2012-07-19 00:00:00.0,,ABDOMEN,5,Pancreatic-CT-CBCT-SEG,Pancreas-CT-CB_001,Varian Medical Systems,...,93,2022-08-11 13:02:12.0,Creative Commons Attribution 4.0 International...,https://creativecommons.org/licenses/by/4.0/,https://doi.org/10.7937/TCIA.ESHQ-4D90,48997226,2022-08-23 00:00:00.0,PANCREAS,2012-07-06 00:00:00.0,NO
241,1.3.6.1.4.1.14519.5.2.1.1873504185020096547520...,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,CT,2012-07-28 00:00:00.0,,ABDOMEN,8,Pancreatic-CT-CBCT-SEG,Pancreas-CT-CB_001,Varian Medical Systems,...,93,2022-08-11 13:02:19.0,Creative Commons Attribution 4.0 International...,https://creativecommons.org/licenses/by/4.0/,https://doi.org/10.7937/TCIA.ESHQ-4D90,49000938,2022-08-23 00:00:00.0,PANCREAS,2012-07-06 00:00:00.0,NO
108,1.3.6.1.4.1.14519.5.2.1.3023827908555824457224...,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,CT,2012-07-06 00:00:00.0,"PANCREAS DI, iDose (3)",CHEST,201,Pancreatic-CT-CBCT-SEG,Pancreas-CT-CB_001,Philips,...,134,2021-09-16 12:49:02.0,Creative Commons Attribution 4.0 International...,https://creativecommons.org/licenses/by/4.0/,https://doi.org/10.7937/TCIA.ESHQ-4D90,70579828,2021-09-16 12:49:02.0,PANCREAS,2012-07-06 00:00:00.0,NO
216,1.3.6.1.4.1.14519.5.2.1.2982170419892875273227...,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,RTSTRUCT,2015-11-12 00:00:00.0,BSCB_LL_LR_SDCB,ABDOMEN,1,Pancreatic-CT-CBCT-SEG,Pancreas-CT-CB_001,MIM Software Inc.,...,1,2021-09-16 12:42:59.0,Creative Commons Attribution 4.0 International...,https://creativecommons.org/licenses/by/4.0/,https://doi.org/10.7937/TCIA.ESHQ-4D90,2374672,2021-09-16 12:42:59.0,PANCREAS,2012-07-06 00:00:00.0,NO
83,1.3.6.1.4.1.14519.5.2.1.3090686406672815654659...,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,CT,2012-07-19 00:00:00.0,Aligned resampled CB02,ABDOMEN,56094,Pancreatic-CT-CBCT-SEG,Pancreas-CT-CB_001,Varian Medical Systems / MIM Software,...,134,2021-09-16 12:48:25.0,Creative Commons Attribution 4.0 International...,https://creativecommons.org/licenses/by/4.0/,https://doi.org/10.7937/TCIA.ESHQ-4D90,70603434,2021-09-16 12:48:25.0,PANCREAS,2012-07-06 00:00:00.0,NO
116,1.3.6.1.4.1.14519.5.2.1.1813878355529485429185...,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,RTSTRUCT,2015-11-09 00:00:00.0,BSPC_LL_LR_ROI_SDPC,ABDOMEN,1,Pancreatic-CT-CBCT-SEG,Pancreas-CT-CB_001,MIM Software Inc.,...,1,2021-09-16 12:18:00.0,Creative Commons Attribution 4.0 International...,https://creativecommons.org/licenses/by/4.0/,https://doi.org/10.7937/TCIA.ESHQ-4D90,3265634,2021-09-16 12:18:00.0,PANCREAS,2012-07-06 00:00:00.0,NO
26,1.3.6.1.4.1.14519.5.2.1.1027072244733702318999...,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,CT,2012-07-28 00:00:00.0,Aligned CB07,ABDOMEN,46049,Pancreatic-CT-CBCT-SEG,Pancreas-CT-CB_001,Varian Medical Systems / MIM Software,...,134,2021-09-16 12:49:14.0,Creative Commons Attribution 4.0 International...,https://creativecommons.org/licenses/by/4.0/,https://doi.org/10.7937/TCIA.ESHQ-4D90,70602134,2021-09-16 12:49:14.0,PANCREAS,2012-07-06 00:00:00.0,NO
123,1.3.6.1.4.1.14519.5.2.1.2817663921020979728325...,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,RTSTRUCT,2015-11-13 00:00:00.0,BSCB_LL_LR_SDCB,ABDOMEN,1,Pancreatic-CT-CBCT-SEG,Pancreas-CT-CB_001,MIM Software Inc.,...,1,2021-09-16 12:19:18.0,Creative Commons Attribution 4.0 International...,https://creativecommons.org/licenses/by/4.0/,https://doi.org/10.7937/TCIA.ESHQ-4D90,2384336,2021-09-16 12:19:18.0,PANCREAS,2012-07-06 00:00:00.0,NO
339,1.3.6.1.4.1.14519.5.2.1.2452239859660614261024...,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,RTDOSE,2012-07-13 00:00:00.0,Eclipse Doses,ABDOMEN,321,Pancreatic-CT-CBCT-SEG,Pancreas-CT-CB_001,Varian Medical Systems,...,1,2022-08-11 12:48:59.0,Creative Commons Attribution 4.0 International...,https://creativecommons.org/licenses/by/4.0/,https://doi.org/10.7937/TCIA.ESHQ-4D90,96955372,2022-08-23 00:00:00.0,PANCREAS,2012-07-06 00:00:00.0,NO
250,1.3.6.1.4.1.14519.5.2.1.4539053103052673453496...,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,CT,2012-06-08 00:00:00.0,,ABDOMEN,12,Pancreatic-CT-CBCT-SEG,Pancreas-CT-CB_002,Varian Medical Systems,...,88,2022-08-11 13:02:10.0,Creative Commons Attribution 4.0 International...,https://creativecommons.org/licenses/by/4.0/,https://doi.org/10.7937/TCIA.ESHQ-4D90,46366390,2022-08-23 00:00:00.0,Pancreas/Liver DIBH 2mm,2012-05-19 00:00:00.0,NO


In [5]:
# Download the data
nbia.downloadSeries(sample_series, path="tciaDownload/data", input_type = "df")

2024-09-24 04:34:20,969:INFO:Downloading 28 out of 28 Series Instance UIDs (scans).
2024-09-24 04:34:20,976:INFO:Downloading... https://services.cancerimagingarchive.net/nbia-api/services/v2/getImage?NewFileNames=Yes&SeriesInstanceUID=1.3.6.1.4.1.14519.5.2.1.309436456365366593795854640412150952125
2024-09-24 04:34:27,542:INFO:Downloading... https://services.cancerimagingarchive.net/nbia-api/services/v2/getImage?NewFileNames=Yes&SeriesInstanceUID=1.3.6.1.4.1.14519.5.2.1.187350418502009654752055781793285464352
2024-09-24 04:34:32,304:INFO:Downloading... https://services.cancerimagingarchive.net/nbia-api/services/v2/getImage?NewFileNames=Yes&SeriesInstanceUID=1.3.6.1.4.1.14519.5.2.1.302382790855582445722435410442490497846
2024-09-24 04:34:40,573:INFO:Downloading... https://services.cancerimagingarchive.net/nbia-api/services/v2/getImage?NewFileNames=Yes&SeriesInstanceUID=1.3.6.1.4.1.14519.5.2.1.298217041989287527322718266582029491750
2024-09-24 04:34:41,961:INFO:Downloading... https://serv

### AutoPipeline dry-run
Now let's dry-run AutoPipeline to understand it's crawl functionality.  
We'll use the same command, but add the **--dry-run** flag to see what it would do without actually running it.

In [6]:
INPUT_PATH  ="tciaDownload"
OUTPUT_PATH ="autoPipeline"

In [7]:
!autopipeline \
     $INPUT_PATH \
     $OUTPUT_PATH \
     --modalities CT,RTSTRUCT \
     --n_jobs 4 \
     --dry_run

initializing AutoPipeline...
Indexing the dataset...
  0% 0/1 [00:00<?, ?it/s]100% 1/1 [00:00<00:00, 62.51it/s]
Number of patients in the dataset: 3
Edge table not present. Forming the edge table based on the crawl data...

Total time taken: 0.03749489784240723
Saving edge table in /content/.imgtools/imgtools_tciaDownload_edges.csv
Forming the graph based on the given modalities: CT,RTSTRUCT
  relevant_study_id = self.df_new.loc[(self.df_new.edge_type.str.contains(regex_term)), "study_x"].unique()
There are 10 cases containing all CT,RTSTRUCT modalities.
dry run complete, no processing done
Outputted data to autoPipeline
Dataset info found at autoPipeline/dataset.csv


Running `AutoPipeline` creates a `.imgtools` folder in the dataset's parent directory.

```
parent_folder
└───.imgtools
│   ├── imgtools_dataset.csv
│   ├── imgtools_dataset.json
│   └── imgtools_dataset_edges.csv
│
└───dataset
    ├── patient-001
    ├── patient-002
    ...
```

There are three files in the `.imgtools` folder:

* `imgtools_dataset.csv` contains the metadata for the dataset
* `imgtools_dataset.json` contains the metadata for the dataset in JSON format
* `imgtools_dataset_edges.csv` contains the "edges" for the dataset.
    * An edge is a DICOM-DICOM pair that are connected based on the metadata.


In [8]:
parent_folder   = os.path.dirname(INPUT_PATH)
imgtools_folder = os.path.join(parent_folder, ".imgtools")
imgtools_files  = os.listdir(imgtools_folder)

print("Files generated by Med-ImageTools:\n")
print("\n".join(imgtools_files))

Files generated by Med-ImageTools:

imgtools_tciaDownload_edges.csv
imgtools_tciaDownload.json
imgtools_tciaDownload.csv


This is what the crawled dataset looks like.
Each row represents a DICOM series (CT, MRI, PET, RTSTRUCT, SEG, RTDOSE, RTPLAN, etc).

In [9]:
df_crawl = pd.read_csv(os.path.join(imgtools_folder, imgtools_files[-1]), index_col=0)
df_crawl

Unnamed: 0,patient_ID,study,study_description,series,series_description,subseries,modality,instances,instance_uid,reference_ct,...,reference_frame,folder,orientation,orientation_type,MR_repetition_time,MR_echo_time,MR_scan_sequence,MR_magnetic_field_strength,MR_imaged_nucleus,file_path
0,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.6512363249234457542274...,,1,CT,88,1.3.6.1.4.1.14519.5.2.1.8308560590762491066032...,,...,1.3.6.1.4.1.14519.5.2.1.1756147160543368957118...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.6512...,"[0.99999951136927, 0, 0.00098856523562, 0, 1, 0]",,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.6512...
1,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.1324728530717687446661...,Aligned CBCT01,1,CT,228,1.3.6.1.4.1.14519.5.2.1.1442070368379428302503...,,...,1.3.6.1.4.1.14519.5.2.1.2395440829883110188894...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1324...,"[1.0, 0.0, 0.0, 0.0, 1.0, 0.0]",,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1324...
2,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.4539053103052673453496...,,1,CT,88,1.3.6.1.4.1.14519.5.2.1.3303738426766133244428...,,...,1.3.6.1.4.1.14519.5.2.1.2158793040532861477462...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.4539...,"[1, 0, 0, 0, 1, 0]",,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.4539...
3,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.3214919025494098598614...,Eclipse Doses,default,RTDOSE,1,1.3.6.1.4.1.14519.5.2.1.2812895455145780179879...,,...,1.3.6.1.4.1.14519.5.2.1.2395440829883110188894...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.3214...,"[1, 0, 0, 0, 1, 0]",,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.3214...
4,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.2129190817963846087988...,BSCB_LL_LR_SDCB,default,RTSTRUCT,1,1.3.6.1.4.1.14519.5.2.1.2129190817963846087988...,1.3.6.1.4.1.14519.5.2.1.1324728530717687446661...,...,1.3.6.1.4.1.14519.5.2.1.2395440829883110188894...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2129...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2129...
5,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.1058177309410879024400...,"DIBH, iDose (3)",3,CT,228,1.3.6.1.4.1.14519.5.2.1.3070319204424362694259...,,...,1.3.6.1.4.1.14519.5.2.1.2395440829883110188894...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1058...,"[1, 0, 0, 0, 1, 0]",,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1058...
6,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.1116851988950543841719...,Aligned CT,1,CT,228,1.3.6.1.4.1.14519.5.2.1.3282911217272883942436...,,...,1.3.6.1.4.1.14519.5.2.1.2395440829883110188894...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1116...,"[1.0, 0.0, 0.0, 0.0, 1.0, 0.0]",,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1116...
7,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.1768106710823996410134...,BSCB_LL_LR_SDCB,default,RTSTRUCT,1,1.3.6.1.4.1.14519.5.2.1.1768106710823996410134...,1.3.6.1.4.1.14519.5.2.1.1116851988950543841719...,...,1.3.6.1.4.1.14519.5.2.1.2395440829883110188894...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1768...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1768...
8,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.2114137508584631957882...,BSPC_LL_LR_ROI_SDPC,default,RTSTRUCT,1,1.3.6.1.4.1.14519.5.2.1.2114137508584631957882...,1.3.6.1.4.1.14519.5.2.1.1058177309410879024400...,...,1.3.6.1.4.1.14519.5.2.1.2395440829883110188894...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2114...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2114...
9,Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.2027136189212788770525...,PANCREAS,1.3.6.1.4.1.14519.5.2.1.7696748946279977700328...,BSO1_BSO2_ROI_SDO1_SDO2,default,RTSTRUCT,1,1.3.6.1.4.1.14519.5.2.1.7696748946279977700328...,1.3.6.1.4.1.14519.5.2.1.6161518552827897770510...,...,1.3.6.1.4.1.14519.5.2.1.3113641383176229395155...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.7696...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.7696...


This is what the adjoined edges of the dataset looks like.

In [10]:
df_edges = pd.read_csv(os.path.join(imgtools_folder, imgtools_files[0]))
df_edges

Unnamed: 0,patient_ID_x,study_x,study_description_x,series_x,series_description_x,subseries_x,modality_x,instances_x,instance_uid_x,reference_ct_x,...,folder_y,orientation_y,orientation_type_y,MR_repetition_time_y,MR_echo_time_y,MR_scan_sequence_y,MR_magnetic_field_strength_y,MR_imaged_nucleus_y,file_path_y,edge_type
0,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.1324728530717687446661...,Aligned CBCT01,1.0,CT,228,1.3.6.1.4.1.14519.5.2.1.1442070368379428302503...,,...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2129...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2129...,2
1,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.1058177309410879024400...,"DIBH, iDose (3)",3.0,CT,228,1.3.6.1.4.1.14519.5.2.1.3070319204424362694259...,,...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2114...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2114...,2
2,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas/Liver DIBH 2mm,1.3.6.1.4.1.14519.5.2.1.1116851988950543841719...,Aligned CT,1.0,CT,228,1.3.6.1.4.1.14519.5.2.1.3282911217272883942436...,,...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1768...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1768...,2
3,Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.2027136189212788770525...,PANCREAS,1.3.6.1.4.1.14519.5.2.1.1767644549555512228280...,Aligned CT,1.0,CT,93,1.3.6.1.4.1.14519.5.2.1.7921605915459345120496...,,...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.7616...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.7616...,2
4,Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.2027136189212788770525...,PANCREAS,1.3.6.1.4.1.14519.5.2.1.6161518552827897770510...,,1.0,CT,88,1.3.6.1.4.1.14519.5.2.1.2356922013514428755032...,,...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.7696...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.7696...,2
5,Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.2027136189212788770525...,PANCREAS,1.3.6.1.4.1.14519.5.2.1.1004411337869179670896...,Aligned CT,1.0,CT,93,1.3.6.1.4.1.14519.5.2.1.2008744807020779037397...,,...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1693...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1693...,2
6,Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.2027136189212788770525...,PANCREAS,1.3.6.1.4.1.14519.5.2.1.9566682535928442256760...,DI,,CT,93,1.3.6.1.4.1.14519.5.2.1.2416640574708273083639...,,...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1043...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1043...,2
7,Pancreas-CT-CB_001,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,PANCREAS,1.3.6.1.4.1.14519.5.2.1.3023827908555824457224...,"PANCREAS DI, iDose (3)",1.0,CT,134,1.3.6.1.4.1.14519.5.2.1.3328043612256966168356...,,...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1813...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.1813...,2
8,Pancreas-CT-CB_001,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,PANCREAS,1.3.6.1.4.1.14519.5.2.1.3090686406672815654659...,Aligned resampled CB02,1.0,CT,134,1.3.6.1.4.1.14519.5.2.1.8740201381973198460214...,,...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2982...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2982...,2
9,Pancreas-CT-CB_001,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,PANCREAS,1.3.6.1.4.1.14519.5.2.1.1027072244733702318999...,Aligned CB07,1.0,CT,134,1.3.6.1.4.1.14519.5.2.1.1971756455065354165673...,,...,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2817...,,,,,,,,tciaDownload/data/1.3.6.1.4.1.14519.5.2.1.2817...,2


Let's see how many edges of each type we have in this dataset.
There are 8 edge types detected by Med-ImageTools:
* (0) RTDOSE-RTSTRUCT
* (1) RTDOSE-CT
* (2) RTSTRUCT-CT
* (3) RTSTRUCT-PET
* (4) CT-PET
* (5) RTDOSE-PET
* (6) RTPLAN-RTSTRUCT
* (7) SEG-CT

In [11]:
print(df_edges.edge_type.value_counts())

edge_type
2    10
Name: count, dtype: int64


# 3 Processing a sample TCIA dataset using `AutoPipeline` with radiotherapy data

Now let's actually run the AutoPipeline and see what we get!

In [12]:
!autopipeline \
     $INPUT_PATH \
     $OUTPUT_PATH \
     --modalities CT,RTSTRUCT \
     --n_jobs 4

initializing AutoPipeline...
The dataset has already been indexed.
Edge table is already present. Loading the data...
Forming the graph based on the given modalities: CT,RTSTRUCT
  relevant_study_id = self.df_new.loc[(self.df_new.edge_type.str.contains(regex_term)), "study_x"].unique()
There are 10 cases containing all CT,RTSTRUCT modalities.
starting AutoPipeline...
5_Pancreas-CT-CB_003
9_Pancreas-CT-CB_001
2_Pancreas-CT-CB_002
8_Pancreas-CT-CB_001
Processing: 5_Pancreas-CT-CB_003
Processing: 2_Pancreas-CT-CB_002
Processing: 9_Pancreas-CT-CB_001
Processing: 8_Pancreas-CT-CB_001
5_Pancreas-CT-CB_003  start
8_Pancreas-CT-CB_001  start
9_Pancreas-CT-CB_001  start
8_Pancreas-CT-CB_001  SAVED IMAGE
labels: {'Bowel_sm_CBCT': 0, 'LUNG_L': 1, 'LUNG_R': 2, 'Stomach_duo_CBCT': 3}
9_Pancreas-CT-CB_001  SAVED IMAGE
labels: {'Bowel_sm_CBCT': 0, 'LUNG_L': 1, 'LUNG_R': 2, 'Stomach_duo_CBCT': 3}
5_Pancreas-CT-CB_003  SAVED IMAGE
labels: {'Bowel_sm_CBCT': 0, 'LUNG_L': 1, 'LUNG_R': 2, 'Stomach_duo_CBCT

The output folder will be structured like this:
```
output_folder
├── dataset.csv
├── report.md
│
├── 0_patient-001
│   ├── CT
│   │   └── CT.nii.gz
│   └── RTSTRUCT_CT
│       ├── Head.nii.gz
│       ├── Shoulder.nii.gz
│       ├── Knees.nii.gz
│       └── Toes.nii.gz
│
├── 1_patient-002
├── 2_patient-003
...
```

Let's see what's inside the folder:

In [13]:
output_folders = [path for path in os.listdir(OUTPUT_PATH) if os.path.isdir(os.path.join(OUTPUT_PATH, path))]
print("Output folders:\n")
print("\n".join(output_folders))


Output folders:

1_Pancreas-CT-CB_002
0_Pancreas-CT-CB_002
2_Pancreas-CT-CB_002
8_Pancreas-CT-CB_001
4_Pancreas-CT-CB_003
6_Pancreas-CT-CB_003
5_Pancreas-CT-CB_003
7_Pancreas-CT-CB_001
9_Pancreas-CT-CB_001
3_Pancreas-CT-CB_003


Let's take a look at the `dataset.csv` file.

In [14]:
df_data = pd.read_csv(os.path.join(OUTPUT_PATH, "dataset.csv"), index_col=0)
df_data

Unnamed: 0,study,patient_ID,series_CT,input_folder_CT,series_RTSTRUCT_CT,input_folder_RTSTRUCT_CT,BodyPartExamined,DataCollectionDiameter,SliceThickness,PatientPosition,...,output_folder_RTSTRUCT_CT,Modalities,numRTSTRUCTs,raw_labels_Bowel_sm_planCT,raw_labels_ROI,raw_labels_Stomach_duo_planCT,raw_labels_BowelSmObs1,raw_labels_BowelSmObs2,raw_labels_StomachDuoObs1,raw_labels_StomachDuoObs2
0_Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1324728530717687446661...,data/1.3.6.1.4.1.14519.5.2.1.13247285307176874...,1.3.6.1.4.1.14519.5.2.1.2129190817963846087988...,data/1.3.6.1.4.1.14519.5.2.1.21291908179638460...,ABDOMEN,464.906433,2.0,HFS,...,0_Pancreas-CT-CB_002/RTSTRUCT_CT,"['RTSTRUCT', 'CT']",1.0,,,,,,,
1_Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1058177309410879024400...,data/1.3.6.1.4.1.14519.5.2.1.10581773094108790...,1.3.6.1.4.1.14519.5.2.1.2114137508584631957882...,data/1.3.6.1.4.1.14519.5.2.1.21141375085846319...,ABDOMEN,700.0,2.0,HFS,...,1_Pancreas-CT-CB_002/RTSTRUCT_CT,"['RTSTRUCT', 'CT']",1.0,0.0,3.0,4.0,,,,
2_Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1582046344163509315842...,Pancreas-CT-CB_002,1.3.6.1.4.1.14519.5.2.1.1116851988950543841719...,data/1.3.6.1.4.1.14519.5.2.1.11168519889505438...,1.3.6.1.4.1.14519.5.2.1.1768106710823996410134...,data/1.3.6.1.4.1.14519.5.2.1.17681067108239964...,ABDOMEN,464.906433,2.0,HFS,...,2_Pancreas-CT-CB_002/RTSTRUCT_CT,"['RTSTRUCT', 'CT']",1.0,,,,,,,
3_Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.2027136189212788770525...,Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.1767644549555512228280...,data/1.3.6.1.4.1.14519.5.2.1.17676445495555122...,1.3.6.1.4.1.14519.5.2.1.7616487906661996620699...,data/1.3.6.1.4.1.14519.5.2.1.76164879066619966...,ABDOMEN,464.906433,3.0,HFS,...,3_Pancreas-CT-CB_003/RTSTRUCT_CT,"['RTSTRUCT', 'CT']",1.0,,,,,,,
4_Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.2027136189212788770525...,Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.6161518552827897770510...,data/1.3.6.1.4.1.14519.5.2.1.61615185528278977...,1.3.6.1.4.1.14519.5.2.1.7696748946279977700328...,data/1.3.6.1.4.1.14519.5.2.1.76967489462799777...,ABDOMEN,464.906423,1.98849,HFS,...,4_Pancreas-CT-CB_003/RTSTRUCT_CT,"['RTSTRUCT', 'CT']",1.0,,2.0,,0.0,1.0,3.0,4.0
5_Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.2027136189212788770525...,Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.1004411337869179670896...,data/1.3.6.1.4.1.14519.5.2.1.10044113378691796...,1.3.6.1.4.1.14519.5.2.1.1693968016598330084597...,data/1.3.6.1.4.1.14519.5.2.1.16939680165983300...,ABDOMEN,464.906433,3.0,HFS,...,5_Pancreas-CT-CB_003/RTSTRUCT_CT,"['RTSTRUCT', 'CT']",1.0,,,,,,,
6_Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.2027136189212788770525...,Pancreas-CT-CB_003,1.3.6.1.4.1.14519.5.2.1.9566682535928442256760...,data/1.3.6.1.4.1.14519.5.2.1.95666825359284422...,1.3.6.1.4.1.14519.5.2.1.1043846495280936961716...,data/1.3.6.1.4.1.14519.5.2.1.10438464952809369...,ABDOMEN,600.0,3.0,HFS,...,6_Pancreas-CT-CB_003/RTSTRUCT_CT,"['RTSTRUCT', 'CT']",1.0,0.0,3.0,4.0,,,,
7_Pancreas-CT-CB_001,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,Pancreas-CT-CB_001,1.3.6.1.4.1.14519.5.2.1.3023827908555824457224...,data/1.3.6.1.4.1.14519.5.2.1.30238279085558244...,1.3.6.1.4.1.14519.5.2.1.1813878355529485429185...,data/1.3.6.1.4.1.14519.5.2.1.18138783555294854...,ABDOMEN,700.0,3.0,HFS,...,7_Pancreas-CT-CB_001/RTSTRUCT_CT,"['RTSTRUCT', 'CT']",1.0,0.0,3.0,4.0,,,,
8_Pancreas-CT-CB_001,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,Pancreas-CT-CB_001,1.3.6.1.4.1.14519.5.2.1.3090686406672815654659...,data/1.3.6.1.4.1.14519.5.2.1.30906864066728156...,1.3.6.1.4.1.14519.5.2.1.2982170419892875273227...,data/1.3.6.1.4.1.14519.5.2.1.29821704198928752...,ABDOMEN,261.729645,3.0,HFS,...,8_Pancreas-CT-CB_001/RTSTRUCT_CT,"['RTSTRUCT', 'CT']",1.0,,,,,,,
9_Pancreas-CT-CB_001,1.3.6.1.4.1.14519.5.2.1.2108734576221172452337...,Pancreas-CT-CB_001,1.3.6.1.4.1.14519.5.2.1.1027072244733702318999...,data/1.3.6.1.4.1.14519.5.2.1.10270722447337023...,1.3.6.1.4.1.14519.5.2.1.2817663921020979728325...,data/1.3.6.1.4.1.14519.5.2.1.28176639210209797...,ABDOMEN,261.729645,3.0,HFS,...,9_Pancreas-CT-CB_001/RTSTRUCT_CT,"['RTSTRUCT', 'CT']",1.0,,,,,,,


There are 3 main types of columns in the dataset.csv file that are important for analysis:
* `patient_ID`: Defines the patient ID excluding index number.
* `output_folder_{modality}`: Path to the output folder per modality.
  * For example, for CT,RTSTRUCT modality pairs, the output folders will be `output_folder_CT` and `output_folder_RTSTRUCT_CT`.
* DICOM imaging metadata: Imaging parameters saved in the metadata.

In [15]:
df_data.columns.tolist()

['study',
 'patient_ID',
 'series_CT',
 'input_folder_CT',
 'series_RTSTRUCT_CT',
 'input_folder_RTSTRUCT_CT',
 'BodyPartExamined',
 'DataCollectionDiameter',
 'SliceThickness',
 'PatientPosition',
 'Manufacturer',
 'ScanOptions',
 'RescaleType',
 'RescaleSlope',
 'ManufacturerModelName',
 'PixelSize',
 'KVP',
 'XRayTubeCurrent',
 'ReconstructionDiameter',
 'ConvolutionKernel',
 'size_CT',
 'output_folder_CT',
 'numROIs',
 'metadata_RTSTRUCT_CT',
 'raw_labels_Bowel_sm_CBCT',
 'raw_labels_LUNG_L',
 'raw_labels_LUNG_R',
 'raw_labels_Stomach_duo_CBCT',
 'output_folder_RTSTRUCT_CT',
 'Modalities',
 'numRTSTRUCTs',
 'raw_labels_Bowel_sm_planCT',
 'raw_labels_ROI',
 'raw_labels_Stomach_duo_planCT',
 'raw_labels_BowelSmObs1',
 'raw_labels_BowelSmObs2',
 'raw_labels_StomachDuoObs1',
 'raw_labels_StomachDuoObs2']

If you want to create a PyTorch Dataset/DataLoader using a Med-ImageTools processed dataset, you can use the `dataset.csv` to easily refer to the data.

Here's an example of what a PyTorch Dataset/DataLoader might look like:

```python
class MedImageToolsDataset(Dataset):
    def __init__(self,
                 data_folder,
                 roi="GTV"):
        self.data_dir = data_folder
        self.data_df  = pd.read_csv(os.path.join(data_folder, "dataset.csv"))
        self.roi      = roi

    def __len__(self):
        return len(self.data_df)

    def __getitem__(self, idx):
        # get the row of the dataframe
        row = self.data_df.iloc[idx]
        
        # get image and mask
        img = sitk.ReadImage(os.path.join(self.data_dir, row["output_folder_CT"], "CT.nii.gz"))
        mask = sitk.ReadImage(os.path.join(self.data_dir, row["output_folder_RTSTRUCT_CT"], f"{self.roi}.nii.gz"))
        
        # return the pair!
        return img, mask

# Create a DataLoader
dataloader = DataLoader(MedImageToolsDataset(dataset), batch_size=32)
```

A less-simplified version of the code with safer error handling and more comments looks like this. Try it out!

In [16]:
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.dataloader import default_collate
import SimpleITK as sitk
import re
import pathlib

# Define the Dataset class
class MedImageToolsDataset(Dataset):
    def __init__(self,
                 data_folder,
                 roi="GTV"):
        """
        Parameters
        ----------
        data_folder : str
            Path to the folder containing the dataset.csv file and the output folders
        roi : str
            Name of the Region of Interest (ROI) to extract from the RTSTRUCT masks.
            Regex expressions are accepted.
        """

        if not os.path.exists(data_folder):
            raise FileNotFoundError(f"Folder {data_folder} does not exist")
        self.data_dir = data_folder

        # Load the dataset.csv file
        data_df_path = os.path.join(data_folder, "dataset.csv")
        if not os.path.exists(data_df_path):
            raise FileNotFoundError(f"File dataset.csv not found in {data_folder}")
        self.data_df  = pd.read_csv(data_df_path)

        self.output_cols   = [col for col in self.data_df.columns if col.startswith("output_folder_")]
        self.roi           = roi

    def __len__(self):
        return len(self.data_df)

    def __getitem__(self, idx):
        row = self.data_df.iloc[idx]

        for col in self.output_cols:
            if 'folder_CT' in col:
                img_path = pathlib.Path(self.data_dir, row[col], "CT.nii.gz").as_posix()
                if os.path.exists(img_path):
                    img = sitk.ReadImage(img_path)
                else:
                    raise FileNotFoundError(f"CT image not found at {img_path}")
                    break
            elif 'RTSTRUCT' in col:
                mask_folder_path = pathlib.Path(self.data_dir, row[col]).as_posix()
                if os.path.exists(mask_folder_path):
                    for mask_file in os.listdir(mask_folder_path):
                        roi_name = mask_file.split(".")[0]
                        if re.fullmatch(self.roi, roi_name, flags=re.IGNORECASE) or self.roi in roi_name:
                            mask = sitk.ReadImage(os.path.join(self.data_dir, row[col], mask_file))
                            break
                else:
                    continue
                if 'mask' not in locals():
                    raise FileNotFoundError(f"Mask of {self.roi} not found in {row[col]}")

        if 'img' in locals() and 'mask' in locals():
            return img, mask
        else:
            return None

# Define a collate function
def my_collate(batch):
    "Puts each data field into a tensor with outer dimension batch size"
    return [x for x in batch if x is not None]
    # batch = filter(lambda x: x is not None, batch)
    # return default_collate(batch)

# Create a DataLoader
dataloader = DataLoader(MedImageToolsDataset(OUTPUT_PATH,
                                             roi="LUNG_L"),
                        batch_size=4,
                        collate_fn=my_collate)

# Print the first batch of data
batch = next(iter(dataloader))
print("Batch size:", len(batch))

img, mask = batch[0]
print(f"Image: {img.GetSize()}")
print(f"Mask: {mask.GetSize()}")

Batch size: 4
Image: (500, 500, 228)
Mask: (500, 500, 228)
