You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_NCTN_Annotations.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_NCTN_Annotations.ipynb)

# Summary

Access to large, high quality data is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However HIPAA constraints make sharing medical images outside an individual institution a complex process. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) TCIA is a public service funded by the National Cancer Institute which addresses this challenge by providing hosting and de-identification services to take major burdens of data sharing off researchers.

TCIA has published over 200 unique data collections containing more than 60 million images. Recognizing that images alone are not enough to conduct meaningful research, most collections are linked to rich supporting data including patient outcomes, treatment information, genomic / proteomic analyses, and expert image analyses (segmentations, annotations, and radiomic / radiogenomic features).

**This notebook is focused on accessing NCI Clinical Trial Network (NCTN) collections via TCIA REST APIs which contain radiologist-generated tumor annotations (3d segmentation labels and seed points) which can be used for creating automated methods to detect and segment tumors.**  More information about this activity can be found on the [Imaging Clinical Trials](https://wiki.cancerimagingarchive.net/x/BQHDAg) page on TCIA.


# 1 Setup

The following installs **[tcia_utils](https://pypi.org/project/tcia-utils/)**, which contains a variety of useful functions for accessing TCIA data via Python.  It also installs **[simpleDicomViewer](https://pypi.org/project/simpleDicomViewer/)** as a very basic way to visualize example images.  There are a few conditional steps that will execute if you're using Google Colab to adjust log settings.

In [None]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q tcia_utils

# install simpleDicomViewer and forked pydicom-seg dependency
!{sys.executable} -m pip install --upgrade -q git+https://github.com/kirbyju/pydicom-seg.git@master
!{sys.executable} -m pip install --upgrade -q simpleDicomViewer

In [None]:
import requests
import pandas as pd
from tcia_utils import nbia
from simpleDicomViewer import dicomViewer

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

# 2 Learn about the datasets

The images, annotations (tumor segmentation and seed point labels), and clinical data associated with this trial are described in detail at the following links.  These pages are publicly visible without logging in, and can be used to obtain an understanding of the dataset before going through the trouble of requesting access.  Instructions for obtaining access can be found on the **Collection Summary** page for each dataset.

1. **Annotations for Chemotherapy and Radiation Therapy in Treating Young Patients With Newly Diagnosed, Previously Untreated, High-Risk Medulloblastoma/PNET (ACNS0332)**: [Image Collection Summary](https://doi.org/10.7937/TCIA.582B-XZ89), [Annotation Summary](https://doi.org/10.7937/D8A8-6252),   [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/838)
2. **Combination Chemotherapy and Radiation Therapy in Treating Young Patients With Newly Diagnosed Hodgkin Lymphoma (AHOD0831)**: [Collection Summary](https://doi.org/10.7937/CV5M-1H59), [Annotation Summary](https://doi.org/10.7937/4QAD-4280), [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/1137)
3. **Vincristine, Dactinomycin, and Doxorubicin With or Without Radiation Therapy or Observation Only in Treating Younger Patients Who Are Undergoing Surgery for Newly Diagnosed Stage I, Stage II, or Stage III Wilms' Tumor (AREN0532)**: [Collection Summary](https://doi.org/10.7937/6PJ1-M859), [Annotation Summary](https://doi.org/10.7937/kja4-1z76), [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/689)
4. **Combination Chemotherapy With or Without Radiation Therapy in Treating Young Patients With Newly Diagnosed Stage III or Stage IV Wilms Tumor (AREN0533)**: [Collection Summary](https://doi.org/10.7937/SJEZ-CJ78), [Annotation Summary](https://doi.org/10.7937/WFCC-DA41), [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/737)
5. **Combination Chemotherapy and Surgery in Treating Young Patients With Wilms Tumor (AREN0534)**: [Collection Summary](https://doi.org/10.7937/TCIA.5M9S-6Y97), [Annotation Summary](https://doi.org/10.7937/N930-BM78), [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/728)
6. **Rituximab and Combination Chemotherapy in Treating Patients With Diffuse Large B-Cell Non-Hodgkin's Lymphoma (CALGB50303)**: [Collection Summary](https://doi.org/10.7937/CM65-A013), [Annotation Summary](https://doi.org/10.7937/9jer-g980), [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/989)
7. **Sorafenib Tosylate in Treating Patients With Desmoid Tumors or Aggressive Fibromatosis (A091105)**: [Collection Summary](https://doi.org/10.7937/0WF5-SJ50), [Annotation Summary](https://doi.org/10.7937/T8RN-J447), [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/1266)
8. **Radiation Therapy, Amifostine, and Chemotherapy in Treating Young Patients With Newly Diagnosed Nasopharyngeal Cancer (ARAR0331)**: [Collection Summary](https://doi.org/10.7937/WTEC-MN22), [Annotation Summary](https://doi.org/10.25737/H65S-8F58), [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/1302)
9. **ACRIN 6685 (ACRIN-HNSCC-FDG-PET-CT)**: [Collection Summary and clinical datasets](https://doi.org/10.7937/K9/TCIA.2016.JQEJZZNG), [Annotation Summary](https://doi.org/10.7937/JVGC-AQ36)
10. **Risk-Based Therapy in Treating Younger Patients With Newly Diagnosed Liver Cancer (AHEP0731)**: [Collection Summary](https://doi.org/10.7937/F2DB-8826), [Annotation Summary](https://doi.org/10.7937/BDBN-NQ81), [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/696)
11. **A Randomized Phase III Study Comparing Carboplatin/Paclitaxel or Carboplatin/Paclitaxel/Bevacizumab With or Without Concurrent Cetuximab in Patients With Advanced Non-small Cell Lung Cancer (S0819)**: [Collection Summary](http://doi.org/10.7937/DT39-JS04), [Annotation Summary](https://doi.org/10.7937/R0R8-BN93), [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/850)

**Note:** The **Clinical datasets** links above allow you to view data dictionaries outlining the specific clinical variables that were collected before you request access.

After obtaining access to the dataset(s) you're interested in, select the dataset you'd like to explore through the rest of this notebook by setting the collection variable below.  The variable should be set to the collection **short name** which is listed in parentheses at the end of each title above.

In [None]:
collection = "S0819"

# 3 Downloading images and annotations with the NBIA Data Retriever

TCIA uses software called NBIA to manage its DICOM data.  One way to download TCIA data is to install the NBIA Data Retriever and use the predefined manifest files that are found on the summary pages mentioned in section 2.  

This tool provides a number of useful features such as auto-retry if there are any problems, saving data in an organized hierarchy on your hard drive (Collection > Patient > Study > Series > Images), and providing a CSV file containing key DICOM metadata about the images you've downloaded.  

There are versions available for Windows, Mac and Linux.  If you're working from a system with a GUI you can follow the [instructions](https://wiki.cancerimagingarchive.net/display/NBIA/Downloading+TCIA+Images) to install Data Retriever on your computer.  There is also a [Linux command-line version of the NBIA Data Retriever](https://wiki.cancerimagingarchive.net/x/2QKPBQ) which is demonstrated in [this notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb).

# 4 Accessing the REST APIs
The [NBIA REST APIs](https://wiki.cancerimagingarchive.net/x/ZoATBg) are another useful way for TCIA users to query metadata and download image data, which will be the focus of the rest of this notebook.  We'll rely heavily on [tcia_utils](https://pypi.org/project/tcia-utils/) to simplify accessing them.  

If you have questions that are not covered in this notebook you can find many additional examples in the other notebooks at https://github.com/kirbyju/TCIA_Notebooks.

### Create login token
First, you must create a login token with your user name and password in order to access restricted data via the API.  

In [None]:
nbia.getToken()

## 4.1 Explore the data with REST API queries

Let's start by looking at what body parts and modalities are contained in the collection.  By default, most functions from **tcia_utils** return results in JSON.

For these datasets, **RTSTRUCT** DICOM series were used to record the segmentations, seed points, and scans where no tumor was found. There is one exception to this, which is that DICOM **SEG** was the format used for segmentations in the **ACNS0332** dataset.

In [None]:
# count patients for each modality
data = nbia.getModalityCounts(collection)
print(data)

You can also use **format = "df"** to return the results as a dataframe.  Let's try that for viewing the body parts examined.

In [None]:
# Count patients for each body part examined,
# return results as dataframe
df = nbia.getBodyPartCounts(collection, format = "df")

# rename headers and sort by PatientCount
df.rename(columns = {'criteria':'BodyPartExamined', 'count':'PatientCount'}, inplace = True)
df.PatientCount = df.PatientCount.astype(int)
display(df.sort_values(by='PatientCount', ascending=False, ignore_index = True))

Now let's run **nbia.getPatient()** and **nbia.getStudy()** to see what we can learn about the patient cohort from the DICOM metadata.  The patient information can include things like age, gender, and ethnicity. The study information includes details recorded on the date the patient was scanned such as the patient's age at the time of imaging or how many days it has been since they were registered to the trial.  This information can augment the **Clinical data** available through the links at the top of the notebook.

In [None]:
df = nbia.getPatient(collection, format = "df")

display(df)

Let's use **format = "csv"** this time to save a CSV file in addition to returning a dataframe.  Verify that it has been saved to your file system before proceeding.

In [None]:
# obtain study/visit details (e.g. anonymized study date, age at the time of visit)
df = nbia.getStudy(collection, format = "csv")
display(df)

We can also create a report with **nbia.getSeries()** that gives useful metadata about each scan in the dataset (e.g. series description, modality, scanner manufacturer & software version, number of images).

In [None]:
# obtain scan/series metadata and save to variable for use in next example
df = nbia.getSeries(collection, format = "df")

display(df)

Finally, we can pass the results from the **getSeries()** query to **reportDoiSummary()** to generate some summary statistics about the data in the collection.  Note that there are separate rows summarizing the contents of the original collection and the contents of the annotation dataset.

In [None]:
# Calculate summary statistics for a given collection
nbia.reportDoiSummary(df, input_type = "df")

## 4.3 Downloading data with the REST API
Next we'll demonstrate using the API to download data.  This can be useful if you'd like to download results from API queries rather than using an existing manifest file.  It's also useful if you can't install the NBIA Data Retriever or want to integrate TCIA downloads into other pipelines/tools.  

Here we will focus on the following use cases:

1. Download and visualize a sample case
2. Download seed point labels
2. Download 3d segmentation labels
3. Download source images used to create seed points and segmentations
4. Download source images with negative finding assessments

To identify the subsets for our use cases, we'll leverage the **annotation metadata** spreadsheet the authors provided, which you can download manually from the **Annotation Summary** links in section 2 of the notebook or retrieve directly into a dataframe with the code below.

In [None]:
metadata_urls = {
    "ACNS0332": "https://www.cancerimagingarchive.net/wp-content/uploads/ACNS0332_annotations_metadata-2023-08-03.csv",
    "AHOD0831": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_AHOD0831_01222023.csv",
    "AREN0532": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_AREN0532_01122023.csv",
    "AREN0533": "https://www.cancerimagingarchive.net/wp-content/uploads/AREN0533_Annotations_Metadata__01-12-2023.csv",
    "AREN0534": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_AREN0534_01122023.csv",
    "CALGB50303": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_CALGB50303_02272023.csv",
    "A091105": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_A091105_2023_11_06.csv",
    "ARAR0331": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_ARAR0331_2023_11_13.csv",
    "AHEP0731": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_AHEP0731_2024_1_3.csv",
    "ACRIN-HNSCC-FDG-PET-CT": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_ACRIN-HNSCC_2023_11_07.csv",
    "S0819": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_S0819_v2_2024-04-11.csv"
}

if collection in metadata_urls:
    spreadsheet_url = metadata_urls[collection]
    annotation_Metadata = pd.read_csv(spreadsheet_url)
    display(annotation_Metadata)
else:
    print("URL for collection not found.")

### Download and visualize a sample case
Here we'll walk through some steps to identify an example segmentation file, find the corresponding reference series and visualize them together in the notebook.

First, let's pull a series UID for a random segmentation from our annotation metadata.

In [None]:
random_row = annotation_Metadata.loc[annotation_Metadata['Annotation Type'] == 'Segmentation'].sample(n=1)

segSeries = random_row['SeriesInstanceUID'].iloc[0]

print(segSeries)

In this case we are fortunate to also have the corresponding reference Series UID in the spreadsheet so we can obtain that from our dataframe as well.

In [None]:
# Find the row where 'SeriesInstanceUID' is equal to segSeries
filtered_row = annotation_Metadata[annotation_Metadata['SeriesInstanceUID'] == segSeries]

# Extract the value from 'ReferencedSeriesInstanceUID' column in the filtered row
refSeries = filtered_row['ReferencedSeriesInstanceUID'].iloc[0]

print(refSeries)

Alternatively, if you ever have a situation where you don't have a spreadsheet like this and want to determine the Reference Series UID you can use **getSegRefSeries()** to obtain it.  Note that it matches the UID from the spreadsheet in the previous step.

In [None]:
refSeries = nbia.getSegRefSeries(segSeries)

print(refSeries)

Next let's download these two series.  Since we're working with Series UIDs saved as variables instead of JSON output from other API calls, we'll use the  **input_type = "list"** parameter in the remaining download steps.  

In [None]:
nbia.downloadSeries([refSeries, segSeries], input_type= "list")

Now we can look at the images and segmentation together with **viewSeriesAnnotation()** from [simpleDicomViewer](https://pypi.org/project/simpleDicomViewer/).  This function is only meant to be a  quick and dirty way to preview the data.  There are more comprehensive solutions such as [3D Slicer](https://slicer.org/) or [itkWidgets](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_RTStruct_SEG_Visualization_with_itkWidgets.ipynb) if you want analyze the data.

**Note:** Unfortunately this annotation visualization function doesn't support the type of DICOM SEG data that were generated for ACNS0332, but the images will still display.  These segmentations do work properly in 3D Slicer.

In [None]:
# Assuming you didn't change the default download options for downloadSeries
imgPath = "tciaDownload/" + refSeries

# The annotation path has to be a file name (not directory name).  Since there is generally
# only one file in a segmentation series we can assume it will always be called 1-1.dcm
segPath = "tciaDownload/" + segSeries + "/1-1.dcm"

# Display the viewer
dicomViewer.viewSeriesAnnotation(imgPath, segPath)

#### Download seed points
These are setup to download a sample (three scans).  Remove the **number** parameter to download the entire datasets.  We'll also specify a **csv_filename** to save the related metadata to a file.

In [None]:
# filter dataframe to only include seed point rows
seedPoints = annotation_Metadata[annotation_Metadata['Annotation Type'].str.contains('Seed point')]
#display(seedPoints)

# extract series UID column to list for downloading
series_data = seedPoints["SeriesInstanceUID"].tolist()

# download a sample set of three scans
# return metadata dataframe as dataframe
# save a CSV of the metadata
nbia.downloadSeries(series_data, number = 2, input_type = "list", csv_filename = collection + "-seedPoints")

#### Download 3D segmentations

In [None]:
# filter dataframe to only include segmentations
segs = annotation_Metadata[annotation_Metadata['Annotation Type'].str.contains('Segmentation')]
#display(segs)

# extract series UID column to list for downloading
series_data = segs["SeriesInstanceUID"].tolist()

# download a sample set of three scans
# return metadata dataframe as dataframe
# save a CSV of the metadata
nbia.downloadSeries(series_data, number = 2, input_type = "list", csv_filename = collection + "-segs")

#### Download source images for seed points and segmentations

In [None]:
# filter dataframe to only include seg and seed point rows (remove "no findings")
ref_Series = annotation_Metadata[(annotation_Metadata['Annotation Type'] == 'Seed point') |
                                 (annotation_Metadata['Annotation Type'] == 'Segmentation')]

# remove duplicate ReferencedSeriesUIDs
clean_refSeries = ref_Series.drop_duplicates(subset='ReferencedSeriesInstanceUID')
#display(clean_refSeries)

# extract series UID column to list for downloading
series_data = clean_refSeries["ReferencedSeriesInstanceUID"].tolist()

# download a sample set of three scans
# return metadata dataframe as dataframe
# save a CSV of the metadata
nbia.downloadSeries(series_data, number = 2, input_type = "list", csv_filename = collection + "-seg_seed_source_images")

The following code will download the scans with negative finding assessments.  These are cases where the authors of the dataset did not find anything that could be annotated.

In [None]:
# filter dataframe to only include scans with "no findings"
ref_Series = annotation_Metadata[annotation_Metadata['Annotation Type'] == 'No findings']

# remove duplicate ReferencedSeriesUIDs
clean_refSeries = ref_Series.drop_duplicates(subset='ReferencedSeriesInstanceUID')
#display(clean_refSeries)

# extract series UID column to list for downloading
series_data = clean_refSeries["ReferencedSeriesInstanceUID"].tolist()

# download a sample set of three scans
# return metadata dataframe as dataframe
# save a CSV of the metadata
nbia.downloadSeries(series_data, number = 2, input_type = "list", csv_filename = collection + "-noFinding_source_images")

# Acknowledgements
[The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a service which de-identifies and hosts a large publicly available archive of medical images of cancer.  TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/).  If you leverage TCIA datasets in your work please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF) and include all relevant citations.

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7

## Dataset Citations

Instructions for citing the datasets can be found on their summary pages that are listed in section 1.  