You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/AREN0532/AREN0532.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/AREN0532/AREN0532.ipynb)

# Accessing DICOM images and annotations from the AREN0532 dataset hosted on TCIA

This notebook is focused on accessing the **"Vincristine, Dactinomycin, and Doxorubicin With or Without Radiation Therapy or Observation Only in Treating Younger Patients Who Are Undergoing Surgery for Newly Diagnosed Stage I, Stage II, or Stage III Wilms' Tumor (AREN0532)"** Collection hosted on [The Cancer Imaging Archive(TCIA)](https://cancerimagingarchive.net).  This dataset includes DICOM images hosted on TCIA and clinical data hosted by the NCTN Data Archive.  The National Cancer Institute has also funded an activity to generate and publish annotations (3d segmentation labels and seed points) on TCIA to help jumpstart research on tumor detection and auto-segmentation methods.  


# 1 Learn about and request access to the datasets

The images, annotations (tumor segmentation and seed point labels), and clinical data associated with this trial are described in detail at the following links.  These pages are publicly visible without logging in, and can be used to obtain an understanding of the dataset before going through the trouble of requesting access:

1.  [Collection Summary](https://doi.org/10.7937/6PJ1-M859)
2.  [Annotation Summary](https://doi.org/10.7937/kja4-1z76)
3.  [Clinical datasets](https://nctn-data-archive.nci.nih.gov/node/689)

**Note:** You can use the link above to view data dictionaries outlining the specific clinical variables that were collected before requesting access.

### Requesting Access to the data
In order to download the actual data you must request access through the NCTN Data Archive via the following steps:
 
 1. [Register an account on the NCTN Data Archive](https://nctn-data-archive.nci.nih.gov/).  
 2. After logging in, use the "Request Data" link in the left side menu.  
 3. Follow the on screen instructions, and enter **NCT00352534** when asked which trial you want to request.  
 4. In step 2 of the Create Request form, be sure to select “Imaging Data Requested”. 
 
Once you are approved for access you'll be able to download the clinical data from the NCTN Archive.  You will then be asked to create an account on TCIA with the same email address to access the imaging data.  Please contact NCINCTNDataArchive@mail.nih.gov for any questions about access requests.  

# 2 Import tcia_utils

The following imports **[tcia_utils](https://pypi.org/project/tcia-utils/)**, which contains a variety of useful functions for accessing TCIA via Python and Juptyter Notebooks.

In [None]:
!pip install tcia_utils -q
!pip install pandas -q

In [None]:
import requests
import pandas as pd
from tcia_utils import nbia

# 3 Downloading images and annotations with the NBIA Data Retriever

TCIA uses software called NBIA to manage its DICOM data.  One way to download TCIA data is to install the [linux command-line version of the NBIA Data Retriever](https://wiki.cancerimagingarchive.net/x/2QKPBQ) using the following steps.  This tool provides a number of useful features such as auto-retry if there are any problems, saving data in an organized hierarchy on your hard drive (Collection > Patient > Study > Series > Images), and providing a CSV file containing key DICOM metadata about the images you've downloaded.

_**Note:**_ It's also possible to download these data via our REST API if you can't or don't want to install the NBIA Data Retriever. This is covered later in the notebook.

## 3.1 Install the NBIA Data Retriever CLI package

In [None]:
# Install NBIA Data Retriever CLI software for downloading images later in this notebook.

!mkdir /usr/share/desktop-directories/
!wget -P /content/NBIA-Data-Retriever https://cbiit-download.nci.nih.gov/nbia/releases/ForTCIA/NBIADataRetriever_4.4/nbia-data-retriever-4.4.deb
!dpkg -i /content/NBIA-Data-Retriever/nbia-data-retriever-4.4.deb

# NOTE: If you're working on a Linux OS that uses RPM packages, you can change the wget line above to point to
#       https://cbiit-download.nci.nih.gov/nbia/releases/ForTCIA/NBIADataRetriever_4.4/NBIADataRetriever-4.4-1.x86_64.rpm

## 3.2 Set up your credential file
Since this Collection requires logging in, you must setup a **credentials.txt** file that contains your user name and password. We will leverage the **nbia.makeCredentialFile()** function to do this.

In [None]:
nbia.makeCredentialFile()

## 3.3 Download data
The NBIA Data Retriever software works by ingesting a "manifest" file that contains the DICOM Series Instance UIDs of the scans you want to download. The manifest files can be downloaded from [this page](https://doi.org/10.7937/D8A8-6252), but you can also obtain these manifests with the commands below.

* Annotations - segmentations, seed points, and Negative Findings Assessments 
* Source images used to create segmentations & seed points
* Source images used to create Negative Assessment Reports

In [None]:
# Annotations - segmentations, seed points, and Negative Findings Assessments
manifest = requests.get("https://wiki.cancerimagingarchive.net/download/attachments/145752341/AREN0532_Tumor-Annotations-manifest_2-8-2023.tcia?api=v2")
with open('AREN0532_Tumor-Annotations-manifest_2-8-2023.tcia', 'wb') as f:
    f.write(manifest.content)

In [None]:
# Source images used to create segmentations & seed points
manifest = requests.get("https://wiki.cancerimagingarchive.net/download/attachments/145752341/AREN0532_SourceImages_SEGSandSeedpoints-manifest_2-8-2023.tcia?api=v2")
with open('AREN0532_SourceImages_SEGSandSeedpoints-manifest_2-8-2023.tcia', 'wb') as f:
    f.write(manifest.content)


In [None]:
# Source images used to create Negative Assessment Reports
# (no segmentation or seed points created for the scan)
manifest = requests.get("https://wiki.cancerimagingarchive.net/download/attachments/145752341/AREN0532_SourceImages_NegativeAssessments-manifest-2-8-2023.tcia?api=v2")
with open('AREN0532_SourceImages_NegativeAssessments-manifest-2-8-2023.tcia', 'wb') as f:
    f.write(manifest.content)

A manifest containing sample images and annotations for a single subject has also been created for use with this notebook to facilitate quick testing and demonstrations.

In [None]:
# Single subject manifest containing examples of each annotation type
# Use this one for a quick demo.
manifest = requests.get("https://github.com/kirbyju/TCIA_Notebooks/raw/main/AREN0532/AREN0532-PAPVBY.tcia")
with open('AREN0532-PAPVBY.tcia', 'wb') as f:
    f.write(manifest.content)

Now we can open the sample manifest file with the NBIA Data Retriever to download the actual data. You can repeat this step below for each dataset by changing the name of the TCIA manifest you want to download.

**<font color='red'>After running the following command, click in the output cell, type "y," and press Enter to agree with the TCIA Data Usage Policy and start the download.</font>**

In [None]:
!/opt/nbia-data-retriever/nbia-data-retriever --cli '/content/AREN0534-PAPYAR.tcia' -d /content/ -l /content/credentials.txt

# 4 Accessing the REST APIs 
The [NBIA REST APIs](https://wiki.cancerimagingarchive.net/x/ZoATBg) allow TCIA users to query metadata and download image data.  We'll rely heavily on [tcia_utils](https://pypi.org/project/tcia-utils/) to simplify accessing them.  

## 4.1 Create an API token

We'll use **nbia.getToken()** to generate an access token to query restricted Collections on TCIA.  **<font color='red'>Tokens are valid for 2 hours and must be refreshed after that point.</font>** See https://wiki.cancerimagingarchive.net/x/X4ATBg for more details. 

In [None]:
nbia.getToken()

## 4.2 Explore the data with REST API queries

Let's start by looking at what body parts and modalities are contained in the Collection.  For this dataset, RTSTRUCTs were used to record  the segmentations, seed points, and scans where no tumor was found. By default, most functions from **tcia_utils** return results in JSON.

In [None]:
# count patients for each modality
data = nbia.getModalityCounts(collection = "AREN0532")
print(data)

However, you can also use **format = "df"** to return the results as a dataframe.

In [None]:
# Count patients for each body part examined, 
# return results as dataframe
df = nbia.getBodyPartCounts(collection = "AREN0532", format = "df")

# rename headers and sort by PatientCount
df.rename(columns = {'criteria':'BodyPartExamined', 'count':'PatientCount'}, inplace = True)
df.PatientCount = df.PatientCount.astype(int)
display(df.sort_values(by='PatientCount', ascending=False, ignore_index = True))

Now let's run **nbia.getPatient()** and **nbia.getStudy()** to see what we can learn about the patient cohort from the DICOM metadata.  This information can include things like age, gender, and ethnicity.  However, in the case of AREN0532, most of this information is also available in the clinical data at https://nctn-data-archive.nci.nih.gov/node/689.

In [None]:
# obtain patient details (e.g. species, gender, ethnicity) for the collection 
df = nbia.getPatient(collection = "AREN0532", api_url = "restricted", format = "df")
display(df)

Let's use **format = "csv"** this time to save a CSV file in addition to returning a dataframe.  Verify that **getPatientStudy.csv** has been saved to your file system before proceeding.

In [None]:
# obtain study/visit details (e.g. anonymized study date, age at the time of visit)
df = nbia.getStudy(collection = "AREN0532", api_url = "restricted", format = "csv")
display(df)

We can also create a report with **nbia.getSeries()** that gives useful metadata about each scan in the dataset (e.g. series description, modality, scanner manufacturer & software version, number of images).  We'll return the results as JSON this time so we can use them in a subsequent step, but still convert them to a dataframe for readability.

In [None]:
# obtain scan/series metadata for a collection as JSON for use in next example
data = nbia.getSeries(collection = "AREN0532", api_url = "restricted")

# format as dataframe for easy viewing
df = pd.DataFrame(data)
display(df)

Finally, we can use the results from the getSeries() query to generate some summary statistics about the scans in the collection.

In [None]:
# Calculate summary statistics for a given collection 
nbia.makeSeriesReport(data)

## 4.3 Downloading data with the REST API
Next we'll cover using the API to download data.  This can be useful if you'd like to download results from API queries rather than using an existing manifest file.  It's also useful if you can't install the NBIA Data Retriever or want to integrate TCIA downloads into other pipelines/tools.  

As a reminder, many of the scans in the AREN0532 Collection were not annotated by the authors of https://doi.org/10.7937/KJA4-1Z76.  The reasons are outlined in the Annotation Protocol on that page.  As a result, you may want to download only a subset of the scans such as:

1. Seed point labels 
2. 3d segmentation labels 
3. Source images used to create seed points and segmentations
4. Source images with negative finding assessments

The following examples demonstrate how to tackle each of these use cases. 

### 4.3.1 tcia_utils download functions
**tcia_utils** contains a **downloadSeries()** function that has multiple options for specifying the seriesUids you'd like to download.  By default, the function expects JSON data containing "SeriesInstanceUID" elements, which can be generated using **getSeries()**.  However, if you have a series UID list from some other source, you can set **input_type = "list"** to pass a python list of one or more series UIDs instead of JSON.

You can also specify **number = n** to tell the function to only download the first **n** scans of your seriesUids.  Remove this parameter in the examples below to download the full dataset. 

The function returns a dataframe of the series metadata describing the scans that were downloaded, and you can optionally export a CSV of the series metadata by specifying the **csv_filename** parameter.

Last but not least, there is some logic built in to detect whether you've already downloaded a series.  If a directory named after the seriesUid already exists the function will assume it's already been downloaded and skip that series.

### 4.3.2 Download a sample subject
First, let's re-use the **AREN0532-PAPVBY.tcia** file we downloaded earlier to demonstrate how we can work with manifest files to download images via the API.

If you open this manifest file in a text editor you'll notice that it contains several lines of download parameters that precede a list of Series Instance UIDs to download.  The **manifestToList()** function will put the Series UIDs into a Python list and ignore the parameters in the first six lines of the manifest so that we can feed it to **nbia.downloadSeries()**.

In [None]:
# enter manifest path/filename
manifest = "AREN0532-PAPVBY.tcia"

# converts manifest to list of UIDs
uids = nbia.manifestToList(manifest)

Now we can download the series in the list and return the metadata to a dataframe. 

In [None]:
df = nbia.downloadSeries(uids, input_type = "list", api_url = "restricted")


Here's what the metadata looks like in the dataframe.

In [None]:
display(df)

#### Visualize the data
You can preview a series that you've downloaded directly in your notebook using the **viewSeries()** function.  This function requires EITHER a seriesUid or path parameter.  Leave the seriesUid empty if you want to provide a custom path.  The function assumes "tciaDownload/**seriesUid**/" as the path if a seriesUid is provided since this is where downloadSeries() saves data.

**Note:** This function only works with regular scans and cannot be used to visualize the annotation data.

In [None]:
# view a sample scan we've downloaded using a Series UID from the previous dataframe
nbia.viewSeries("1.3.6.1.4.1.14519.5.2.1.1610.1216.996552775034540775693160619298")

# example showing how to use the path parameter instead of a UID
# nbia.viewSeries(path = "tciaDownload/1.3.6.1.4.1.14519.5.2.1.1610.1216.996552775034540775693160619298")

### 4.3.3 Download subsets of the data
To identify the subsets for the other use cases, we'll leverage the **annotation metadata** spreadsheet the authors provided, which you can download from https://doi.org/10.7937/kja4-1z76 or retrieve directly into a dataframe with the code below.

In [None]:
# load annotation metadata spreadsheet to dataframe

annotation_Metadata = pd.read_csv('https://wiki.cancerimagingarchive.net/download/attachments/145752341/Metadata_Report_AREN0532_01122023.csv?api=v2')

display(annotation_Metadata)

#### Download seed points
Since we're working with Series UIDs from a dataframe instead of JSON output from the API, we'll use the  **input_type = "list"** parameter in the remaining download steps.  Options to download a sample (three scans) or the entire dataset are provided.  We'll also specify a **csv_filename** to save the related metadata to a file.

In [None]:
# filter dataframe to only include seed point rows
seedPoints = annotation_Metadata[annotation_Metadata['Annotation Type'].str.contains('Seed point')]
#display(seedPoints)

# extract series UID column to list for downloading
series_data = seedPoints["SeriesInstanceUID"].tolist()

# download a sample set of three scans 
# return metadata dataframe as dataframe
# save a CSV of the metadata 
df = nbia.downloadSeries(series_data, number = 3, api_url = "restricted", input_type = "list", csv_filename = "seedPoints")

#### Download 3D segmentations

In [None]:
# filter dataframe to only include segmentations
segs = annotation_Metadata[annotation_Metadata['Annotation Type'].str.contains('Segmentation')]
#display(segs)

# extract series UID column to list for downloading
series_data = segs["SeriesInstanceUID"].tolist()

# download a sample set of three scans 
# return metadata dataframe as dataframe
# save a CSV of the metadata 
df = nbia.downloadSeries(series_data, number = 3, api_url = "restricted", input_type = "list", csv_filename = "acns0332_SEGs")

#### Download source images for seed points and segmentations

In [None]:
# filter dataframe to only include seg and seed point rows (remove "no findings")
ref_Series = annotation_Metadata[(annotation_Metadata['Annotation Type'] == 'Seed point') |
                                 (annotation_Metadata['Annotation Type'] == 'Segmentation')]

# remove duplicate ReferencedSeriesUIDs
clean_refSeries = ref_Series.drop_duplicates(subset='ReferencedSeriesInstanceUID')
#display(clean_refSeries)

# extract series UID column to list for downloading
series_data = clean_refSeries["ReferencedSeriesInstanceUID"].tolist()

# download a sample set of three scans 
# return metadata dataframe as dataframe
# save a CSV of the metadata 
df = nbia.downloadSeries(series_data, number = 3, api_url = "restricted", input_type = "list", csv_filename = "seg_seed_source_images")

The following code will download the scans with negative finding assessments.  These are cases where the authors of the dataset did not find anything that could be annotated.  Downloading these scans could be useful if you are training a tumor/metastases detection model.

In [None]:
# filter dataframe to only include scans with "no findings"
ref_Series = annotation_Metadata[annotation_Metadata['Annotation Type'] == 'No findings']

# remove duplicate ReferencedSeriesUIDs
clean_refSeries = ref_Series.drop_duplicates(subset='ReferencedSeriesInstanceUID')
#display(clean_refSeries)

# extract series UID column to list for downloading
series_data = clean_refSeries["ReferencedSeriesInstanceUID"].tolist()

# download a sample set of three scans
# return metadata dataframe as dataframe
# save a CSV of the metadata 
df = nbia.downloadSeries(series_data, number = 3, api_url = "restricted", input_type = "list", csv_filename = "noFinding_source_images")

# Acknowledgements
[The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a service which de-identifies and hosts a large publicly available archive of medical images of cancer.  TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/).  If you leverage TCIA datasets in your work please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). Upon receiving access, you must also abide by the terms of your NCTN/NCORP Data Archive’s Data Use Agreement (DUA). You are not allowed to redistribute the data or use it for other purposes. Attribution should include references to the following citations:

## Data Citations

1. Fernandez, C. V., Mullen, E. A., Chi, Y.-Y., Ehrlich, P. F., Perlman, E. J., Kalapurakal, J. A., Khanna, G., Paulino, A. C., Hamilton, T. E., Gow, K. W., Tochner, Z., Hoffer, F. A., Withycombe, J. S., Shamberger, R. C., Kim, Y., Geller, J. I., Anderson, J. R., Grundy, P. E., & Dome, J. S. (2022). Vincristine, Dactinomycin, and Doxorubicin With or Without Radiation Therapy or Observation Only in Treating Younger Patients Who Are Undergoing Surgery for Newly Diagnosed Stage I, Stage II, or Stage III Wilms’ Tumor (AREN0532) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/6PJ1-M859
2. Rozenfeld, M., & Jordan, P. (2023). Annotations for Vincristine, Dactinomycin, and Doxorubicin With or Without Radiation Therapy or Observation Only in Treating Younger Patients Who Are Undergoing Surgery for Newly Diagnosed Stage I, II, or III Wilms' Tumor (AREN0532-Tumor-Annotations) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/KJA4-1Z76

## Publication Citation

Fernandez, C. V., Mullen, E. A., Chi, Y.-Y., Ehrlich, P. F., Perlman, E. J., Kalapurakal, J. A., Khanna, G., Paulino, A. C., Hamilton, T. E., Gow, K. W., Tochner, Z., Hoffer, F. A., Withycombe, J. S., Shamberger, R. C., Kim, Y., Geller, J. I., Anderson, J. R., Grundy, P. E., & Dome, J. S. (2018). Outcome and Prognostic Factors in Stage III Favorable-Histology Wilms Tumor: A Report From the Children’s Oncology Group Study AREN0532. In Journal of Clinical Oncology (Vol. 36, Issue 3, pp. 254–261). American Society of Clinical Oncology (ASCO). https://doi.org/10.1200/jco.2017.73.7999

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7