You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/CPTAC/CPTAC.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/CPTAC/CPTAC.ipynb)

# Accessing DICOM images and annotations from the CPTAC datasets hosted on TCIA

This notebook is focused on accessing the **"Clinical Proteomic Tumor Analysis Consortium"** collections hosted on [The Cancer Imaging Archive(TCIA)](https://cancerimagingarchive.net).  These datasets include radiology and histopathology images hosted on TCIA as well as proteomic, genomic and clinical data hosted in the Proteomic Data Commons and Genomic Data Commons.  The National Cancer Institute has also funded an activity to generate and publish annotations (3d segmentation labels and seed points) on TCIA to help jumpstart research on tumor detection, auto-segmentation methods and generating related imaging features which can be compared with the proteomic, genomic and clinical data.  


# 1 Learn about the datasets

The images, annotations (tumor segmentation and seed point labels), and other related data are described in detail at the following links.  These pages are publicly visible without logging in:

1. [CPTAC-UCEC](https://doi.org/10.7937/89M3-KQ43): Corpus Endometrial Carcinoma
2. [CPTAC-PDA](https://doi.org/10.7937/BW9V-BX61): Pancreatic Ductal Adenocarcinoma
3. [CPTAC-CCRCC](https://doi.org/10.7937/SKQ4-QX48): Clear Cell Renal Carcinoma
4. [CPTAC-HNSCC](https://doi.org/10.7937/PFEC-T641): Head and Neck Squamous Cell Carcinoma **(restricted access - requires extra steps below)**

After taking a look at these collections, select the one you'd like to explore through the rest of this notebook by setting the collection variable below.

In [None]:
collection = "CPTAC-PDA"

# 2 Setup

The following imports **[tcia_utils](https://pypi.org/project/tcia-utils/)**, which contains a variety of useful functions for accessing TCIA via Python and Juptyter Notebooks.

In [None]:
!pip install --upgrade -q tcia_utils
!pip install -q pandas

In [None]:
import requests
import pandas as pd
from tcia_utils import nbia

### Set logging level to INFO (optional)
This step isn't necessary on local Jupyter Labs, but Google Colab's logging root handler only shows warnings and errors by default.  If you'd like to see INFO statements you can run the following code.  This is particularly helpful when running some of the API examples so you can see the progress as downloads complete and a requirement if you want to see the output of the makeSeriesReport() example.

In [None]:
import logging

# Check current handlers
#print(logging.root.handlers)

# Remove all handlers associated with the root logger object.
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
#print(logging.root.handlers)

# Set handler with level = info
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                    level=logging.INFO)

print("Logging set to INFO")

# 3 Downloading images and annotations with the NBIA Data Retriever

TCIA uses software called NBIA to manage its DICOM data.  One way to download TCIA data is to install the NBIA Data Retriever.  There are regular [GUI versions](https://wiki.cancerimagingarchive.net/display/NBIA/Downloading+TCIA+Images) of this tool for Windows, Mac and Linux, but here we will highlight the [Linux command-line version of the NBIA Data Retriever](https://wiki.cancerimagingarchive.net/x/2QKPBQ).  This tool provides a number of useful features such as auto-retry if there are any problems, saving data in an organized hierarchy on your hard drive (Collection > Patient > Study > Series > Images), and providing a CSV file containing key DICOM metadata about the images you've downloaded.

_**Note:**_ It's also possible to download these data via our REST API if you can't or don't want to install the NBIA Data Retriever. This is covered later in the notebook.

## 3.1 Install the NBIA Data Retriever CLI package

In [None]:
# Install NBIA Data Retriever CLI software for downloading images later in this notebook.

!mkdir /usr/share/desktop-directories/
!wget -P /content/NBIA-Data-Retriever https://cbiit-download.nci.nih.gov/nbia/releases/ForTCIA/NBIADataRetriever_4.4/nbia-data-retriever-4.4.deb
!dpkg -i /content/NBIA-Data-Retriever/nbia-data-retriever-4.4.deb

# NOTE: If you're working on a Linux OS that uses RPM packages, you can change the wget line above to point to
#       https://cbiit-download.nci.nih.gov/nbia/releases/ForTCIA/NBIADataRetriever_4.4/NBIADataRetriever-4.4-1.x86_64.rpm

## 3.2 Download data
The NBIA Data Retriever software works by ingesting a "manifest" file that contains the DICOM Series Instance UIDs of the scans you want to download. We'll use some smaller manifests here that contain sample data for a single subject.  When you're ready to download the full dataset(s) you can obtain the manifests you are interested in from the summary pages mentioned in section 1.

In [None]:
manifest_urls = {
    "CPTAC-CCRCC": "https://github.com/kirbyju/TCIA_Notebooks/raw/main/CPTAC/CPTAC-CCRCC-demo.tcia",
    "CPTAC-PDA": "https://github.com/kirbyju/TCIA_Notebooks/raw/main/CPTAC/CPTAC-PDA-demo.tcia",
    "CPTAC-HNSCC": "https://github.com/kirbyju/TCIA_Notebooks/raw/main/CPTAC/CPTAC-HNSCC-demo.tcia",
    "CPTAC-UCEC": "https://github.com/kirbyju/TCIA_Notebooks/raw/main/CPTAC/CPTAC-UCEC-demo.tcia"
}

if collection in manifest_urls:
    manifest_url = manifest_urls[collection]
    manifest_filename = "CPTAC-demo.tcia"
    manifest = requests.get(manifest_url)
    with open(manifest_filename, 'wb') as f:
        f.write(manifest.content)
    print("Manifest file saved as " + manifest_filename)
else:
    print("URL for collection not found.")

### Accessing restricted CPTAC-HNSCC data (optional)
The data in the CPTAC-HNSCC collection contains images that could potentially be used to reconstruct a human face. To safeguard the privacy of participants, users must sign and submit a [TCIA Restricted License Agreement](https://wiki.cancerimagingarchive.net/download/attachments/4556915/TCIA%20Restricted%20License%2020220519.pdf?version=1&modificationDate=1652964581655&api=v2) to help@cancerimagingarchive.net before accessing the image data used to create the tumor annotations.  

After being granted access by the helpdesk, you must create a credential file to provide your user name and password to the NBIA Data Retriever.

In [None]:
nbia.makeCredentialFile()

Now we can open the sample manifest file with the NBIA Data Retriever to download the actual data.  You may need to update paths below if you installed Data Retriever elsewhere or saved the manifest/credential files somewhere other than **/content/**.

**<font color='red'>After running the following command, click in the output cell, type "y," and press Enter to agree with the TCIA Data Usage Policy and start the download.</font>**

In [None]:
if collection == "CPTAC-HNSCC":
    # this will include the credential file we created earlier as the final parameter
    !/opt/nbia-data-retriever/nbia-data-retriever --cli '/content/CPTAC-demo.tcia' -d /content/ -l /content/credentials.txt
else:
    !/opt/nbia-data-retriever/nbia-data-retriever --cli '/content/CPTAC-demo.tcia' -d /content/

Stop and take a look at how the data are organized and review the metadata CSV file before moving on.

# 4 Accessing the REST APIs
The [NBIA REST APIs](https://wiki.cancerimagingarchive.net/x/ZoATBg) are another useful way for TCIA users to query metadata and download image data.  We'll rely heavily on [tcia_utils](https://pypi.org/project/tcia-utils/) to simplify accessing them.

Before we get started, let's set the api_url parameter.  This needs to be "restricted" if accessing CPTAC-HNSCC, but can be left empty otherwise.

In [None]:
api_url = "restricted" if collection == "CPTAC-HNSCC" else ""

### Create login token to access CPTAC-HNSCC (optional)

If CPTAC-HNSCC was selected as your collection, you must create a login token in order to access this restricted data via the API. In most of the steps below we will then utilize the **api_url = "restricted"** parameter to pass the login token to the API.

In [None]:
nbia.getToken()

## 4.1 Explore the data with REST API queries

Let's start by looking at what body parts and modalities are contained in the collection.  For these datasets, RTSTRUCTs were used to record  the segmentations, seed points, and scans where no tumor was found. By default, most functions from **tcia_utils** return results in JSON.

In [None]:
# count patients for each modality
data = nbia.getModalityCounts(collection)
print(data)

However, you can also use **format = "df"** to return the results as a dataframe.

In [None]:
# Count patients for each body part examined,
# return results as dataframe
df = nbia.getBodyPartCounts(collection, format = "df")

# rename headers and sort by PatientCount
df.rename(columns = {'criteria':'BodyPartExamined', 'count':'PatientCount'}, inplace = True)
df.PatientCount = df.PatientCount.astype(int)
display(df.sort_values(by='PatientCount', ascending=False, ignore_index = True))

Now let's run **nbia.getPatient()** and **nbia.getStudy()** to see what we can learn about the patient cohort from the DICOM metadata.  The patient information can include things like age, gender, and ethnicity. The study information includes additional information recorded on the date the patient was scanned such as the patient's age or how many days it has been since they were diagnosed.

In [None]:
df = nbia.getPatient(collection, format = "df", api_url = api_url)

display(df)

Let's use **format = "csv"** this time to save a CSV file in addition to returning a dataframe.  Verify that **getPatientStudy.csv** has been saved to your file system before proceeding.

In [None]:
# obtain study/visit details (e.g. anonymized study date, age at the time of visit)
df = nbia.getStudy(collection, format = "csv", api_url = api_url)
display(df)

We can also create a report with **nbia.getSeries()** that gives useful metadata about each scan in the dataset (e.g. series description, modality, scanner manufacturer & software version, number of images).  We'll return the results as JSON this time so we can use them in a subsequent step, but also convert them to a dataframe for readability.

In [None]:
# obtain scan/series metadata for a collection as JSON for use in next example
data = nbia.getSeries(collection, api_url = api_url)

# format as dataframe for easy viewing
df = pd.DataFrame(data)
display(df)

Finally, we can use the results from the getSeries() query to generate some summary statistics about the scans in the collection.

In [None]:
# Calculate summary statistics for a given collection
nbia.makeSeriesReport(data)

## 4.3 Downloading data with the REST API
Next we'll cover using the API to download data.  This can be useful if you'd like to download results from API queries rather than using an existing manifest file.  It's also useful if you can't install the NBIA Data Retriever or want to integrate TCIA downloads into other pipelines/tools.  

The following examples demonstrate how to download the following subsets of data:

1. Seed point labels
2. 3d segmentation labels
3. Source images used to create seed points and segmentations
4. Source images with negative finding assessments



### 4.3.1 tcia_utils download functions
**tcia_utils** contains a **downloadSeries()** function that has multiple options for specifying the seriesUids you'd like to download.  By default, the function expects JSON data containing "SeriesInstanceUID" elements, which can be generated using **getSeries()** or **getCart()**.  However, if you have a series UID list from some other source, you can set **input_type = "list"** to pass a python list of one or more series UIDs instead of JSON.  You can also set **input_type = "manifest"** to pass the path of a *.TCIA manifest file as series_data.

Data are saved to a **tciaDownload** folder in your current working directory by default, but you can use the **path** parameter to change this to a different directory.

There is an optional **format** parameter that can be used to return metadata about what was downloaded.  It can be set to **df** to return dataframe or **csv** to save a spreadsheet. There's also a **csv_filename** parameter if you want to set a specific file name.

You can specify **number = n** to tell the function to only download the first **n** scans of your seriesUids.  Remove this parameter in the examples below to download the full dataset.

The **api_url** parameter can be omitted in most cases.  However, it must be set to **api_url = "nlst"** to access the [National Lung Screening Trial (NLST)](https://doi.org/10.7937/TCIA.HMQ8-J677) collection and you must use **api_url = "restricted"** for datasets that require logging in. This is only relevant for accessing CPTAC-HNSCC for our purposes.

Last but not least, there is some logic built in to detect whether you've already downloaded a series.  If a directory named after the seriesUid already exists the function will assume it's already been downloaded and skip that series.

### 4.3.2 Download subsets of the data
To identify the subsets for our use cases, we'll leverage the **annotation metadata** spreadsheet the authors provided, which you can download manually from the collection homepage links in section 1 of the notebook or retrieve directly into a dataframe with the code below.

In [None]:
metadata_urls = {
    "CPTAC-CCRCC": "https://wiki.cancerimagingarchive.net/download/attachments/157288300/Metadata_Report_CPTAC-CCRCC_2023_07_14.csv",
    "CPTAC-PDA": "https://wiki.cancerimagingarchive.net/download/attachments/157288334/Metadata_Report_CPTAC-PDA_2023_07_14.csv",
    "CPTAC-HNSCC": "https://wiki.cancerimagingarchive.net/download/attachments/157288486/Metadata_Report_CPTAC-HNSCC_2023_07_14.csv",
    "CPTAC-UCEC": "https://wiki.cancerimagingarchive.net/download/attachments/157288358/Metadata_Report_CPTAC-UCEC_2023_07_14.csv"
}

if collection in metadata_urls:
    spreadsheet_url = metadata_urls[collection]
    annotation_Metadata = pd.read_csv(spreadsheet_url)
    display(annotation_Metadata)
else:
    print("URL for collection not found.")

#### Download seed points
Since we're working with Series UIDs from a dataframe instead of JSON output from the API, we'll use the  **input_type = "list"** parameter in the remaining download steps.  Options to download a sample (three scans) or the entire dataset are provided.  We'll also specify a **csv_filename** to save the related metadata to a file.

In [None]:
# filter dataframe to only include seed point rows
seedPoints = annotation_Metadata[annotation_Metadata['Annotation Type'].str.contains('Seed point')]
#display(seedPoints)

# extract series UID column to list for downloading
series_data = seedPoints["SeriesInstanceUID"].tolist()

# download a sample set of three scans
# return metadata dataframe as dataframe
# save a CSV of the metadata
nbia.downloadSeries(series_data, number = 3, api_url = api_url, input_type = "list", csv_filename = collection + "-seedPoints")

#### Download 3D segmentations

In [None]:
# filter dataframe to only include segmentations
segs = annotation_Metadata[annotation_Metadata['Annotation Type'].str.contains('Segmentation')]
#display(segs)

# extract series UID column to list for downloading
series_data = segs["SeriesInstanceUID"].tolist()

# download a sample set of three scans
# return metadata dataframe as dataframe
# save a CSV of the metadata
nbia.downloadSeries(series_data, number = 3, api_url = api_url, input_type = "list", csv_filename = collection + "-segs")

#### Download source images for seed points and segmentations

In [None]:
# filter dataframe to only include seg and seed point rows (remove "no findings")
ref_Series = annotation_Metadata[(annotation_Metadata['Annotation Type'] == 'Seed point') |
                                 (annotation_Metadata['Annotation Type'] == 'Segmentation')]

# remove duplicate ReferencedSeriesUIDs
clean_refSeries = ref_Series.drop_duplicates(subset='ReferencedSeriesInstanceUID')
#display(clean_refSeries)

# extract series UID column to list for downloading
series_data = clean_refSeries["ReferencedSeriesInstanceUID"].tolist()

# download a sample set of three scans
# return metadata dataframe as dataframe
# save a CSV of the metadata
nbia.downloadSeries(series_data, number = 3, api_url = api_url, input_type = "list", csv_filename = collection + "-seg_seed_source_images")

The following code will download the scans with negative finding assessments.  These are cases where the authors of the dataset did not find anything that could be annotated.  Downloading these scans could be useful if you are training a tumor/metastases detection model.

In [None]:
# filter dataframe to only include scans with "no findings"
ref_Series = annotation_Metadata[annotation_Metadata['Annotation Type'] == 'No findings']

# remove duplicate ReferencedSeriesUIDs
clean_refSeries = ref_Series.drop_duplicates(subset='ReferencedSeriesInstanceUID')
#display(clean_refSeries)

# extract series UID column to list for downloading
series_data = clean_refSeries["ReferencedSeriesInstanceUID"].tolist()

# download a sample set of three scans
# return metadata dataframe as dataframe
# save a CSV of the metadata
nbia.downloadSeries(series_data, number = 3, api_url = api_url, input_type = "list", csv_filename = collection + "-noFinding_source_images")

# Acknowledgements
[The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a service which de-identifies and hosts a large publicly available archive of medical images of cancer.  TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/).  If you leverage TCIA datasets in your work please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF) and include all relevant citations.

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045â€“1057. https://doi.org/10.1007/s10278-013-9622-7

## Dataset Citations

Citations for each dataset can be found on their summary pages that are listed in section 1.