You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries.ipynb)

# Summary

Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, HIPAA constraints make sharing medical images outside an individual institution complex. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute that addresses this challenge by providing hosting and de-identification services that take major burdens of data sharing off researchers.

**This notebook is focused on basic use cases for leveraging the REST APIs to execute queries to learn about TCIA datasets.**  If you're interested in additional TCIA notebooks and coding examples, check out the tutorials at https://github.com/kirbyju/TCIA_Notebooks.

# 1 Learn about Available Collections on the TCIA Website

[Browsing Collections](https://www.cancerimagingarchive.net/collections) and viewing [Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) of TCIA datasets are the easiest ways to become familiar with what is available. These pages will help you quickly identify datasets of interest, find valuable supporting data (e.g. clinical spreadsheets and non-DICOM segmentation data), and answer the most common questions you might have about the datasets.  Please note, there is a [separate API](https://www.cancerimagingarchive.net/collection-manager-rest-api/) to work with these types metadata but that is not the focus of this notebook.

# 2 NBIA REST API Overview
TCIA uses software called NBIA to manage DICOM data. The NBIA REST APIs are provided for the search and download functions used in the TCIA radiology portal and allow access to both public and limited access collections.
1. The [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB) allow you to perform basic queries and download data from **public** collections. These APIs do not require a TCIA account.
2. The [NBIA Search with Authentication REST APIs](https://wiki.cancerimagingarchive.net/x/X4ATBg) allow you to perform basic queries and download data from **public and limited-access** collections. These APIs require a TCIA account to create authentication tokens.
3. The [NBIA Advanced REST APIs](https://wiki.cancerimagingarchive.net/x/YoATBg) also allow access to **public and limited-access** collections, but provide query endpoints mostly geared towards developers seeking to integrate searching and downloading TCIA data into web and desktop applications. This API requires a TCIA account to create authentication tokens.

As of v2.3 **tcia_utils** you don't have to worry about this complexity.  Each function automatically calls the proper API, and if credentials aren't specified you are logged in via a guest account to view fully public data.  See section 6 of this notebook to learn how to log in to view "limited access" datasets you have received permission to use.

# 3 Setup

The following cells install and import [**tcia_utils**](https://github.com/kirbyju/tcia_utils) which contain a variety of useful functions for accessing TCIA via Jupyter/Python. We'll step through many of its functions in the following section.

By default, most functions from tcia_utils return results in JSON.  However, you can use **format = "df"** to return the results as a dataframe, or **format = "csv"** to save a CSV file in addition to returning a dataframe.

Nearly all functions allow you to specify **api_url** as a query parameter.  This allows you to specify if you'd like to access the [National Lung Screening Trial (NLST)](https://doi.org/10.7937/TCIA.HMQ8-J677) collection, which lives on a separate server due to its size (>26,000 patients!).  We'll provide examples to show how this works later in the notebook.

In [None]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q tcia_utils

In [None]:
import requests
import pandas as pd
from tcia_utils import nbia

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

# 4 Query Examples

## 4.1 getCollections()
The **getCollections()** function returns a list of collections.  

In [None]:
nbia.getCollections()


## 4.2 getBodyPart()
The **getBodyPart()** function returns a list of available body parts that were examined. Query parameters include **collection** and **modality**.

Let's look at the **TCGA-LUAD** collection from the list above and find out more about what body parts were examined.

In [None]:
nbia.getBodyPart(collection = "TCGA-LUAD")

## 4.3 getModality()
The **getModality()** function returns a list of available modalities. Query parameters include **collection** and **bodyPart**.

In [None]:
nbia.getModality(collection = "TCGA-LUAD")

## 4.4 getPatient()
The **getPatient()** function returns available patient information (e.g. species, gender, and ethnicity). You can also learn whether the subject is a [phantom](https://www.nist.gov/physics/what-are-imaging-phantoms) or not.  The only query parameter for this function is **collection**.

Let's try looking at the **CPTAC-LUAD** collection this time.  We'll also set the output format to a dataframe using **format = "df"** to make it easier to view in a notebook.

In [None]:
df = nbia.getPatient(collection = "CPTAC-LUAD", format = "df")
display(df)

Here's an example that does the same thing with the [National Lung Screening Trial (NLST) Collection](https://doi.org/10.7937/TCIA.HMQ8-J677).  In this case we have to set **api_url = "nlst"** to talk to the NLST server, but everything else works the same.

In [None]:
df = nbia.getPatient(collection = "NLST", format = "df", api_url = "nlst")
display(df)

## 4.5 getPatientByCollectionAndModality()

This function requires specifying collection and modality.  It returns a list of patient IDs that match your query.

In [None]:
patients = nbia.getPatientByCollectionAndModality(collection = "CPTAC-LUAD", modality = "CT")
print(patients)

## 4.6 getNewPatientsInCollection()
Gets patient metadata for all subjects in a given collection that were published after a specified release date.  Requires specifying a collection and release date.  The date format is YYYY/MM/DD.

In [None]:
df = nbia.getNewPatientsInCollection(collection = "CPTAC-LUAD", date = "2019/04/15", format = "df")
display(df)

## 4.7 getStudy()

The **getStudy()** function returns study/visit details such as the anonymized study date, subject's age at the time of visit, and number of scans acquired at each time point. Query parameters include **collection (required)**, **patientId**, and **studyUid**.

In [None]:
df = nbia.getStudy(collection = "CPTAC-LUAD", format = "df")
display(df)

## 4.8 getNewStudiesInPatient()
Gets metadata for all studies from a given patient that were published after a specified release date. Requires specifying a collection, patient ID and release date. The date format is YYYY/MM/DD.

In [None]:
df = nbia.getNewStudiesInPatient(collection = "CPTAC-LUAD", patientId = "C3N-02973", date = "2019/04/15", format = "df")
display(df)

## 4.9 getSeries()

The **getSeries()** function returns metadata about each scan in the dataset (e.g. series description, modality, scanner manufacturer and software version, number of images). Query parameters include **collection**, **patientId**, **studyUid**, **seriesUid**, **modality**, **bodyPart**, **manufacturer**, and **manufacturerModel**.  This time let's set the format to **CSV**.  Note that the file is saved as **getSeries_YYYY-MM-DD_HH-MM.csv**.

In [None]:
df = nbia.getSeries(collection = "CPTAC-LUAD", format = "csv")
display(df)

## 4.10 getUpdatedSeries()
Gets metadata for all series that were published after a specified release date. The release date is the only parameter for this function. The date format is YYYY/MM/DD.

**NOTE:** Unlike other API endpoints, this one expects DD/MM/YYYY
      but tcia-utils converts this before making the request so that date inputs are consistently YYYY/MM/DD across functions.

In [None]:
df = nbia.getUpdatedSeries(date = "2024/04/15", format = "df")
display(df)

## 4.11 getSopInstanceUids()
This returns the unique identifier (SOP Instance UID) for each image contained in a series/scan.  It requires that you specify the Series Instance UID of the scan you're interested in.

In [None]:
sopUids = nbia.getSopInstanceUids(seriesUid = "1.3.6.1.4.1.14519.5.2.1.7695.2311.498163603405178114978583022189")
print(sopUids)

# 5 Shared Carts
It's possible to use https://nbia.cancerimagingarchive.net to create a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" which includes a specific set of scans you'd like to share with others.

After creating a Shared Cart you will receive a URL that looks like https://nbia.cancerimagingarchive.net/nbia-search/?saved-cart=nbia-49121659384603347 which can be shared with others.  Try clicking the link to see what this looks like on the TCIA website.  Then use the code below to see how you can achieve this using **tcia_utils**.

## 5.1 makeSharedCart()  (requires account registration)
First let's create a shared cart using **makeSharedCart()**.  This function requires a list of Series Instance UIDs as well as a your cart's name, description and optional URL (if you want to provide a link to a website with more information about what's in your cart).

**Note:** Shared Cart names must be unique and you must log in to create one.  You can register for a TCIA account [here](https://wiki.cancerimagingarchive.net/x/xgHDAg).

In [None]:
# Create a random cart identifier to avoid naming collisions.
# You could also update the code below to choose your own custom name if you like.
import random

def generate_random_name():
  random_numbers = ''.join([str(random.randint(0, 9)) for _ in range(18)])
  return f"nbia-{random_numbers}"

name = generate_random_name()
series_list = ['1.2.826.0.1.534147.756.812677238.2022918711993.4', '1.2.826.0.1.534147.756.812677238.20229187057220.4', '1.2.826.0.1.534147.756.812677238.20229187148814.4']
description = "Testing out API cart creation with tcia_utils."
description_url = "https://pypi.org/project/tcia-utils/"

nbia.makeSharedCart(series_list, name, description, description_url)

## 5.2 getSharedCart()
Now let's try using the API to retrieve the cart we just created.  This function only requires that you specify the name of the cart you're trying to retrieve.  We'll reuse the randomly created **name** variable we setup in the last step.

In [None]:
df = nbia.getSharedCart(name = name, format = "df")
display(df)

# 6 Functions to analyze query results

Here we'll briefly discuss a couple of special functions in **tcia_utils** that can help further assist you in understanding your query results before you decide to download the data.

## 6.1 Collection and DOI summaries
These functions allow you to generate a summary report about Collections or DOIs from the series metadata created by the output of getSeries(), getSeriesList(), getSharedcart(), getUpdatedSeries(), a python list of Series UIDs, or from a TCIA manifest.  

**Note:** getSharedCart() and getUpdatedSeries() do not provide DOI information in their output so these two queries only work as input to reportCollectionSummary().
```
Parameters:
series_data: The input data to be summarized (expects JSON by default).
input_type: Set to 'df' for dataframe.
            Set to 'list' for python list, or 'manifest' for *.TCIA manifest file.
            If manifest is used, series_data should be the path to the TCIA manifest file.
format: Output format (default is dataframe, 'csv' for CSV file, 'chart' for charts).
api_url: Only necessary if input_type = list or manifest.
        Set to 'restricted' for limited-access collections or 'nlst'
        for National Lung Screening trial.
```
Let's say you want to create a report summarizing scans that are CT modality which have a Body Part Examined of CHEST.  First we'll run the query for that and then we'll pass the results to our reporting functions.

Note that Analysis Result datasets have their own DOIs and the analysis data lives within the collection(s) that they analyzed.  Therefore, if you're trying to understand how those Analsis Results fit into collections you should use the the DOI report option.  If you're just trying to understand what's in each collection and don't care if it was primary data or derived analyses contributed by others then you can use the Collection report option.

In [None]:
series = nbia.getSeries(modality = "CT", bodyPart="CHEST")

In [None]:
nbia.reportCollectionSummary(series)

Let's take a look at the breakdown by DOI now.  This will separate out the derived "analysis result" datasets that are outlined at https://www.cancerimagingarchive.net/browse-analysis-results/.  We can also use the **format** parameter to display some pie charts (shown below) in addition to providing the dataframe output or you could set **format="csv"** to save the dataframe to a CSV file.

In [None]:
nbia.reportDoiSummary(series, format = "chart")

## 5.2 makeSeriesReport()

This function ingests the JSON output from **getSeries()** or **getSharedCart()** and creates summary report.  Let's try it using the Shared Cart results that we looked at in our last query.

In [None]:
data = nbia.getSharedCart(name = "nbia-49121659384603347")

nbia.makeSeriesReport(data)

## 5.3 makeVizLinks()
This function ingests JSON output from **getSeries()** or **getSharedCart()**  and creates URLs to visualize them in a browser.  The links appear in the last 2 columns of the dataframe.  

The TCIA column displays the individual series described in each row.  The [Imaging Data Commons (IDC)](https://portal.imaging.datacommons.cancer.gov/) column displays the entire study (all series/scans from that time point).  The function accepts a **csv_filename** parameter if you'd like to save a CSV file of the output.  It just returns the dataframe if this is ommitted.

There are a few caveats worth noting about this function:
* Modalities such as SEG/RTSTRUCT will not load using the TCIA series viewer, but opening the entire study with the IDC viewer generally enables you to see RTSTRUCT/SEG annotations overlaid on top of the images they were derived from.
* IDC links may not work if they haven't mirrored the series from TCIA yet. Here is the [list of the collections](https://portal.imaging.datacommons.cancer.gov/collections/) they currently host.
* The visualization URLs only work if the series/study you selected is from a fully public dataset. Visualization of limited-access collections is not currently supported.

In [None]:
# use getSeries() to identify some scans of interest
data = nbia.getSeries(collection = "CPTAC-LUAD", modality = "CT")

# create a dataframe and CSV file visualization links
nbia.makeVizLinks(data, csv_filename="viz_links")

# 6 Querying "Limited Access" Collections (optional)
In some cases, you must specifically request access to collections before you can download them.  These are listed as **limited access** on the [Browse Collections](https://www.cancerimagingarchive.net/collections/) page.

The steps to request access may vary depending on the collection, but will always require that you first [create a TCIA user account](https://wiki.cancerimagingarchive.net/x/xgHDAg). Once you've created an account and have access to restricted collections you can use your login/password to create an API token with the **getToken()** function from **tcia_utils** to verify your permissions. Tokens are valid for 2 hours and must be refreshed after that point, but **tcia_utils** monitors the timeout for you and automatically refreshes tokens if needed.

**Note:** Historically the **api_url** parameter needed to be specified if you wanted to access 'restricted' datasets in most functions.  As of **tcia_utils version 2.3** this is no longer needed.  Simply create a token with your credentials and you're good to go!

In [None]:
nbia.getToken()

Let's say that we're interested in the [QIN-Breast-02](https://doi.org/10.7937/TCIA.2019.4cfm06rr) collection. As you can see on the collection page, you must email help@cancerimagingarchive.net to request access to the data. Once you've recieved approval and created a token we can use **nbia.getSeries()** to get a full list of series UIDs in this restricted collection.

In [None]:
# getSeries with query parameters
df = nbia.getSeries(collection = "QIN-Breast-02", format = "df")
display(df)

# 7 Downloading Data
Once you've mastered querying for data the next logical step would be to download it.  You can learn more about how to do this in https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Downloads.ipynb.

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7