You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries.ipynb)

# Summary

Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, HIPAA constraints make sharing medical images outside an individual institution complex. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute that addresses this challenge by providing hosting and de-identification services that take major burdens of data sharing off researchers.

**This notebook is focused on basic use cases for leveraging the REST APIs to execute queries to learn about TCIA datasets.**  If you're interested in additional TCIA notebooks and coding examples, check out the tutorials at https://github.com/kirbyju/TCIA_Notebooks.

# 1 Learn about Available Collections on the TCIA Website

[Browsing Collections](https://www.cancerimagingarchive.net/collections) and viewing [Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) of TCIA datasets are the easiest ways to become familiar with what is available. These pages will help you quickly identify datasets of interest, find valuable supporting data that are not available via our APIs (e.g. clinical spreadsheets and non-DICOM segmentation data), and answer the most common questions you might have about the datasets.  

# 2 REST API Overview
TCIA uses software called NBIA to manage DICOM data. The NBIA REST APIs are provided for the search and download functions used in the TCIA radiology portal and allow access to both public and limited access collections.
1. The [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB) allow you to perform basic queries and download data from **public** collections. These APIs do not require a TCIA account.
2. The [NBIA Search with Authentication REST APIs](https://wiki.cancerimagingarchive.net/x/X4ATBg) allow you to perform basic queries and download data from **public and limited-access** collections. These APIs require a TCIA account to create authentication tokens.
3. The [NBIA Advanced REST APIs](https://wiki.cancerimagingarchive.net/x/YoATBg) also allow access to **public and limited-access** collections, but provide query endpoints mostly geared towards developers seeking to integrate searching and downloading TCIA data into web and desktop applications. This API requires a TCIA account to create authentication tokens.

# 3 Setup

The following cells install and import [**tcia_utils**](https://github.com/kirbyju/tcia_utils) which contain a variety of useful functions for accessing TCIA via Jupyter/Python. We'll step through many of its functions in the following section.

By default, most functions from tcia_utils return results in JSON.  However, you can use **format = "df"** to return the results as a dataframe, or **format = "csv"** to save a CSV file in addition to returning a dataframe.

Nearly all functions allow you to specify **api_url** as a query parameter.  This allows you to specify if you'd like to access restricted collections or the [National Lung Screening Trial (NLST)](https://doi.org/10.7937/TCIA.HMQ8-J677) collection, which lives on a separate server due to its size (>26,000 patients!).  We'll provide examples to show how this works later in the notebook.


In [1]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q tcia_utils

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.7/52.7 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.7/117.7 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import requests
import pandas as pd
from tcia_utils import nbia

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

Google Colab Logging set to INFO


# 4 Query Examples

## 4.1 getCollections()
The **getCollections()** function returns a list of collections.  

In [None]:
nbia.getCollections()


## 4.2 getBodyPart()
The **getBodyPart()** function returns a list of available body parts that were examined. Query parameters include **collection** and **modality**.

Let's look at the **TCGA-LUAD** collection from the list above and find out more about what body parts were examined.

In [None]:
nbia.getBodyPart(collection = "TCGA-LUAD")

## 4.3 getModality()
The **getModality()** function returns a list of available modalities. Query parameters include **collection** and **bodyPart**.

In [None]:
nbia.getModality(collection = "TCGA-LUAD")

## 4.4 getPatient()
The **getPatient()** function returns available patient information (e.g. species, gender, and ethnicity). You can also learn whether the subject is a [phantom](https://www.nist.gov/physics/what-are-imaging-phantoms) or not.  The only query parameter for this function is **collection**.

Let's try looking at the **CPTAC-LUAD** collection this time.  We'll also set the output format to a dataframe to make it easier to view.

In [None]:
df = nbia.getPatient(collection = "CPTAC-LUAD", format = "df")
display(df)

Here's an example that does the same thing with the [National Lung Screening Trial (NLST) Collection](https://doi.org/10.7937/TCIA.HMQ8-J677).  In this case we have to set **api_url = "nlst"** to talk to the NLST server, but everything else works the same.

In [None]:
df = nbia.getPatient(collection = "NLST", format = "df", api_url = "nlst")
display(df)

## 4.5 getStudy()

The **getStudy()** function returns study/visit details such as the anonymized study date, subject's age at the time of visit, and number of scans acquired at each timepoin. Query parameters include **collection (required)**, **patientId**, and **studyUid**.

In [None]:
df = nbia.getStudy(collection = "CPTAC-LUAD", format = "df")
display(df)

## 4.6 getSeries()

The **getSeries()** function returns metadata about each scan in the dataset (e.g. series description, modality, scanner manufacturer and software version, number of images). Query parameters include **collection**, **patientId**, **studyUid**, **seriesUid**, **modality**, **bodyPart**, **manufacturer**, and **manufacturerModel**.  This time let's set the format to **CSV**.  Note that the file is saved as **getSeries.csv**.

In [None]:
df = nbia.getSeries(collection = "CPTAC-LUAD", format = "csv")
display(df)

## 4.7 getSharedCart()
You can use https://nbia.cancerimagingarchive.net to create a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" which includes a specific set of scans you'd like to share with others. After creating a Shared Cart you receive a URL like https://nbia.cancerimagingarchive.net/nbia-search/?saved-cart=nbia-49121659384603347.  Try clicking the link to see what this looks like on the TCIA website.  Then use the code below to see how you can use the cart name at the end of the URL to access the related scans via the API.

In [None]:
df = nbia.getSharedCart(name = "nbia-49121659384603347", format = "df")
display(df)

# 5 Functions to analyze query results

Here we'll briefly discuss a couple of special functions in **tcia_utils** that can help further assist you in understanding your query results before you decide to download the data.

## 5.1 Collection and DOI summaries
These functions allow you to generate a summary report about Collections or DOIs from the series metadata created by the output of getSeries(), getSeriesList(), getSharedcart(), getUpdatedSeries(), a python list of Series UIDs, or from a TCIA manifest.  

**Note:** getSharedCart() and getUpdatedSeries() do not provide DOI information in their output so these two queries only work as input to reportCollectionSummary().
```
Parameters:
series_data: The input data to be summarized (expects JSON by default).
input_type: Set to 'df' for dataframe.
            Set to 'list' for python list, or 'manifest' for *.TCIA manifest file.
            If manifest is used, series_data should be the path to the TCIA manifest file.
format: Output format (default is dataframe, 'csv' for CSV file, 'chart' for charts).
api_url: Only necessary if input_type = list or manifest.
        Set to 'restricted' for limited-access collections or 'nlst'
        for National Lung Screening trial.
```
Let's say you want to create a report summarizing scans that are CT modality which have a Body Part Examined of CHEST.  First we'll run the query for that and then we'll pass the results to our reporting functions.

Note that Analysis Result datasets have their own DOIs and the analysis data lives within the collection(s) that they analyzed.  Therefore, if you're trying to understand how those Analsis Results fit into collections you should use the the DOI report option.  If you're just trying to understand what's in each collection and don't care if it was primary data or derived analyses contributed by others then you can use the Collection report option.

In [3]:
series = nbia.getSeries(bodyPart="CHEST")

2024-04-03 23:00:28,141:INFO:Calling... https://services.cancerimagingarchive.net/nbia-api/services/v1/getSeries with parameters {'BodyPartExamined': 'CHEST'}


In [4]:
nbia.reportCollectionSummary(series)

Unnamed: 0,Collection,DOIs,Licenses,Subjects,Studies,Series,Images,File Size,Disk Space,Body Parts,Modalities,Manufacturers,Min TimeStamp,Max TimeStamp,UniqueTimeStamps
0,ACRIN-FLT-Breast,https://doi.org/10.7937/K9/TCIA.2017.ol20zmxg,Creative Commons Attribution 3.0 Unported License,1,2,8,777,468101618,468.10 MB,CHEST,CT,SIEMENS,2020-01-17 11:40:07.0,2020-01-17 11:42:38.0,2020-01-17
1,ACRIN-NSCLC-FDG-PET,https://doi.org/10.7937/tcia.2019.30ilqfcl,Creative Commons Attribution 3.0 Unported License,124,199,523,48668,26054313156,26.05 GB,CHEST,"CT, PT, CR, DX","SIEMENS, TOSHIBA, GE MEDICAL SYSTEMS, Siemens,...",2018-12-06 13:21:28.0,2020-01-22 18:57:51.0,"2018-12-06, 2018-12-26, 2018-12-07, 2018-12-27..."
2,Anti-PD-1_Lung,https://doi.org/10.7937/tcia.2019.zjjwb9ip,Creative Commons Attribution 3.0 Unported License,8,9,95,17676,10066569008,10.07 GB,CHEST,CT,"SIEMENS, TOSHIBA",2019-03-25 20:49:47.0,2019-03-26 02:19:38.0,"2019-03-26, 2019-03-25"
3,C4KC-KiTS,https://doi.org/10.7937/TCIA.2019.IX49E8NX,Creative Commons Attribution 3.0 Unported License,11,11,22,1885,1115022430,1.12 GB,CHEST,"CT, SEG","SIEMENS, TOSHIBA, QIICR",2020-06-17 15:29:10.0,2020-06-17 17:13:41.0,2020-06-17
4,CC-Radiomics-Phantom-2,https://doi.org/10.7937/TCIA.2019.4l24tz5g,Creative Commons Attribution 3.0 Unported License,138,138,138,33517,17660595234,17.66 GB,CHEST,CT,SIEMENS,2019-02-21 16:05:49.0,2019-02-21 18:18:29.0,2019-02-21
5,CMB-CRC,https://doi.org/10.7937/DJG7-GZ87,Creative Commons Attribution 4.0 International...,12,23,88,9533,5798326084,5.80 GB,CHEST,"CT, DX","SIEMENS, TOSHIBA, Samsung Electronics, GE MEDI...",2021-07-30 09:39:17.0,2024-02-23 05:45:12.0,"2021-07-30, 2022-08-01, 2023-03-31, 2023-04-02..."
6,CMB-LCA,https://doi.org/10.7937/3CX3-S132,Creative Commons Attribution 4.0 International...,7,10,56,5373,2914546916,2.91 GB,CHEST,"CT, DX","TOSHIBA, SIEMENS, GE MEDICAL SYSTEMS, FUJIFILM...",2021-07-30 09:48:18.0,2024-02-23 05:45:10.0,"2021-07-30, 2022-08-01, 2023-04-02, 2024-02-23"
7,CMB-MEL,https://doi.org/10.7937/GWSP-WH72,Creative Commons Attribution 4.0 International...,3,7,32,4135,3350837220,3.35 GB,CHEST,CT,"SIEMENS, Philips",2022-08-01 14:44:29.0,2024-02-23 05:45:12.0,"2022-08-01, 2023-04-02, 2024-02-23"
8,CMB-PCA,https://doi.org/10.7937/25T7-6Y12,Creative Commons Attribution 4.0 International...,2,5,16,1162,624779868,624.78 MB,CHEST,"CT, NM",GE MEDICAL SYSTEMS,2023-03-31 13:45:03.0,2024-02-23 05:44:40.0,"2023-03-31, 2023-04-02, 2024-02-23"
9,COVID-19-AR,https://doi.org/10.7937/tcia.2020.py71-5978,Creative Commons Attribution 4.0 International...,104,244,329,10180,8362258186,8.36 GB,CHEST,"CR, DX, CT","FUJIFILM Corporation, Philips, Philips Medical...",2020-07-11 16:43:53.0,2020-07-11 17:22:45.0,2020-07-11


Let's take a look at the breakdown by DOI now.  This will separate out the derived "analysis result" datasets that are outlined at https://www.cancerimagingarchive.net/browse-analysis-results/.  We can also use the **format** parameter to display some pie charts (shown below) in addition to providing the dataframe output or you could set **format="csv"** to save the dataframe to a CSV file.

In [6]:
nbia.reportDoiSummary(series, format = "chart")

2024-04-03 23:01:27,471:INFO:Calling... https://api.datacite.org/dois/ with parameters {'provider-id': 'tciar', 'page[size]': 1000}


Unnamed: 0,Identifier,CollectionURI,Collections,Licenses,Subjects,Studies,Series,Images,File Size,Disk Space,Body Parts,Modalities,Manufacturers,Min TimeStamp,Max TimeStamp,UniqueTimeStamps
0,Anti-PD-1_Lung,https://doi.org/10.7937/tcia.2019.zjjwb9ip,Anti-PD-1_Lung,Creative Commons Attribution 3.0 Unported License,8,9,95,17676,10066569008,10.07 GB,CHEST,CT,"SIEMENS, TOSHIBA",2019-03-25 20:49:47.0,2019-03-26 02:19:38.0,"2019-03-26, 2019-03-25"
1,CMB-PCA,https://doi.org/10.7937/25T7-6Y12,CMB-PCA,Creative Commons Attribution 4.0 International...,2,5,16,1162,624779868,624.78 MB,CHEST,"CT, NM",GE MEDICAL SYSTEMS,2023-03-31 13:45:03.0,2024-02-23 05:44:40.0,"2023-03-31, 2023-04-02, 2024-02-23"
2,EA1141,https://doi.org/10.7937/2BAS-HR33,EA1141,Creative Commons Attribution 4.0 International...,20,30,217,15259,7629649460,7.63 GB,CHEST,MR,GE MEDICAL SYSTEMS,2023-08-13 16:25:41.0,2023-08-14 20:50:21.0,"2023-08-14, 2023-08-13"
3,MIDRC-RICORD-1B,https://doi.org/10.7937/31V8-4A40,MIDRC-RICORD-1B,Creative Commons Attribution-NonCommercial 4.0...,53,54,54,9217,4635307706,4.64 GB,CHEST,CT,Not Specified,2021-01-08 10:15:41.0,2021-01-08 10:16:54.0,2021-01-08
4,CMB-LCA,https://doi.org/10.7937/3CX3-S132,CMB-LCA,Creative Commons Attribution 4.0 International...,7,10,56,5373,2914546916,2.91 GB,CHEST,"CT, DX","TOSHIBA, SIEMENS, GE MEDICAL SYSTEMS, FUJIFILM...",2021-07-30 09:48:18.0,2024-02-23 05:45:10.0,"2021-07-30, 2022-08-01, 2023-04-02, 2024-02-23"
5,CT-vs-PET-Ventilation-Imaging,https://doi.org/10.7937/3ppx-7s22,CT-vs-PET-Ventilation-Imaging,Creative Commons Attribution 4.0 International...,20,21,60,23175,12230968756,12.23 GB,CHEST,CT,SIEMENS,2022-05-26 11:18:47.0,2022-05-26 11:20:02.0,2022-05-26
6,CPTAC-UCEC-Tumor-Annotations,https://doi.org/10.7937/89M3-KQ43,CPTAC-UCEC,Creative Commons Attribution 4.0 International...,2,2,4,4,34080,34.08 kB,CHEST,RTSTRUCT,Open Health Imaging Foundation,2023-07-14 19:50:09.0,2023-07-14 19:50:55.0,2023-07-14
7,MIDRC-RICORD-1C,https://doi.org/10.7937/91ah-v663,MIDRC-RICORD-1C,Creative Commons Attribution-NonCommercial 4.0...,348,948,1164,1180,11506468200,11.51 GB,CHEST,"DX, CR","Not Specified, Agfa",2020-12-11 13:38:23.0,2020-12-15 10:39:43.0,"2020-12-11, 2020-12-15"
8,CPTAC-PDA-Tumor-Annotations,https://doi.org/10.7937/BW9V-BX61,CPTAC-PDA,Creative Commons Attribution 4.0 International...,4,4,14,14,150186,150.19 kB,CHEST,RTSTRUCT,Open Health Imaging Foundation,2023-07-03 14:52:24.0,2023-07-03 14:54:15.0,2023-07-03
9,CMB-CRC,https://doi.org/10.7937/DJG7-GZ87,CMB-CRC,Creative Commons Attribution 4.0 International...,12,23,88,9533,5798326084,5.80 GB,CHEST,"CT, DX","SIEMENS, TOSHIBA, Samsung Electronics, GE MEDI...",2021-07-30 09:39:17.0,2024-02-23 05:45:12.0,"2021-07-30, 2022-08-01, 2023-03-31, 2023-04-02..."


## 5.2 makeSeriesReport()

This function ingests the JSON output from **getSeries()** or **getSharedCart()** and creates summary report.  Let's try it using the Shared Cart results that we looked at in our last query.

In [None]:
data = nbia.getSharedCart(name = "nbia-49121659384603347")

nbia.makeSeriesReport(data)

## 5.3 makeVizLinks()
This function ingests JSON output from **getSeries()** or **getSharedCart()**  and creates URLs to visualize them in a browser.  The links appear in the last 2 columns of the dataframe.  

The TCIA column displays the individual series described in each row.  The [Imaging Data Commons (IDC)](https://portal.imaging.datacommons.cancer.gov/) column displays the entire study (all series/scans from that time point).  The function accepts a **csv_filename** parameter if you'd like to save a CSV file of the output.  It just returns the dataframe if this is ommitted.

There are a few caveats worth noting about this function:
* Modalities such as SEG/RTSTRUCT will not load using the TCIA series viewer, but opening the entire study with the IDC viewer generally enables you to see RTSTRUCT/SEG annotations overlaid on top of the images they were derived from.
* IDC links may not work if they haven't mirrored the series from TCIA yet. Here is the [list of the collections](https://portal.imaging.datacommons.cancer.gov/collections/) they currently host.
* The visualization URLs only work if the series/study you selected is from a fully public dataset. Visualization of limited-access collections is not currently supported.

In [None]:
# use getSeries() to identify some scans of interest
data = nbia.getSeries(collection = "CPTAC-LUAD", modality = "CT")

# create a dataframe and CSV file visualization links
nbia.makeVizLinks(data, csv_filename="viz_links")

# 6 Querying "Limited Access" Collections (optional)
In some cases, you must specifically request access to collections before you can download them.  These are listed as **limited access** on the [Browse Collections](https://www.cancerimagingarchive.net/collections/) page.

The steps to request access may vary depending on the collection, but will always require that you first [create a TCIA user account](https://wiki.cancerimagingarchive.net/x/xgHDAg). Once you've created an account and have access to restricted collections you can use your login/password to create an API token with the **getToken()** function from **tcia_utils** to verify your permissions. **<font color='red'>Tokens are valid for 2 hours and must be refreshed after that point.</font>**

In [None]:
nbia.getToken()

Let's say that we're interested in the [QIN-Breast-02](https://doi.org/10.7937/TCIA.2019.4cfm06rr) collection. As you can see on the collection page, you must email help@cancerimagingarchive.net to request access to the data. Once you've recieved approval we can use **nbia.getSeries()** to get a full list of series UIDs in this restricted collection by including **api_url = "restricted"** as a parameter.

In [None]:
# getSeries with query parameters
df = nbia.getSeries(collection = "QIN-Breast-02",
                      format = "df",
                      api_url = "restricted")
display(df)

**Note:** If you'd like to do further exploration of restricted datasets, you can modify any of the previously discussed queries in the notebook by adding the **api_url = "restricted"** parameter as shown above.

# 7 Downloading Data
Once you've mastered querying for data the next logical step would be to download it.  You can learn more about how to download, visualize and analyze TCIA data in the other notebooks at https://github.com/kirbyju/TCIA_Notebooks.

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7