You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries.ipynb)

# Summary

Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, HIPAA constraints make sharing medical images outside an individual institution complex. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute that addresses this challenge by providing hosting and de-identification services that take major burdens of data sharing off researchers.

**This notebook is focused on basic use cases for leveraging the REST APIs to execute queries to learn about TCIA datasets.**  If you're interested in additional TCIA notebooks and coding examples, check out the tutorials at https://github.com/kirbyju/TCIA_Notebooks.

# 1 Learn about Available Collections on the TCIA Website

[Browsing Collections](https://www.cancerimagingarchive.net/collections) and viewing [Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) of TCIA datasets are the easiest ways to become familiar with what is available. These pages will help you quickly identify datasets of interest, find valuable supporting data (e.g. clinical spreadsheets and non-DICOM segmentation data), and answer the most common questions you might have about the datasets.  Please note, there is a [separate API](https://www.cancerimagingarchive.net/collection-manager-rest-api/) to work with these types metadata but that is not the focus of this notebook.

# 2 NBIA REST API Overview
TCIA uses software called NBIA to manage DICOM data. The NBIA REST API is provided for the search and download functions that are described in more detail on our [NBIA Swagger](https://cbiit.github.io/NBIA-TCIA/#/) page.

# 3 Setup

The following cells install and import [**tcia_utils**](https://github.com/kirbyju/tcia_utils) which contain a variety of useful functions for accessing TCIA via Jupyter/Python.

In v3 of **tcia_utils** there was a major overhaul of nbia.py to bring everything in line with the latest NBIA API updgrades, and to account for the fact that TCIA no longer directly hosts controlled access datasets due to recent changes in NIH policies.  You can learn more at https://www.cancerimagingarchive.net/new-nih-policies-for-controlled-access-data/.  Other notebooks will eventually be developed to describe how to obtain controlled-access data we've published from the other NCI repositories we've partnered with moving forward.

In [None]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q tcia_utils

In [None]:
import requests
import pandas as pd
from tcia_utils import nbia

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

# 4 Basic Query Examples
By default, most functions from tcia_utils return results in JSON.  However, you can use **format = "df"** to return the results as a dataframe or **format = "csv"** to save a CSV file in addition to returning a dataframe.  

Most query functions allow you to specify **api_url** as a query parameter.  The only reason to use this is if you'd like to access the [National Lung Screening Trial (NLST)](https://doi.org/10.7937/TCIA.HMQ8-J677) collection, which lives on a separate server due to its size (>26,000 patients!).  We'll provide examples to show how this works later in the notebook.

We'll start out with some simple stuff for learning purposes, but **don't skip section 5 getSimpleSearch()** as this is by far the most powerful and useful way to find data once you understand the basics of how everything is organized.

## 4.1 getCollections()
The **getCollections()** function returns a list of collections.  

In [None]:
nbia.getCollections()


## 4.2 getBodyPart()
The **getBodyPart()** function returns a list of available body parts that were examined. Query parameters include **collection** and **modality**.

Let's look at the **TCGA-LUAD** collection from the list above and find out more about what body parts were examined.

In [None]:
nbia.getBodyPart(collection = "TCGA-LUAD")

## 4.3 getModality()
The **getModality()** function returns a list of available modalities. Query parameters include **collection** and **bodyPart**.

In [None]:
nbia.getModality(collection = "TCGA-LUAD")

## 4.4 getPatient()
The **getPatient()** function returns available patient information (e.g. species, gender, and ethnicity). You can also learn whether the subject is a [phantom](https://www.nist.gov/physics/what-are-imaging-phantoms) or not.  

It's **important to note** that often times this information is missing or filled with non-sense values by the folks who acquired the images.  If you can find an accompanying spreadsheet or other source of patient information that should pretty much always be assumed more accurate than this information coming from the DICOM headers, but sometimes it's the only thing available.

Let's try looking at the **CPTAC-LUAD** collection this time.  We'll also set the output format to a dataframe using **format = "df"** to make it easier to view in a notebook.  The only query parameter for this function is **collection**.

In [None]:
df = nbia.getPatient(collection = "CPTAC-LUAD", format = "df")
display(df)

Here's an example that does the same thing with the [National Lung Screening Trial (NLST) Collection](https://doi.org/10.7937/TCIA.HMQ8-J677).  In this case we have to set **api_url = "nlst"** to talk to the NLST server, but everything else works the same.

In [None]:
df = nbia.getPatient(collection = "NLST", format = "df", api_url = "nlst")
display(df)

## 4.5 getPatientByCollectionAndModality()

This function requires specifying collection and modality.  It returns a list of patient IDs that match your query.

In [None]:
patients = nbia.getPatientByCollectionAndModality(collection = "CPTAC-LUAD", modality = "CT")
print(patients)

## 4.6 getNewPatientsInCollection()
Gets patient metadata for all subjects in a given collection that were published after a specified release date.  Requires specifying a collection and release date.  The date format is YYYY/MM/DD.

In [None]:
df = nbia.getNewPatientsInCollection(collection = "CPTAC-LUAD", date = "2019/04/15", format = "df")
display(df)

## 4.7 getStudy()

The **getStudy()** function returns study/visit details such as the anonymized study date, subject's age at the time of visit, and number of scans acquired at each time point. Query parameters include **collection (required)**, **patientId**, and **studyUid**.

Functions that return Study Instance UIDs or Series Instance UIDs also support **format = "html"**.  This is a way to quickly preview a sample of studies or series in your query results by leveraging the work of [NCI's Imaging Data Commons](https://learn.canceridc.dev/getting-started-with-idc).

**Note:** IDC contains a copy of all publicly available DICOM data from TCIA, but their release schedule is slightly delayed from ours so there may be occasional times when our most recently published DICOM data are not available in their viewers yet.

In [None]:
df = nbia.getStudy(collection = "CPTAC-LUAD", format = "html")
display(df)

## 4.8 getNewStudiesInPatient()
Gets metadata for all studies from a given patient that were published after a specified release date. Requires specifying a collection, patient ID and release date. The date format is YYYY/MM/DD.

In [None]:
df = nbia.getNewStudiesInPatient(collection = "CPTAC-LUAD", patientId = "C3N-02973", date = "2019/04/15", format = "df")
display(df)

## 4.9 getSeries()

The **getSeries()** function returns metadata about each scan in the dataset (e.g. series description, modality, scanner manufacturer and software version, number of images). Query parameters include **collection**, **patientId**, **studyUid**, **seriesUid**, **modality**, **bodyPart**, **manufacturer**, and **manufacturerModel**.  This time let's set the format to **CSV**.  Note that the file is saved as **getSeries_YYYY-MM-DD_HH-MM.csv**.

In [None]:
df = nbia.getSeries(collection = "CPTAC-LUAD", format = "csv")
display(df)

## 4.10 getUpdatedSeries()
Gets metadata for all series that were published after a specified release date. The release date is the only parameter for this function. The date format is YYYY/MM/DD.

In [None]:
df = nbia.getUpdatedSeries(date = "2025/04/15", format = "df")
display(df)

## 4.11 getSopInstanceUids()
This returns the unique identifier (SOP Instance UID) for each image contained in a series/scan.  It requires that you specify the Series Instance UID of the scan you're interested in.

In [None]:
sopUids = nbia.getSopInstanceUids(seriesUid = "1.3.6.1.4.1.14519.5.2.1.7695.2311.498163603405178114978583022189")
print(sopUids)

## 4.12 getDicomTags()
This returns DICOM metadata for a given series UID.  The output defaults to JSON but `format` can be set to **df** or **csv**.

In [None]:
tags = nbia.getDicomTags(seriesUid = "1.3.6.1.4.1.14519.5.2.1.7695.2311.498163603405178114978583022189", format = "df")
display(tags)

# 5 getSimpleSearch()
This function lets you do pretty much everything you can do in the **Simple Search** screen at https://nbia.cancerimagingarchive.net. As as a result, there are a lot of parameters.

```
collections: list[str]   -- The DICOM collections of interest to you
    species: list[str]       -- Filter collections by species. Possible values are 'human', 'mouse', and 'dog'
    modalities: list[str]    -- Filter collections by modality
    modalityAnded: bool      -- If true, only return subjects with all requested modalities, as opposed to any
    minStudies: int          -- The minimum number of studies a collection must have to be included in the results
    manufacturers: list[str] -- Imaging device manufacturers, e.g. SIEMENS
    bodyParts: list[str]     -- Body parts of interest, e.g. CHEST, ABDOMEN
    fromDate: str            -- First cutoff date, in YYYY/MM/DD format. Defaults to 1900/01/01
    toDate: str              -- Second cutoff date, in YYYY/MM/DD format. Defaults to today's date
    patients: list[str]      -- Patients to include in the output
    start: int               -- Starting point of returned subject results. Defaults to 0.
    size: int                -- Number of returned subjects per page. Defaults to 10.
    sortDirection            -- 'ascending' or 'descending'. Defaults to 'ascending'.
    sortField                -- 'subject', 'studies', 'series', or 'collection'. Defaults to 'subject'.
    format: str              -- Defaults to JSON. Can be set to "uids" to return a python list of
                                Series Instance UIDs or "manifest" to save a TCIA manifest file (up to 1,000,000 series).
```

Here's an example of the default JSON output.  Note that some of the identifiers used here are for internal tracking in NBIA and do not align with the DICOM identifiers (e.g. Series Instance UID, Study Instance UID).  More details about how this works can be found on our [NBIA Swagger](https://cbiit.github.io/NBIA-TCIA/#/Patient%20Metadata/post_getSimpleSearch) page.

In [None]:
nbia.getSimpleSearch(bodyParts=["CHEST"], modalities=["CT"])

## 5.1 IMPORTANT: getSimpleSearch(format="uid")
This is one of the most efficient ways search our DICOM data.
When you set the `format` parameter to be **uids** it will return a python list of all Series Instance UIDs that match your query.  Once you have your UID list you can review metadata about your results with `getSeriesList()` in case you want to refine anything before you start your actual download.  

In [None]:
uids = nbia.getSimpleSearch(collections=["LIDC-IDRI"], modalities=["CT"], format = "uids")

metadata = nbia.getSeriesList(uids)
display(metadata)

You can also set it to **manifest** to return a TCIA manifest file which will allow you to download those series with the NBIA Data Retriever.

In [None]:
nbia.getSimpleSearch(bodyParts=["CHEST"], modalities=["CT"], format = "manifest")

# 6 Shared Carts
It's possible to use https://nbia.cancerimagingarchive.net to create a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" which includes a specific set of scans you'd like to share with others.

After creating a Shared Cart you will receive a URL that looks like https://nbia.cancerimagingarchive.net/nbia-search/?saved-cart=nbia-49121659384603347 which can be shared with others.  Try clicking the link to see what this looks like on the TCIA website.  Then use the code below to see how you can achieve this using **tcia_utils**.

In [None]:
df = nbia.getSharedCart(name = "nbia-49121659384603347", format = "df")
display(df)

# 7 Functions to analyze query results

Here we'll briefly discuss a couple of special functions in **tcia_utils** that can help further assist you in understanding your query results before you decide to download the data.

## 7.1 Collection and DOI summaries
These functions allow you to generate a summary report about Collections or DOIs from the series metadata created by the output of getSeries(), getSeriesList(), getSharedcart(), getUpdatedSeries(), a python list of Series UIDs, or from a TCIA manifest.  

**Note:** getSharedCart() and getUpdatedSeries() do not provide DOI information in their output so these two queries only work as input to reportCollectionSummary().
```
Parameters:
series_data: The input data to be summarized (expects JSON by default).
input_type: Set to 'df' for dataframe.
            Set to 'list' for python list, or 'manifest' for *.TCIA manifest file.
            If manifest is used, series_data should be the path to the TCIA manifest file.
format: Output format (default is dataframe, 'csv' for CSV file, 'chart' for charts).
api_url: Only necessary if input_type = list or manifest.
        Set to 'restricted' for limited-access collections or 'nlst'
        for National Lung Screening trial.
```
Let's say you want to create a report summarizing scans that are CT modality which have a Body Part Examined of CHEST.  First we'll run the query for that and then we'll pass the results to our reporting functions.

Note that Analysis Result datasets have their own DOIs and the analysis data lives within the collection(s) that they analyzed.  Therefore, if you're trying to understand how those Analsis Results fit into collections you should use the the DOI report option.  If you're just trying to understand what's in each collection and don't care if it was primary data or derived analyses contributed by others then you can use the Collection report option.

In [None]:
series = nbia.getSeries(modality = "CT", bodyPart="CHEST", format = "df")

In [None]:
nbia.reportCollectionSummary(series)

Let's take a look at the breakdown by DOI now.  This will separate out the derived "analysis result" datasets that are outlined at https://www.cancerimagingarchive.net/browse-analysis-results/.  We can also use the **format** parameter to display some pie charts (shown below) in addition to providing the dataframe output or you could set **format="csv"** to save the dataframe to a CSV file.

In [None]:
nbia.reportDoiSummary(series, format = "chart")

## 7.2 Visualize before downloading with idcOhifViewer()
If you'd like to use the previously mentioned OHIF Viewer functionality supported by IDC, but want to investigate more than just a small sample, then you can do that with the **idcOhifViewer()** function.

First, run any query that includes Study or Series Instance UIDs in the return fields.  Then, pass the result (either JSON or dataframe inputs) to idcOhifViewer() to create the HTML table result with links in the SeriesInstanceUID and StudyInstanceUID rows.

This function provides added flexibility by allowing you to specify the max number of rows you'd like to view in your table, and provides opportunity to further refine the results of your query (e.g. with Pandas operations) before creating the HTML table.

Carrying forward our last example, let's say you thought the Collection called "C4KC-KiTS" seemed particularly interesting.  Let's limit the contents of our series dataframe to that dataset and then visualize some example scans.

In [None]:
df = series[series["Collection"] == "C4KC-KiTS"]
nbia.idcOhifViewer(df, max_rows=5)

## 7.3 makeSeriesReport()

This function ingests the JSON output from **getSeries()** or **getSharedCart()** and creates summary report.  Let's try it using the Shared Cart results that we looked at in our last query.

In [None]:
data = nbia.getSharedCart(name = "nbia-49121659384603347")

nbia.makeSeriesReport(data)

# 8 Querying "Limited Access" Collections (deprecated)
TCIA no longer directly hosts controlled access datasets due to recent changes in NIH policies.  Learn more at https://www.cancerimagingarchive.net/new-nih-policies-for-controlled-access-data/.

# 9 Downloading Data
Once you've mastered querying for data the next logical step would be to download it.  You can learn more about how to do this in https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Downloads.ipynb.

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7