You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Downloads.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Downloads.ipynb)

# Summary

Access to large, high quality data is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However HIPAA constraints make sharing medical images outside an individual institution a complex process. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute which addresses this challenge by providing hosting and de-identification services to take major burdens of data sharing off researchers.

**This notebook is focused on basic use cases for leveraging TCIA's REST APIs to query and download data.**  If you're interested in additional TCIA notebooks and coding examples check out https://github.com/kirbyju/TCIA_Notebooks.

# 1 Learn about available Collections on the TCIA website

[Browsing Collections](https://www.cancerimagingarchive.net/collections) and [Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) datasets on TCIA are the easiest ways to become familiar with what is available.  These pages will help you quickly identify datasets of interest, find valuable supporting data that are not available via our APIs (e.g. clinical spreadsheets, non-DICOM segmentation data), and answer most common questions you might have about the datasets.  

# 2 REST API Overview
TCIA uses software called NBIA to manage DICOM data.  The NBIA REST APIs include:
1. [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB) that allow you to perform basic queries and download data from **public** collections. This API does not require a TCIA account.
2. [NBIA Search with Authentication REST APIs](https://wiki.cancerimagingarchive.net/x/X4ATBg) that allow you to perform basic queries and download data from **public and limited-access** collections. This API requires a TCIA account for creation of authentication tokens.
3. [NBIA Advanced REST APIs](https://wiki.cancerimagingarchive.net/x/YoATBg) that allow access to **public and limited-access** collections, but provide query endpoints mostly geared towards developers seeking to integrate searching and downloading TCIA data into web and desktop applications.  This API requires a TCIA account for creation of authentication tokens.

As of v2.3 **tcia_utils** you don't have to worry about this complexity.  Each function automatically calls the proper API, and if credentials aren't specified you are logged in via a guest account to view fully public data.  See section 4.5 of this notebook to learn how to download "limited access" datasets you have received permission to use.

# 3 Setup

The following cells install and import the **nbia** from [**tcia_utils**](https://pypi.org/project/tcia-utils/), which contains a variety of useful functions for accessing TCIA via Jupyter/Python.

**tcia_utils** contains a **downloadSeries()** function that has multiple options for specifying the seriesUids you'd like to download.  By default, the function expects JSON data containing "SeriesInstanceUID" elements, which can be generated using any of the series-related queries such as **getSeries()** or **getCart()**.  However, if you have a series UID list from some other source, you can set **input_type = "list"** to pass a python list of one or more series UIDs instead of JSON. You can also set **input_type = "manifest"** to provide a path to a **.TCIA** manifest file.

Data are saved to a **tciaDownload** folder in your current working directory by default, but you can use the **path** parameter to change this to a different directory.

There is an optional **format** parameter that can be used to return metadata about what was downloaded.  It can be set to **df** to return dataframe or **csv** to save a spreadsheet. There's also a **csv_filename** parameter if you want to set a specific file name.

You can specify **number = n** to tell the function to only download the first **n** scans of your seriesUids.  Remove this parameter in the examples below to download the full dataset.

The **api_url** parameter can be omitted in most cases.  However, it must be set to **api_url = "nlst"** to access the [National Lung Screening Trial (NLST)](https://doi.org/10.7937/TCIA.HMQ8-J677) collection.  

Last but not least, there is some logic built in to detect whether you've already downloaded a series.  If a directory named after the seriesUid already exists the function will assume it's already been downloaded and skip that series.

In [None]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q tcia_utils

In [None]:
import requests
import pandas as pd
from tcia_utils import nbia

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

# 4 Download Examples

In this section we'll cover downloading data via the REST API for the following use cases:

1.   Download a full TCIA collection
2.   Download custom results of an API query
3.   Download a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" that was created via https://nbia.cancerimagingarchive.net/
4.   Download data from a TCIA manifest file
5.   Download data from a **restricted** collection that requires logging in with your account
6.   Download individual images from a series

## 4.1 Download a full collection

You can [Browse Collections](https://www.cancerimagingarchive.net/collections) on our website to figure out what you might want to download, but you can also get a list of available collections via the API as shown below.

In [None]:
# get list of available collections as JSON
nbia.getCollections()


Let's say that we're interested in downloading the entire **Soft-tissue-Sarcoma** collection.  First we need to get a list of all Series Instance UIDs in that collection.  We can use **nbia.getSeries()** to return JSON metadata about all series (scans) in this collection.

In [None]:
data = nbia.getSeries(collection = "Soft-tissue-Sarcoma")
print(data)

We can then pass that to the our download functions and view/save the metadata for what was downloaded.  We'll leverage the **number** parameter here to just grab the first 2 scans as a test.  You can remove this parameter if you want to download the full collection.

In [None]:
nbia.downloadSeries(data, number = 2)

Take a second to go look at your **tciaDownload** folder to view the data.  Note that each series is saved in a directory named by its Series Instance UID.

You can learn more about various ways to visualize and analyze your data in the other notebooks at https://github.com/kirbyju/TCIA_Notebooks.  However, let's use **nbia.viewSeries()** to get a quick look at one of the series we've downloaded.  You can change the Series UID in the code below to view other scans you've downloaded.

In [None]:
seriesUid = "1.3.6.1.4.1.14519.5.2.1.5168.1900.104193299251798317056218297018"
nbia.viewSeries(seriesUid)

## 4.2 Download custom API query
The REST API allows for a variety of different query options as demonstrated in [this notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries.ipynb).  For this use case, let's assume that you are only interested in the MR scans from the [TCGA-BRCA](https://doi.org/10.7937/K9/TCIA.2016.AB2NAZRP) collection that were acquired on Siemens scanners.

In [None]:
# getSeries with query parameters
data = nbia.getSeries(collection = "TCGA-BRCA",
                      modality = "MR",
                      manufacturer = "SIEMENS")

print(len(data), 'Series returned')

Once again, let's pass those Series Instance UIDs to our download function.  This time we'll also set **format = "df"** to return a dataframe about what we downloaded.

In [None]:
# feed series_data to our downloadSampleSeries function
df = nbia.downloadSeries(data, number = 2, format = "df")
display(df)

Another common use case may be that you'd like to review the results from **getSeries()** and do some further filtering.  This time we'll add **format = "df"** to save the output to a dataframe.  After removing unwanted scans from the dataframe, we'll pass the remaining Series Instance UIDs to **downloadSeries()**.  For example, let's say that you only wanted to download **T2 MR** series instead of any MR.

In [None]:
# getSeries with query parameters
df = nbia.getSeries(collection = "TCGA-BRCA",
               modality = "MR",
               manufacturer = "SIEMENS", format = "df")

Let's drop scans that don't contain **t2** in either the **ProtocolName** or **SeriesDescription** fields.  Note that we're down to 21 scans now instead of 353.

In [None]:
# convert the columns to lowercase to catch both upper/lower case with filter for 't2'
filtered_df = df[(df['ProtocolName'].str.lower().str.contains('t2')) |
                 (df['SeriesDescription'].str.lower().str.contains('t2'))]

display(filtered_df)

Now we can feed the Series Instance UIDs from this dataframe back to **downloadSeries()**.  Note that since we're not working with the API's default JSON output anymore we need to specify **input_type = "df"** when we call **downloadSeries()**.

In [None]:
# download the selected series_uids
nbia.downloadSeries(filtered_df, input_type = "df", number = 2)

## 4.3 Download custom NLST API query
Let's show a similar example where we look for a specific modality and manufacturer within the [National Lung Screening Trial (NLST) Collection](https://doi.org/10.7937/TCIA.HMQ8-J677).  We have to set **api_url = "nlst"** in our functions for this to work, but otherwise the steps are the same.

In [None]:
# getSeries with query parameters
data = nbia.getSeries(modality = "CT",
                      manufacturer = "Philips",
                      api_url = "nlst")

print(len(data), 'Series returned')

In [None]:
# feed series_data to our downloadSampleSeries function
df = nbia.downloadSeries(data, number = 2, api_url = "nlst", format = "df")
display(df)

## 4.4 Download a "shared cart"
It's possible to use https://nbia.cancerimagingarchive.net to create a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" which includes a specific set of scans you'd like to share with others.

After creating a Shared Cart you will receive a URL that looks like https://nbia.cancerimagingarchive.net/nbia-search/?saved-cart=nbia-49121659384603347 which can be shared with others.  Try clicking the link to see what this looks like on the TCIA website.  Then use the code below to see how you can use the cart name at the end of the URL to download the related scans via the API.

In [None]:
# getSharedCart metadata
data = nbia.getSharedCart(name = "nbia-49121659384603347")
print(len(data), 'Series returned')

We'll skip the use of the **number** parameter this time since the full cart is only 4 series.  Let's also try **format = "csv"** to save a spreadsheet of the metadata in addition to returning a dataframe.

In [None]:
# feed series_data to our downloadSampleSeries function
df = nbia.downloadSeries(data, format = "csv")
display(df)

## 4.4 Download data from a TCIA manifest file

When working with manifest files you can install the NBIA Data Retriever to open the manifest and download the data as shown in [this notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb).  However, there may be cases where you don't have administrative rights to install software or prefer using the REST API to download a manifest.  

In order to demonstrate this use case, let's assume that after you [Browse Collections](https://www.cancerimagingarchive.net/collections) you are interested in the [RIDER Breast MRI](https://doi.org/10.7937/K9/TCIA.2015.H1SXNUXL) collection.  We can find the URL of the manifest to download the full collection by looking at the blue "Download" button on that page.  Then we can download the manifest with the following commands.  

In [None]:
# download manifest file from RIDER Breast MRI page
manifest = requests.get("https://www.cancerimagingarchive.net/wp-content/uploads/doiJNLP-Fo0H1NtD.tcia")
with open('RIDER_Breast_MRI.tcia', 'wb') as f:
    f.write(manifest.content)

If you open this manifest file in a text editor you'll notice that it contains several lines of download parameters that precede a list of Series Instance UIDs to download.  If we set **input_type = "manifest"** we can provide the path/filename to **downloadSeries()** and it will extract the UIDs from the file and download them.

In [None]:
df = nbia.downloadSeries("RIDER_Breast_MRI.tcia", input_type = "manifest", number = 2, format = "df")
display(df)

## 4.5 Download data from a restricted collection
In some cases, you must specifically request access to collections before you can download them.  These are listed as **limited access** on the [Browse Collections](https://www.cancerimagingarchive.net/collections/) page. The steps to request access may vary depending on the collection, but will always require that you first [create a TCIA user account](https://wiki.cancerimagingarchive.net/x/xgHDAg). Once you've created an account, you can use your login/password to create an API token with the **getToken()** function from **tcia_utils** to verify your permissions. Tokens are valid for 2 hours and must be refreshed after that point, but **tcia_utils** monitors the timeout for you and automatically refreshes tokens if needed.

**Note:** Historically the **api_url** parameter needed to be specified if you wanted to access 'restricted' datasets in most functions.  As of **tcia_utils version 2.3** this is no longer needed.  Simply create a token with your credentials and you're good to go!

In [None]:
nbia.getToken()

Let's say that we're interested in the [QIN-Breast-02](https://doi.org/10.7937/TCIA.2019.4cfm06rr) collection. As you can see on the collection page, you must email help@cancerimagingarchive.net to request access to the data. Once you've recieved approval we can use **nbia.getSeries()** to get a full list of series UIDs in this restricted collection.

In [None]:
# getSeries with query parameters
data = nbia.getSeries(collection = "QIN-Breast-02")

if data != None:
    print(len(data), 'Series returned')
else:
    print('No data returned.')
    print('Did you forget to log in with your TCIA account in the previous cell?')
    print('Did you forget to obtain permission from the helpdesk to access this collection?')

Now we can download those scans just like before.

In [None]:
# feed series_data to our downloadSampleSeries function
df = nbia.downloadSeries(data, number = 2, format = "df")
display(df)

## 4.6 Download individual images from a series

It is also possible to download specific images if you don't want the entire series. Let's look at an example using the [**CMB-GEC**](https://doi.org/10.7937/E7KH-R486) collection.  First we'll get a list of the metadata about the series.

In [None]:
nbia.getSeries(collection = "CMB-GEC")

Once we identify a series of interest we can pass its Series Instance UID to **nbia.getSopInstanceUids()** to obtain a list of SOP Instance UIDs for the individual images that are part of this series.

In [None]:
seriesUID = "1.3.6.1.4.1.14519.5.2.1.1600.1204.919741553251398079475267746505"

nbia.getSopInstanceUids(seriesUid)

After we have both a series UID and SOP Instance UID, we can call the **nbia.downloadImage()** function to download a specific image from the series.

In [None]:
sopUID = "1.3.6.1.4.1.14519.5.2.1.1600.1204.211684247543622814130853101548"

nbia.downloadImage(seriesUID, sopUID)

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7