<a href="https://colab.research.google.com/github/kirbyju/TCIA-Citation-Parser/blob/master/TCIA_REST_API_Downloads_for_Public_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summary


Access to large, high quality data is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However HIPAA constraints make sharing medical images outside an individual institution a complex process. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute which addresses this challenge by providing hosting and de-identification services to take major burdens of data sharing off researchers. 

**This notebook is focused on basic use cases for leveraging TCIA's REST APIs to download data from open-access Collections that don't require a user account.**  If you're interested additional TCIA notebooks and coding examples check out the tutorials at https://github.com/kirbyju/TCIA_Notebooks. You can also view a list of Github repositories that have tagged themself as relevant to TCIA at https://github.com/topics/tcia-dac.

# 1 Learn about available Collections on the TCIA website

[Browsing Collections](https://www.cancerimagingarchive.net/collections) and [Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) datasets on TCIA are the easiest ways to become familiar with what is available.  These pages will help you quickly identify datasets of interest, find valuable supporting data that are not available via our APIs (e.g. clinical spreadsheets, non-DICOM segmentation data), and answer most common questions you might have about the datasets.  

# 2 REST API Overview 
TCIA uses software called NBIA to manage DICOM data.  The NBIA REST APIs are provided to the search and download functions used in the TCIA radiology portal, and allow access to both public and limited access collections.
1. The [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB) allow you to perform basic queries and download data from **public** collections. This API does not require a TCIA account.
2. The [NBIA Search with Authentication REST APIs](https://wiki.cancerimagingarchive.net/x/X4ATBg) allow you to perform basic queries and download data from **public and limited-access** collections. This API requires a TCIA account for creation of authentication tokens.
3. The [NBIA Advanced REST APIs](https://wiki.cancerimagingarchive.net/x/YoATBg) also allow access to **public and limited-access** collections, but provides query endpoints mostly geared towards developers seeking to integrate searching and downloading TCIA data into web and desktop applications.  This API requires a TCIA account for creation of authentication tokens.

This notebook will focus on the fully public [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB).  If you'd like to see examples using the APIs that require authentication check out [this notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/ACNS0332/ACNS0332.ipynb) which shows many similar examples with the additional steps necessary to create a secure token using your TCIA login credentials.

***Note:*** Many of the examples below allow for additional query parameters to refine your results.  These are covered in the documentation links above.

### 2.1 Setting the Base URLs
The URL for accessing the Search APIs changes slightly depending on whether or not you would like to access the [National Lung Screening Trial (NLST)](https://doi.org/10.7937/TCIA.HMQ8-J677) collection, which lives on its own server due to its size (26,000+ patients, ~13 TBytes).  Here are the base URLs:

* All other Collections - https://services.cancerimagingarchive.net/nbia-api/services/v1/
* NLST - https://services.cancerimagingarchive.net/nlst-api/services/v1/

Let's set those as variables and also import a few modules we'll need later.


In [None]:
# set API base URLs

base_url = "https://services.cancerimagingarchive.net/nbia-api/services/v1/"
nlst_url = "https://services.cancerimagingarchive.net/nlst-api/services/v1/"

# imports

import requests
import pandas as pd
import json
import zipfile
from io import BytesIO

# 3 Download Examples

In this section we'll cover downloading data via the REST API for the following use cases:

1.   Download a full Collection
2.   Download custom results of an API query
3.   Download a "shared cart" that was created via https://nbia.cancerimagingarchive.net/
4.   Download data from a TCIA manifest file

But before we address those, let's define a generic download function that we can re-use for each of these use cases.  This will take a list of series UIDs as the input, download each scan, and create a dataframe/CSV that contains the metadata about each of those scans.  It also accepts an optional parameter to specify a file name if you'd like a CSV export of the dataframe.

***Note: This function is setup to only download the first 3 scans of your results for demonstration purposes.  If you'd like to download the full set of results you'll need to comment out or delete the relevant lines below.***

In [None]:
# define a function to accept a list of seriesInstanceUIDs and download it
# reminder: this only downloads the first 3 scans unless you comment out that section

def downloadSeries(api_url, series_data, csv_filename=""):  
    manifestDF=pd.DataFrame()
    seriesUID = ''
    count = 0
    for x in series_data:
        seriesUID = x['SeriesInstanceUID']
        data_url = api_url + "getImage?SeriesInstanceUID=" + seriesUID
        print("Downloading " + data_url)
        data = requests.get(data_url)
        file = zipfile.ZipFile(BytesIO(data.content))
        # print(file.namelist())
        file.extractall(path = "apiDownload/" + collection + "/" + seriesUID)
        # write the series metadata to a dataframe
        metadata_url = api_url + "getSeriesMetaData?SeriesInstanceUID=" + seriesUID
        metadata = requests.get(metadata_url).json()
        newRow = pd.DataFrame.from_dict(metadata)
        tmpManifest = pd.concat([manifestDF, newRow], ignore_index = True)
        tmpManifest.reset_index()
        manifestDF = tmpManifest
        # Repeat n times for demo purposes - comment out these next 3 lines to download a full results
        count += 1;
        if count == 3:
            break  
    # display manifest dataframe and/or save manifest to CSV file
    if csv_filename != "":
        manifestDF.to_csv(csv_filename + '.csv')
        display(manifestDF)
    else:
        display(manifestDF)

## 3.1 Download a full Collection

You can [browse Collections](https://www.cancerimagingarchive.net/collections) on our website to figure out what you might want to download, but you can also get a list of available collections via the API as shown below.

In [None]:
# get list of available collections as JSON

data_url = base_url + "getCollectionValues"
data = requests.get(data_url).json()
print(json.dumps(data, indent=2))


Let's say that we're interested in the Soft-tissue-Sarcoma collection.  First we need to get a list of all Series Instance UIDs in that collection.

In [None]:
collection = "Soft-tissue-Sarcoma"

data_url = base_url + "getSeries?Collection=" + collection
data = requests.get(data_url)
if data.text != "":
    series_data = data.json()
    print("Collection contains", len(series_data), "Series Instance UIDs (scans).")
else:
    print("Collection not found.")

Next, let's feed those Series Instance UIDs to our downloadSeries function we created earlier.

In [None]:
# feed series_data to our downloadSeries function
downloadSeries(base_url, series_data, collection + "_full_Collection")

## 3.2 Download custom results of an API query
The REST API allows for a variety of different query options as demonstrated in [this notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Queries_for_Public_Datasets.ipynb).  For this use case, let's assume that you are only interested in the MR scans from the TCGA-BRCA Collection that were acquired on Siemens scanners.

In [None]:
# specify a query parameters
collection = "TCGA-BRCA"
modality = "MR"
manufacturer = "SIEMENS"

# get Series UIDs
data_url = base_url + "getSeries?Collection=" + collection  + "&Modality=" + modality + "&Manufacturer=" + manufacturer
data = requests.get(data_url)
if data.text != "":
    series_data = data.json()
    print("Result contains", len(series_data), "Series Instance UIDs (scans).")
else:
    print("No results: Please check to make sure the Collection "
      + collection + " exists and it contains "
      + modality + " modality with "
      + manufacturer + " manufacturer.")

Once again, let's pass those Series Instance UIDs to our downloadSeries function.

In [None]:
# feed series_data to our downloadSeries function
downloadSeries(base_url, series_data, collection + "_" + modality + "_" + manufacturer)

Let's show a similar example where we look for a specific modality and manufacturer within the [National Lung Screening Trial (NLST) Collection](https://doi.org/10.7937/TCIA.HMQ8-J677).  Remember that we have to use the NLST API URL we specified earlier for this to work, but otherwise the steps are the same.

In [None]:
# specify a query parameters
collection = "NLST"
modality = "CT"
manufacturer = "Philips"

# get Series UIDs -- NOTE: this uses the "nlst_url" we defined earlier
data_url = nlst_url + "getSeries?Collection=" + collection  + "&Modality=" + modality + "&Manufacturer=" + manufacturer
data = requests.get(data_url)
if data.text != "":
    series_data = data.json()
    print("Result contains", len(series_data), "Series Instance UIDs (scans).")
else:
    print("No results: Please check to make sure the Collection "
      + collection + " exists and it contains "
      + modality + " modality with "
      + manufacturer + " manufacturer.")

In [None]:
# feed series_data to our downloadSeries function
downloadSeries(nlst_url, series_data, collection + "_" + modality + "_" + manufacturer)

## 3.3 Download a "shared cart"
It's possible to use https://nbia.cancerimagingarchive.net to create a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" which includes a specific set of scans you'd like to share with others. After creating a Shared Cart you receive a URL like https://nbia.cancerimagingarchive.net/nbia-search/?saved-cart=nbia-49121659384603347 which can be shared with others.  Try clicking the link to see what this looks like on the TCIA website.  Then use the code below to see how you can use the cart name to download the (first 3) related scans via our API.

In [None]:
# Download a "Shared Cart" that has been previously created via the NBIA GUI (https://nbia.cancerimagingarchive.net)

cartName="nbia-49121659384603347"

cart_URL = base_url + "getContentsByName?name=" + cartName
series_data = requests.get(cart_URL).json()
print("Result contains", len(series_data), "Series Instance UIDs (scans).")

In [None]:
# feed series_data to our downloadSeries function
downloadSeries(base_url, series_data, cartName + "_manifest")

## 3.4 Download data from a TCIA manifest file

When working with manifest files in a notebook you can install the NBIA Data Retriever to open the manifest and download the data as shown in [this notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb).  However, there may be cases when you don't have administrative rights to install software or prefer using the REST API to download a manifest.  

In order to demonstrate this use case, let's assume that after [Browsing the Collections](https://www.cancerimagingarchive.net/collections) you decided you were interested in the [RIDER Breast MRI](https://doi.org/10.7937/K9/TCIA.2015.H1SXNUXL) Collection.  If you're working from your local machine you can simply click the blue "Download" button on the [RIDER Breast MRI](https://doi.org/10.7937/K9/TCIA.2015.H1SXNUXL) page to save the manifest file to your computer.  If you're working on Google Colab or some other remote server the easiest thing to do is use wget to save it to your VM as shown below.

In [None]:
# use wget to download the manifest

!wget -O /content/manifest.tcia https://wiki.cancerimagingarchive.net/download/attachments/22512757/doiJNLP-Fo0H1NtD.tcia?version=1&modificationDate=1534787017928&api=v2


TCIA manifest files contain several lines of information that precede a list of Series Instance UIDs to download.  The step below will remove the header.

In [None]:
with open("manifest.tcia") as f:
    first_line = f.readline()

if "downloadServerUrl" in first_line:
    !sed -i -e 1,6d manifest.tcia
    print("Header text removed.")
else:
    print("This is not a TCIA manifest file, or you've already removed the header lines.")

Now we'll write the UIDs into a list and count them.

In [None]:
# initialize variable
uid_list = []

# open file
with open("manifest.tcia") as f:
    for line in f:
        uid_list.append(line.rstrip())

print("Result contains", len(uid_list), "Series Instance UIDs (scans).")

Finally, we'll download the series in the list.  

In [None]:
manifestDF=pd.DataFrame()
seriesUID = ''
count = 0

for x in uid_list:
      seriesUID = x    
      data_url = base_url + "getImage?SeriesInstanceUID=" + seriesUID
      print("Downloading " + data_url)
      data = requests.get(data_url)
      file = zipfile.ZipFile(BytesIO(data.content))
      # print(file.namelist())
      file.extractall(path = "apiDownload/" + collection + "/" + seriesUID)
      # write the series metadata to a dataframe
      metadata_url = base_url + "getSeriesMetaData?SeriesInstanceUID=" + seriesUID
      metadata = requests.get(metadata_url).json()
      newRow = pd.DataFrame.from_dict(metadata)
      tmpManifest = pd.concat([manifestDF, newRow], ignore_index = True)
      tmpManifest.reset_index()
      manifestDF = tmpManifest
      # Repeat n times for demo purposes - comment out these next 3 lines to download a full results
      count += 1;
      if count == 3:
          break  

# display manifest dataframe and/or save manifest to CSV file
manifestDF.to_csv('RIDER-Breast-MRI_manifest.csv')
display(manifestDF)

# Conclusion
This notebook demonstrated various ways to use TCIA's REST APIs to download cohorts of imaging data.  You can find additional TCIA notebooks at https://github.com/kirbyju/TCIA_Notebooks. 

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/) and Qinyan Pan.  If you leverage this notebook or any TCIA datasets in your work please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7