You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_Data_Curation_Learning_Lab_SIIM_2024.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Data_Curation_Learning_Lab_SIIM_2024.ipynb)

# SIIM 2024 TCIA Data Curation Learning Lab

Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, there are many challenges associated with publishing and utilizing radiological imaging data of human subjects. In this hands-on learning lab we'll teach you some solutions to these challenges, including how to properly de-identify and publish your DICOM data as well as how to access freely available datasets that have been published in online archives.

You can view the full course description, requirements and objects at https://annualmeeting.siim.org/sessions/hands-on-data-curation-learning-lab/.  This notebook was developed for the course to demonstrate command-line and API-based options for accessing data from The Cancer Imaging Archive.

# Setup

The following installs and imports **[tcia_utils](https://pypi.org/project/tcia-utils/)**, which contains a variety of useful functions for accessing TCIA via Python and Jupyter Notebooks.

In [None]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q tcia_utils

Next we'll import modules to help us work with a few different TCIA APIs and change the logging settings if you're on Colab so you can see more of the INFO statements that tell you what's going on as we run our commands.

In [None]:
import requests
import pandas as pd
from tcia_utils import wordpress
from tcia_utils import nbia

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

# Finding datasets of interest with our Wordpress API (aka Collection Manager)

This API contains metadata about the datasets we host including free-text summaries, available files for download, citation requirements, related publications and versioning info. Full documentation about this API can be found at https://www.cancerimagingarchive.net/collection-manager-rest-api/, but we'll rely on the **wordpress** module in **tcia_utils** to simplify some common tasks.

# Getting collection metadata
New image datasets are organized as “collections”. Typically these are patient cohorts related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. Supporting data related to the images such as patient outcomes, treatment details, genomics and image analyses are also provided when available.  These are described at https://www.cancerimagingarchive.net/browse-collections/.

In order to get metadata about collections with tcia_utils we'll use the **wordpress.getCollections()** function.  The default format of the returned data is JSON, but you can set the format to "df" if you prefer a dataframe.  You can also specify a subset of fields to return and set a file_name if you'd like to save a CSV of the output.  Finally, since the values in some columns are HTML code there is a removeHTML parameter that can be used to convert those to plain text.  We'll demonstrate the use of all of these features in the next cell.

In [None]:
# select fields to retrieve
fields = ["id", "slug", "collection_page_accessibility", "link", "cancer_types",
          "collection_doi", "cancer_locations", "collection_status", "species",
          "versions", "citations", "collection_title", "version_number",
          "date_updated", "subjects", "collection_short_title", "data_types",
          "supporting_data", "program", "collection_summary", "collection_downloads"]

# request metadata
collections = wordpress.getCollections(format = "df", fields = fields, file_name = "tciaCollections.csv", removeHtml = "yes")

collections

# Getting analysis results metadata
To enhance the value of TCIA collections we encourage the research community to publish their analyses of existing TCIA image collections. Examples of this kind of data includes radiologist or pathologist annotations, image classifications, segmentations, radiomics features, or derived/reprocessed images. These are datasets described at https://www.cancerimagingarchive.net/browse-analysis-results/.

The **wordpress.getAnalyses()** function for this is nearly identical to getCollections() as shown below.

In [None]:
# select fields to retrieve
fields = ["id", "slug", "result_page_accessibility", "type",
          "link", "cancer_types", "result_doi", "cancer_locations",
          "status", "citations", "result_title", "version_number",
          "date_updated", "versions", "subjects", "result_short_title",
          "supporting_data", "program", "result_summary", "result_downloads"]

# request metadata
analyses = wordpress.getAnalyses(format = "df", fields = fields, file_name = "tciaAnalyses.csv")

analyses

# Filtering dataframes

tcia_utils has a helper function called **searchDf()** that makes it easy to filter across an entire dataframe for keywords regardless of case.  Let's say that you were interesting in narrowing the results of your collection dataframe to datasets that mention **brain** in any column.

In [None]:
brain = wordpress.searchDf("brain", dataframe = collections)

brain

Note that in the **collection_page_accessibility** column a lot of these datasets are **Limited** access.  While not currently part of HIPAA regulations, there is some preliminary evidence of the possibility of matching 3D reconstructed MR and CT imaging of the head could be used to try and identify subjects.  As a result, data with these features are made available to investigators with a data use agreement.  

For the sake of simplicity, let's reduce our results to only the fully public datasets which have been de-faced or skull stripped and do not require any special data usage agreement.  We can specify that we only want rows where **public** is in the **collection_page_accessibility** column here.

In [None]:
public_brain = wordpress.searchDf("public", dataframe = brain, column_name = "collection_page_accessibility")

public_brain

Next, let's assume that we're only interested in datasets that also contain tumor segmentations.  This can be represented in a few different ways in the metadata.  In the **data_types** column, **RTSTRUCT** or **SEG** indicate DICOM representations of segmentations.  Sometimes we also receive data in other formats such as NIfTI which may be represented as **Segmentation** in this column.  

Note: Including rows that have **Image Analyses** in the **supporting_data** column may also be of interest if you're looking for other types of analyses such as seed points, other 2D measurements or image classifications.

In [None]:
# note: 'Seg' in our search terms will catch 'SEG' and 'Segmentations'
public_brain_seg = wordpress.searchDf(['RTSTRUCT', 'Seg'], dataframe = public_brain)

public_brain_seg

Ok, now let's say that we also only want to include datasets that have supporting clinical data.  We can find these by filtering for **Clinical** in the **supporting_data** column.

In [None]:
# note: 'Seg' in our search terms will catch 'SEG' and 'Segmentations'
public_brain_seg_clinical = wordpress.searchDf('clinical', dataframe = public_brain_seg, column_name = 'supporting_data')

public_brain_seg_clinical

# Downloading DICOM data
Let's assume that after reading the information in the **collection_summary** column and taking the other metadata into account we've decided the **ReMIND** dataset is of highest interest to us.  

The IDs associated with the download record for each file we host is shown in the **collection_downloads** column above.  In the case of **ReMIND** the **ids** are [43723, 43725, 43727].  These can be requested with the **wordpress.getDownloads()** function using the **ids** parameter.  

In [None]:
ids = [43723, 43725, 43727]

fields = ["id", "date_updated", "download_title", "data_license", "download_access",
          "data_type", "file_type", "download_size", "download_size_unit",
          "subjects", "study_count", "series_count", "image_count",
           "download_type", "download_url", "download_file", "search_url"]

remind = wordpress.getDownloads(ids = ids, fields = fields, format = "df", file_name = "ReMIND_downloads.csv")

remind

There is a **query** parameter which is also a handy way to retrieve this metadata.  

In [None]:
fields = ["id", "date_updated", "download_title", "data_license", "download_access",
          "data_type", "file_type", "download_size", "download_size_unit",
          "subjects", "study_count", "series_count", "image_count",
           "download_type", "download_url", "download_file", "search_url"]

remind = wordpress.getDownloads(query="ReMIND", fields = fields, format = "df", file_name = "ReMIND_downloads.csv")

remind

Information about where to download these data appear in the **download_url** column.  The **search_url** column conatains links where you can go and view the data before downloading it (where applicable).  In this particular case you can see that there's an option to look at the DICOM Images and Segmentations in our **NBIA** DICOM portal.

In any case, let's first demonstrate how to download the DICOM and clinical data.  We'll start by downloading the **TCIA manifest file** for the DICOM images.

In [None]:
url = 'https://www.cancerimagingarchive.net/wp-content/uploads/ReMIND-Manifest-Sept-2023.tcia'
response = requests.get(url)
with open('ReMIND-Manifest-Sept-2023.tcia', 'wb') as f:
    f.write(response.content)

## Downloading with the Linux Command Line NBIA Data Retriever

TCIA uses software called NBIA to manage its DICOM data.  One way to download TCIA data is to install the NBIA Data Retriever.  This tool provides a number of useful features such as auto-retry if there are any problems, saving data in an organized hierarchy on your hard drive (Collection > Patient > Study > Series > Images), and providing a CSV file containing key DICOM metadata about the images you've downloaded.

### Install the NBIA Data Retriever
There are versions of this tool for Windows, Mac and Linux.  If you're working from a system with a GUI you can follow the [instructions](https://wiki.cancerimagingarchive.net/display/NBIA/Downloading+TCIA+Images) to install Data Retriever on your computer.

There is also a [command-line version of the NBIA Data Retriever](https://wiki.cancerimagingarchive.net/x/2QKPBQ) which can be installed via the steps below if you're running this notebook in a **Linux** environment.  

In [None]:
# Install NBIA Data Retriever CLI software for downloading images later in this notebook.

!mkdir /usr/share/desktop-directories/
!wget -P /content/NBIA-Data-Retriever https://cbiit-download.nci.nih.gov/nbia/releases/ForTCIA/NBIADataRetriever_4.4/nbia-data-retriever-4.4.2.deb
!dpkg -i /content/NBIA-Data-Retriever/nbia-data-retriever-4.4.2.deb

# NOTE: If you're working on a Linux OS that uses RPM packages, you can change the lines above to use
#       https://cbiit-download.nci.nih.gov//nbia/releases/ForTCIA/NBIADataRetriever_4.4/NBIADataRetriever-4.4-2.x86_64.rpm

If you open the **ReMIND-Manifest-Sept-2023.tcia** file you'll see some configuration information at the top, followed by a list of Series Instance UIDs that are part of the dataset.  

Let's edit the manifest file to only include the first 2 UIDs in the manifest so that we can demonstrate the download process more quickly.


In [None]:
with open('ReMIND-Manifest-Sept-2023.tcia','r') as firstfile, open('ReMIND-Sample.tcia','a') as secondfile:
    count = 0
    for line in firstfile:
        # append content to second file
        secondfile.write(line)
        # Stop after header and first 3 series UIDs
        count += 1;
        if count == 8:
            break

### Open the Manifest File with the NBIA Data Retriever
Next, let's open the sample manifest file with the NBIA Data Retriever to download the actual DICOM data.  To do this, we'll call the command to launch Data Retriever, specify the **--cli** flag to indicate we want to run this via command line (not with a GUI).  We also need to specify the path to the manifest file we want to open and then use **-d** to specify the path where we want to save the data.

**<font color='red'>After running the following command, click in the output cell, type "y," and press Enter to agree with the TCIA Data Usage Policy and start the download.</font>**

In [None]:
# download the data using NBIA Data Retriever

!/opt/nbia-data-retriever/nbia-data-retriever --cli '/content/ReMIND-Sample.tcia' -d /content/

### NBIA Data Retriever Conclusion
You should now find that the data have been saved to your machine in a well-organized hierarchy with some useful metadata in the accompanying CSV file and a license file detailing how it can be used.  Take a look at the data before moving on.

A few other notes:
* The CLI Data Retriever supports both "Descriptive" and "Classic" organization of the data.  Descriptive naming uses information from the DICOM Study/Series Description and Dates to make them easier for humans to interpret.  Classic names everything by machine-readable unique identifiers.  If you prefer machine-readable directory names simply add the **-cd** parameter to your download command.
* In some cases, you must specifically request access to [Collections](https://www.cancerimagingarchive.net/browse-collections/) before you can download them.  Information about how to do this can be found on the homepage for the Collection(s) you're interested in, but will always require that you first [create a TCIA user account](https://wiki.cancerimagingarchive.net/x/xgHDAg).  Once you've created an account and obtained permission to the restricted data you want to download, you can use your login/password to create the **credentials.txt** file that NBIA Data Retriever uses to verify your permissions.  The path to the credential file is specified using the **-l** parameter.

You can find examples for these use cases at [TCIA_Linux_Data_Retriever_App.ipynb](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb).



## Downloading DICOM data with the REST API
The NBIA REST APIs are provided for the search and download functions used in the TCIA radiology portal and allow access to both public and limited access collections.
1. The [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB) allow you to perform basic queries and download data from **public** collections. These APIs do not require a TCIA account.
2. The [NBIA Search with Authentication REST APIs](https://wiki.cancerimagingarchive.net/x/X4ATBg) allow you to perform basic queries and download data from **public and limited-access** collections. These APIs require a TCIA account to create authentication tokens.
3. The [NBIA Advanced REST APIs](https://wiki.cancerimagingarchive.net/x/YoATBg) also allow access to **public and limited-access** collections, but provide query endpoints mostly geared towards developers seeking to integrate searching and downloading TCIA data into web and desktop applications. This API requires a TCIA account to create authentication tokens.

Feel free to reference this documentation if you run into issues or need clarification about anything related to the API.  

### tcia_utils

Rather than working directly with the NBIA API we will once again be relying on [**tcia_utils**](https://pypi.org/project/tcia-utils/) to simplify things. Docstrings detailing how to use each function are provided in the code itself, and an extensive library of tutorial notebooks can be found at https://github.com/kirbyju/TCIA_Notebooks to address use cases we don't have time for today.

By default, most search functions from tcia_utils return results in JSON.  However, you can use **format = "df"** to return the results as a dataframe, or **format = "csv"** to save a CSV file in addition to returning a dataframe.

There are also two download functions that you can use after you've finished searching and determined what you want to download.  The first is **downloadSeries()** which is used for downloading an entire scan.  The second is **downloadImage()** which you can use to pull a single slice from a scan if needed.


### Download the full collection and preview a series



Let's say that we're interested in downloading the entire **ReMIND** collection via the API instead of installing the **NBIA Data Retriever** software.  One way to do this is to leverage the manifest file we obtained earlier.  If we set **input_type = "manifest"** when calling **downloadSeries()** we can provide the path/filename to **downloadSeries()**.  This will automatically extract the UIDs from the file and download them.

In [None]:
nbia.downloadSeries("/content/ReMIND-Sample.tcia", input_type = "manifest")


Another way to download data from the API is to get a list of all Series Instance UIDs in that collection.  We can use **nbia.getSeries()** to save the JSON metadata about all series (scans) in this collection to a variable called **data**.

In [None]:
data = nbia.getSeries(collection = "ReMIND")
print(data)

Then we can pass that **data** variable to the our download function.  We'll leverage the **number** parameter here to just grab the first scan as a test.  You can remove this parameter if you want to download the full collection.

In [None]:
nbia.downloadSeries(data, number = 1)

Before moving on, take a second to review how the data are saved.  You'll note that by default the data are stored to a directory called **tciaDownload** with each directory named after the **Series Instance UID**.  While these UIDs are important to uniquely identify each scan, it can be difficult to figure out what is what unless you have additional metadata.  

In order to obtain such metadata, you can set the **format** parameter to **df** to return a dataframe containing the metadata for the files you've downloaded.  Setting **format = "csv"** will save a spreadsheet in addition to returning a dataframe.

In [None]:
nbia.downloadSeries(data, number = 1, format = "csv")

**tcia_utils** also provides a function called **viewSeries()** we can use to do a quick preview of the images here in the notebook.  The function assumes "tciaDownload/**seriesUid**/" as path if seriesUid is provided since this is where downloadSeries() saves data.  However you can also open a series from a custom path using the **path** parameter.

In [None]:
# example using just the series UID
nbia.viewSeries("1.3.6.1.4.1.14519.5.2.1.129870686332767537746901777770022995560")

# alternate example specifying a custom path
# nbia.viewSeries(path = "/content/tciaDownload/1.3.6.1.4.1.14519.5.2.1.129870686332767537746901777770022995560")

### 3.2.2 Download custom API query
Now let's assume that you've recently trained a model to segment tumors in brain MRIs so you're not interested in the ultrasound data that ReMIND contains. First, let's get an inventory of the available scans.  Then we'll filter it to only include the MR and SEG modalities.

In [None]:
remind_series = nbia.getSeries(collection = "ReMIND",format = "df")
remind_seg_mr_series = nbia.searchDf(["MR", "SEG"], dataframe = remind_series, column_name = "Modality")

display(remind_seg_mr_series)

Now we can pass this dataframe to **downloadSeries()** to grab only this subset of data.  We'll use the **number** parameter here for demonstration purposes, but you can remove this if you want to download the full dataset.

In [None]:
nbia.downloadSeries(remind_seg_mr_series, input_type= "df", format = "df", number = 2)

### 3.2.7 Download images based on clinical data

Basic demographic data such as patient sex, ethnicity and age at the time of the imaging study can sometimes be found within DICOM tags.  When these data are populated you can use **getStudy()** to access this information.  

In [None]:
nbia.getStudy(collection = "ReMIND", format = "df")

In this particular case there is no basic demographic information such as patient age, sex, race/ethnicity that sometimes resides in the DICOM metadata.  However, many collections in TCIA also come with clinical data about the subjects that are not contained in the DICOM itself.  

For the larger data collection initiatives sponsored by NCI/NIH such as [The Cancer Genome Atlas (TCGA), the Clinical Proteomic Tumor Analysis Consortium (CPTAC), the Cancer Moonshot Biobank (CMB)](https://www.cancerimagingarchive.net/imaging-proteogenomics/) and the [NCI Clinical Trial Network (NCTN)](https://wiki.cancerimagingarchive.net/x/BQHDAg) these clinical data are generally stored in external databases which can be found in the **Additional Resources** section of the pages describing those datasets.  

However, for **community-proposed** datasets we typically host the data directly in the form of CSV files provided by the submitters.  ReMIND happens to be an example of this as we learned while doing our Wordpress searches earlier.

Let's first **ReMIND** ourselves where that clinical file lives (har har har).

In [None]:
remind

Ok, now let's read the XLSX file into a dataframe.

In [None]:
# Filter rows where 'download_type' column contains 'clinical'
clinical = remind[remind['download_type'] == 'Clinical Data']

# Since there's only one URL, we can directly read it into a DataFrame
url = clinical['download_url'].values[0]
remind_clinical = pd.read_excel(url)

remind_clinical

This data could now be easily merged with the image metadata in the **remind_seg_mr_series** we created earlier and used for training AI models on any of the available data.  

### Downloading with Aspera ASCLI

You may have also noticed there was also a separate row for some NRRD segmentation data in the **remind** dataframe.  

In [None]:
remind

Most non-DICOM content in TCIA are provided via IBM Aspera Faspex packages which are typically accessed via a browser plugin.  However, the IBM Aspera developers also maintain an open source tool called [ascli (aspera-cli)](https://github.com/IBM/aspera-cli) that allows a client to download an Aspera Faspex package via the command line.

You can tell if data is hosted in an Aspera Faspex by the **download_url**, which begins with https://faspex.cancerimagingarchive.net/.

In order to pull data from this package we need to install Ruby, then the aspera-cli gem, and afterward we can use that to install ascli.  The steps to [install Ruby](https://www.ruby-lang.org/en/downloads/) vary by operating system but the gem and ascli commands should be the same as the last two lines below.

In [None]:
# consult the link above for Ruby installation instructions if you're not on an OS that uses apt
!apt install -y ruby ruby-dev rubygems ruby-json

# these should work in all environments after ruby is installed
!gem install aspera-cli
!ascli conf ascp install

First, we need to know the package **--url** where the NRRD files reside.  Then you can use the **browse** parameter to look around inside the package before downloading things.  Running the command without any path specified will show you the root folder of the package.  

In [None]:
# First let's pull the id associated with the Aspera package to get the url
url = remind.loc[remind['id'] == 43725, 'download_url'].values[0]

# Now you can use this url in your shell command
command = f'ascli faspex5 packages browse --url="{url}"'

# run the ascli command
get_ipython().system(command)



If you want to look at the contents of one of the directories you just add the directory path to the end of this command.

In [None]:
command = f'ascli faspex5 packages browse --url="{url}" ReMIND_NRRD_Seg_Sep_2023'

# run the ascli command
get_ipython().system(command)

When you decide what you want to download, you can once again use the **recieve** command.  Let's say that we're interested in grabbing a single subject's data to assess it before we download the full package.  You can do that by simply appending the path of the directory to the end of the **receive** command.

In [None]:
command = f'ascli faspex5 packages receive --url="{url}" ReMIND_NRRD_Seg_Sep_2023/ReMIND-001'

# run the ascli command
get_ipython().system(command)

Take a second to look at the files downloaded.  Congratulations! You've now seen how to search and download all data types programatically without a browser for this dataset!

# Additional Resources
The following pages on TCIA may be of special interest to deep learning researchers:

1. [Finding Annotated Data for AI/ML on TCIA](https://wiki.cancerimagingarchive.net/x/TAGJAw) provides basic guidance for finding datasets that could be useful for deep learning tasks.
2. [Challenge Competitions using TCIA data](https://wiki.cancerimagingarchive.net/x/nYIaAQ) can be useful for benchmarking your model's performance.
3. [ACR Data Science Institute's Define AI Directory](https://www.acrdsi.org/DSI-Services/Define-AI) links clinically relevant AI use-cases to TCIA datasets that can be used to address them.
4. [Additional TCIA Notebooks](https://github.com/kirbyju/TCIA_Notebooks) about accessing and visualizing data are available.

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7