You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb)

# Summary

Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, HIPAA constraints make sharing medical images outside an individual institution a complex process. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute that addresses this challenge by providing hosting and de-identification services to take major burdens of data sharing off researchers.

**This notebook is focused on basic use cases for identifying TCIA datasets of interest and downloading them using the NBIA Data Retriever application via the command line on a Linux operating system.** If you're interested in additional TCIA notebooks and coding examples, check out the tutorials at https://github.com/kirbyju/TCIA_Notebooks.

# 1 Learn about Available Collections on the TCIA Website

[Browsing Collections](https://www.cancerimagingarchive.net/collections) and viewing [Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) of datasets on TCIA are the easiest ways to become familiar with what is available. These pages will help you quickly identify datasets of interest, find valuable supporting data that are not available via our APIs (e.g. clinical spreadsheets and non-DICOM segmentation data), and answer the most common questions you might have about the datasets.  

# 2 Downloading images and annotations with the NBIA Data Retriever

TCIA uses software called NBIA to manage its DICOM data.  One way to download TCIA data is to install the NBIA Data Retriever.  This tool provides a number of useful features such as auto-retry if there are any problems, saving data in an organized hierarchy on your hard drive (Collection > Patient > Study > Series > Images), and providing a CSV file containing key DICOM metadata about the images you've downloaded.

**Note:** It's also possible to download these data via our REST API if you can't or don't want to install the NBIA Data Retriever. This is covered in a [separate notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Downloads.ipynb).

## 2.1 Install the NBIA Data Retriever
There are versions of this tool for Windows, Mac and Linux.  If you're working from a system with a GUI you can follow the [instructions](https://wiki.cancerimagingarchive.net/display/NBIA/Downloading+TCIA+Images) to install Data Retriever on your computer.

There is also a [command-line version of the NBIA Data Retriever](https://wiki.cancerimagingarchive.net/x/2QKPBQ) which can be installed via the steps below if you're running this notebook in a **Linux** environment.  

In [None]:
# Install NBIA Data Retriever CLI software for downloading images later in this notebook.

!mkdir /usr/share/desktop-directories/
!wget -P /content/NBIA-Data-Retriever https://cbiit-download.nci.nih.gov/nbia/releases/ForTCIA/NBIADataRetriever_4.4/nbia-data-retriever-4.4.2.deb
!dpkg -i /content/NBIA-Data-Retriever/nbia-data-retriever-4.4.2.deb

# NOTE: If you're working on a Linux OS that uses RPM packages, you can change the lines above to use
#       https://cbiit-download.nci.nih.gov//nbia/releases/ForTCIA/NBIADataRetriever_4.4/NBIADataRetriever-4.4-2.x86_64.rpm

### 2.2 Download a Manifest File
The NBIA Data Retriever software works by ingesting a "manifest" file that contains the DICOM Series Instance UIDs of the scans you'd like to download. Let's assume that after [browsing the collections](https://www.cancerimagingarchive.net/collections), you decided you were interested in the [RIDER Breast MRI](https://doi.org/10.7937/K9/TCIA.2015.H1SXNUXL) Collection.  We can find the URL of the manifest to download the full collection by looking at the blue "Download" button on that page.  Then we can download the manifest with the following command.


In [None]:
!wget https://wiki.cancerimagingarchive.net/download/attachments/22512757/doiJNLP-Fo0H1NtD.tcia

If you look at the file you'll see some configuration information at the top, followed by a list of Series Instance UIDs that are part of the dataset.  

Let's edit the manifest file to only include the first 3 UIDs in the manifest so that we can demonstrate the download process more quickly.


In [None]:
with open('doiJNLP-Fo0H1NtD.tcia','r') as firstfile, open('RIDER-Breast-MRI-Sample.tcia','a') as secondfile:
    count = 0
    for line in firstfile:
        # append content to second file
        secondfile.write(line)
        # Stop after header and first 3 series UIDs
        count += 1;
        if count == 9:
            break

### 2.3 Open the Manifest File with the NBIA Data Retriever
Next, let's open the sample manifest file with the NBIA Data Retriever to download the actual DICOM data.

**<font color='red'>After running the following command, click in the output cell, type "y," and press Enter to agree with the TCIA Data Usage Policy and start the download.</font>**

In [None]:
# download the data using NBIA Data Retriever

!/opt/nbia-data-retriever/nbia-data-retriever --cli '/content/RIDER-Breast-MRI-Sample.tcia' -d /content/

### 2.4 Review the Downloaded Data
You should now find that the data have been saved to your machine in a well-organized hierarchy with some useful metadata in the accompanying CSV file and a license file detailing how it can be used.

With the release of v4.4.1 the CLI Data Retriever now supports both "Descriptive" and "Classic" organization of the data.  In short, Descriptive naming provides more human-readable directory names whereas Classic names everything by machine-readable unique identifiers.

* The **Descriptive Directory Name (default)** organizes the files in a child folder under the destination folder as follows: Collection Name > Patient ID > Study Date + Study ID + Study Description (54 char max) + last 5 digits of Study Instance UID > Series Number + Series Description (54 char max) + last 5 digits of Series Instance UID
* The **Classic Directory Name** organizes the files in a child folder under the destination folder as follows: Collection Name > Patient ID > Study Instance UID > Series Instance UID.  

Let's try downloading the same data with the classic directory flag **-cd** so you can see the difference.  Data Retriever will prompt you with a warning letting you know that these data have already been downloaded.  Choose that you'd like to redownload "All" of the data so that you can see the difference in how it names the Study and Series subdirectories.



In [None]:
# download the data using NBIA Data Retriever with the classic directory name flag

!/opt/nbia-data-retriever/nbia-data-retriever --cli '/content/RIDER-Breast-MRI-Sample.tcia' -d /content/ -cd

### 2.5 Downloading "Limited-Access" Collections with the NBIA Data Retriever
In some cases, you must specifically request access to [Collections](https://www.cancerimagingarchive.net/collections/) before you can download them.  Information about how to do this can be found on the homepage for the Collection(s) you're interested in, but will always require that you first [create a TCIA user account](https://wiki.cancerimagingarchive.net/x/xgHDAg).

Let's say that we're interested in the [RIDER Neuro MRI](http://doi.org/10.7937/K9/TCIA.2015.VOSN3HN1) Collection. As you can see on the Collection page, you must sign and submit a TCIA Restricted License Agreement to help@cancerimagingarchive.net before accessing the data. Once you've done this, click the blue Download button on the RIDER Neuro MRI page to save the manifest file to your computer or grab it by using the code shown below.

In [None]:
!wget https://wiki.cancerimagingarchive.net/download/attachments/22512753/TCIA_RIDER_NEURO_MRI_06-22-2015.tcia

Once again, let's edit the manifest file to download only the first three scans.

In [None]:
with open('TCIA_RIDER_NEURO_MRI_06-22-2015.tcia','r') as firstfile, open('RIDER-Neuro-MRI-Sample.tcia','a') as secondfile:
    count = 0
    for line in firstfile:
        # append content to second file
        secondfile.write(line)
        # Stop after header and first 3 series UIDs
        count += 1;
        if count == 9:
            break

Once you've created an account, you can use your login/password to create the **credentials.txt** file that NBIA Data Retriever uses to verify your permissions. The format of the text file must be identical to this:

```
userName=YourUserName
passWord=YourPassword
```
The userName and passWord parameters are case sensitive.  Alternatively, the **nbia.makeCredentialFile()** function in **[tcia_utils](https://github.com/kirbyju/tcia_utils)** can be used to easily create a properly formed credential file.  This was added as a convenience to those who have this python package installed, which provides many features to work with TCIA APIs.  However, it's easier to just create the text file if you don't have any plans to use our APIs.

Once you've created your credentials.txt file we'll need to call NBIA Data Retriever with the "-l" parameter to tell it where you saved your credential file.  Don't forget to update your paths if necessary.

**<font color='red'>After running the following command, click in the output cell, type "y," and press Enter to agree with the TCIA Data Usage Policy and start the download.</font>**

In [None]:
# download the data using the NBIA Data Retriever
# you may need to update the path to your credential file

!/opt/nbia-data-retriever/nbia-data-retriever --cli '/content/RIDER-Neuro-MRI-Sample.tcia' -d /content/ -l /content/credentials.txt

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7