You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_Jupyter_Learning_Lab_2023.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Jupyter_Learning_Lab_2023.ipynb)

# Course Description

Access to large, high quality data is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However HIPAA constraints make sharing medical images outside an individual institution a complex process. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) TCIA is a public service funded by the National Cancer Institute which addresses this challenge by providing hosting and de-identification services to take major burdens of data sharing off researchers.

TCIA has published over 200 unique data collections containing more than 60 million images. Recognizing that images alone are not enough to conduct meaningful research, most collections are linked to rich supporting data including patient outcomes, treatment information, genomic / proteomic analyses, and expert image analyses (segmentations, annotations, and radiomic / radiogenomic features). **In this course we will address a variety of use cases for identifying TCIA datasets of interest and downloading them via Jupyter Notebooks.**

# Learning Objectives

* Learn how TCIA makes data sharing easier for researchers, and hear a summary of existing datsets that are freely available for download
* Practice utilizing TCIA for data exploration, cohort definition, and downloading of data
* Learn how to access public and restricted access datasets using TCIA's REST APIs and other command line tools via Google Colab

# 1 Learn about Available Collections on the TCIA Website

[Browsing Collections](https://www.cancerimagingarchive.net/collections) and viewing [Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) of datasets on TCIA are the easiest ways to become familiar with what is available. These pages will help you quickly identify datasets of interest, find valuable supporting data that are not available via our APIs (e.g. clinical spreadsheets and non-DICOM segmentation data), and answer the most common questions you might have about the datasets.  

# 2 Downloading with the NBIA Data Retriever

TCIA uses software called NBIA to manage its DICOM data.  One way to download TCIA data is to install the NBIA Data Retriever.  This tool provides a number of useful features such as auto-retry if there are any problems, saving data in an organized hierarchy on your hard drive (Collection > Patient > Study > Series > Images), and providing a CSV file containing key DICOM metadata about the images you've downloaded.

## 2.1 Install the NBIA Data Retriever
There are versions of this tool for Windows, Mac and Linux.  If you're working from a system with a GUI you can follow the [instructions](https://wiki.cancerimagingarchive.net/display/NBIA/Downloading+TCIA+Images) to install Data Retriever on your computer.

There is also a [command-line version of the NBIA Data Retriever](https://wiki.cancerimagingarchive.net/x/2QKPBQ) which can be installed via the steps below if you're running this notebook in a **Linux** environment.  

In [None]:
# Install NBIA Data Retriever CLI software for downloading images later in this notebook.

!mkdir /usr/share/desktop-directories/
!wget -P /content/NBIA-Data-Retriever https://cbiit-download.nci.nih.gov/nbia/releases/ForTCIA/NBIADataRetriever_4.4/nbia-data-retriever-4.4.2.deb
!dpkg -i /content/NBIA-Data-Retriever/nbia-data-retriever-4.4.2.deb

# NOTE: If you're working on a Linux OS that uses RPM packages, you can change the lines above to use
#       https://cbiit-download.nci.nih.gov//nbia/releases/ForTCIA/NBIADataRetriever_4.4/NBIADataRetriever-4.4-2.x86_64.rpm

### 2.2 Download a Manifest File
The NBIA Data Retriever software works by ingesting a "manifest" file that contains the DICOM Series Instance UIDs of the scans you'd like to download. How do you find these manifest files?  

Starting from the TCIA homepage, choose [Access the data](https://www.cancerimagingarchive.net/access-data) and then click on [Browse collections](https://www.cancerimagingarchive.net/collections).  After reviewing the table of datasets, let's say you decided you were interested in the [RIDER Breast MRI](https://doi.org/10.7937/K9/TCIA.2015.H1SXNUXL) collection.  We can find the URL of the manifest to download the full collection by looking at the blue "Download" button on that page.  You can download the file to your computer and upload it to Colab manually, or you can import it directly to Colab by using **wget** with the URL of the manifest file.


In [None]:
!wget https://wiki.cancerimagingarchive.net/download/attachments/22512757/doiJNLP-Fo0H1NtD.tcia

If you look at the file you'll see some configuration information at the top, followed by a list of Series Instance UIDs that are part of the dataset.  

Let's edit the manifest file to only include the first 3 UIDs in the manifest so that we can demonstrate the download process more quickly.


In [None]:
with open('doiJNLP-Fo0H1NtD.tcia','r') as firstfile, open('RIDER-Breast-MRI-Sample.tcia','a') as secondfile:
    count = 0
    for line in firstfile:
        # append content to second file
        secondfile.write(line)
        # Stop after header and first 3 series UIDs
        count += 1;
        if count == 9:
            break

### 2.3 Open the Manifest File with the NBIA Data Retriever
Next, let's open the sample manifest file with the NBIA Data Retriever to download the actual DICOM data.  To do this, we'll call the command to launch Data Retriever, specify the **--cli** flag to indicate we want to run this via command line (not with a GUI).  We also need to specify the path to the manifest file we want to open and then use **-d** to specify the path where we want to save the data.

**<font color='red'>After running the following command, click in the output cell, type "y," and press Enter to agree with the TCIA Data Usage Policy and start the download.</font>**

In [None]:
# download the data using NBIA Data Retriever

!/opt/nbia-data-retriever/nbia-data-retriever --cli '/content/RIDER-Breast-MRI-Sample.tcia' -d /content/

### 2.4 NBIA Data Retriever Conclusion
You should now find that the data have been saved to your machine in a well-organized hierarchy with some useful metadata in the accompanying CSV file and a license file detailing how it can be used.  Take a look at the data before moving on.

A few other notes:
* With the release of v4.4.1 the CLI Data Retriever now supports both "Descriptive" and "Classic" organization of the data.  In short, Descriptive naming provides more human-readable directory names whereas Classic names everything by machine-readable unique identifiers.  If you prefer machine-readable directory names simply add the **-cd** parameter to your download command.
* In some cases, you must specifically request access to [Collections](https://www.cancerimagingarchive.net/collections/) before you can download them.  Information about how to do this can be found on the homepage for the Collection(s) you're interested in, but will always require that you first [create a TCIA user account](https://wiki.cancerimagingarchive.net/x/xgHDAg).  Once you've created an account and obtained permission to the restricted data you want to download, you can use your login/password to create the **credentials.txt** file that NBIA Data Retriever uses to verify your permissions.  The path to the credential file is specified using the **-l** parameter.

You can find examples showing how to do this at [TCIA_Linux_Data_Retriever_App.ipynb](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb).



# 3 Downloading with the REST API
The NBIA REST APIs are provided for the search and download functions used in the TCIA radiology portal and allow access to both public and limited access collections.
1. The [NBIA Search REST APIs](https://wiki.cancerimagingarchive.net/x/fILTB) allow you to perform basic queries and download data from **public** collections. These APIs do not require a TCIA account.
2. The [NBIA Search with Authentication REST APIs](https://wiki.cancerimagingarchive.net/x/X4ATBg) allow you to perform basic queries and download data from **public and limited-access** collections. These APIs require a TCIA account to create authentication tokens.
3. The [NBIA Advanced REST APIs](https://wiki.cancerimagingarchive.net/x/YoATBg) also allow access to **public and limited-access** collections, but provide query endpoints mostly geared towards developers seeking to integrate searching and downloading TCIA data into web and desktop applications. This API requires a TCIA account to create authentication tokens.

Feel free to reference this documentation if you run into issues or need clarification about anything related to the API.  

## 3.1 Setup

Rather than working directly with the API we will be relying heavily on [**tcia_utils**](https://pypi.org/project/tcia-utils/), which contains a variety of useful functions for accessing TCIA via Jupyter/Python. Docstrings detailing how to use each function are provided in the code itself, and an extensive library of tutorial notebooks can be found at https://github.com/kirbyju/TCIA_Notebooks.

By default, most functions from tcia_utils return results in JSON.  However, you can use **format = "df"** to return the results as a dataframe, or **format = "csv"** to save a CSV file in addition to returning a dataframe.

Nearly all functions allow you to specify **api_url** as a query parameter.  This is generally only needed for accessing restricted collections, but is also used if you want to access the [National Lung Screening Trial (NLST)](https://doi.org/10.7937/TCIA.HMQ8-J677) collection.  NLST lives on a separate server with a different URL from the rest of our datasets due to its size (>26,000 patients!).  

We'll incorporate examples using these different output formats and API URLs below.  But first, let's install the tools we need and import them so they're ready for use.

**Note:** The setup steps have been updated slightly since the creation of the explainer video at https://vimeo.com/893497931 to make them more generalizable and efficient but everything else afterwards should work the same.


In [None]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q tcia_utils

In [None]:
import requests
import pandas as pd
from tcia_utils import nbia

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

## 3.2 Download Examples

In this section we'll cover downloading data via the REST API for the following use cases:

1.   Download a full TCIA collection and preview a series
2.   Download custom results of an API query
3.   Download custom results of an API query
4.   Download a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" that was created via https://nbia.cancerimagingarchive.net/
5.   Download data from a TCIA manifest file
6.   Download images with DICOM segmentations
7.   Download images based on related clinical data
8.   Download data from a restricted collection

### 3.2.1 Download a full collection and preview a series

As mentioned earlier, you can [Browse Collections](https://www.cancerimagingarchive.net/collections) on our website to figure out what you might want to download, but you can also get a list of available collections via the API as shown below.

In [None]:
# get list of available collections as JSON
nbia.getCollections()


Let's say that we're interested in downloading the entire **Soft-tissue-Sarcoma** collection.  First we need to get a list of all Series Instance UIDs in that collection.  We can use **nbia.getSeries()** to save the JSON metadata about all series (scans) in this collection to a variable called **data**.

In [None]:
data = nbia.getSeries(collection = "Soft-tissue-Sarcoma")
print(data)

Then we can pass that **data** variable to the our download function.  We'll leverage the **number** parameter here to just grab the first scan as a test.  You can remove this parameter if you want to download the full collection.

In [None]:
nbia.downloadSeries(data, number = 1)

Before moving on, take a second to review how the data are saved.  You'll note that by default the data are stored to a directory called **tciaDownload** with each directory named after the **Series Instance UID**.  While these UIDs are important to uniquely identify each scan, it can be difficult to figure out what is what unless you have additional metadata.  

In order to obtain such metadata, you can set the **format** parameter to **df** to return a dataframe containing the metadata for the files you've downloaded.  Setting **format = "csv"** will save a spreadsheet in addition to returning a dataframe.

In [None]:
nbia.downloadSeries(data, number = 1, format = "df")

**tcia_utils** also provides a function called **viewSeries()** we can use to do a quick preview of the images here in the notebook.  The function assumes "tciaDownload/**seriesUid**/" as path if seriesUid is provided since this is where downloadSeries() saves data.  However you can also open a series from a custom path using the **path** parameter.

In [None]:
# example using just the series UID
nbia.viewSeries("1.3.6.1.4.1.14519.5.2.1.5168.1900.104193299251798317056218297018")

# alternate example specifying a custom path
# nbia.viewSeries(path = "/content/tciaDownload/1.3.6.1.4.1.14519.5.2.1.5168.1900.104193299251798317056218297018")

### 3.2.2 Download custom API query
For this use case, let's assume that you've recently trained a model to segment tumors in breast MRIs and are looking for data to test it with.  First, let's use **getBodyPartCounts()**, which returns the number of subjects with a given value in the Body Part Examined DICOM tag.

In [None]:
nbia.getBodyPartCounts(modality = "MR")

Note that the data are returned in JSON format by default, and it appears we have ~3,000 subjects with breast MR images.  

Next, let's use **getManufacturerCounts()** to see what kind of MRI scanners were used to acquire the images for these breast cancer subjects.  This time we'll also use the **format = "df"** parameter to make the result a little easier to read.

In [None]:
nbia.getManufacturerCounts(modality = "MR", bodyPart = "BREAST", format = "df")

Ok, let's say that we'd like to only focus on the MR scans acquired on GE scanners to see if our model works well on this kind of data.  We can use **getSeries()** to create a detailed inventory of these for our review before we download the data.

In [None]:
# getSeries with query parameters
data = nbia.getSeries(modality = "MR",
                      bodyPart = "BREAST",
                      manufacturer = "GE MEDICAL SYSTEMS",
                      format = "df")

display(data)

Looks like we've got over 16,000 scans in our results.  As you can see from the **Series Description** column, there are a variety of different types of scans that match this criteria.  

Let's say that the tool or model you're developing is aimed specifically at analyzing T2 MR data, so let's filter our dataframe to only keep the scans that mention "T2" in the **SeriesDescription** or **ProtocolName** columns.

In [None]:
# convert the columns to lowercase to catch both upper/lower case with filter for 't2'
t2_data = data[(data['ProtocolName'].str.lower().str.contains('t2')) |
               (data['SeriesDescription'].str.lower().str.contains('t2'))].reset_index(drop=True)


display(t2_data)

Now we can provide our **t2_data** dataframe as input to **downloadSeries()** using **input_type = "df"**.  By setting this as the **input_type** our download function will look in the SeriesInstanceUID column to determine which series to download.

In [None]:
# download the selected series_uids
nbia.downloadSeries(t2_data, input_type = "df", number = 3, format = "df")

### 3.2.3 Download custom NLST API query
Let's show a simplified example where we look for a specific modality and manufacturer within the [National Lung Screening Trial (NLST) Collection](https://doi.org/10.7937/TCIA.HMQ8-J677).  Remember that we have to set **api_url = "nlst"** in our functions for this to work since this data lives on a different server, but otherwise the steps are basically the same.  

Let's start by checking what modalities are available in the collection.

In [None]:
nbia.getModalityCounts(collection = "NLST", format = "df", api_url = "nlst")

Looks like CT is the only modality for this collection.  Let's check the manufacturers.

In [None]:
nbia.getManufacturerCounts(collection = "NLST", format = "df", api_url = "nlst")

Let's say we're interested in Philips scanners this time and just want to know how many scans we'll be downloading rather than doing additional filtering.

In [None]:
# getSeries with query parameters
data = nbia.getSeries(collection = "NLST",
               modality = "CT",
               manufacturer = "Philips",
               api_url = "nlst")

print(len(data), 'Series returned')

Nearly 14,000 scans!  Looks like we'll have plenty of data for training this time.  Here's the final command to download the data.  We don't need to specify an **input_type** this time since we are passing the default JSON data that we stored with **getSeries()**.

In [None]:
# feed series_data to our downloadSampleSeries function
df = nbia.downloadSeries(data, number = 1, api_url = "nlst", format = "df")
display(df)

### 3.2.4 Download a "shared cart"
If you're collaborating with someone who prefers to use https://nbia.cancerimagingarchive.net to define their cohort rather than the API, they can create a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" which includes a specific set of scans they'd like to share with others.

After creating a Shared Cart the user receives a URL that looks like https://nbia.cancerimagingarchive.net/nbia-search/?saved-cart=nbia-49121659384603347 which can be shared with others to let them easily download the same data.  Try clicking the link to see what this looks like on the TCIA website.  Then use the code below to see how you can use the cart name at the end of the URL to download the related scans via the API.

In [None]:
# getSharedCart metadata
data = nbia.getSharedCart(name = "nbia-49121659384603347")
print(len(data), 'Series returned')

We'll skip the use of the **number** parameter this time since the full cart is only 4 series.  Let's also try **format = "csv"** to save a spreadsheet of the metadata in addition to returning a dataframe.  Feel free to open the resulting CSV file if you like.  It will look identical to the dataframe.

In [None]:
# feed series_data to our downloadSampleSeries function
df = nbia.downloadSeries(data, format = "csv")
display(df)

### 3.2.5 Download data from a TCIA manifest file

When working with manifest files you can install the NBIA Data Retriever to open the manifest and download the data as discussed earlier in this notebook.  However, there may be cases where you don't have administrative rights to install software or prefer using the REST API to download an existing TCIA manifest file.  

In order to demonstrate this use case, let's assume that after you [Browse Collections](https://www.cancerimagingarchive.net/collections) you are interested in the [RIDER Breast MRI](https://doi.org/10.7937/K9/TCIA.2015.H1SXNUXL) collection.  We can find the URL of the manifest to download the full collection by looking at the blue "Download" button on that page.  After copying and pasting that URL into the code below we can download the manifest and save it.  

In [None]:
# download manifest file from RIDER Breast MRI page
manifest = requests.get("https://wiki.cancerimagingarchive.net/download/attachments/22512757/doiJNLP-Fo0H1NtD.tcia?version=1&modificationDate=1534787017928&api=v2")
with open('RIDER_Breast_MRI.tcia', 'wb') as f:
    f.write(manifest.content)

If you open this manifest file in a text editor you'll notice that it contains several lines of download parameters that precede a list of Series Instance UIDs to download.  If we set **input_type = "manifest"** when calling **downloadSeries()** we can provide the path/filename to **downloadSeries()** rather than the JSON or dataframes that we have shown in other examples.  This will automatically extract the UIDs from the file and download them.

In [None]:
df = nbia.downloadSeries("RIDER_Breast_MRI.tcia", input_type = "manifest", number = 3, format = "df")
display(df)

### 3.2.6 Download images with DICOM segmentations
Here we'll walk through some steps to identify an example DICOM segmentation file, find the corresponding reference series and visualize them together inside the notebook.  There are two DICOM modalities that are commonly used for storing segmentation data.  These are **SEG** and **RTSTRUCT**.

The best way to get a broad sense of what kinds of DICOM segmentation data are available in TCIA is to review the [Browse Collections](https://www.cancerimagingarchive.net/collections) and [Browse Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) pages.  You can use the filter box on both pages to limit the results in the table to **SEG** or **RTSTRUCT** on the collections page, or you could filter for **Segmentation** on the Analysis Results page.  

It's worth noting that that vast majority of the Analysis Results datasets contain segmentations so that won't reduce the list by very much.  Further, we often receive segmentations in other formats (e.g. NIfTI, NRRD) which are not discoverable via the NBIA API since this system only handles DICOM data.  However, that makes the [Browse Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/) page a great place to look for these other kinds of segmentations!

In any case, let's say that we are a researcher interested in kidney cancer so we filtered for **kidney** and **SEG** on the [Browse Collections](https://www.cancerimagingarchive.net/collections) page and discovered the [C4KC-KiTS](https://doi.org/10.7937/TCIA.2019.IX49E8NX) collection.  This dataset contains CT scans and segmentations from subjects from the training set of the [2019 Kidney and Kidney Tumor Segmentation Challenge (KiTS19)](https://kits19.grand-challenge.org/) in DICOM SEG format.

Let's start by getting an inventory of all scans in the collection using **nbia.getSeries()** and then taking a look at the first patient's data.

In [None]:
df = nbia.getSeries(collection = "C4KC-KiTS", format = "df")
sorted = df.sort_values(["PatientID", "SeriesDescription"])
sorted.head(4)

Here we can see that patient KiTS-00000 has 3 CT series and one SEG series.  How do we know which one of the CTs goes with the SEG?  

The vast majority of SEG/RTSTRUCT data in TCIA has the Reference Series UID tag populated, which will tell you the Series Instance UID of the images that were used to create the segmentation.  However, in some rare cases the data was submitted to us without this information present.  We encourage all future submitters to be sure to populate this tag to make it easier on others to figure out the relationship between segmentations and their images!

In any case, let's see if this C4KC-KiTS collection has this information.  First, we'll need to save the SEG Series Instance UID to a variable.


In [None]:
segSeries = sorted.loc[df['Modality'] == 'SEG', 'SeriesInstanceUID'].iloc[0]

print(segSeries)

Now we can pass that UID to **nbia.getSegRefSeries()** to try and determine the Reference Series Instance UID of the CT scan that goes with the segmentation.

In [None]:
refSeries = nbia.getSegRefSeries(segSeries)

print(refSeries)

Hurray! The submitter populated the Referenced Series UID tag in their SEG data.  Now let's download these two series.  This time we'll set **input_type = "list"** so that we can send a simple Python list of UIDs to the function to be downloaded.

In [None]:
nbia.downloadSeries([refSeries, segSeries], input_type= "list", format = "df")

Now we can look at the images and segmentation together with **viewSeriesAnnotation()**.  

You can move the slider to flip through the images and toggle the segmentation layer on/off.  Once the slider is selected, sometimes it's easier to move between images using the left/right arrow keys on your keyboard than to use your mouse.



In [None]:
nbia.viewSeriesAnnotation(seriesUid = refSeries, annotationUid = segSeries)

This function is only meant to be a quick and dirty way to preview the data.  There are many edge cases in which the segmentation data will not load properly.  If you are looking for more comprehensive solutions to visualize images together with segmentations please check out the [3D Slicer](https://slicer.org/) desktop application or [itkWidgets](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_RTStruct_SEG_Visualization_with_itkWidgets.ipynb) if you're trying to do this inside of a notebook.

### 3.2.7 Download images based on demographics and treatment data

Basic demographic data such as patient sex, ethnicity and age at the time of the imaging study can sometimes be found within DICOM tags.  When these data are populated you can use **getStudy()** to access this information.  Here's a random example showing this with the [Prostate-MRI-US-Biopsy](https://doi.org/10.7937/TCIA.2020.A61IOC1A) collection.

In [None]:
nbia.getStudy(collection = "Prostate-MRI-US-Biopsy", format = "df")

Many collections in TCIA also come with clinical data about the subjects that are not contained in the DICOM itself.  You can find collections which have such data by going to the [Browse Collections](https://www.cancerimagingarchive.net/browse-collections/) page and typing **clinical** into the filter box at the top of the table.  

For the larger data collection initiatives sponsored by NCI/NIH such as [The Cancer Genome Atlas (TCGA), the Clinical Proteomic Tumor Analysis Consortium (CPTAC), the Cancer Moonshot Biobank (CMB)](https://www.cancerimagingarchive.net/imaging-proteogenomics/) and the [NCI Clinical Trial Network (NCTN)](https://wiki.cancerimagingarchive.net/x/BQHDAg) these clinical data are generally stored in external databases which can be found in the **Additional Resources** section of the pages describing those datasets.  You can see an example of accessing clinical data from NCI's Genomic Data Commons website and then using that to define a custom cohort of radiology images at https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCGA/TCGA_Clinical.ipynb.

However, for **community-proposed** datasets we typically host the data directly in the form of CSV files provided by the submitters.  To demonstrate this, let's assume that you're interested in looking at CT images from COVID-19 patients and are looking for additional labels that could be used to train an AI model.  

First, we'll need to go to the [Browse Collections](https://www.cancerimagingarchive.net/browse-collections/) page and type both **clinical** and **covid** into the filter box.  Looking at the resulting datasets, let's say we've decided to focus on [COVID-19-NY-SBU](https://doi.org/10.7937/TCIA.BBAG-2923) since it has many more patients than the others.  

After clicking the link to view the homepage for this dataset we notice that there are two spreadsheets available.  Let's take a quick look at them to see what we can learn by copying the URLs and reading them into a dataframe.  We'll start with the file on the page called **Clinical data**.

In [None]:
clinical_data = pd.read_csv("https://www.cancerimagingarchive.net/wp-content/uploads/deidentified_overlap_tcia.csv.cleaned.csv_20210806.csv")

display(clinical_data)

Wow, this looks pretty extensive! Let's take a quick at the **Clinical data template** file now.

In [None]:
clinical_data_template = pd.read_csv("https://www.cancerimagingarchive.net/wp-content/uploads/deidentified_overlap_tcia.csv.cleaned.csv.template_20210806.csv")

display(clinical_data_template)

It appears the **Clinical Data Template** file is summarizing the contents of the clinical data.

After reviewing the contents of these files, let's say that we're interested trying to predict **last.status** from CT scans. To achieve this, let's start by creating a list of the patient IDs in each category.

In [None]:
deceased = clinical_data.loc[clinical_data['last.status'] == 'deceased', 'to_patient_id'].tolist()
discharged = clinical_data.loc[clinical_data['last.status'] == 'discharged', 'to_patient_id'].tolist()

print(len(deceased), "deceased patients.")
print(len(discharged), "discharged patients.")

This dataset has a number of different modalities in it, so let's get an inventory of all of the available CT scans.

In [None]:
ct_scans = nbia.getSeries(collection = "COVID-19-NY-SBU", modality = "CT", format = "df")

display(ct_scans)

Now let's use the lists we created previously to add the **last.status** column to our inventory of CT scans.

In [None]:
# Create the 'last.status' column and populate it based on the 'PatientID' values
ct_scans['last.status'] = ct_scans['PatientID'].apply(lambda x: 'deceased' if x in deceased else 'discharged')


In [None]:
display(ct_scans)

Now that we have everything we need in our dataframe, we can download the actual images.

**Reminder:** We're using the **number** parameter to download 1 sample series.  Remove this to download the full dataset.

In [None]:
nbia.downloadSeries(ct_scans, input_type = "df", number = 1, format = "df")

### 3.2.8 Download data from a restricted collection
In some cases, you must specifically request access to collections before you can download them.  These are listed as **limited access** on the [Browse Collections](https://www.cancerimagingarchive.net/collections/) page. The steps to request access may vary depending on the collection, but will always require that you first [create a TCIA user account](https://wiki.cancerimagingarchive.net/x/xgHDAg). Once you've created an account, you can use your login/password to create an API token with the **getToken()** function from **tcia_utils** to verify your permissions.

Tokens are valid for 2 hours and must be refreshed after that point.  The tcia_utils functions all check to see if you need to refresh your token when they're called and will take care of that for you automatically if necessary.

In [None]:
nbia.getToken()

Let's say that we're interested in the [QIN-Breast-02](https://doi.org/10.7937/TCIA.2019.4cfm06rr) collection. As you can see on the collection page, you must email help@cancerimagingarchive.net to request access to the data. Once you've recieved approval we can use **nbia.getSeries()** to get a full list of series UIDs in this restricted collection by including **api_url = "restricted"** as a parameter.

In [None]:
# getSeries with query parameters
data = nbia.getSeries(collection = "QIN-Breast-02",
                      api_url = "restricted")

print(len(data), 'Series returned')

Now we can download those scans.  Don't forget to include **api_url = "restricted"** in the download functions as well!

In [None]:
# feed series_data to our downloadSampleSeries function
df = nbia.downloadSeries(data, number = 3, api_url = "restricted", format = "df")
display(df)

# Additional Resources
The following pages on TCIA may be of special interest to deep learning researchers:

1. [Finding Annotated Data for AI/ML on TCIA](https://wiki.cancerimagingarchive.net/x/TAGJAw) provides basic guidance for finding datasets that could be useful for deep learning tasks.
2. [Challenge Competitions using TCIA data](https://wiki.cancerimagingarchive.net/x/nYIaAQ) can be useful for benchmarking your model's performance.
3. [ACR Data Science Institute's Define AI Directory](https://www.acrdsi.org/DSI-Services/Define-AI) links clinically relevant AI use-cases to TCIA datasets that can be used to address them.
4. [Additional TCIA Notebooks](https://github.com/kirbyju/TCIA_Notebooks) about accessing and visualizing data are available.

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7