You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RSNA/AI-Deep-Learning-Lab-2024/blob/main/sessions/tcia/TCIA_RSNA_2024_Deep_Learning_Lab.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/RSNA/AI-Deep-Learning-Lab-2024/blob/main/sessions/tcia/TCIA_RSNA_2024_Deep_Learning_Lab.ipynb)

# TCIA RSNA 2024 Deep Learning Lab

Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, there are many challenges associated with publishing and utilizing radiological imaging data of human subjects. In this hands-on learning lab we'll teach you some solutions to these challenges, including how to properly de-identify and publish your DICOM data as well as how to access freely available datasets that have been published in online archives.

This notebook was developed to demonstrate command-line and API-based options for accessing data from [The Cancer Imaging Archive](https://www.cancerimagingarchive.net/).  You can view the full course description, requirements and objects in the [RSNA 24 Meeting Program](https://reg.meeting.rsna.org/flow/rsna/rsna24/MeetingCentralRSNA24/page/session-catalog/session/1715624815092017Z13l).  

# Brief intro to TCIA

* [TCIA Overview Slides](https://github.com/RSNA/AI-Deep-Learning-Lab-2024/blob/main/sessions/tcia/2024-12-03_RSNA_Deep_Learning_Lab_Intro.pptx)

Primary datasets are organized into [Collections](https://www.cancerimagingarchive.net/browse-collections/), which can include images and supporting data such as clinical, genomics, proteomics and image analyses (classifications, segmenatations, etc).  There are also secondary [Analysis Result](https://www.cancerimagingarchive.net/browse-analysis-results/) datasets, which are published by users who have downloaded the Collections and generated new data derived from them.  Generally these analysis datasets are additional image classifications and segmentations.

In this learning lab we'll explore both types of datasets and a couple of the different systems we use to host them via the following use cases.


# Deep Learning Lab Use Cases

1. You are a researcher interested in finding datasets that already have expert tumor segmentation labels to validate the performance of an AI tumor segmentation model you've developed.
2. You are a researcher interested in doing a multi-modal study that includes clinical, genomic and proteomic classification labels that you can try to predict from the images.
3. You are a researcher that wants to publish or share your analyses of existing TCIA image data with colleagues.

# Setup

The following installs and imports **[tcia_utils](https://pypi.org/project/tcia-utils/)**, which contains a variety of useful functions for accessing TCIA via Python and Jupyter Notebooks.  It also installs [simpleDicomViewer](https://pypi.org/project/simpleDicomViewer/) to allow us to view some of the data we'll download directly in the notebook.

In [None]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q tcia_utils

# install simpleDicomViewer and forked pydicom-seg dependency
!{sys.executable} -m pip install --upgrade -q git+https://github.com/kirbyju/pydicom-seg.git@master
!{sys.executable} -m pip install --upgrade -q simpleDicomViewer

Next we'll import modules to help us work with a few different TCIA APIs and change the logging settings if you're on Colab so you can see more of the INFO statements that tell you what's going on as we run our commands.

In [None]:
import requests
import pandas as pd
import json
import io
from tcia_utils import wordpress
from tcia_utils import nbia
from simpleDicomViewer import dicomViewer

# set logging level to INFO
import logging

for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)

# Set handler with level = info
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                    level=logging.INFO)

# Use Case: Find datasets with tumor labels to test AI segmentation models

Many collections on TCIA contain tumor segmentations which can be used for training and testing artificial intelligence models.  In this section of the course we'll review some tips and tricks for finding them using our Wordpress API (aka Collection Manager).  It contains metadata about the datasets we host including descriptive summaries, stats about the files available for download, citation requirements, related publications and versioning info. Full documentation about this API can be found at https://www.cancerimagingarchive.net/collection-manager-rest-api/, but we'll rely on the **wordpress** module in **tcia_utils** to simplify some common tasks.



## Exploring collection metadata
In order to get metadata about collections with tcia_utils we'll use the **wordpress.getCollections()** function and save the results to a dataframe that we'll be able to use to start looking for relevant datasets with segmentations.

**Note:** You can mouse over function names in Colab to view their docstrings. These provide helpful explanations about what the functions do and the available parameters you can use with them.



In [None]:
# select fields to retrieve
fields = ["id", "slug", "collection_page_accessibility", "link", "cancer_types",
          "collection_doi", "cancer_locations", "collection_status", "species",
          "versions", "citations", "collection_title", "version_number",
          "date_updated", "subjects", "collection_short_title", "data_types",
          "supporting_data", "program", "collection_summary",
          "collection_downloads", "related_analysis_results"]

# request metadata
collections = wordpress.getCollections(format = "df", fields = fields, file_name = "tciaCollections.csv", removeHtml = "yes")

collections

## Filtering for datasets with segmentation labels
As you can see, there's quite a bit of information in this dataframe.  In order to find the ones with segmentations, the most important columns to check are **data_types**	and **supporting_data**.

Let's look at data_types first.  DICOM provides support for segmentations using SEG and RTSTRUCT modalities.  Many popular open-source tools export these labels in other formats.  Popular formats include NIFTI, NRRD, and MHA.  TCIA contains segmentation data in pretty much all of these formats.  When querying our datasets, you can filter for **SEG and RTSTRUCT** to catch the DICOM segmentations.  If a Collection contains this type of data in other formats we simply list them as **Segmentations**.

Seasoned data scientists can filter for these values in our Collections dataframe using regular Pandas commands, but to help make the course more beginner friendly we'll use the **searchDf()** helper function from tcia_utils to make it even easier.  We'll include our search terms as the first parameter and the dataframe we want to filter as the second one.

In [None]:
# note that 'seg' catches both SEG (dicom) and (Segmentation) values here
segs = wordpress.searchDf(['seg','rtstruct'], column_name = "data_types", dataframe = collections)

segs

Let's take advantage of the AI-driven coding features of Colab to auto-generate some code to plot this data in a bar chart.

In [None]:
# prompt: create a bar chart with plotly showing how many collections have SEG vs RTSTRUCT vs Segmentation in the data_type column.  Note that the values in this column are lists.

import plotly.express as px

# Count occurrences of 'SEG', 'RTSTRUCT', and 'Segmentation' in the 'data_type' column
seg_count = 0
rtstruct_count = 0
segmentation_count = 0

for index, row in segs.iterrows():
    data_types = row['data_types']
    if isinstance(data_types, list):  # Check if data_types is a list
        if 'SEG' in data_types:
            seg_count += 1
        if 'RTSTRUCT' in data_types:
            rtstruct_count += 1
        if 'Segmentation' in data_types:
            segmentation_count += 1

# Create a DataFrame for the bar chart
import pandas as pd
data = {'Data Type': ['SEG', 'RTSTRUCT', 'Segmentation'],
        'Count': [seg_count, rtstruct_count, segmentation_count]}
df = pd.DataFrame(data)

# Create the bar chart using Plotly Express
fig = px.bar(df, x='Data Type', y='Count',
             title='Number of Collections with Segmentation Data Types',
             labels={'Count': 'Number of Collections'})
fig.show()

Let's say that you were specifically interested in working with a lung cancer tumor segmentation model.  Try using the Colab AI assistant to help you filter the **segs** dataframe to down to only the lung datasets.

**Note:** You can click "generate with AI" or you can type out what you need help with using code comments and it will auto-suggest solutions when you go to the next line.

In [None]:
# prompt: filter my "segs" dataframe for rows that contain "lung"

# Filter the 'segs' DataFrame for rows containing "lung" in the 'cancer_types' column.
lung_segs = segs[segs['cancer_types'].astype(str).str.contains("lung", case=False, na=False)]

lung_segs

You can now review the datasets you've identified to decide which ones you'd like to include in your project.  In the interest of time, let's say that after reviewing them you decided you wanted to work with [RIDER Lung CT](https://doi.org/10.7937/k9/tcia.2015.u1x8a5nr).  Go ahead and take a quick look at this page so we can see how it looks compared to the API output we'll generate in a minute.

In order to download the data from the links that you see in the "Data Access" table of the RIDER Lung CT web page, we need to first use the **getDownloads()** function to look up the **id** values in the **collection_downloads** column.

In [None]:
# prompt: save the ids in the 'collection_downloads' column of my 'lung' dataframe to a list where collection short title = "RIDER Lung CT"

# Assuming 'collections' dataframe is already loaded as in the provided code.

rider_lung_ct_ids = []

# Iterate through the rows of the dataframe
for index, row in collections.iterrows():
    if row['collection_short_title'] == "RIDER Lung CT":
        # Extract and append the ids from the 'collection_downloads' column.
        # Handle cases where 'collection_downloads' might be a string, list, or missing.
        if isinstance(row['collection_downloads'], list):
            rider_lung_ct_ids.extend(row['collection_downloads'])
        elif isinstance(row['collection_downloads'], str):
          rider_lung_ct_ids.append(row['collection_downloads'])
        # You can add an 'else' block to handle missing values if needed,
        # e.g. else: print(f"Missing collection_downloads for {row['collection_short_title']}")
        break # Assuming there's only one row with that title

rider_lung_ct_ids

Note that we have 2 ids, which correspond to the 2 download rows you saw on the web page.  Let's feed these ids to the getDownloads() function now.  We'll also provide a subset of the more interesting fields to focus on rather than returning them all.

In [None]:
fields = ["id", "date_updated", "download_title", "data_license", "download_access",
          "data_type", "file_type", "download_size", "download_size_unit",
          "subjects", "study_count", "series_count", "image_count",
           "download_type", "download_url", "download_file", "search_url"]

downloads = wordpress.getDownloads(ids = rider_lung_ct_ids, fields=fields, format = "df")
downloads

Looking at the results, we can see that the first row contains the DICOM data with both the CT images and segmentations.  Let's grab that download_url and use it to save the file.

In [None]:
# prompt: Using dataframe downloads: download the file from the URL in the download_url field where file_type contains "DICOM". Note that file_type is a list.

import pandas as pd
import requests

# Assuming 'downloads' is your pandas DataFrame
# Find rows where 'file_type' contains "DICOM"
dicom_downloads = downloads[downloads['file_type'].astype(str).str.contains("DICOM")]

# Iterate through the filtered DataFrame and download the files
for index, row in dicom_downloads.iterrows():
    url = row['download_url']
    # Get the filename from the URL
    filename = url.split('/')[-1]
    print(f"Downloading file from: {url}")

    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an exception for bad status codes

        with open(filename, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print(f"Downloaded {filename} successfully.")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading {filename}: {e}")

## Downloading with the Linux Command Line NBIA Data Retriever

TCIA uses software called NBIA to manage its DICOM data.  One way to download TCIA data is to install the NBIA Data Retriever and use it to open a **.TCIA** manifest that contains the list of files to download.

This tool provides a number of useful features such as auto-retry if there are any problems, saving data in an organized hierarchy on your hard drive (Collection > Patient > Study > Series > Images), and providing a CSV file containing key DICOM metadata about the images you've downloaded.  

### Install the NBIA Data Retriever
There are versions of this tool for Windows, Mac and Linux.  If you're working from a system with a GUI you can follow the [instructions](https://wiki.cancerimagingarchive.net/display/NBIA/Downloading+TCIA+Images) to install Data Retriever on your computer.

There is also a [command-line version of the NBIA Data Retriever](https://wiki.cancerimagingarchive.net/x/2QKPBQ) which can be installed via the steps below if you're running this notebook in a **Linux** environment.  

In [None]:
# Install NBIA Data Retriever CLI software for downloading images later in this notebook.

!mkdir /usr/share/desktop-directories/
!wget -P /content/NBIA-Data-Retriever https://github.com/CBIIT/NBIA-TCIA/releases/download/DR-4_4_3-TCIA-20240916-1/nbia-data-retriever_4.4.3-1_amd64.deb
!dpkg -i /content/NBIA-Data-Retriever/nbia-data-retriever_4.4.3-1_amd64.deb

# NOTE: If you're working on a Linux OS that uses RPM packages, you can change the lines above to use
#       https://github.com/CBIIT/NBIA-TCIA/releases/download/DR-4_4_3-TCIA-20240916-1/nbia-data-retriever-4.4.3-1.x86_64.rpm

Click on the icon in the left sidebar that looks like a file folder to view the files we've downloaded thus far. You can inspect the manifest file in a text editor by double clicking it. You'll see some configuration information at the top, followed by a list of Series Instance UIDs that are part of the dataset.  

Don't worry if this next cell doesn't make much sense.  Typically you wouldn't need to do this, but for the purposes of this demo I'm  editing the file to only include the UIDs for a single CT + Segmentation from this Collection so that we're not waiting a long time for the full dataset to download.



In [None]:
!cp /content/RIDER-Lung-CT_v3_20240625.tcia /content/RIDER-sample.tcia

# Open the original file for reading and the new file for writing
with open('/content/RIDER-Lung-CT_v3_20240625.tcia', 'r') as infile, open('/content/RIDER-sample.tcia', 'w') as outfile:
    # Read the first six lines and write them to the new file
    for i in range(6):
        line = infile.readline()
        outfile.write(line)
    outfile.write("1.2.276.0.7230010.3.1.3.8323329.3151.1554820054.833742\n")
    outfile.write("1.3.6.1.4.1.9328.50.1.244501235124778437799277943012383090930\n")
    outfile.write("1.2.246.352.71.2.494841863751.4253265.20190214220030")

### Open the Manifest File with the NBIA Data Retriever
Next, let's open the sample manifest file with the NBIA Data Retriever to download the actual DICOM data.  To do this, we'll call the command to launch Data Retriever, specify the **--cli** flag to indicate we want to run this via command line (not with a GUI).  We also need to specify the path to the manifest file we want to open and then use **-d** to specify the path where we want to save the data.

**<font color='red'>We are using the --agree-to-license flag to bypass the need to interactively agree to the [Data Usage Policy](https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/)</font>**.  Please be sure you always check the licensing and attribution requirements before utilizing any TCIA data in your work.

In [None]:
# download the data using NBIA Data Retriever

!/opt/nbia-data-retriever/bin/nbia-data-retriever --cli '/content/RIDER-sample.tcia' -d /content/ --agree-to-license

Let's use the simpleDicomViewer we imported at the beginning of the notebook to take a quick look at these scans in the notebook.  You can use the slider to scroll through the image stack.

In [None]:
# Assuming you didn't change the default download options for downloadSeries
imgPath = "/content/RIDER-sample/RIDER Lung CT/RIDER-1532432635/04-12-2007-05084/109.000000-90930"

# The annotation path has to be a file name (not directory name).  Since there is generally
# only one file in a segmentation series we can assume it will always be called 1-1.dcm
segPath = "/content/RIDER-sample/RIDER Lung CT/RIDER-1532432635/04-12-2007-05084/110.000000-TEST-20030/1-1.dcm"

# Display the viewer
dicomViewer.viewSeriesAnnotation(imgPath, segPath)

### NBIA Data Retriever Conclusion
You should now find that the data have been saved to your machine in a well-organized hierarchy with some useful metadata in the accompanying CSV file and a license file detailing how it can be used.  Take a look at the data before moving on.

A few other notes:
* The CLI Data Retriever supports both "Descriptive" and "Classic" organization of the data.  Descriptive naming uses information from the DICOM Study/Series Description and Dates to make them easier for humans to interpret.  Classic names everything by machine-readable unique identifiers.  If you prefer machine-readable directory names simply add the **-cd** parameter to your download command.
* In some cases, you must specifically request access to [Collections](https://www.cancerimagingarchive.net/browse-collections/) before you can download them.  Information about how to do this can be found on the homepage for the Collection(s) you're interested in, but will always require that you first [create a TCIA user account](https://wiki.cancerimagingarchive.net/x/xgHDAg).  Once you've created an account and obtained permission to the restricted data you want to download, you can use your login/password to create the **credentials.txt** file that NBIA Data Retriever uses to verify your permissions.  The path to the credential file is specified using the **-l** parameter.

You can find examples for these use cases at [TCIA_Linux_Data_Retriever_App.ipynb](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb).



# Use Case: Predict clinical, genomic and proteomic classification from images
In this use case we'll look for datasets that have clinical, genomic and proteomic supporting data and then grab their related images.  

**Note:** It's outside the scope of this notebook to do a deep dive on how to obtain the genomic and proteomic data, but you can find lots of additional information about the NCI programs that are generating this type of data and how to access it at https://www.cancerimagingarchive.net/imaging-omics/.

To get started, let's filter the collections dataframe again that we created earlier.  This time we'll look for 'omics' in the **supporting_data** column.

In [None]:
omics = wordpress.searchDf('omics', column_name = "supporting_data", dataframe = collections)

omics

Over 40 datasets!  Let's say that I'm particularly interested in pancreatic data though.

In [None]:
pancreas_omics = wordpress.searchDf('pancreas', dataframe = omics)

pancreas_omics

Ok, we're down to one called **CPTAC-PDA** now, but it looks like a good candidate since it also has clinical data listed in the supporting_data column.  

Let's navigate to the URL in the link column.  

In [None]:
# prompt: print the full value of the URL in the link column of my pancreas_omics dataframe

# Assuming 'pancreas_omics' DataFrame is already loaded as in the provided code.

for index, row in pancreas_omics.iterrows():
    print(row['link'])

Note that, in addition to the images, there are Analysis Result datasets which have been published containing image segmentations.  This looks like a really great dataset to explore in more detail!

## Accessing DICOM REST APIs
Instead of just downloading the full dataset with NBIA Data Retriever like we did earlier, this time we'll use the [NBIA REST APIs](https://wiki.cancerimagingarchive.net/x/ZoATBg) to query metadata and download DICOM data.  These can be particularly helpful if you want to build a more customized cohort, or if you can't or don't want to install software like the Data Retriever in your computational environment.

In this notebook we'll rely heavily on [tcia_utils](https://pypi.org/project/tcia-utils/) again to simplify accessing the APIs.

## Exploring the data with REST API queries

Let's start by looking at what body parts and modalities are contained in the CPTAC-PDA collection.  By default, most functions from **tcia_utils** return results in JSON.

In [None]:
# count patients for each modality
data = nbia.getModalityCounts(collection = "CPTAC-PDA")
print(data)

However, you can also use **format = "df"** to return the results as a dataframe.  Let's try that for looking at the patient counts by Body Part Examined values.

In [None]:
# Count patients for each body part examined,
# return results as dataframe
df = nbia.getBodyPartCounts(collection = "CPTAC-PDA", format = "df")

# rename headers and sort by PatientCount
df.rename(columns = {'criteria':'BodyPartExamined', 'count':'PatientCount'}, inplace = True)
df.PatientCount = df.PatientCount.astype(int)
display(df.sort_values(by='PatientCount', ascending=False, ignore_index = True))

Now let's run **nbia.getPatient()** and **nbia.getStudy()** to see what we can learn about the patient cohort from the DICOM metadata.  The patient information can include things like age, gender, and ethnicity. The study information might include additional information recorded on the date the patient was scanned such as the patient's age or how many days it has been since they were diagnosed.

In [None]:
df = nbia.getPatient(collection = "CPTAC-PDA", format = "df")

display(df)

Let's use **format = "csv"** this time to save a CSV file in addition to returning a dataframe.  Verify that **getPatientStudy.csv** has been saved to your file system before proceeding.

In [None]:
# obtain study/visit details (e.g. anonymized study date, age at the time of visit)
df = nbia.getStudy(collection = "CPTAC-PDA", format = "csv")
display(df)

We can also create a report with **nbia.getSeries()** that gives useful metadata about each scan in the dataset (e.g. series description, modality, scanner manufacturer & software version, number of images).  Here's a full list of the return values.

In [None]:
# obtain scan/series metadata and save to variable for use in next example
series = nbia.getSeries(collection = "CPTAC-PDA", format = "df")

# list all column headers in the series dataframe
list(series.columns)

# uncomment the next line if you'd like to see the actual data
#display(series)

Finally, we can use the results from the getSeries() query to generate some summary statistics about the data in the collection.  Note that there are separate rows summarizing the contents of the original collection and the contents any related Analysis Result datasets that have DICOM data.  

The other Analysis Result datasets we saw on the CPTAC-PDA webpage do not appear because the data they generated were other file formats that were not submitted to the NBIA DICOM system.

In [None]:
# Calculate summary statistics for a given collection
nbia.reportDoiSummary(series, input_type = "df")

## Downloading data with the REST API
There are a wide variety of ways to use **downloadSeries()** to download data from TCIA.  You can learn more about this in https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Downloads.ipynb, but we'll cover a few basic use cases in this notebook.

First we'll demonstrate downloading a segmentation and the corresponding image series for a single subject.  To do this we'll pull a random segmentation using the **series** dataframe we created earlier with **getSeries()**.  All annotation data in the CPTAC-PDA collection is in RTSTRUCT format, so we'll filter for this in the Modality column and then use SeriesDescription to make sure we're pulling a 3d segmentation and not a seed point annotation as they're too small to visualize.



In [None]:
random_row = series.loc[(series['Modality'] == 'RTSTRUCT') &
                        (~series['SeriesDescription'].fillna('').str.lower().str.contains('seed'))].sample(n=1)
segSeries = random_row['SeriesInstanceUID'].iloc[0]

print(segSeries)

To determine the Reference Series UID of the image data that goes with this segmentation you can use **nbia.getSegRefSeries()**.

**Note:** This should work 100% of the time with RTSTRUCT data, but there are some older SEG datasets that were submitted to us without this DICOM element populated so you may encounter issues using this function with those.

In [None]:
refSeries = nbia.getSegRefSeries(segSeries)

print(refSeries)

If you wanted to inspect the DICOM tags for these series before downloading them you can use the **nbia.reportDicomTags()** function.  

**Note:** This function looks up each series UID with a separate API call, so if you run this with a really large list of UIDs it could take a very long time.

In [None]:
nbia.reportDicomTags([refSeries, segSeries])

Next let's download these two series by passing their UIDs as a list.  Note that this is not the default way downloadSeries() expects to receive input about what to download.  This is why we're specifying input_type = "list".

In [None]:
nbia.downloadSeries([refSeries, segSeries], input_type= "list")

Now we can look at the images and segmentation together with **viewSeriesAnnotation()** from [simpleDicomViewer](https://pypi.org/project/simpleDicomViewer/).  Note that this function is only meant to be a  quick and dirty way to preview the data.  There are more comprehensive solutions such as [3D Slicer](https://slicer.org/) or [itkWidgets](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_RTStruct_SEG_Visualization_with_itkWidgets.ipynb) if you want analyze the data.

In [None]:
# Assuming you didn't change the default download options for downloadSeries
imgPath = "tciaDownload/" + refSeries

# The annotation path has to be a file name (not directory name).  Since there is generally
# only one file in a segmentation series we can assume it will always be called 1-1.dcm
segPath = "tciaDownload/" + segSeries + "/1-1.dcm"

# Display the viewer
dicomViewer.viewSeriesAnnotation(imgPath, segPath)

Let's say that after spot checking a few cases you decide you'd like to download the full dataset.  To achieve this, we can pass the full `series` dataframe we created earlier to `downloadSeries()` as shown here.  We'll need to include the `input_type` parameter again because the function expects JSON by default and we're using a dataframe here.

In [None]:
# Note: the 'number' parameter lets you specify how many series to download
# This is useful for demos or if you just want a small sample of data for testing.

nbia.downloadSeries(series, number = 2, input_type = 'df')

## Accessing Genomics, Proteomics and Clinical data
NCI has funded several large projects generating vast amounts of imaging, proteomic, genomic and clinical data.  While TCIA hosts much of the imaging data for these projects, the 'omic and clinical data are hosted in the [Proteomic Data Commons (PDC)](https://pdc.cancer.gov/) and [Genomic Data Commons (GDC)](https://gdc.cancer.gov/).  

In the interest of time, we will only touch briefly on how to access the clinical data for CPTAC-PDA.  
* If you are interested in learning more about how to access imaging and clinical data from these datasets please check out the notebooks for the [Clinical Proteomic Tumor Analysis Consortium (CPTAC)](https://github.com/kirbyju/TCIA_Notebooks/blob/main/CPTAC/CPTAC.ipynb) and [The Cancer Genome Atlas (TCGA)](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCGA/TCGA_Clinical.ipynb).
* If you have questions about accessing the genomic or proteomic data please contact [GDC support](mailto:support@nci-gdc.datacommons.io) or [PDC support](mailto:PDCHelpDesk@mail.nih.gov).

In [None]:
cases_endpt = 'https://api.gdc.cancer.gov/cases'

filters = {
    "op": "in",
    "content":{
        "field": "project.project_id",
        "value": ["CPTAC-3"]
        }
    }

fields = [
    "submitter_id",
    ]

fields = ','.join(fields)

expand = [ ## For the allowable values for this list, look under "mapping" at https://api.gdc.cancer.gov/cases/_mapping
    "demographic",
    "diagnoses",
    "diagnoses.treatments",
    "exposures",
    "family_histories"
    ]

expand = ','.join(expand)

params = {
    "filters": json.dumps(filters),
    "expand": expand,
    "fields": fields,
    "format": "TSV", ## This can be "JSON" too
    "size": "2000", ## If you are including several projects, I would recommend playing with this and the "from" number.
    "from":"0"
    }

response = requests.get(cases_endpt, params = params)

output = response.content.decode('UTF-8')
clinicalDf = pd.read_csv(io.StringIO(output), sep='\t')

clinicalDf

Now let's merge the clinical data with our imaging data so that we're only looking at subjects where we have both.

In [None]:
# create new dataframe from 'series' with unique IDs of patients with imaging
uniquePatients = pd.DataFrame(series['PatientID'].unique(), columns=['PatientID'])

# Rename the patient id column in clinicalDf to match
clinicalDf = clinicalDf.rename(columns={'submitter_id': 'PatientID'})

# Merge the dataframes
mergedClinical = uniquePatients.merge(clinicalDf, how='left', on='PatientID')

mergedClinical

We'll stop there for now, but from here you could review the clinical data in more detail and start looking for possible fields that you want to try to predict from the imaging data.

# Use Case: Publishing or sharing your analyses of TCIA images
After accessing TCIA data you may find that you'd like to publish your results in connection with a manuscript, or maybe you just want to share specific data with others to demonstrate your model's performance.  Here, we will demonstrate how the API can help you achieve this, as well as discuss general considerations and best practices.

Let's say that you're a researcher interested in developing or refining an all-purpose foundational tumor segmentation model for CT images.  You might begin by using **getSimpleSearchWithModalityAndBodyPartPaged()**.  Take a minute to look at the docstring for this function.  It's extremely powerful and versatile!

We'll run this twice to create a starting point for the data we may want to use for creating our model.  The first query is returning the CT/SEG series from subjects that have both CT AND SEG, and the second is returning RTSTRUCT/CT series from subjects that have both CT AND RTSTRUCT segmentations.

In [None]:
ct_seg = nbia.getSimpleSearchWithModalityAndBodyPartPaged(modalities=['CT', 'SEG'], modalityAnded=True, format = 'uids')
ct_rtstruct = nbia.getSimpleSearchWithModalityAndBodyPartPaged(modalities=['CT', 'RTSTRUCT'], modalityAnded=True, format = 'uids')

We can see in the following cell that this results in approximately 30,000 series to review.  


In [None]:
print(len(ct_seg))
print(len(ct_rtstruct))

Next we can use **getSeriesList()** to retrieve metadata for these series and save that as a CSV for both datasets. Take a minute to review what sort of metadata are in the spreadsheets by double clicking one of the files in the left sidebar to open it.

In [None]:
nbia.getSeriesList(ct_seg, format = 'csv')

nbia.getSeriesList(ct_rtstruct, format = 'csv')

Now the real fun begins.  You can do some data wrangling with these spreadsheets to rule out some series you obviously don't want.  For example, you might want to eliminate scans with certain Series Descriptions (e.g. "Scout") or that have less than a certain number of images in them.  

Let's pretend that after doing this you've created a new file, **series.csv** where you've retained only the series you think are worth downloading for further review.  Please use the file manager in the left sidebar to right click and rename one of the **series_report_datetime.csv** spreadsheets you downloaded in our previous steps to **series.csv**.  

Next we'll demonstrate how to download those series using the CSV file by isolating the Series Instance UIDs and passing them to **downloadSeries()** with **input_type = 'list'**.

**Note:** This time we're including the **as_zip = True** feature for downloading.  This skips the unzipping process that typically occurs as part of the downloading function, which some users may prefer to help save disk space.

In [None]:
series_metadata = pd.read_csv('/content/series.csv')

uids = series_metadata['Series ID'].tolist()

# you'd remove the number parameter if you wanted to download the full dataset
nbia.downloadSeries(uids, input_type = 'list', number = 1, as_zip = True)

After inspecting the data more carefully with other tools, you will likely find scans and segmentations that weren't quite what you wanted and that needed to be excluded due to issues like these:

1. You might disagree with the way some segmentations were created.
2. You may find some segmentations are of organs or something else besides tumors.
3. You might find that certain types of CT scans do not work well with your model (e.g. slice thickness is too thick or too thin).
4. You might want to exclude some subjects to create a better demographic distribution for gender, race, and ethnicity.

After you whittle the dataset down to exactly the CTs and segmentations that you'd like to use to train and test your model you'll need to decide how do you share these with the rest of the world.

Our recommendation for the best practice here is to keep track of the Series Instance UIDs!  You can use these in conjunction with the API to create a "[Shared Cart](https://wiki.cancerimagingarchive.net/display/NBIA/TCIA+Radiology+Portal+User+Guide#TCIARadiologyPortalUserGuide-SharingDatainYourCart)" that lets you share these specific scans with others.  

After creating a Shared Cart you will receive a URL that looks like https://nbia.cancerimagingarchive.net/nbia-search/?saved-cart=nbia-49121659384603347 which can be shared with others.  Try clicking the link to see what this looks like on the TCIA website.  Then use the code below to see how you can achieve this using **tcia_utils**.

## makeSharedCart()
First let's create a shared cart using **makeSharedCart()**.  This function accepts a list of Series Instance UIDs and (optionally) your cart's name, description and URL (if you want to provide more information about what's in your cart).

Let's take a look at how to create one using the **uids** list that we extracted from the **series_metadata** dataframe.

In [None]:
# You can update the code below to choose your own custom name if you like.
# A random one will generated for you if left empty.
name = ""
description = "Testing out API cart creation with tcia_utils."
description_url = "https://pypi.org/project/tcia-utils/"

carts = nbia.makeSharedCart(uids, name, description, description_url)

The **makeSharedCart()** function returns the URL(s) that can be used to access your shared cart(s).  

In [None]:
carts

## getSharedCart()
Users who want to use their browsers can just click on this links and download them using the Data Retriever.  

If you want to download the associated series in a cart using the API you can do so by updating the code below to specify the name of the cart you're trying to retrieve. First we'll retrieve the metadata about the cart, and then we can feed that to **downloadSeries()** to actually download the DICOM data.

In [None]:
df = nbia.getSharedCart(name = "nbia-58178536016905286-part1", format = "df")
display(df)

In [None]:
nbia.downloadSeries(df, input_type = "df", number = 1)

# Additional TCIA Resources for AI/ML Researchers
The following pages on TCIA may be of special interest to deep learning researchers:

1. [Finding Annotated Data for AI/ML on TCIA](https://wiki.cancerimagingarchive.net/x/TAGJAw) provides basic guidance for finding datasets that could be useful for deep learning tasks.
2. [Challenge Competitions using TCIA data](https://wiki.cancerimagingarchive.net/x/nYIaAQ) can be useful for benchmarking your model's performance.
3. [ACR Data Science Institute's Define AI Directory](https://www.acrdsi.org/DSI-Services/Define-AI) links clinically relevant AI use-cases to TCIA datasets that can be used to address them.
4. [Additional TCIA Notebooks](https://github.com/kirbyju/TCIA_Notebooks) about accessing and visualizing data are available.

# AI/LLM programming tips and tricks for beginners
Here are some tips and tricks that can be extremely helpful to novice coders and data scientists thanks to recent advances in AI and LLMs.  We don't have time to dive into them all, but I wanted to provide them to review on your own time.
1. Google Colab has built in AI coding assistance.  Try out the auto-complete and `generate with AI` features.  These are excellent tools to speed up your ability to write code within the notebook.
2. https://claude.ai/ is an amazing tool for generating code to tackle more in depth tasks.  Rather than helping you write your next line of code, you can write detailed instructions in paragraph form and watch as it generates entire functions or even full scripts/programs for you.  You can then iterate with it via chat to refine specific parts.  You can upload files or larger blobs of code and it seems to generally handle them pretty well in my experience.
3. Microsoft Copilot and ChatGPT are also great LLM solutions to help with your code.  For ChatGPT you can even click on "Explore GPTs" and try out the ones that are specifically tailored for Python coding.  ChatGPT is also nice in that you can upload full .py files and ask it to analyze them rather than having to copy/paste in code blocks.
4. If you find yourself running into message limits when interacting with LLMs, you can alternate back and forth between these different solutions while you wait for your free credits to build back up on the one(s) you've fully depleted.  
5. Another option is to install [Ollama](https://ollama.com/download), which lets you run LLMs locally on your laptop with code-specific models like [qwen](https://ollama.com/library/qwen2.5-coder).  The down side here is that the performance may not be quite as good as the commercial solutions, and I've found the user experience to lack behind things like Claude and ChatGPT.

Even as someone who is a very novice programmer, these tools have helped me code everything you'll see the rest of this course.  I've also recently discovered [Streamlit](https://streamlit.io/), which makes it super easy to build and deploy small, but very functional, web applications like https://tcia-cohort-builder.streamlit.app/ and https://tcia-shared-cart-creator.streamlit.app/ that interact with our APIs. No sysadmin experience required!  If you come up with an idea for something like this which may benefit other TCIA users, please contact our helpdesk and we'll add it to our list of [Data Analysis Centers](https://wiki.cancerimagingarchive.net/x/x49XAQ)!

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7