You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/CPTAC/CPTAC.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/CPTAC/CPTAC.ipynb)

# Analyzing DICOM images and annotations from the CPTAC datasets hosted on TCIA

This notebook is focused on accessing the [Clinical Proteomic Tumor Analysis Consortium collections](https://wiki.cancerimagingarchive.net/display/Public/CPTAC+Imaging+Proteomics) hosted on The Cancer Imaging Archive(TCIA).  These datasets include radiology and histopathology images hosted on TCIA as well as proteomic, genomic and clinical data hosted in the Proteomic Data Commons (PDC) and Genomic Data Commons (GDC).  

The National Cancer Institute has also funded an activity to generate and publish annotations (3d segmentation labels and seed points) on TCIA to help jumpstart research on tumor detection, auto-segmentation methods and generating radiomics imaging features which can be compared with the proteomic, genomic and clinical data.  **This notebook is focused on:**

1. Demonstrating how to access the radiology images and tumor annotations
2. Generating radiomic features from the 3D tumor segmentations using [MIRP](https://github.com/oncoray/mirp)
3. Extracting the corresponding clinical data from GDC to facilitate correlation with the image features.
4. Leveraging potential genomic and proteomic classification data from [publications written by the Clinical Proteomic Tumor Analysis Consortium](https://proteomics.cancer.gov/resources/milestones-and-publications).

**Note:** Users of this notebook may also find the [CPTAC python package](https://github.com/PayneLab/cptac) useful for working with the non-image data types.  This repository includes documentation about how to access genomic, proteomic and clinical data for CPTAC subjects using python dataframes.


# 1 Setup

The following installs and imports the necessary Python packages and runs a few conditional steps if you're using Google Colab to adjust log settings.

In [None]:
import sys

!{sys.executable} -m pip install --upgrade -q tcia_utils
!{sys.executable} -m pip install --upgrade -q altair
!{sys.executable} -m pip install --upgrade -q mirp
!{sys.executable} -m pip install --upgrade -q simpleDicomViewer

In [None]:
import requests
import json
import io
import pandas as pd
import altair as alt
from tcia_utils import nbia_v4 as nbia
from simpleDicomViewer import dicomViewer
from mirp import extract_mask_labels
from mirp import extract_image_parameters
from mirp import extract_images
from mirp import extract_features

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

# 2 Learn about the datasets

TCIA maintains a [summary page for CPTAC](https://wiki.cancerimagingarchive.net/display/Public/CPTAC+Imaging+Proteomics) that details all available radiology and digitized histopathology we host.  

As previously mentioned, a subset of these images also include annotations (tumor segmentation and seed point labels).  The annotation datasets are described at:

1. [CPTAC-UCEC](https://doi.org/10.7937/89M3-KQ43): Corpus Endometrial Carcinoma
2. [CPTAC-PDA](https://doi.org/10.7937/BW9V-BX61): Pancreatic Ductal Adenocarcinoma
3. [CPTAC-CCRCC](https://doi.org/10.7937/SKQ4-QX48): Clear Cell Renal Carcinoma
4. [CPTAC-HNSCC](https://doi.org/10.7937/PFEC-T641): Head and Neck Squamous Cell Carcinoma **(restricted access - Images are temporarily unavailable due to [new NIH policies](https://www.cancerimagingarchive.net/nih-controlled-data-access-policy/) on controlled-access data)**

After taking a look at these collections, select the one you'd like to explore through the rest of this notebook by setting the collection variable below.

In [None]:
collection = "CPTAC-PDA"

# 3 Downloading images and annotations with the NBIA Data Retriever

TCIA uses software called NBIA to manage its DICOM data.  One way to download TCIA data is to install the NBIA Data Retriever and use the predefined manifest files that are found on the summary pages mentioned in section 2.  

This tool provides a number of useful features such as auto-retry if there are any problems, saving data in an organized hierarchy on your hard drive (Collection > Patient > Study > Series > Images), and providing a CSV file containing key DICOM metadata about the images you've downloaded.  

There are versions available for Windows, Mac and Linux.  If you're working from a system with a GUI you can follow the [instructions](https://wiki.cancerimagingarchive.net/display/NBIA/Downloading+TCIA+Images) to install Data Retriever on your computer.  There is also a [Linux command-line version of the NBIA Data Retriever](https://wiki.cancerimagingarchive.net/x/2QKPBQ) which is demonstrated in [this notebook](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Linux_Data_Retriever_App.ipynb).

# 4 Accessing the REST APIs
The [NBIA REST APIs](https://wiki.cancerimagingarchive.net/x/ZoATBg) are another useful way for TCIA users to query metadata and download image data, which will be the focus of the rest of this notebook.  We'll rely heavily on [tcia_utils](https://pypi.org/project/tcia-utils/) to simplify accessing them.

If you have questions that are not covered in this notebook you can find many additional examples in the other notebooks at https://github.com/kirbyju/TCIA_Notebooks.

## 4.1 Exploring the data with REST API queries

Let's start by looking at what body parts and modalities are contained in the collection.  For these datasets, RTSTRUCTs were used to record  the segmentations and seed points as well as to indicate scans where no tumor was found. By default, most functions from **tcia_utils** return results in JSON.

In [None]:
# count patients for each modality
data = nbia.getModalityCounts(collection)
print(data)

However, you can also use **format = "df"** to return the results as a dataframe.

In [None]:
# Count patients for each body part examined,
# return results as dataframe
df = nbia.getBodyPartCounts(collection, format = "df")

# rename headers and sort by PatientCount
df.rename(columns = {'criteria':'BodyPartExamined', 'Count':'PatientCount'}, inplace = True)
df.PatientCount = df.PatientCount.astype(int)
display(df.sort_values(by='PatientCount', ascending=False, ignore_index = True))

Now let's run **nbia.getPatient()** and **nbia.getStudy()** to see what we can learn about the patient cohort from the DICOM metadata.  The patient information can include things like age, gender, and ethnicity. The study information includes additional information recorded on the date the patient was scanned such as the patient's age or how many days it has been since they were diagnosed.

In [None]:
df = nbia.getPatient(collection, format = "df")

display(df)

Let's use **format = "csv"** this time to save a CSV file in addition to returning a dataframe.  Verify that **getPatientStudy.csv** has been saved to your file system before proceeding.

In [None]:
# obtain study/visit details (e.g. anonymized study date, age at the time of visit)
df = nbia.getStudy(collection, format = "csv")
display(df)

We can also create a report with **nbia.getSeries()** that gives useful metadata about each scan in the dataset (e.g. series description, modality, scanner manufacturer & software version, number of images).

In [None]:
# obtain scan/series metadata and save to variable for use in next example
series = nbia.getSeries(collection, format = "df")

display(series)

Finally, we can use the results from the getSeries() query to generate some summary statistics about the data in the collection.  Note that there are separate rows summarizing the contents of the original collection and the contents of the annotation dataset.

In [None]:
# Calculate summary statistics for a given collection
nbia.reportDoiSummary(series, input_type = "df")

## 4.2 Downloading data with the REST API
There are a wide variety of ways to use **downloadSeries()** to download data from TCIA.  You can learn more about this in https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_REST_API_Downloads.ipynb, but we'll cover a few basic use cases in this notebook.

First we'll demonstrate downloading a segmentation and the corresponding image series for a single subject.  To do this we'll pull a random segmentation using the **series** dataframe we created earlier with **getSeries()**.  All annotation data is in RTSTRUCT format, so we'll filter for this in the Modality column and then use SeriesDescription to make sure we're pulling a 3d segmentation and not a seed point annotation as they're too small to visualize.



In [None]:
random_row = series.loc[(series['Modality'] == 'RTSTRUCT') &
                        (~series['SeriesDescription'].fillna('').str.lower().str.contains('seed'))].sample(n=1)
segSeries = random_row['SeriesInstanceUID'].iloc[0]

print(segSeries)

To determine the Reference Series UID of the image data that goes with this segmentation you can use **nbia.getSegRefSeries()**.

In [None]:
refSeries = nbia.getSegRefSeries(segSeries)

print(refSeries)

Next let's download these two series by passing their UIDs as a list.

In [None]:
nbia.downloadSeries([refSeries, segSeries], input_type= "list", format = 'df')

Now we can look at the images and segmentation together with **viewSeriesAnnotation()** from [simpleDicomViewer](https://pypi.org/project/simpleDicomViewer/).  Note that this function is only meant to be a  quick and dirty way to preview the data.  There are more comprehensive solutions such as [3D Slicer](https://slicer.org/) or [itkWidgets](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_RTStruct_SEG_Visualization_with_itkWidgets.ipynb) if you want analyze the data.

In [None]:
# Assuming you didn't change the default download options for downloadSeries
imgPath = "tciaDownload/" + refSeries

# The annotation path has to be a file name (not directory name).  Since there is generally
# only one file in a segmentation series we can assume it will always be called 1-1.dcm
segPath = "tciaDownload/" + segSeries + "/1-1.dcm"

# Display the viewer
dicomViewer.viewDicom(imgPath, segPath)

# 5 Exploring the annotation metatadata

Now let's take a look at the annotation metadata spreadsheet the authors provided.  You can download them manually from the **Annotation Summary** links in section 2 of the notebook or retrieve the one relevant to your selected collection directly into a dataframe with the code below.

In [None]:
metadata_urls = {
    "CPTAC-CCRCC": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_CPTAC-CCRCC_2023_07_14.csv",
    "CPTAC-PDA": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_CPTAC-PDA_2023_07_14.csv",
    "CPTAC-HNSCC": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_CPTAC-HNSCC_2023_07_14.csv",
    "CPTAC-UCEC": "https://www.cancerimagingarchive.net/wp-content/uploads/Metadata_Report_CPTAC-UCEC_2023_07_14.csv"
}

if collection in metadata_urls:
    spreadsheet_url = metadata_urls[collection]
    annotation_Metadata = pd.read_csv(spreadsheet_url)
    display(annotation_Metadata)
else:
    print("URL for collection not found.")

## Visualize StructureSetLabels
Let's take a look at a couple of plots to help us understand the available data. First we'll look at the structure set labels together with Clinical Trial Timepoint for each subject.

In [None]:
# Create the scatter plot
scatter_plot = alt.Chart(annotation_Metadata).mark_circle().encode(
    x=alt.X('PatientID', axis=alt.Axis(labelAngle=-45)),
    y='StructureSetLabel',
    color='ClinicalTrialTimePointID',
    tooltip=['PatientID', 'StudyDate', 'SeriesDescription', 'ROIVolume', 'ClinicalTrialTimePointID', 'StructureSetLabel']
).properties(
    title="Structure Set Label by Patient ID and Clinical Trial Time Point ID"
).configure_axis(
    grid=True
).interactive()

# Show the scatter plot
scatter_plot


## Visualize Tumor Volumes
Next let's compare the tumor volumes for each patient at each available Clinical Trial Timepoint.

In [None]:
# Convert the ROIVolume column to string type
annotation_Metadata['ROIVolume'] = annotation_Metadata['ROIVolume'].astype(str)

# Remove the "cc" text from each value
annotation_Metadata['ROIVolume'] = annotation_Metadata['ROIVolume'].str.replace(' cc', '')

# Convert the ROIVolume column to float type
annotation_Metadata['ROIVolume'] = pd.to_numeric(annotation_Metadata['ROIVolume'], errors='coerce')

# Create the scatter plot
scatter_plot = alt.Chart(annotation_Metadata).mark_circle().encode(
    x=alt.X('PatientID', axis=alt.Axis(labelAngle=-45)),
    y='ROIVolume',
    color='StructureSetLabel',
    tooltip=['PatientID', 'StudyDate', 'SeriesDescription', 'ROIVolume', 'ClinicalTrialTimePointID', 'StructureSetLabel']
).properties(
    title="ROI Volume by Patient ID and Clinical Trial Time Point ID"
).configure_axis(
    grid=True  # Add grid lines to the plot
).interactive()

# Show the scatter plot
scatter_plot

## Downloading specific subjects
If, after reviewing these charts, you were interested to download a specific subject's imaging and annotations you can do this with the steps shown below.  Make sure to update the **patient** variable below with the PatientID you want to download.

In [None]:
# set this to the PatientID you want to download
patient = "C3L-00395"

# Get instances of ReferencedSeriesInstanceUID and SeriesInstanceUID associated with the PatientID
unique_series = annotation_Metadata.loc[annotation_Metadata['PatientID'] == patient, ['ReferencedSeriesInstanceUID', 'SeriesInstanceUID']]

# Flatten it into a single list
unique_series_list = [item for sublist in unique_series.values for item in sublist]

# Remove any duplicates from the unique series list
unique_series_list = list(set(unique_series_list))

# download the series from the unique series list
nbia.downloadSeries(unique_series_list, input_type="list", format = "df")


## Downloading in bulk
You can also download data in bulk by filtering the dataframe for whatever you're interested in and feeding the resulting UIDs to the nbia.downloadSeries() function.  

However, since these types of queries will result in much larger numbers of series to download we'll we'll set the **number** parameter in **downloadSeries()** so that it just grabs the first couple series for testing purposes.  Remove this parameter in the code below to download everything.

For this example, let's assume a researcher is interested in grabbing all of the seed point annotations and the images used to create them.  You can change this to whatever other criteria you like if you want a different subset of data.  

In [None]:
# specify how many test series you'd like to download
number = 2

# Get instances of ReferencedSeriesInstanceUID and SeriesInstanceUID associated with your query
unique_series = annotation_Metadata.loc[annotation_Metadata['Annotation Type'] == "Seed point", ['ReferencedSeriesInstanceUID', 'SeriesInstanceUID']]

# Flatten it into a single list
unique_series_list = [item for sublist in unique_series.values for item in sublist]

# Remove any duplicates from the unique series list
unique_series_list = list(set(unique_series_list))

# download the series from the unique series list
nbia.downloadSeries(unique_series_list, input_type="list", format = "df", number = number)

# 6 Computing radiomic features with MIRP

Now let's use [Medical Image Radiomics Processor (MIRP)](https://github.com/oncoray/mirp) to compute some radiomic features from the tumor segmentations.  We'll also merge in some of the metadata from the original images (saved to the **series** dataframe earlier in the notebook) that were annotated so we can see that side by side with the annotation details.

In [None]:
# Make a copy of annotation_Metadata and drop the "DICOM Type" column
radiomics_metadata = annotation_Metadata.drop(columns=['DICOM Type'])

# Exclude rows where the "Annotation Type" does not equal "Segmentation"
radiomics_metadata = radiomics_metadata[radiomics_metadata['Annotation Type'] == 'Segmentation']

# Select the series columns we want to merge
refSeries = series[['SeriesInstanceUID', 'Modality', 'BodyPartExamined', 'SeriesDescription', 'Manufacturer', 'ManufacturerModelName']]

# Rename the columns in 'refSeries' dataframe
refSeries = refSeries.rename(columns={
    'SeriesInstanceUID': 'ReferencedSeriesInstanceUID',
    'Modality': 'ReferencedSeriesModality',
    'BodyPartExamined': 'ReferencedSeriesBodyPartExamined',
    'SeriesDescription': 'ReferencedSeriesDescription',
    'Manufacturer': 'ReferencedSeriesManufacturer',
    'ManufacturerModelName': 'ReferencedSeriesManufacturerModelName'
})

# Merge 'radiomics_metadata' and 'refSeries' on the matching column
radiomics_metadata = pd.merge(radiomics_metadata, refSeries, how='left', left_on='ReferencedSeriesInstanceUID', right_on='ReferencedSeriesInstanceUID')

radiomics_metadata

Use the code below to to specify whether you want to test with a few sample scans or process the entire dataset. This will download both the relevant RTSTRUCT segmentation(s) and image series used to create them.

In [None]:
# set the number of segmentations or use None to download the full set of scans
num_segs = 3

if num_segs is None:
    uids_list = list(radiomics_metadata['SeriesInstanceUID']) + list(radiomics_metadata['ReferencedSeriesInstanceUID'])
else:
    selected_segs = radiomics_metadata.head(num_segs)
    uids_list = list(selected_segs['SeriesInstanceUID']) + list(selected_segs['ReferencedSeriesInstanceUID'])

# Pass the uids_list to nbia.downloadSeries(uids) for downloading
nbia.downloadSeries(uids_list, input_type = "list")



In [None]:
selected_segs

## Assessing image metadata
Image metadata are important for understanding the image and how it was acquired and reconstructed. MIRP provides a function called **extract_image_parameters()** to help with this.

In [None]:
# Create an empty DataFrame to store the concatenated results
all_parameters = pd.DataFrame()

# Loop through the selected_patients DataFrame
for index, row in selected_segs.iterrows():
    # Set seg_path for each row
    image_path = "tciaDownload/" + row['ReferencedSeriesInstanceUID']
    # Extract parameters for each image series
    parameters = extract_image_parameters(image = image_path)

    # Concatenate the results into the all_labels DataFrame
    all_parameters = pd.concat([all_parameters, parameters])

all_parameters

## Computing image features
Next we will compute our image features.  Note that certain optional parameters and pre-processing steps may be required to achieve meaningful feature results.  Since the goal of this notebook is to demonstrate a basic workflow for loading TCIA data into MIRP (not to provide a tutorial in radiomic feature analysis), we'll stick to computing morphological features which are mostly unaffected by these more advanced topics.

In [None]:
# Create an empty list to store the DataFrames
all_features_list = []

# Loop through the selected_patients DataFrame
for index, row in selected_segs.iterrows():
    # Set image_path and seg_path for each row
    image_path = "tciaDownload/" + row['ReferencedSeriesInstanceUID']
    seg_path = "tciaDownload/" + row['SeriesInstanceUID'] + "/1-1.dcm"

    # Extract parameters for each image series
    features = extract_features(
        image=image_path,
        mask=seg_path,
        by_slice=True,
        base_feature_families=["morphology"]
        #intensity_normalisation="standardisation",
        #new_spacing=1.0,
        #base_discretisation_method="fixed_bin_number",
        #base_discretisation_n_bins=16
    )

    # Extend the list of DataFrames
    all_features_list.extend(features)

# Concatenate all DataFrames in the list
all_features = pd.concat(all_features_list, ignore_index=True)
all_features


# 7 Accessing clinical data from NCI's Genomic Data Commons (GDC)

There is a significant amount of supporting genomic and clinical data for these subjects in the [Genomic Data Commons](https://portal.gdc.cancer.gov/).  Here we'll provide an example showing how to obtain all available clinical data and merge it with the subset of subjects who also have annotated radiology images.  

If you have any questions about this section, pleaes consult their documentation at https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/ and their helpdesk at support@nci-gdc.datacommons.io.

To begin, let's download all clinical data they have available associated with the CPTAC-3 project.


In [None]:
cases_endpt = 'https://api.gdc.cancer.gov/cases'

filters = {
    "op": "in",
    "content":{
        "field": "project.project_id",
        "value": ["CPTAC-3"]
        }
    }

fields = [
    "submitter_id",
    ]

fields = ','.join(fields)

expand = [ ## For the allowable values for this list, look under "mapping" at https://api.gdc.cancer.gov/cases/_mapping
    "demographic",
    "diagnoses",
    "diagnoses.treatments",
    "exposures",
    "family_histories"
    ]

expand = ','.join(expand)

params = {
    "filters": json.dumps(filters),
    "expand": expand,
    "fields": fields,
    "format": "TSV", ## This can be "JSON" too
    "size": "2000", ## If you are including several projects, I would recommend playing with this and the "from" number.
    "from":"0"
    }

response = requests.get(cases_endpt, params = params)

output = response.content.decode('UTF-8')
clinicalDf = pd.read_csv(io.StringIO(output), sep='\t')

#clinicalDf
#clinicalDf.to_csv("gdc-cptac-clinical.csv")

Now let's merge the clinical data with our image annotation data so that we're only looking at subjects where we have both.

In [None]:
# create new dataframe from annotation_Metadata with only unique IDs of patients with imaging
uniquePatients = pd.DataFrame(annotation_Metadata['PatientID'].unique(), columns=['PatientID'])

# Rename the patient id column in dropNaN to match annotation_Metadata
clinicalDf = clinicalDf.rename(columns={'submitter_id': 'PatientID'})

# Merge the dataframes
mergedClinical = uniquePatients.merge(clinicalDf, how='left', on='PatientID')

mergedClinical

## Visualize missing clinical data

Let's investigate what types of clinical information are available and how often they are populated.  We'll drop all the columns where no information is provided and then visualize the number of times there are null values in the columns that remain.

In [None]:
# Drop columns with all NaN values from clinical data
cleanClinical = mergedClinical.dropna(axis=1, how='all')

null_counts = cleanClinical.isnull().sum()

null_df = null_counts.reset_index()
null_df.columns = ['Column', 'Null Count']

chart = alt.Chart(null_df).mark_bar().encode(
    x=alt.X('Column', axis=alt.Axis(labelAngle=-45)),
    y='Null Count',
    tooltip=['Column', 'Null Count']  # Show these values on mouse-over
).interactive()  # Enable zooming and panning

chart.show()


Let's say we want to understand the racial distribution of patients before we start our analysis.  

In [None]:
# Convert the 'value_counts' to a DataFrame for Altair
counts_df = cleanClinical['demographic.race'].value_counts().reset_index()
counts_df.columns = ['race', 'count']

# Create a pie chart with Altair
chart = alt.Chart(counts_df).mark_arc().encode(
    theta=alt.Theta(field='count', type='quantitative', stack=True),  # Stack ensures the pie chart is a full circle
    color=alt.Color(field='race', type='nominal', legend=alt.Legend(title="Races")),  # Add a legend for color
    tooltip=['race', 'count']  # Optional: to display tooltips on hover
).properties(
    title='Subject distribution'
)

# Show the pie chart
chart.show()

Let's also take a look at age distribution.

In [None]:
counts_df = cleanClinical['diagnoses.0.age_at_diagnosis'].value_counts().reset_index()
counts_df.columns = ['age_at_diagnosis', 'count']

# Convert age from days to years
counts_df['age_at_diagnosis'] = counts_df['age_at_diagnosis'] / 365.25

# Create a histogram with Altair
hist = alt.Chart(counts_df).mark_bar().encode(
    x=alt.X('age_at_diagnosis:Q', bin=True),  # Quantitative scale with automatic binning
    y='count:Q',  # Quantitative scale for the count
).properties(
    title='Age (years) at Diagnosis'
)

# Show the histogram
hist.show()

# 8 Genomic and Proteomic feature classification
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) brings together leading centers nationwide in a comprehensive and coordinated effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics, and to address mechanisms of treatment response, resistance, or toxicity.  They maintain a list of the consortium's most [significant milestones and publications](https://proteomics.cancer.gov/resources/milestones-and-publications).  

Let's take a quick look at an example from their "[Pan-cancer proteogenomics connects oncogenic drivers to functional states](https://doi.org/10.1016/j.cell.2023.07.014)" study, which has a wealth of [supplemental classification data](https://www.cell.com/cms/10.1016/j.cell.2023.07.014/attachment/dcd2885a-81a2-44bd-ab35-a0440f1352b1/mmc1.xlsx) which might be interesting to try and predict using the image data.  

Unfortunately Cell blocks direct programmatic access to their supplemental data files, so I've created a copy of the "Table S1A" tab from that file and put it on Github for convenience.  Please make sure you consult the links above for the full dataset if you're going to do any actual research with this.  This is only meant to give you an example of what's available.

In [None]:
omics = pd.read_excel('https://github.com/kirbyju/TCIA_Notebooks/raw/refs/heads/main/CPTAC/cptac-pan-cancer-drivers-table-S1A.xlsx')

print(omics.columns.tolist())

display(omics)

In order to find out what radiology imaging data are available for these subjects let's start by saving the omic case IDs to a list.

In [None]:
omic_patients = omics['CASE_ID'].tolist()

Next we'll feed that case list to a tcia_utils called **getSimpleSearchWithModalityAndBodyPartPaged()**.  This function lets you do pretty much everything you can do in the Simple Search screen at https://nbia.cancerimagingarchive.net. As as a result, there are a lot of parameters but we'll just use it to request all imaging data that corresponds to any of our omic patient IDs.

**Note:** if you want to include imaging data that contains faces in your search you must request access as described near the beginning of the notebook and log in with nbia.getToken() before executing the next query.

In [None]:
# optional step -- use if you want to include restricted data
nbia.getToken()

In [None]:
radiology_patients = nbia.getSimpleSearchWithModalityAndBodyPartPaged(patients = omic_patients, format = 'uids')

Before we jump into downloading the data, let's take a minute to inspect what this returned.  We can do that be feeding the UID list to **nbia.reportCollectionSummary()** which will provide information about how many patients, studies, series, and images are part of your search results.

In [None]:
nbia.reportCollectionSummary(radiology_patients, input_type = 'list', format = 'df')

If you wanted a more detailed inventory of the available scans you can use **nbia.getSeriesList()** and refine the results to weed out things you're not interested in.  Let's say that we only care about reviewing CT images for our project.  All we need to do is drop everything that doesn't have a CT modality and then pass the resulting dataframe to **nbia.downloadSeries()** to save the final dataset as shown below.

In [None]:
scan_inventory = nbia.getSeriesList(radiology_patients, format = 'df')

In [None]:
# Filter out rows where 'Modality' is not 'CT'
ct_radiology = scan_inventory[scan_inventory['Modality'] == 'CT']

ct_radiology

In [None]:
# downloadSeries expects SeriesInstanceUID to be the column name if you feed it a dataframe
scan_inventory = scan_inventory.rename(columns={'Series ID': 'SeriesInstanceUID'})

# we'll use number = 1 to just download a single example scan
nbia.downloadSeries(scan_inventory, input_type = 'df', number = 1)

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7