You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCGA/TCGA_Clinical.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCGA/TCGA_Clinical.ipynb)

# Summary

Access to large, high quality data is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However HIPAA constraints make sharing medical images outside an individual institution a complex process. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute which addresses this challenge by providing hosting and de-identification services to take major burdens of data sharing off researchers.

[The Cancer Genome Atlas (TCGA)](https://www.cancer.gov/ccg/research/genome-sequencing/tcga) began in 2006 as a three-year pilot jointly sponsored by the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI). The TCGA pilot project (focused initially on glioblastoma, ovary, and lung cancers) confirmed that an atlas of genomic changes could be constructed for specific cancer types. It also showed that national networks of research and technology teams working on related projects could pool their efforts, create an economy of scale, and develop an infrastructure for making the data publicly accessible. The success of that pilot encouraged the National Institutes of Health (NIH) to invest in TCGA's efforts to collect and characterize more than 20 additional tumor types and make findings freely accessible for researchers to download.  

The genomic, clinical and histopathology images from the project are available via NCI's [Genomic Data Commons (GDC)](https://gdc.cancer.gov/).  NCI's Cancer Imaging Program subsequently leveraged the agreements with TCGA Tissue Source Sites to collect [clinical diagnostic images from these subjects](https://wiki.cancerimagingarchive.net/x/sgEe) and make them available on The Cancer Imaging Archive (TCIA).  By combining the imaging data from TCIA with the other data types collected by TCGA a research community focused on connecting cancer phenotypes to genotypes was formed, resulting in over one hundred peer-reviewed publications about these data.

**This notebook is focused on using the clinical data available on the GDC to create a cohort of interest and then obtaining the related radiology data for those subjects from TCIA.** If you're interested in additional TCIA notebooks and coding examples check out https://github.com/kirbyju/TCIA_Notebooks.

# Setup
We'll leverage https://pypi.org/project/tcia-utils/ to make it easier to access TCIA data via its APIs, as well as the Pandas and Numpy for working with the clinical data from GDC.

In [None]:
!pip install --upgrade -q tcia-utils
!pip install -q pandas
!pip install numpy

In [None]:
import pandas as pd
import numpy as np
from tcia_utils import nbia

This link to the [clinical data](https://portal.gdc.cancer.gov/repository?facetTab=files&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.program.name%22%2C%22value%22%3A%5B%22TCGA%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TCGA-BLCA%22%2C%22TCGA-BRCA%22%2C%22TCGA-CESC%22%2C%22TCGA-COAD%22%2C%22TCGA-ESCA%22%2C%22TCGA-GBM%22%2C%22TCGA-HNSC%22%2C%22TCGA-KICH%22%2C%22TCGA-KIRC%22%2C%22TCGA-KIRP%22%2C%22TCGA-LGG%22%2C%22TCGA-LIHC%22%2C%22TCGA-LUAD%22%2C%22TCGA-LUSC%22%2C%22TCGA-OV%22%2C%22TCGA-PRAD%22%2C%22TCGA-READ%22%2C%22TCGA-SARC%22%2C%22TCGA-STAD%22%2C%22TCGA-THCA%22%2C%22TCGA-UCEC%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_category%22%2C%22value%22%3A%5B%22clinical%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_format%22%2C%22value%22%3A%5B%22bcr%20biotab%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%5B%22Clinical%20Supplement%22%5D%7D%7D%5D%7D)  can be used to download all TCGA clinical files from the Genomic Data Commons for the cancer types where TCIA hosts the radiology images. As a convenience, the resulting TSV file has been uploaded [here](https://github.com/kirbyju/TCIA_Notebooks/raw/main/TCGA/clinical.tsv) but the GDC should be considered the authoritative copy in case of any updates or discrepancies.

# Import the clinical data
Next, let's read in the clinical TSV file and take a look at the contents.  Note that many of the columns contain missing values or are completely empty.

In [None]:
# Read the TSV file into a DataFrame
df = pd.read_csv("https://github.com/kirbyju/TCIA_Notebooks/raw/main/TCGA/clinical.tsv", delimiter='\t')
display(df)

# Dataset cleanup

Let's figure out which columns aren't very well populated with useful data so we can drop them.  We'll do this by creating two CSV files.  The first will summarize the content of each column to let us know how often they're populated and the second will create a new CSV that only contains columns that aren't empty.

In [None]:
# Replace the string "'--" with NaN values
df = df.replace("'--", np.nan)

# Calculate the count and percentage of populated values in each column
populated_count = df.notna().sum()
populated_percentage = (populated_count / len(df)) * 100

# Drop columns that are empty
columns_to_drop = populated_percentage[populated_percentage == 0].index
df = df.drop(columns_to_drop, axis=1)

# Format the non-missing percentage as a two-digit percentage
populated_percentage = populated_percentage.round(2).astype(str) + '%'

# Combine non-missing count and percentage into a DataFrame
populated_info = pd.DataFrame({'Populated Count': populated_count, 'Populated Percentage': populated_percentage})

# Save the non-missing information to a CSV file
populated_info.to_csv('tcga_clinical_inventory.csv')

# Save the updated clinical CSV
df.to_csv('tcga_clean_clinical.csv')


Take a minute to look over the two spreadsheets to get a sense of what kind of data remains.  We'll go over a few examples in this notebook to show how to build a cohort based on the clinical data and imaging metadata, but as you can see from the results of these files, there are lots of other columns you could use to customize your analysis.  

# Cohort selection

As a starting point, let's investigate how many potential cases are available for each primary_diagnosis.

In [None]:
case_group = "primary_diagnosis"

# Create a DataFrame to store the count of unique case_submitter_id values for each primary_diagnosis
origin_count = pd.DataFrame(df.groupby(case_group)['case_submitter_id'].nunique()).reset_index()
origin_count.columns = [case_group, 'Count of Unique case_submitter_id']

# Sort the DataFrame by the count of case_submitter_id in descending order
origin_count = origin_count.sort_values('Count of Unique case_submitter_id', ascending=False)

# Reset the index of the DataFrame
origin_count = origin_count.reset_index(drop=True)

# Display the sorted DataFrame
display(origin_count)
origin_count.to_csv('tcga_' + case_group + '_counts.csv')

You can easily repeat this with other criteria by changing the **case_group** variable to a different column name in the CSV's we created.  Let's try grouping cases by **tissue_or_organ_of_origin** this time.

In [None]:
case_group = "tissue_or_organ_of_origin"

# Create a DataFrame to store the count of unique case_submitter_id values for each primary_diagnosis
origin_count = pd.DataFrame(df.groupby(case_group)['case_submitter_id'].nunique()).reset_index()
origin_count.columns = [case_group, 'Count of Unique case_submitter_id']

# Sort the DataFrame by the count of case_submitter_id in descending order
origin_count = origin_count.sort_values('Count of Unique case_submitter_id', ascending=False)

# Reset the index of the DataFrame
origin_count = origin_count.reset_index(drop=True)

# Display the sorted DataFrame
display(origin_count)
origin_count.to_csv('tcga_' + case_group + '_counts.csv')


Now let's create a filtered dataframe/CSV that contains only the clinical data for the tissue type you're interested in.  You can leave the tissue_type variable below set to lung, or change it to match other rows in the previous dataframe. Note that the code is setup to allow partial matches, so typing "lung" will catch all 5 records that contain "lung" somewhere within the **tissue_or_organ_of_origin** column.

In [None]:
# feel free to change this to other tissue types
tissue_type = "lung"

# Create dataframe for selected tissue type
tissue_type_df = df[df['tissue_or_organ_of_origin'].str.contains(tissue_type, case=False, na=False)]

display(tissue_type_df)

Now let's see how many of these subjects have imaging data on TCIA.  To achieve this efficiently, first we'll figure out which TCGA **project_id**'s are included in our **tissue_type_df**.  These project_id's are the same as what TCIA uses for its collection names.  

Then we'll get a full inventory of the scans associated with those collections on TCIA and compare the GDC case_submitter_id with the TCIA PatientID to determine the overlap.  Here is where we'll take advantage of [tcia_utils](https://pypi.org/project/tcia-utils/) to download the scan metadata.

**Note:** TCIA datasets that contain images of the head require special permission due to the potential privacy risks associated with 3D facial reconstruction of such images.  As a result, if you would like to look at TCGA-LGG, TCGA-GBM or TCGA-HNSC datasets you must sign and submit a [TCIA Restricted License Agreement](https://wiki.cancerimagingarchive.net/download/attachments/4556915/TCIA%20Restricted%20License%2020220519.pdf?version=1&modificationDate=1652964581655&api=v2) to help@cancerimagingarchive.net before accessing the data.  After completing this process, you'll be able to easily access them by using the **getToken()** function in **tcia_utils** to create a token with your login credentials.

In [None]:
# Extract unique project_id values from tissue_type_df into a list
project_ids = tissue_type_df['project_id'].unique().tolist()

# Create list of restricted TCGA datasets to facilitate login process if necessary
restricted_datasets = ["TCGA-GBM", "TCGA-LGG", "TCGA-HNSC"]

if any(dataset in project_ids for dataset in restricted_datasets):
    api_url = "restricted"
    print("You're attempting to access datasets from TCIA that require special permission. Please log in to create an access token:")
    nbia.getToken()
else:
    api_url = ""

# Create an empty list to store the DataFrames for each project_id
dataframes = []

# Iterate through the project_ids list and download the scan metadata for each project/collection
for project_id in project_ids:
    # Call the nbia.getSeries(project_id, format="df") function and store the resulting DataFrame
    series_df = nbia.getSeries(project_id, format="df", api_url = api_url)
    # Append the DataFrame to the dataframes list
    dataframes.append(series_df)

# Concatenate the DataFrames in the dataframes list into a single DataFrame called tcia_inventory
tcia_scan_inventory = pd.concat(dataframes)

# Extract unique PatientID values from tcia_inventory
patient_ids = tcia_scan_inventory['PatientID'].unique()

# Filter tissue_type_df to include only rows where case_submitter_id is in patient_ids
clinical_cases_with_radiology = tissue_type_df[tissue_type_df['case_submitter_id'].isin(patient_ids)]


In [None]:
# Display the clinical_cases_with_radiology DataFrame and save a CSV
display(clinical_cases_with_radiology)
clinical_cases_with_radiology.to_csv("tcga_patients_with_tcia_radiology.csv")

Note that subjects often contain multiple rows.  This seems to be mostly due to repeating their information if they've had both pharmaceutical and radiation therapy treatments (see treatment_type column). Let's run a quick check to see how many unique subjects remain for us to analyze.

In [None]:
# Count the unique values in the case_submitter_id column
unique_count = clinical_cases_with_radiology['case_submitter_id'].nunique()

# Display the count
print("Number of unique case_submitter_id values:", unique_count)

Now let's take a look at what kind of imaging data are available for these subjects by reviewing the scan inventory dataframe we created earlier.  First we'll use a reporting function from tcia_utils to get a high-level understanding of the data, and then we'll display and export the metadata for each individual scan to CSV.

**Note:** The report below will include relevant image collections as well as 3rd party "analysis result" datasets related to those images (if any exist).  You can learn more about the various datasets by visiting their CollectionURI.

In [None]:
nbia.reportDoiSummary(tcia_scan_inventory, input_type = "df", api_url = api_url)

If you stuck with the lung example, you should see that there are 69 TCGA-LUAD (lung adenocarcinoma) subjects and 37 TCGA-LUSC (lung squamous cell) subjects that contain a mix of PET, CT and nuclear medicine modalities.

Now let's take a quick look at the scan-level report, where you can see a variety of additional info about each scan:

In [None]:
# Display the tcia_inventory DataFrame
display(tcia_scan_inventory)
tcia_scan_inventory.to_csv('tcia_scan_inventory.csv')

Let's pretend that you're only interested in a particular modality of imaging to analyze.  Feel free to leave it as "CT" if you're following along with our lung example, or try customizing the modality value to anything else you saw in the previous report.  

In [None]:
modality = "CT"

# Create dataframe for selected modality
download_df = tcia_scan_inventory[tcia_scan_inventory['Modality'].str.contains(modality, case=False, na=False)]

display(download_df)

Finally, let's download the scans that match your modality of interest for your analysis.

**Note:** The download step includes a parameter called **number** which lets you set the number of scans to download.  This is useful for quick tests/demos.  If you'd like to download the full cohort of images you should remove this parameter.

In [None]:
# extract the SeriesInstanceUID column as a list variable
series_uids = download_df['SeriesInstanceUID'].tolist()

nbia.downloadSeries(series_uids, number = 1, api_url = api_url, input_type = "list")

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045â€“1057. https://doi.org/10.1007/s10278-013-9622-7