You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCGA/TCGA_Clinical.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCGA/TCGA_Clinical.ipynb)

# Summary

Access to large, high quality data is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However HIPAA constraints make sharing medical images outside an individual institution a complex process. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute which addresses this challenge by providing hosting and de-identification services to take major burdens of data sharing off researchers.

[The Cancer Genome Atlas (TCGA)](https://www.cancer.gov/ccg/research/genome-sequencing/tcga) began in 2006 as a three-year pilot jointly sponsored by the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI). The TCGA pilot project (focused initially on glioblastoma, ovary, and lung cancers) confirmed that an atlas of genomic changes could be constructed for specific cancer types. It also showed that national networks of research and technology teams working on related projects could pool their efforts, create an economy of scale, and develop an infrastructure for making the data publicly accessible. The success of that pilot encouraged the National Institutes of Health (NIH) to invest in TCGA's efforts to collect and characterize more than 20 additional tumor types and make the resulting data freely accessible for researchers to download.  

The genomic, clinical and histopathology images from the project are available via NCI's [Genomic Data Commons (GDC)](https://gdc.cancer.gov/).  NCI's Cancer Imaging Program subsequently leveraged the agreements with TCGA Tissue Source Sites to collect [clinical diagnostic images from these subjects](https://wiki.cancerimagingarchive.net/x/sgEe) and make them available on The Cancer Imaging Archive (TCIA).  By combining the imaging data from TCIA with the other data types collected by TCGA a research community focused on connecting cancer phenotypes to genotypes was formed, resulting in over one hundred peer-reviewed publications about these data.

**This notebook is focused on using the clinical data available on the GDC to create a cohort of interest and then obtaining the related radiology data for those subjects from TCIA.** If you're interested in additional TCIA notebooks and coding examples check out https://github.com/kirbyju/TCIA_Notebooks.

# Setup
Install https://pypi.org/project/tcia-utils/ to make it easier to access TCIA data via its APIs.

In [None]:
import sys

!{sys.executable} -m pip install --upgrade -q tcia-utils

In [None]:
import pandas as pd
import numpy as np
import json
import io
import requests
import plotly.express as px
from tcia_utils import nbia

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

# Accessing clinical data from NCI's Genomic Data Commons (GDC)

There is a significant amount of supporting genomic and clinical data for these subjects in the [Genomic Data Commons](https://portal.gdc.cancer.gov/).  If you have any questions about GDC, please consult their documentation at https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/ and their helpdesk at support@nci-gdc.datacommons.io.

**Note:** TCIA datasets that contain images of the head require special permission due to the potential privacy risks associated with 3D facial reconstruction of such images.  As a result, if you would like to look at TCGA-LGG, TCGA-GBM or TCGA-HNSC datasets you must sign and submit a [TCIA Restricted License Agreement](https://wiki.cancerimagingarchive.net/download/attachments/4556915/TCIA%20Restricted%20License%2020220519.pdf?version=1&modificationDate=1652964581655&api=v2) to help@cancerimagingarchive.net before accessing the data.  After completing this process, you'll be able to easily access them by using the **getToken()** function in **tcia_utils** to create a token with your login credentials.


In [None]:
# If you want to include "restricted" collections (GBM, LGG, HNSC),
# you must login first.
#
# Skip this step if you just want to anonymously access
# the fully public datasets.

nbia.getToken()

First let's create an inventory of the TCGA cancer types where imaging exists on TCIA.

In [None]:
# get list of all collections
collections_json = nbia.getCollections()
print(str(len(collections_json)) + " collections were found.")
collections = [item['Collection'] for item in collections_json]

# select only TCGA collections
collectionSubset = [item for item in collections if "TCGA" in item]
collections = collectionSubset
print(str(len(collections)) + " collections matched your subset.")
print(collections)

Next, let's query the GDC API to obtain the clinical data for TCGA data.

In [None]:
cases_endpt = 'https://api.gdc.cancer.gov/cases'

filters = {
    "op": "in",
    "content":{
        "field": "project.project_id",
        "value": collections
        }
    }

fields = [
    "project.project_id",
    "submitter_id",
    ]

fields = ','.join(fields)

expand = [ ## For the allowable values for this list, look under "mapping" at https://api.gdc.cancer.gov/cases/_mapping
    "demographic",
    "diagnoses",
    "diagnoses.treatments",
    "exposures",
    "family_histories"
    ]

expand = ','.join(expand)

params = {
    "filters": json.dumps(filters),
    "expand": expand,
    "fields": fields,
    "format": "TSV", ## This can be "JSON" too
    "size": "10000", ## If you are re-using this for other projects, you may need to modify this and the "from" number.
    "from":"0"
    }

response = requests.get(cases_endpt, params = params)

output = response.content.decode('UTF-8')
clinicalDf = pd.read_csv(io.StringIO(output), sep='\t')

clinicalDf


Now let's merge the clinical data with our radiology data so that we're only looking at subjects where we have both.  We'll first need to pull the list of patient IDs for each cancer type from TCIA.

In [None]:
# get inventory of patients with radiology data
patients = pd.DataFrame()

for collection in collections:
    patientCollection = nbia.getPatient(collection)
    patients = pd.concat([patients, pd.DataFrame(patientCollection)], ignore_index=True)

patients

Now we'll reduce the clinical data to patients where we also have radiology images.

In [None]:
# create new dataframe from patients with only unique IDs of patients with imaging
uniquePatients = pd.DataFrame(patients['PatientId'].unique(), columns=['PatientId'])

# Rename the patient id column to match uniquePatients
clinicalDf = clinicalDf.rename(columns={'submitter_id': 'PatientId'})

# Merge the dataframes
mergedClinical = uniquePatients.merge(clinicalDf, how='left', on='PatientId')

mergedClinical

## Visualize clinical data

Let's investigate what types of clinical information are available and how often they are populated.  First we'll drop all the columns where no information is provided and then visualize the number of times there are null values in the columns that remain.

In [None]:
# Drop columns with all NaN values from clinical data
cleanClinical = mergedClinical.dropna(axis=1, how='all')

null_counts = cleanClinical.isnull().sum()

null_df = null_counts.reset_index()
null_df.columns = ['Column', 'Null Count']

# Create a bar chart using Plotly
fig = px.bar(null_df, x='Column', y='Null Count', title='Null Count per Column',
             labels={'Column': 'Column', 'Null Count': 'Null Count'},
             hover_data=['Column', 'Null Count'])

# Update layout for better readability
fig.update_layout(xaxis_tickangle=-45)

# Show the chart
fig.show()


# Cohort selection

Let's investigate how many potential cases are available for each **tissue_or_organ_of_origin**.

In [None]:
case_group = "diagnoses.0.tissue_or_organ_of_origin"

# Create a DataFrame to store the count of unique case_submitter_id values for each tissue_or_organ_of_origin
origin_count = pd.DataFrame(cleanClinical.groupby(case_group)['PatientId'].nunique()).reset_index()
origin_count.columns = [case_group, 'Count of Unique PatientId']

# Sort the DataFrame by the count of case_submitter_id in descending order
origin_count = origin_count.sort_values('Count of Unique PatientId', ascending=False)

# Reset the index of the DataFrame
origin_count = origin_count.reset_index(drop=True)

# Display the sorted DataFrame and save a spreadsheet
display(origin_count)
origin_count.to_csv('tcga_' + case_group + '_counts.csv')

Now let's create a filtered dataframe/CSV that contains only the clinical data for the tissue type you're interested in.  You can leave the tissue_type variable below set to lung, or change it to match other rows in the previous dataframe. Note that the code is setup to allow partial matches, so typing "lung" will catch all 5 records that contain "lung" somewhere within the **tissue_or_organ_of_origin** column.

In [None]:
# feel free to change this to other tissue types
tissue_type = "lung"

# Create dataframe for selected tissue type
tissue_type_df = cleanClinical[cleanClinical['diagnoses.0.tissue_or_organ_of_origin'].str.contains(tissue_type, case=False, na=False)]

display(tissue_type_df)

Now let's get a full inventory of the scans associated with those cancer types on TCIA.

In [None]:
# Extract unique project_id values from tissue_type_df into a list
project_ids = tissue_type_df['project.project_id'].unique().tolist()

# Create an empty list to store the DataFrames for each project_id
dataframes = []

# Iterate through the project_ids list and download the scan metadata for each project/collection
for project_id in project_ids:
    # Call the nbia.getSeries(project_id, format="df") function and store the resulting DataFrame
    series_df = nbia.getSeries(project_id, format="df")
    # Append the DataFrame to the dataframes list
    dataframes.append(series_df)

# Concatenate the DataFrames in the dataframes list into a single DataFrame called tcia_inventory
tcia_scan_inventory = pd.concat(dataframes)

Let's take a look at what kind of imaging data are available for these subjects using a reporting function from tcia_utils to get a high-level understanding of the data.

**Note:** The report below will include relevant image collections as well as 3rd party "analysis result" datasets related to those images (if any exist).  You can learn more about the various datasets by visiting their CollectionURI.

In [None]:
nbia.reportDoiSummary(tcia_scan_inventory, input_type = "df")

If you stuck with the lung example, you should see that there are both TCGA-LUAD (lung adenocarcinoma) subjects and TCGA-LUSC (lung squamous cell) subjects that contain a mix of PET, CT and nuclear medicine modalities.

Now let's take a quick look at the scan-level report, where you can see a variety of additional info about each scan:

In [None]:
# Display the tcia_inventory DataFrame
display(tcia_scan_inventory)


Let's say that you're only interested in a particular modality of imaging to analyze.  Feel free to leave it as "CT" if you're following along with our lung example, or try customizing the modality value to anything else you saw in the previous report.  

In [None]:
modality = "CT"

# Create dataframe for selected modality
download_df = tcia_scan_inventory[tcia_scan_inventory['Modality'].str.contains(modality, case=False, na=False)]

display(download_df)

Finally, let's download the scans that match the modality of interest for your analysis.

**Note:** The download step includes a parameter called **number** which lets you set the number of scans to download.  This is useful for quick tests/demos.  If you'd like to download the full cohort of images you should remove this parameter.

In [None]:
# extract the SeriesInstanceUID column as a list variable
series_uids = download_df['SeriesInstanceUID'].tolist()

nbia.downloadSeries(series_uids, number = 1, input_type = "list")

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7