<a href="https://colab.research.google.com/github/ncihtan/Community-Notebooks/blob/master/HTAN/Python%20Notebooks/Identifying_HTAN_Data_Files_by_Organ_in_ISB_CGC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Identifying HTAN Data Files by Organ in ISB-CGC

        Title:   Identifying HTAN Data Files by Organ in ISB-CGC
        Author:  Clarisse Lau (clau@systemsbiology.org)
        Created: October 2023

# 1. Introduction & Overview
[HTAN](https://humantumoratlas.org/) is a National Cancer Institute (NCI)-funded Cancer Moonshot initiative to construct 3-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced disease. [Cell April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469)

Clinical data, sample biospecimen data and assay files in HTAN have a rich set of annotations supplied by HTAN data contributors. These annotations are made according to the [HTAN Data model](https://data.humantumoratlas.org/standards), a set of standards defined by the HTAN consortium. Using Google BigQuery, the supplied values of these attributes have been collected into comprehensive data tables in the cloud.

This notebook will demonstrate how users can identify and access assay data for a particular organ or cancer type using Google BigQuery metadata tables.

## 1.1 Goal
In this notebook we introduce the tables and metadata attributes needed to identify data files for a particular organ type, and illustrate how these files can be accessed in BigQuery to enable further analysis.  

## 1.2 Inputs, Outputs, & Data
The originating data can be found on the [HTAN Data Portal](https://data.humantumoratlas.org/), and the compiled tables are on the [Cancer Gateway in the Cloud](https://isb-cgc.appspot.com/).

## 1.3 Notes
The tables correspond to HTAN Data Version 4.

# 2. Environment & Library Setup


In [1]:
# Import libraries
import pandas as pd

# 3. Google Authentication

Running the BigQuery cells in this notebook requires a Google Cloud Project. Instructions for creating a project can be found in [Google Cloud Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console). The instance needs to be authorized to bill the project for queries. For more information on getting started with ISB-CGC see [Quick Start Guide to ISB-CGC](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found in [Google Cloud Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console).

## 3.1 Authenticating with Google Credentials


#### Option 1. Running in Google Colab

If you are using Google Colab, run the code block below to authenticate

In [2]:
from google.colab import auth
auth.authenticate_user()

#### Option 2. Running on local machine

Alternatively, if you're running the notebook locally, take the following steps to authenticate.

1.   Run `gcloud auth application-default login` on your local machine
2.   Run the command below replacing `<path to key>` with the path to your credentials file

In [None]:
# %env GOOGLE_APPLICATION_CREDENTIALS=<path to key>

## 3.2 Initializing the Google BigQuery client




In [3]:
# Import the Google BigQuery client
from google.cloud import bigquery

# Replace <my-project> with the name of the Google project that will be billed for this notebook's computations
google_project = '<my-project>'

# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)

# 4. "Tissue or Organ of Origin" to Whole Organ Mapping Table

In the HTAN Clinical Diagnosis data model, designation for anatomic site of origin of the patient's malignant disease is provided as the attribute
"Tissue or Organ of Origin". This attribute currently includes 332 possible tissue or organ value options, and can include both whole organs and parts of a whole.

For convenience, the HTAN Data Coordinating Center (DCC) has compiled a mapping of all "Tissue or Organ of Origin" values to 30 whole organ types. This mapping is provided as a BigQuery table `isb-cgc-bq.HTAN.tissueOrOrganOfOrigin_to_wholeOrgan_mapping_current`.

To begin, we will query this table and explore its contents.


In [4]:
primary_map = client.query("""
  SELECT * FROM `isb-cgc-bq.HTAN.tissueOrOrganOfOrigin_to_wholeOrgan_mapping_current`
""").result().to_dataframe()

primary_map

Unnamed: 0,Whole_Organ,Tissue_or_Organ_of_Origin
0,Adrenal Gland,Adrenal gland NOS
1,Adrenal Gland,Cortex of adrenal gland
2,Adrenal Gland,Medulla of adrenal gland
3,Bile Duct,Ampulla of Vater
4,Bile Duct,Biliary tract NOS
...,...,...
327,Uterus,Fundus uteri
328,Uterus,Isthmus uteri
329,Uterus,Myometrium
330,Uterus,Overlapping lesion of corpus uteri


Let's take a look at the possible values for `Whole_Organ`

In [5]:
set(primary_map['Whole_Organ'])

{'Adrenal Gland',
 'Bile Duct',
 'Bladder',
 'Bone',
 'Bone Marrow',
 'Brain',
 'Breast',
 'Cervix',
 'Colorectal',
 'Esophagus',
 'Eye',
 'Head and Neck',
 'Kidney',
 'Liver',
 'Lung',
 'Lymph Nodes',
 'Nervous System',
 'Not Reported',
 'Other and Ill-defined Sites',
 'Ovary',
 'Pancreas',
 'Pleura',
 'Prostate',
 'Skin',
 'Soft Tissue',
 'Stomach',
 'Testis',
 'Thymus',
 'Thyroid',
 'Uterus'}

In [None]:
len(set(primary_map['Whole_Organ']))

30

We see that each of the 332 Tissue or Organ of Origin values are mapped to one of thirty whole organ types.

# 5. Identifying Breast Cancer Data or Precancer Data in ISB-CGC

Perhaps you would like to know whether assay
data for breast cancer is available in ISB-CGC. The HTAN DCC has made nearly 800 Single Cell and Imaging Level 4 assay data files available in ISB-CGC. Are any of these files derived from breast cancer patients?

We will utilize four BigQuery tables and perform a series of joins to build a SQL query that will enable us to answer this question, and access data in ISB-CGC.

We start with the `isb-cgc-bq.HTAN.tissueOrOrganOfOrigin_to_wholeOrgan_mapping_current` table, this time updating our query to select for only the tissues and organs corresponding to 'Breast', We see that there are 9 applicable values for this cancer type.

In [6]:
breast_organs = client.query("""
  SELECT *
  FROM `isb-cgc-bq.HTAN.tissueOrOrganOfOrigin_to_wholeOrgan_mapping_current`
  WHERE Whole_Organ = 'Breast'
""").result().to_dataframe()

breast_organs

Unnamed: 0,Whole_Organ,Tissue_or_Organ_of_Origin
0,Breast,Axillary tail of breast
1,Breast,Breast NOS
2,Breast,Central portion of breast
3,Breast,Lower-inner quadrant of breast
4,Breast,Lower-outer quadrant of breast
5,Breast,Nipple
6,Breast,Overlapping lesion of breast
7,Breast,Upper-inner quadrant of breast
8,Breast,Upper-outer quadrant of breast


With this information, we can identify the participants that have diagnoses originating in the breast, using the Clinical Diagnosis BigQuery table

In [7]:
breast_cases = client.query("""
  SELECT HTAN_Participant_ID, HTAN_Center, Tissue_or_Organ_of_Origin
  FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
  WHERE Tissue_or_Organ_of_Origin IN (
      SELECT Tissue_or_Organ_of_Origin
      FROM `isb-cgc-bq.HTAN.tissueOrOrganOfOrigin_to_wholeOrgan_mapping_current`
      WHERE Whole_Organ = 'Breast'
  )
""").result().to_dataframe()

breast_cases

Unnamed: 0,HTAN_Participant_ID,HTAN_Center,Tissue_or_Organ_of_Origin
0,HTA6_7,HTAN Duke,Breast NOS
1,HTA6_1042,HTAN Duke,Breast NOS
2,HTA6_1076,HTAN Duke,Breast NOS
3,HTA6_1024,HTAN Duke,Breast NOS
4,HTA6_1132,HTAN Duke,Breast NOS
...,...,...,...
913,HTA14_41,HTAN TNP - TMA,Breast NOS
914,HTA14_43,HTAN TNP - TMA,Breast NOS
915,HTA14_44,HTAN TNP - TMA,Breast NOS
916,HTA14_45,HTAN TNP - TMA,Breast NOS


918 HTAN participants have tumors originating in the breast.

To determine whether any of the 800 assay data files available in ISB-CGC are derived from these 918 patients of interest, we will utilize the HTAN ID Provenance table.

For an in-depth introduction to the ID Provenance table, please see the notebook [HTAN_ID_Provenance_In_BQ.ipynb](https://github.com/isb-cgc/Community-Notebooks/blob/master/HTAN/Python%20Notebooks/HTAN_ID_Provenance_In_BQ.ipynb)




Below, we select for Imaging Level 4 data files derived from the participants found in our query above.

In [11]:
breast_img = client.query("""
  SELECT DISTINCT HTAN_Center, Filename, entityId, Component
  FROM `isb-cgc-bq.HTAN.id_provenance_current`
  WHERE HTAN_Participant_ID IN (
      SELECT HTAN_Participant_ID
      FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
      WHERE Tissue_or_Organ_of_Origin IN (
          SELECT Tissue_or_Organ_of_Origin
          FROM `isb-cgc-bq.HTAN.tissueOrOrganOfOrigin_to_wholeOrgan_mapping_current`
          WHERE Whole_Organ = 'Breast'
      )
  )
  AND Component = 'ImagingLevel4'
""").result().to_dataframe()

breast_img

Unnamed: 0,HTAN_Center,Filename,entityId,Component
0,HTAN Duke,mibi_imaging_level_4/Single_cell_data.csv,syn52126197,ImagingLevel4
1,HTAN HTAPP,merfish_level_4/counts_combined/counts_HTAPP-3...,syn25541295,ImagingLevel4
2,HTAN HTAPP,merfish_level_4/counts_combined/counts_HTAPP-5...,syn25541298,ImagingLevel4
3,HTAN HTAPP,merfish_level_4/counts_combined/counts_HTAPP-8...,syn25541290,ImagingLevel4
4,HTAN HTAPP,merfish_level_4/counts_combined/counts_HTAPP-8...,syn25541291,ImagingLevel4
...,...,...,...,...
512,HTAN TNP - TMA,phase1_imaging_level_4/OHSU_TMA1_005-H11_OHSU_...,syn52044641,ImagingLevel4
513,HTAN TNP - TMA,phase1_imaging_level_4/OHSU_TMA1_010-H11_OHSU_...,syn52044733,ImagingLevel4
514,HTAN TNP - TMA,phase1_imaging_level_4/OHSU_TMA1_011-H11_OHSU_...,syn52044913,ImagingLevel4
515,HTAN TNP - TMA,phase3_imaging_level_4/LSP12021-H11_LSP12021-H...,syn52044011,ImagingLevel4


There are 517 Imaging Level 4 data files derived from breast patients.

We can run a similar query to select for Single Cell RNA sequencing Level 4 data files:

In [12]:
breast_scrnaseq = client.query("""
  SELECT DISTINCT HTAN_Center, Filename, entityId, Component
  FROM `isb-cgc-bq.HTAN.id_provenance_current`
  WHERE HTAN_Participant_ID IN (
      SELECT HTAN_Participant_ID
      FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
      WHERE Tissue_or_Organ_of_Origin IN (
          SELECT Tissue_or_Organ_of_Origin
          FROM `isb-cgc-bq.HTAN.tissueOrOrganOfOrigin_to_wholeOrgan_mapping_current`
          WHERE Whole_Organ = 'Breast'
      )
  )
  AND Component = 'ScRNA-seqLevel4'
""").result().to_dataframe()

breast_scrnaseq

Unnamed: 0,HTAN_Center,Filename,entityId,Component
0,HTAN HTAPP,single_cell_RNAseq_level_4_breast/MBC_sc_L4.tsv,syn26127156,ScRNA-seqLevel4
1,HTAN HTAPP,single_cell_RNAseq_level_4_breast/MBC_sn_L4.tsv,syn26127157,ScRNA-seqLevel4


There are 2 single cell RNA-seq Level 4 data files that have been released on the HTAN data portal.

Single-cell Level 4 data from HTAPP is available on ISB-CGC in the table `isb-cgc-bq.HTAN.scRNAseq_HTAPP_level4_current`. This table contains single cell data across seven cancer types: breast, colon, lung, glioma, ovarian, neuroblastoma, and sarcoma.  

To select for only the two breast files, we will utilize the `Source_entityId` column in the HTAPP table, and join on `entityId` in our above result.

In [13]:
breast_scrna_l4 = client.query("""
  SELECT * FROM `isb-cgc-bq.HTAN.scRNAseq_HTAPP_level4_current`
  WHERE Source_entityId IN (
      SELECT DISTINCT entityId
      FROM `isb-cgc-bq.HTAN.id_provenance_current`
      WHERE HTAN_Participant_ID IN (
          SELECT HTAN_Participant_ID
          FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
          WHERE Tissue_or_Organ_of_Origin IN (
                  SELECT Tissue_or_Organ_of_Origin
                  FROM `isb-cgc-bq.HTAN.tissueOrOrganOfOrigin_to_wholeOrgan_mapping_current`
                  WHERE Whole_Organ = 'Breast'
              )
      )
  AND Component = 'ScRNA-seqLevel4'
  )
""").result().to_dataframe()

breast_scrna_l4

Unnamed: 0,NAME,X,Y,Cell_Subset,Biospecimen,Source_filename,Source_entityId,HTAN_Center,CellType,HTAPP_ID,Top_Level_Cell_Subset,batchID
0,AGCCAGCAGGCTTCCG-1,4.192876,10.292455,Monocyte,HTA1_983_7659101,MBC_sc_L4.tsv,syn26127156,HTAN HTAPP,,HTAPP-983-SMP-7659_fresh_channel1,,
1,GCGCGATCATAAAGGT-1,-0.310580,2.113862,Chondrocyte,HTA1_382_1441101,MBC_sc_L4.tsv,syn26127156,HTAN HTAPP,,HTAPP-382-SMP-1441_fresh_channel1,,
2,GTCGTAACATCCGTGG-1,-0.382505,2.100729,Chondrocyte,HTA1_382_1441101,MBC_sc_L4.tsv,syn26127156,HTAN HTAPP,,HTAPP-382-SMP-1441_fresh_channel1,,
3,TGACGGCAGATCTGAA-1,-0.312379,2.275506,Chondrocyte,HTA1_382_1441101,MBC_sc_L4.tsv,syn26127156,HTAN HTAPP,,HTAPP-382-SMP-1441_fresh_channel1,,
4,GTAGGCCGTGTTTGGT-1,-0.644821,2.104925,Chondrocyte,HTA1_382_1441101,MBC_sc_L4.tsv,syn26127156,HTAN HTAPP,,HTAPP-382-SMP-1441_fresh_channel1,,
...,...,...,...,...,...,...,...,...,...,...,...,...
552060,CCGGTGATCTACTGAG-1,13.330817,4.213646,Endothelial_vascular,HTA1_997_7789601,MBC_sn_L4.tsv,syn26127157,HTAN HTAPP,,HTAPP-997-SMP-7789_TST_channel2,,
552061,TGGTTAGGTGCCTACG-1,11.232491,4.193148,Endothelial_vascular,HTA1_997_7789601,MBC_sn_L4.tsv,syn26127157,HTAN HTAPP,,HTAPP-997-SMP-7789_TST_channel2,,
552062,TCATACTTCCCGTTGT-1,13.404366,4.072702,Endothelial_vascular,HTA1_997_7789601,MBC_sn_L4.tsv,syn26127157,HTAN HTAPP,,HTAPP-997-SMP-7789_TST_channel2,,
552063,TTTATGCCACTCAGAT-1,13.564521,3.161808,Endothelial_vascular,HTA1_997_7789601,MBC_sn_L4.tsv,syn26127157,HTAN HTAPP,,HTAPP-997-SMP-7789_TST_channel2,,


Our final query result contains single cell RNA-seq data for breast samples only.


# 5. Relevant Citations and Links

[HTAN Data Portal](https://humantumoratlas.org/)

[HTAN: The Missing Manual](https://docs.humantumoratlas.org/)