<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/HTAN/Python%20Notebooks/A_Guide_to_HTAN_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Guide to HTAN Data



        Title:   An Overview of Assay Data in HTAN
        Author:  Clarisse Lau
        Created: April 2024
        Purpose: Provide an overview of HTAN data types and volumes


# 1. Introduction & Overview
The Human Tumor Atlas Network ([HTAN](https://humantumoratlas.org/)) is a National Cancer Institute (NCI)-funded Cancer Moonshot<sup>SM</sup> initiative to construct 3-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced disease [[Cell April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469)]


### 1.1 Goal

This notebook aims to provide an overview of the data accessible within HTAN, presenting at-a-glance summaries, specific statistics such as total cell counts, as well as example queries and attributes that can be used to filter for data of interest. The findings presented reflect the status of HTAN Data Release 5.1.

### 1.2 Inputs, Outputs, & Data
The originating data can be found on the [HTAN Data Portal](https://humantumoratlas.org/), and the compiled tables are on the [ISB-Cancer Gateway in the Cloud](https://isb-cgc.appspot.com/bq_meta_search/).

Each query output loads to a Data Table, an interactive display of resulting columns and rows. You are able to select the link below the table to review the Data Table Notebook (https://colab.research.google.com/notebooks/data_table.ipynb) that gives tips on filtering and further customizing the table.

### 1.3 Notes
The queries and results in this notebook correspond to ISB-CGC HTAN Release 5 (consisting of data through HTAN Data Releae 5.1). To choose a different release, edit the BigQuery table names in this notebook by replacing the string r5 with a selected numbered release, e.g. r4. To get results for the most current data release, replace r5 with current and HTAN_versioned with HTAN.

(For example replace isb-cgc-bq.HTAN_versioned.clinical_tier1_demographics_r5 with isb-cgc-bq.HTAN.clinical_tier1_demographics_current).



# 2. Environment & Module Setup

In [1]:
# import libraries
import pandas as pd

Enable interactive table formatting:

In [2]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

In [3]:
from google.colab.data_table import DataTable
DataTable.max_columns = 30

# 3. Google Authentication

Running the BigQuery cells in this notebook requires a Google Cloud Project. Instructions for creating a project can be found in [Google Cloud Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console). The instance needs to be authorized to bill the project for queries. For more information on getting started with ISB-CGC see [Quick Start Guide to ISB-CGC](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found in [Google Cloud Authentication Documentation](https://cloud.google.com/docs/authentication).

## 3.1 Authenticating with Google Credentials



#### Option 1. Running in Google Colab

If you are using Google Colab, run the code block below to authenticate

In [4]:
from google.colab import auth
auth.authenticate_user()

#### Option 2. Running on local machine

Alternatively, if you're running the notebook locally, take the following steps to authenticate.

1.   Run `gcloud auth application-default login` on your local machine
2.   Run the command below replacing `<path to key>` with the path to your credentials file

In [None]:
# %env GOOGLE_APPLICATION_CREDENTIALS=<path to key>

## 3.2 Initializing the Google BigQuery client


In [5]:
# Import the Google BigQuery client
from google.cloud import bigquery

# Set the google project that will be billed for this notebook's computations
google_project = <my-project>

# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)

# 4. Summary of Data in HTAN

## 4.1 Data Levels by Center

In the query below, a data file listing is first obtained using the ID Provenance table (see https://github.com/isb-cgc/Community-Notebooks/blob/master/HTAN/Python%20Notebooks/HTAN_ID_Provenance_In_BQ.ipynb).

HTAN imaging metadata annotations differ slightly from other assay types, as they encompass multiple imaging and spatial modalities. For example, H&E, MIBI, CODEX, etc. types are all annotated under the imaging template. In order to distinguish these assay types in our summary, we utilize the `Imaging Assay Type` attribute. As `Imaging Assay Type` is exclusively found in the `ImagingLevel2` table, we conduct a sequence of joins utilizing the parent data file lineage to classify higher-level imaging data by assay type.

In the next portions of the query we split the component name e.g. `ScRNA-seqLevel2` into separate assay type (`ScRNA-seq`)and data level (`Level2`) attributes.

Lastly, we group by assay type, data level, and HTAN center, and pivot the table to provide a concise overview of all HTAN data types by center, along with the levels available.

In [6]:
summary = client.query("""
  WITH prov AS (
    SELECT DISTINCT HTAN_Data_File_ID, entityId, Component, HTAN_Center
    FROM `isb-cgc-bq.HTAN_versioned.id_provenance_r5`
    WHERE Component IS NOT NULL
    AND Component NOT LIKE '%Auxiliary%'
    AND Component NOT LIKE 'OtherAssay'
  ),
  img AS(SELECT * EXCEPT(HTAN_Data_File_ID) FROM (
    SELECT HTAN_Data_File_ID,Imaging_Assay_Type,entityId
    FROM `isb-cgc-bq.HTAN_versioned.imaging_level2_metadata_r5`
    WHERE Component IS NOT NULL
    UNION ALL
    SELECT il3s.HTAN_Data_File_ID,il2.Imaging_Assay_Type,il3s.entityId
    FROM `isb-cgc-bq.HTAN_versioned.imaging_level2_metadata_r5` il2
    JOIN (SELECT * FROM `isb-cgc-bq.HTAN_versioned.id_provenance_r5`
      WHERE Component = 'ImagingLevel3Segmentation') il3s
    ON il2.HTAN_Data_File_ID = il3s.HTAN_Parent_Data_File_ID
    UNION ALL
    SELECT il4.HTAN_Data_File_ID,il2.Imaging_Assay_Type,il4.entityId
    FROM `isb-cgc-bq.HTAN_versioned.imaging_level2_metadata_r5` il2
    JOIN (SELECT * FROM `isb-cgc-bq.HTAN_versioned.id_provenance_r5`
      WHERE Component = 'ImagingLevel3Segmentation') il3s
    ON il2.HTAN_Data_File_ID = il3s.HTAN_Parent_Data_File_ID
    JOIN (SELECT * FROM `isb-cgc-bq.HTAN_versioned.id_provenance_r5`
      WHERE Component = 'ImagingLevel4') il4
    ON il3s.HTAN_Data_File_ID = il4.HTAN_Parent_Data_File_ID
  )
  ),
  files AS (
    SELECT HTAN_Center,array_to_string([SPLIT(Component, 'Level')[OFFSET(0)], Imaging_Assay_Type], ' - ') AS Assay,
    REGEXP_EXTRACT(Component, r'Level\d') AS Level
    FROM prov
    LEFT JOIN img USING(entityId)
    GROUP BY HTAN_Center, Assay, Level
    ORDER BY HTAN_Center, Assay, Level
  )
  SELECT HTAN_Center, Assay, STRING_AGG(DISTINCT Level ORDER BY Level) AS Levels
  FROM files
  GROUP BY HTAN_Center, Assay
""").result().to_dataframe()

In [7]:
summary.pivot(index='Assay', columns='HTAN_Center', values='Levels').fillna('')

HTAN_Center,HTAN BU,HTAN CHOP,HTAN DFCI,HTAN Duke,HTAN HMS,HTAN HTAPP,HTAN MSK,HTAN OHSU,HTAN SRRS,HTAN Stanford,HTAN TNP - TMA,HTAN TNP SARDANA,HTAN Vanderbilt,HTAN WUSTL
Assay,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
10xVisiumSpatialTranscriptomics-RNA-seq,,,,,,"Level1,Level2,Level3",,,,,,,"Level1,Level2,Level3,Level4","Level1,Level2,Level3"
BulkMethylation-seq,,,,,,,,,,"Level1,Level2",,,,
BulkRNA-seq,Level1,,,"Level1,Level2,Level3","Level1,Level2,Level3",Level2,,"Level1,Level2,Level3",,Level1,,,,Level1
BulkWES,Level1,"Level1,Level2,Level3",Level2,"Level1,Level2,Level3",,Level2,,"Level1,Level2,Level3",,"Level1,Level2",,,"Level1,Level2,Level3",Level1
ElectronMicroscopy,,,,,,,,"Level1,Level2",,,,,,
HI-C-seq,,,,,,,,,,"Level1,Level2",,,,
Imaging,,,,Level4,,Level4,,Level4,,Level1,,,,
Imaging - CODEX,,"Level2,Level3,Level4",,,,,,,,"Level2,Level3,Level4",,Level2,,Level2
Imaging - CyCIF,,,,,,,,Level2,,,"Level2,Level3",Level2,,
Imaging - H&E,Level2,,,Level2,Level2,Level2,,,,,,Level2,Level2,Level2


## 4.2 Number of files per center and assay type

This next table provides a long-form breakdown of the number of data files submitted by each HTAN center per assay type and level.

In [8]:
all_files = client.query("""
  SELECT Component, HTAN_Center, Count(*) AS Count
  FROM `isb-cgc-bq.HTAN_versioned.id_provenance_r5`
  WHERE Component LIKE '%Level%'
  GROUP BY Component, HTAN_Center
  ORDER BY Component, HTAN_Center
  """).result().to_dataframe()

In [9]:
all_files.pivot_table(
    index='Component',
    columns='HTAN_Center',
    values='Count',
    fill_value=0).astype(int)

HTAN_Center,HTAN BU,HTAN CHOP,HTAN DFCI,HTAN Duke,HTAN HMS,HTAN HTAPP,HTAN MSK,HTAN OHSU,HTAN SRRS,HTAN Stanford,HTAN TNP - TMA,HTAN TNP SARDANA,HTAN Vanderbilt,HTAN WUSTL
Component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
10xVisiumSpatialTranscriptomics-RNA-seqLevel1,0,0,0,0,0,400,0,0,0,0,0,0,96,100
10xVisiumSpatialTranscriptomics-RNA-seqLevel2,0,0,0,0,0,400,0,0,0,0,0,0,96,100
10xVisiumSpatialTranscriptomics-RNA-seqLevel3,0,0,0,0,0,336,0,0,0,0,0,0,192,266
10xVisiumSpatialTranscriptomics-RNA-seqLevel4,0,0,0,0,0,0,0,0,0,0,0,0,192,0
BulkMethylation-seqLevel1,0,0,0,0,0,0,0,0,0,80,0,0,0,0
BulkMethylation-seqLevel2,0,0,0,0,0,0,0,0,0,80,0,0,0,0
BulkRNA-seqLevel1,826,0,0,885,444,0,0,88,0,348,0,0,0,358
BulkRNA-seqLevel2,0,0,0,774,444,154,0,56,0,0,0,0,0,0
BulkRNA-seqLevel3,0,0,0,1548,222,0,0,38,0,0,0,0,0,0
BulkWESLevel1,516,622,0,2614,0,0,0,336,0,632,0,0,464,744


## 4.3 Total Cell Counts in Google BigQuery
In this section, we aim to retrieve the total cell counts across HTAN genomic and spatial assay data tables in Google BigQuery. Given the varying formats of assay tables, we adopt different approaches to gather cell count information, distinguishing between narrow and non-narrow formats.

In [15]:
# obtain assay tables containing concatenated data derived from CSVs
full = client.query("""
  SELECT table_name
  FROM `isb-cgc-bq.HTAN_versioned`.INFORMATION_SCHEMA.TABLES
  WHERE (table_name LIKE 'imaging_level4%' AND table_name NOT LIKE '%metadata%')
  OR table_name = 'scRNAseq_HTAPP_level4_r5'
  ORDER BY table_name
""").result().to_dataframe()

# obtain assay tables derived from h5ads
narrow = client.query("""
  SELECT table_name
  FROM `isb-cgc-bq.HTAN_versioned`.INFORMATION_SCHEMA.TABLES
  WHERE (table_name LIKE 'scRNAseq%' AND table_name NOT LIKE '%metadata%')
  AND (table_name NOT LIKE 'scRNAseq_VUMC%' OR table_name LIKE '%cellxgene%')
  AND table_name NOT LIKE 'scRNAseq_HTAPP_level4_%'
  ORDER BY table_name
""").result().to_dataframe()

In [11]:
def row_count(table):
  '''
  For csv/tsv derived tables, we can count one cell per row
  '''
  count = client.query(f"SELECT COUNT(*) AS count FROM `isb-cgc-bq.HTAN_versioned.{table}`")
  result = count.result()
  for row in result:
      return row['count']

In [12]:
def obs_count(table):
  '''
  Tables that were derived from h5ads have been converted into narrow format using
  obs & var indices. We can obtain cell count by selecting the max obs index
  '''
  count = client.query(f"SELECT MAX(iObs) AS count FROM `isb-cgc-bq.HTAN_versioned.{table}`")
  result = count.result()
  for row in result:
      return row['count']

In [18]:
cell_count = 0
for t in list(full['table_name']):
  cell_count = cell_count + row_count(t)
for t in list(narrow['table_name']):
  cell_count = cell_count + obs_count(t)


The total cell count in Google BigQuery (genomic and spatial data) is around 204 million cells

In [19]:
cell_count

204060254

# 5. Specific modes of access and identification

### 5.1 Identifying data by HTAN Data Release

Here we demonstrate how users can filter files based on a specific HTAN Data Release, such as `Release 5.0`. We accomplish this by again utilizing the ID Provenance table, which contains information on all HTAN data files.

In [20]:
id_prov = client.query("""
  SELECT Component, Data_Release, HTAN_Center, Count(*) AS Count
  FROM `isb-cgc-bq.HTAN_versioned.id_provenance_r5`
  WHERE Data_Release = 'Release 5.0'
  GROUP BY Component, Data_Release, HTAN_Center
""").result().to_dataframe()
id_prov

Unnamed: 0,Component,Data_Release,HTAN_Center,Count
0,BulkWESLevel1,Release 5.0,HTAN BU,188
1,ScRNA-seqLevel1,Release 5.0,HTAN BU,208
2,BulkRNA-seqLevel1,Release 5.0,HTAN BU,170
3,ImagingLevel2,Release 5.0,HTAN HMS,28
4,BulkWESLevel3,Release 5.0,HTAN CHOP,186
5,ScRNA-seqLevel4,Release 5.0,HTAN CHOP,1093
6,ScATAC-seqLevel1,Release 5.0,HTAN CHOP,869
7,ScATAC-seqLevel2,Release 5.0,HTAN CHOP,108
8,ScATAC-seqLevel3,Release 5.0,HTAN CHOP,144
9,ScRNA-seqLevel1,Release 5.0,HTAN CHOP,1234


### 5.2 Identifying Data in Cancer Data Service (CDS)

Similarly, we can identify data that has been released on CDS by using the `isb-cgc-bq.HTAN.cds_drs_map_r5` BigQuery table. This table contains a listing of all ~30,000 data files available in CDS as of Release 4.0 (the DCC submits data to CDS following each major data release. As of writing of this notebook, the DCC is awaiting publication of Release 5.0 data on CDS).

For more information on how to utilize this table to prepare a DRS manifest used to access CDS files in Seven Bridges, please see the notebook `Creating CDS Data Import Manifests from BQ` https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HTANNotebooks.html

In [21]:
cds = client.query("""
SELECT * FROM `isb-cgc-bq.HTAN_versioned.cds_drs_map_r5`
""").result().to_dataframe()

cds



Unnamed: 0,name,entityId,HTAN_Data_File_ID,drs_uri
0,S1.bam,syn26546212,HTA7_1_1453,drs://nci-crdc.datacommons.io/dg.4DFC/adde2232...
1,S2.bam,syn26546217,HTA7_1_1564,drs://nci-crdc.datacommons.io/dg.4DFC/e836d169...
2,S3.bam,syn26546241,HTA7_1_1598,drs://nci-crdc.datacommons.io/dg.4DFC/b7a010f7...
3,S4.bam,syn26546518,HTA7_1_1609,drs://nci-crdc.datacommons.io/dg.4DFC/897a0598...
4,S5.bam,syn26546529,HTA7_1_1620,drs://nci-crdc.datacommons.io/dg.4DFC/3a036835...
...,...,...,...,...
29294,CRC6_FreshFrozen.sorted.rmdup.bam.mergeContext...,syn45173302,HTA10_27_74065341231732400243811149443120,drs://nci-crdc.datacommons.io/dg.4DFC/5d041ccc...
29295,F_007.sorted.rmdup.bam.mergeContext_CpG.bedGraph,syn51597422,HTA10_05_37789973255977341101735157351126,drs://nci-crdc.datacommons.io/dg.4DFC/5d04442c...
29296,F_034.sorted.rmdup.bam.mergeContext_CpG.bedGraph,syn51597425,HTA10_05_01228730521995281044701324632363,drs://nci-crdc.datacommons.io/dg.4DFC/5d04681c...
29297,F_88B.sorted.rmdup.bam.mergeContext_CpG.bedGraph,syn51597428,HTA10_05_85728373087630024932513533801044,drs://nci-crdc.datacommons.io/dg.4DFC/5d048cca...


### 5.3 Identifying Assay Data in ISB-CGC BigQuery

The HTAN Data Coordinating Center (DCC) has now made over 1000 data files available in ISB-CGC BigQuery, including tabular assay data and channel metadata files.

These files are structured in two main formats: in some instances, each data file corresponds directly to a BigQuery table, while in others, files within a dataset are combined into a single BigQuery table.

The `isb-cgc-bq.HTAN.dataFileSynapseID_to_BigQueryTableID_map_r5` table can be utilized to identify the presence of data files in Google BigQuery using their Synapse IDs, and determine the tables they are associated with.

In [22]:
syn_bq_map = client.query("""
SELECT * FROM `isb-cgc-bq.HTAN_versioned.dataFileSynapseID_to_BigQueryTableID_map_r5`
""").result().to_dataframe()

syn_bq_map

Unnamed: 0,entityId,bq_table_id
0,syn26535445,isb-cgc-bq.HTAN.imaging_level4_OHSU_current
1,syn53275842,isb-cgc-bq.HTAN.imaging_level4_OHSU_current
2,syn53275838,isb-cgc-bq.HTAN.imaging_level4_OHSU_current
3,syn26535453,isb-cgc-bq.HTAN.imaging_level4_OHSU_current
4,syn31547329,isb-cgc-bq.HTAN.imaging_level4_OHSU_current
...,...,...
1311,syn27056099,isb-cgc-bq.HTAN.scRNAseq_VUMC_HTAN_VAL_EPI_cel...
1312,syn27056098,isb-cgc-bq.HTAN.scRNAseq_VUMC_VAL_DIS_NOEPI_ce...
1313,syn51301557,isb-cgc-bq.HTAN.scRNAseq_MSK_Myeloid_SCLC_samp...
1314,syn51301559,isb-cgc-bq.HTAN.scRNAseq_MSK_SCLC_RU1215_epith...


A breakdown of the number of data files contained in each BigQuery assay table

In [23]:
syn_bq_map.bq_table_id.value_counts()

bq_table_id
isb-cgc-bq.HTAN.imaging_channel_metadata_current                                                   423
isb-cgc-bq.HTAN.imaging_level4_TNP_TMA_phase1_current                                              352
isb-cgc-bq.HTAN.imaging_level4_TNP_TMA_phase3_current                                              176
isb-cgc-bq.HTAN.imaging_level4_OHSU_current                                                        120
isb-cgc-bq.HTAN.imaging_level4_HMS_orion_current                                                    75
isb-cgc-bq.HTAN.scRNAseq_HTAPP_level4_current                                                       47
isb-cgc-bq.HTAN.imaging_level4_HMS_crc_mask_current                                                 40
isb-cgc-bq.HTAN.imaging_level4_HMS_crc_current                                                      32
isb-cgc-bq.HTAN.imaging_level4_HMS_mel_mask_current                                                 20
isb-cgc-bq.HTAN.imaging_level4_HTAPP_merfish_current         

# 6. Relevant Citations and Links



[HTAN Portal](https://humantumoratlas.org/)

[Overview paper, Cell, April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469)

