<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/HTAN/Python%20Notebooks/Identifying_and_Compiling_Precancer_Cases_and_Samples_in_HTAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identifying and Compiling Precancer Cases and Samples in HTAN

        Title:   Identifying and Compiling Precancer Cases and Samples in HTAN
        Author:  Clarisse Lau, Vesteinn Thorsson & Kristen Anton
        Created: February 2024
        Purpose: Identify precancer cases and samples in HTAN using Google BigQuery metadata tables


# 1. Introduction & Overview
The Human Tumor Atlas Network ([HTAN](https://humantumoratlas.org/)) is a National Cancer Institute (NCI)-funded Cancer Moonshot<sup>SM</sup> initiative to construct 3-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced disease [[Cell April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469)]


HTAN is comprised of 15 research centers and trans-network projects (TNPs), 5 of which are designated precancer atlases [[Cancer Prevention Research, July 2023](https://aacrjournals.org/cancerpreventionresearch/article-abstract/16/7/379/727480/PreCancer-Atlas-Present-and-FuturePreCancer)], with additional atlases generating precancer data.

HTAN atlases have supplied a rich set of clinical and biospecimen metadata. These annotations are made according to the HTAN Data model, a set of standards defined by the HTAN consortium. The supplied values of these attributes have been collected into comprehensive data tables on the cloud, using the Google BigQuery structure that is part of Google Cloud Project.

This notebook shows how clinical and biospecimen tables can be queried to identify precancer cases and samples in HTAN by querying on relevant metadata attributes. For the purpose of this notebook, data provided by the HTAN precancer atlases, designated as precancer, are included in the query output. The queries run from this notebook provide a snapshot of the HTAN data.

### 1.1 Goal
This example notebook illustrates how to make use of HTAN Google BigQuery clinical and biospecimen tables to identify precancer cases and specimens in HTAN.

### 1.2 Approach
Elements of the HTAN data model may be used to identify patients diagnosed with precancerous lesions and biospecimens collected at the precancer stage of disease. There are several approaches to create precancer subsets. Because of the structure of the data elements pertaining to precancer, and the requirement for these data, currently the identification of precancer cases and biospecimens is not comprehensive.

To identify precancer cases, the data elements `Precancer Condition Type` from Clinical Tier 1 and `Age at Diagnosis` from Clinical Tier 1 may be queried. `Precancer Condition Type` is not a required element, so not all cases will be annotated. Because `Age at Diagnosis` is a required element, many precancer cases use the value `0` to indicate no cancer diagnosis, providing a known value for the query.

To identify biospecimens collected at the precancer stage of disease, the data element `Tumor Tissue Type` from Clinical Tier 1 maybe queried for permissible values `Premalignant,` `Premalignant - in situ,` and `Atypia - hyperplasia.` These data are required for all biospeicmens. The permissible values `Not Otherwise Specified` and `None` create ambiguity in the query result.

The HTAN DCC is refining the data model to facilitiate direct query of the data that will provide comprehensive subsetting of precancer cases and specimens.


### 1.3 Inputs, Outputs, & Data
The originating data can be found on the [HTAN Data Portal](https://humantumoratlas.org/), and the compiled tables are on the [ISB-Cancer Gateway in the Cloud](https://isb-cgc.appspot.com/bq_meta_search/).

Query outputs load to DataFrames, which display columns and rows.  Beside each output table you see two icons.  The top icon converts the DataFrame to an interactive table.  You are able to select the link below the table to review the Data Table Notebook (https://colab.research.google.com/notebooks/data_table.ipynb) that gives tips on filtering and further customizing the table. The lower icon converts that output to graphical format which, depending on the output, is more or less informative.

### 1.4 Notes
These tables correspond to ISB-CGC HTAN Release 4



# 2. Environment & Module Setup

In [None]:
# import libraries
import pandas as pd

# 3. Google Authentication

Running the BigQuery cells in this notebook requires a Google Cloud Project. Instructions for creating a project can be found in [Google Cloud Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console). The instance needs to be authorized to bill the project for queries. For more information on getting started with ISB-CGC see [Quick Start Guide to ISB-CGC](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found in [Google Cloud Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console).

## 3.1 Authenticating with Google Credentials



#### Option 1. Running in Google Colab

If you are using Google Colab, run the code block below to authenticate

In [None]:
from google.colab import auth
auth.authenticate_user()

#### Option 2. Running on local machine

Alternatively, if you're running the notebook locally, take the following steps to authenticate.

1.   Run `gcloud auth application-default login` on your local machine
2.   Run the command below replacing `<path to key>` with the path to your credentials file

In [None]:
# %env GOOGLE_APPLICATION_CREDENTIALS=<path to key>

## 3.2 Initializing the Google BigQuery client


In [None]:
# Import the Google BigQuery client
from google.cloud import bigquery

# Set the google project that will be billed for this notebook's computations
google_project = <my-project>

# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)

# 4. Identifying Precancerous Cases using the Clinical Data BigQuery Tables

### 4.1 Using 'Age at Diagnosis' in the Clinical Tier 1 Diagnosis Table

`Age at Diagnosis` is a required data element defined as the age of the particicpant at the time of diagnosis expressed in number of days since birth. Atlases contributing data on precancers populate `Age at Diagnosis` with `0` since participant's do not receive a cancer diagnosis.

In [None]:
aad = client.query("""
  SELECT HTAN_Center, Age_at_Diagnosis, Primary_Diagnosis, HTAN_Participant_ID
  FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
  WHERE HTAN_Center in (
    "HTAN Duke",
    "HTAN Stanford",
    "HTAN Vanderbilt",
    "HTAN BU",
    "HTAN HMS")
  AND Age_at_Diagnosis = '0'
  GROUP BY HTAN_Center, Age_at_Diagnosis, Primary_Diagnosis, HTAN_Participant_ID
""").result().to_dataframe()

aad

Unnamed: 0,HTAN_Center,Age_at_Diagnosis,Primary_Diagnosis,HTAN_Participant_ID
0,HTAN Vanderbilt,0,Not Reported,HTA11_10034
1,HTAN Vanderbilt,0,Not Reported,HTA11_10167
2,HTAN Vanderbilt,0,Not Reported,HTA11_104
3,HTAN Vanderbilt,0,Not Reported,HTA11_10466
4,HTAN Vanderbilt,0,Not Reported,HTA11_10557
...,...,...,...,...
142,HTAN Vanderbilt,0,Not Reported,HTA11_99999971397
143,HTAN Vanderbilt,0,Not Reported,HTA11_99999971662
144,HTAN Vanderbilt,0,Not Reported,HTA11_99999973458
145,HTAN Vanderbilt,0,Not Reported,HTA11_99999973899


Using this criterion, and narrowing the search to precancer centers, we find 147 precancer cases from the Vanderbilt HTAN center, all designating primary diagnosis as `Not Reported`. There are no additional records containing `Age at Diagnosis` value of `0` at this time.


### 4.2 Using 'Precancerous Condition Type' in the Clinical Tier 2 Table

`Precancerous Condition Type` is an optional data element that uses standardized terms to classify the precancerous cells observed in the participant's tissue. Because the data is not required, we have captured a small amount of data, giving us some (but limited) information about precancers using this method. Only records with values for `Precancerous Condition Type` are reported in the table below.


In [None]:
pct = client.query("""
  SELECT Precancerous_Condition_Type, HTAN_Center, HTAN_Participant_ID
  FROM `isb-cgc-bq.HTAN.clinical_tier2_current`
  WHERE Precancerous_Condition_Type IS NOT NULL
""").result().to_dataframe()

pct

Unnamed: 0,Precancerous_Condition_Type,HTAN_Center,HTAN_Participant_ID
0,Ductal Carcinoma In Situ,HTAN HTAPP,HTA2_225
1,Ductal Carcinoma In Situ,HTAN HTAPP,HTA2_229


This query identifies two additional cases.

### 4.3 Using 'Primary Diagnosis' in the Clinical Tier 1 Diagnosis Table

`Primary Diagnosis` is a required data element that uses World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O) to describe the patient's histologic diagnosis.  The data element permissible value list contains a limited number of values that describe precancers.  `Ductal Carcinoma in situ` is one value that is included, and we can use it to identify a set of breast precancer cases.

In [None]:
pddcis = client.query("""
  SELECT HTAN_Center, Primary_Diagnosis, HTAN_Participant_ID
  FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
  WHERE Primary_Diagnosis = 'Ductal carcinoma in situ NOS' OR Primary_Diagnosis = 'Familial adenomatous polyposis'
""").result().to_dataframe()

pddcis

Unnamed: 0,HTAN_Center,Primary_Diagnosis,HTAN_Participant_ID
0,HTAN Duke,Ductal carcinoma in situ NOS,HTA6_7
1,HTAN Duke,Ductal carcinoma in situ NOS,HTA6_1042
2,HTAN Duke,Ductal carcinoma in situ NOS,HTA6_1076
3,HTAN Duke,Ductal carcinoma in situ NOS,HTA6_1024
4,HTAN Duke,Ductal carcinoma in situ NOS,HTA6_1132
...,...,...,...
772,HTAN Stanford,Familial adenomatous polyposis,HTA10_06
773,HTAN Stanford,Familial adenomatous polyposis,HTA10_08
774,HTAN Stanford,Familial adenomatous polyposis,HTA10_10
775,HTAN Stanford,Familial adenomatous polyposis,HTA10_04


This query returns an additional 777 cases: 767 cases with disease identified as Ductal Carcinoma in situ, and 10 case with disease identified as Familial Adenomatous Polyposis (FAP).

## 5. Identifying Precancerous Samples using the Biospecimen BigQuery Table

### 5.1 Using 'Tumor Tissue Type' in the Biospecimen Table

The Biospecimen attributes offer some metadata to help differentiate precancerous lesion samples from tumor samples.  The required data element `Tumor Tissue Type`, defined as text that describes the kind of disease present in the tumor specimen as related to a specific time point, includes several permissible values that classify a precancerous lesion.  These include `Premalignant`, `Atypia - hyperplasia`, and `Premalignant - in situ`. A value of `Not Otherwise Specified` may include precancers, but this value is ambiguous. Although the name of this data element implies that it is an attribute of tumor tissue only, the permissible values include both precancerous and cancerous descriptors.

The table below contains Tumor_Tissue_Types from the full HTAN Biospecimen dataset.

In [None]:
ttt_all = client.query("""
  SELECT HTAN_Center, Tumor_Tissue_Type, Count(*) as count
  FROM `isb-cgc-bq.HTAN.biospecimen_current`
  GROUP BY HTAN_Center, Tumor_Tissue_Type
  ORDER BY HTAN_Center, Tumor_Tissue_Type
""").result().to_dataframe()

ttt_all

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,count
0,HTAN BU,Additional Primary,3
1,HTAN BU,Not Otherwise Specified,572
2,HTAN BU,Primary,6
3,HTAN CHOP,Primary,34
4,HTAN DFCI,,5
5,HTAN DFCI,Metastatic,10
6,HTAN Duke,Additional Primary,79
7,HTAN Duke,Normal,8
8,HTAN Duke,Normal distant,1
9,HTAN Duke,Not Otherwise Specified,112



Restricting the query to permissible values that do or may indicate precancer yields the following output.

In [None]:
ttt = client.query("""
  SELECT HTAN_Center, Tumor_Tissue_Type, Count(*) as count
  FROM `isb-cgc-bq.HTAN.biospecimen_current`
  WHERE Tumor_Tissue_Type is null OR Tumor_Tissue_Type in
    ('Atypia - hyperplasia',
    'Not Otherwise Specified',
    'Premalignant',
    'Premalignant - in situ')
  GROUP BY HTAN_Center, Tumor_Tissue_Type
  ORDER BY HTAN_Center, Tumor_Tissue_Type
""").result().to_dataframe()

ttt

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,count
0,HTAN BU,Not Otherwise Specified,572
1,HTAN DFCI,,5
2,HTAN Duke,Not Otherwise Specified,112
3,HTAN HMS,Not Otherwise Specified,304
4,HTAN HTAPP,,20
5,HTAN HTAPP,Not Otherwise Specified,538
6,HTAN HTAPP,Premalignant - in situ,2
7,HTAN MSK,Not Otherwise Specified,1
8,HTAN OHSU,,37
9,HTAN Stanford,,8


Further restricting the query to definitive values indicating precancer yields the following output, allowing us to identify 302 precancer biospecimens.

In [None]:
tttr = client.query("""
  SELECT HTAN_Center, Tumor_Tissue_Type, Count(*) as count
  FROM `isb-cgc-bq.HTAN.biospecimen_current`
  WHERE Tumor_Tissue_Type in
    ('Atypia - hyperplasia',
    'Premalignant',
    'Premalignant - in situ')
  GROUP BY HTAN_Center, Tumor_Tissue_Type
  ORDER BY HTAN_Center, Tumor_Tissue_Type
""").result().to_dataframe()

tttr

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,count
0,HTAN HTAPP,Premalignant - in situ,2
1,HTAN Stanford,Premalignant,190
2,HTAN Vanderbilt,Atypia - hyperplasia,16
3,HTAN Vanderbilt,Premalignant,94


### 5.2 Using 'Tumor Tissue Type' and 'Histologic Morphology Code' in the Biospecimen Table

The histologic morphology code, based on the ICD-O-3 coding, can offer insight into the classification of the biospecimen. In the query below, we look at `Tumor Tissue Type` values for premalignancies, and associated histologic morphology codes.

The histologic morphology codes do not identify additional precancerous specimens.

In [None]:
hmc_all = client.query("""
  SELECT HTAN_Center,Tumor_Tissue_Type, Histologic_Morphology_Code
  FROM `isb-cgc-bq.HTAN.biospecimen_current`
  WHERE Tumor_Tissue_Type='Atypia - hyperplasia' OR Tumor_Tissue_Type='Premalignant' OR Tumor_Tissue_Type='Premalignant - in situ'
  ORDER BY HTAN_Center DESC, Histologic_Morphology_Code
""").result().to_dataframe()

hmc_all

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,Histologic_Morphology_Code
0,HTAN Vanderbilt,Atypia - hyperplasia,0
1,HTAN Vanderbilt,Atypia - hyperplasia,0
2,HTAN Vanderbilt,Atypia - hyperplasia,0
3,HTAN Vanderbilt,Atypia - hyperplasia,0
4,HTAN Vanderbilt,Atypia - hyperplasia,0
...,...,...,...
297,HTAN Stanford,Premalignant,Not Available
298,HTAN Stanford,Premalignant,Not Available
299,HTAN Stanford,Premalignant,Not Available
300,HTAN HTAPP,Premalignant - in situ,unknown


Examining the morphology codes, we find codes for tubular adenoma (M82110), serrated adenoma (M82130), and tubulovillous adenoma (M82630). No other codes indicating premalignancy have been reported to date.


In [None]:
hmc = client.query("""
  SELECT Tumor_Tissue_Type, Histologic_Morphology_Code, count(*) AS Count
  FROM isb-cgc-bq.HTAN.biospecimen_current
  WHERE Tumor_Tissue_Type='Atypia - hyperplasia' OR Tumor_Tissue_Type='Premalignant' OR Tumor_Tissue_Type='Premalignant - in situ'
  GROUP BY Tumor_Tissue_Type, Histologic_Morphology_Code
""").result().to_dataframe()

hmc

Unnamed: 0,Tumor_Tissue_Type,Histologic_Morphology_Code,Count
0,Premalignant - in situ,unknown,2
1,Premalignant,Not Available,190
2,Premalignant,M82110,61
3,Premalignant,M82130,26
4,Atypia - hyperplasia,Unknown,11
5,Atypia - hyperplasia,0,5
6,Premalignant,M82630,7


The morphology codes allow further granularity for selecting specific types of precancer samples.

### 5.3 Final Outputs


Using the queries described above, we can identify a set of participants as precancerous cases and a set of biospecimens containing precancerous lesion.  Because of limitations in the data, the sets may not be exhaustive, however the included cases and specimens are definitively annotated as precancerous.

HTAN Precancerous Cases: output 916 cases.

In [None]:
pcc_all = client.query("""
  SELECT HTAN_Center, HTAN_Participant_ID
  FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
  WHERE Age_at_Diagnosis = '0'
  UNION ALL
  SELECT HTAN_Center, HTAN_Participant_ID
  FROM `isb-cgc-bq.HTAN.clinical_tier2_current`
  WHERE Precancerous_Condition_Type IS NOT NULL
  UNION ALL
  Select HTAN_Center, HTAN_Participant_ID
  FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
  WHERE Primary_Diagnosis = 'Ductal carcinoma in situ NOS'
  GROUP BY HTAN_Center, HTAN_Participant_ID
""").result().to_dataframe()

pcc_all

Unnamed: 0,HTAN_Center,HTAN_Participant_ID
0,HTAN Vanderbilt,HTA11_10034
1,HTAN Vanderbilt,HTA11_10167
2,HTAN Vanderbilt,HTA11_104
3,HTAN Vanderbilt,HTA11_10466
4,HTAN Vanderbilt,HTA11_10557
...,...,...
911,HTAN Duke,HTA6_2428
912,HTAN Duke,HTA6_2431
913,HTAN Duke,HTA6_2432
914,HTAN HTAPP,HTA2_225


HTAN Precancerous Specimens: output 302 specimens

In [None]:
pcs_all = client.query("""
  SELECT HTAN_Center, Tumor_Tissue_Type, HTAN_Biospecimen_ID
  FROM `isb-cgc-bq.HTAN.biospecimen_current`
  WHERE Tumor_Tissue_Type in
    ('Atypia - hyperplasia',
    'Premalignant',
    'Premalignant - in situ')
""").result().to_dataframe()

pcs_all

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,HTAN_Biospecimen_ID
0,HTAN HTAPP,Premalignant - in situ,HTA2_229_1
1,HTAN HTAPP,Premalignant - in situ,HTA2_225_1
2,HTAN Stanford,Premalignant,HTA10_08_008
3,HTAN Stanford,Premalignant,HTA10_08_015
4,HTAN Stanford,Premalignant,HTA10_10_018
...,...,...,...
297,HTAN Vanderbilt,Premalignant,HTA11_9700_2000001011
298,HTAN Vanderbilt,Premalignant,HTA11_9827_2000001011
299,HTAN Vanderbilt,Premalignant,HTA11_9829_2000001011
300,HTAN Vanderbilt,Premalignant,HTA11_9998_2000001011


NOTE:
We anticipate additional contribution of lung precancer cases and specimens from BU with the next data release.

The HTAN HMS team investigates skin lesions that include precursor fields, or regions within tissue and slides that reflect early events in melanoma.  Although these early events are not yet described in the HTAN metadata, we anticipate building annotations into the HTAN Data Model, at which time they will be retrievable.


# 5. Relevant Citations and Links



[HTAN Portal](https://humantumoratlas.org)

[Overview paper, Cell, April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469)

[Cancer Prevention Research, July 2023](https://aacrjournals.org/cancerpreventionresearch/article-abstract/16/7/379/727480/PreCancer-Atlas-Present-and-FuturePreCancer)