<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/HTAN/Python%20Notebooks/Identifying_and_Compiling_Precancer_Cases_and_Samples_in_HTAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identifying and Compiling Precancer Cases and Samples in HTAN

        Title:   Identifying and Compiling Precancer Cases and Samples in HTAN
        Author:  Clarisse Lau, Vesteinn Thorsson & Kristen Anton
        Created: 2024-02-29
        Updated: 2025-06-12
        Purpose: Identify precancer cases and samples in HTAN using Google BigQuery metadata tables


# 1. Introduction & Overview
The Human Tumor Atlas Network ([HTAN](https://humantumoratlas.org/)) is a National Cancer Institute (NCI)-funded Cancer Moonshot<sup>SM</sup> initiative to construct 3-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced disease [[Cell April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469)]


HTAN is comprised of 15 research centers and trans-network projects (TNPs), 5 of which are designated precancer atlases [[Cancer Prevention Research, July 2023](https://aacrjournals.org/cancerpreventionresearch/article-abstract/16/7/379/727480/PreCancer-Atlas-Present-and-FuturePreCancer)], with additional atlases generating precancer data.

HTAN atlases have supplied a rich set of clinical and biospecimen metadata. These annotations are made according to the HTAN Data model, a set of standards defined by the HTAN consortium. The supplied values of these attributes have been collected into comprehensive data tables on the cloud, using the Google BigQuery structure that is part of the Google Cloud Project.

This notebook shows how clinical and biospecimen tables can be queried to identify precancer cases and samples in HTAN by querying on relevant metadata attributes. For the purpose of this notebook, data provided by the HTAN precancer atlases, designated as precancer, are included in the query output. The queries run from this notebook provide a snapshot of the HTAN data.

### 1.1 Goal
This example notebook illustrates how to make use of HTAN Google BigQuery clinical and biospecimen tables to identify precancer cases and specimens in HTAN.

### 1.2 Approach
Elements of the HTAN data model may be used to identify patients diagnosed with precancerous lesions and biospecimens collected at the precancer stage of disease. There are several approaches to creating precancer subsets. Because of the structure of the data elements pertaining to precancer and the requirement for these data, currently, the identification of precancer cases and biospecimens is not comprehensive.

To identify precancer cases, the data elements `Precancer Condition Type` from Clinical Tier 1 and `Age at Diagnosis` from Clinical Tier 1 may be queried. `Precancer Condition Type` is not a required element, so not all cases will be annotated. Because `Age at Diagnosis` is a required element, many precancer cases use the value `0` to indicate no cancer diagnosis, providing a known value for the query.

To identify biospecimens collected at the precancer stage of disease, the data element `Tumor Tissue Type` from Clinical Tier 1 may be queried for permissible values `Premalignant,` `Premalignant - in situ,` and `Atypia - hyperplasia.` These data are required for all biospecimens. The permissible values `Not Otherwise Specified` and `None` create ambiguity in the query result.

The HTAN DCC is refining the data model to facilitate direct query of the data that will provide comprehensive subsetting of precancer cases and specimens.


### 1.3 Inputs, Outputs, & Data
The originating data can be found on the [HTAN Data Portal](https://humantumoratlas.org/), and the compiled tables are on the [ISB-Cancer Gateway in the Cloud](https://isb-cgc.appspot.com/bq_meta_search/).

Query outputs load to DataFrames, which display columns and rows.  Beside each output table, you see two icons.  The top icon converts the DataFrame to an interactive table.  You are able to select the link below the table to review the [Data Table Notebook](https://colab.research.google.com/notebooks/data_table.ipynb) that gives tips on filtering and further customizing the table. The lower icon converts that output to a graphical format, which, depending on the output, is more or less informative.

### 1.4 Notes
The queries and results in this notebook correspond to ISB-CGC's most current HTAN Release.

To choose a specific release, edit the BigQuery table names in this notebook by replacing the relevant string endings:
- `current` with a selected numbered release, e.g., `r2`
- `HTAN` with `HTAN_versioned`
- `gc` with `cds` (if applicable)

#
### ⚠️PAUSE
Notebooks associated with ISB-CGC HTAN Releases 6.0 or earlier are based on the **HTAN Phase 1 Data Model**. Please be aware that the structure, terminology, and available data elements may differ from those in Phase 2. 

# 2. Environment & Module Setup

In [2]:
# import libraries
import pandas as pd

# 3. Google Authentication

Running the BigQuery cells in this notebook requires a Google Cloud Project. Instructions for creating a project can be found in [Google Cloud Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console). The instance needs to be authorized to bill the project for queries. For more information on getting started with ISB-CGC see [Quick Start Guide to ISB-CGC](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found in [Google Cloud Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console).

## 3.1 Authenticating with Google Credentials



#### Option 1. Running in Google Colab

If you are using Google Colab, run the code block below to authenticate

In [None]:
from google.colab import auth
auth.authenticate_user()

#### Option 2. Running on local machine

Alternatively, if you're running the notebook locally, take the following steps to authenticate.

1.   Run `gcloud auth application-default login` on your local machine
2.   Run the command below replacing `<path to key>` with the path to your credentials file

In [None]:
# env GOOGLE_APPLICATION_CREDENTIALS='<path to key>'

## 3.2 Initializing the Google BigQuery client


In [1]:
# Import the Google BigQuery client
from google.cloud import bigquery

# Set the Google project that will be billed for this notebook's computations
# Replace <my-project> with your BigQuery Project ID
google_project = '<my-project>'

# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)

# 4. Identifying Precancerous Cases using the Clinical Data BigQuery Tables

### 4.1 Using 'Age at Diagnosis' in the Clinical Tier 1 Diagnosis Table

`Age at Diagnosis` is a required data element defined as the age of the particicpant at the time of diagnosis expressed in number of days since birth. Atlases contributing data on precancers populate `Age at Diagnosis` with `0` since participant's do not receive a cancer diagnosis.

In [3]:
aad = client.query("""
  SELECT HTAN_Center, Age_at_Diagnosis, Primary_Diagnosis, HTAN_Participant_ID
  FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
  WHERE HTAN_Center in (
    "HTAN Duke",
    "HTAN Stanford",
    "HTAN Vanderbilt",
    "HTAN BU",
    "HTAN HMS")
  AND Age_at_Diagnosis = '0'
  GROUP BY HTAN_Center, Age_at_Diagnosis, Primary_Diagnosis, HTAN_Participant_ID
""").result().to_dataframe()

aad

Unnamed: 0,HTAN_Center,Age_at_Diagnosis,Primary_Diagnosis,HTAN_Participant_ID
0,HTAN Vanderbilt,0,Not Reported,HTA11_10034
1,HTAN Vanderbilt,0,Not Reported,HTA11_10167
2,HTAN Vanderbilt,0,Not Reported,HTA11_104
3,HTAN Vanderbilt,0,Not Reported,HTA11_10466
4,HTAN Vanderbilt,0,Not Reported,HTA11_10557
...,...,...,...,...
251,HTAN BU,0,Not Reported,HTA3_70137
252,HTAN BU,0,Not Reported,HTA3_70151
253,HTAN BU,0,Not Reported,HTA3_70154
254,HTAN BU,0,Not Reported,HTA3_70160


Using this criterion, and narrowing the search to precancer centers, we find 147 precancer cases from the Vanderbilt HTAN center, all designating primary diagnosis as `Not Reported`. There are no additional records containing `Age at Diagnosis` value of `0` at this time.


## 4.2 Using 'Precancerous Condition Type' in the Clinical Tier 1 Diagnosis Table

`Precancerous Condition Type` is an optional data element that uses standardized terms to classify the precancerous cells observed in the participant's tissue. Because the data is not required, we have captured a small amount of data, giving us some (but limited) information about precancers using this method. Only records with values for `Precancerous Condition Type` are reported in the table below.


In [4]:
pct = client.query("""
  SELECT Precancerous_Condition_Type, HTAN_Center, HTAN_Participant_ID
  FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
  WHERE Precancerous_Condition_Type IS NOT NULL AND Precancerous_Condition_Type != "Not Applicable"
""").result().to_dataframe()

pct

Unnamed: 0,Precancerous_Condition_Type,HTAN_Center,HTAN_Participant_ID
0,Ductal Carcinoma In Situ,HTAN HTAPP,HTA2_225
1,Ductal Carcinoma In Situ,HTAN HTAPP,HTA2_229


This query identifies two additional cases.

## 4.3 Using 'Primary Diagnosis' in the Clinical Tier 1 Diagnosis Table

`Primary Diagnosis` is a required data element that uses World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O) to describe the patient's histologic diagnosis.  The data element permissible value list contains a limited number of values that describe precancers.  `Ductal Carcinoma in situ` is one value that is included, and we can use it to identify a set of breast precancer cases.

In [5]:
pddcis = client.query("""
  SELECT HTAN_Center, Primary_Diagnosis, HTAN_Participant_ID
  FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
  WHERE Primary_Diagnosis = 'Ductal carcinoma in situ NOS' OR Primary_Diagnosis = 'Familial adenomatous polyposis'
""").result().to_dataframe()

pddcis

Unnamed: 0,HTAN_Center,Primary_Diagnosis,HTAN_Participant_ID
0,HTAN WUSTL,Ductal carcinoma in situ NOS,HTA12_247
1,HTAN WUSTL,Ductal carcinoma in situ NOS,HTA12_250
2,HTAN WUSTL,Ductal carcinoma in situ NOS,HTA12_283
3,HTAN WUSTL,Ductal carcinoma in situ NOS,HTA12_285
4,HTAN Duke,Ductal carcinoma in situ NOS,HTA6_7
...,...,...,...
792,HTAN Stanford,Familial adenomatous polyposis,HTA10_06
793,HTAN Stanford,Familial adenomatous polyposis,HTA10_08
794,HTAN Stanford,Familial adenomatous polyposis,HTA10_10
795,HTAN Stanford,Familial adenomatous polyposis,HTA10_04


In [7]:
pddcis.groupby('Primary_Diagnosis')['HTAN_Participant_ID'].count()

Primary_Diagnosis
Ductal carcinoma in situ NOS      787
Familial adenomatous polyposis     10
Name: HTAN_Participant_ID, dtype: int64

This query returns an additional 796 cases: 787 cases with disease identified as Ductal Carcinoma in situ, and 10 case with disease identified as Familial Adenomatous Polyposis (FAP).

# 5. Identifying Precancerous Samples using the Biospecimen BigQuery Table

## 5.1 Using 'Tumor Tissue Type' in the Biospecimen Table

The Biospecimen attributes offer some metadata to help differentiate precancerous lesion samples from tumor samples.  The required data element `Tumor Tissue Type`, defined as text that describes the kind of disease present in the tumor specimen as related to a specific time point, includes several permissible values that classify a precancerous lesion.  These include `Premalignant`, `Atypia - hyperplasia`, and `Premalignant - in situ`. A value of `Not Otherwise Specified` may include precancers, but this value is ambiguous. Although the name of this data element implies that it is an attribute of tumor tissue only, the permissible values include both precancerous and cancerous descriptors.

The table below contains Tumor_Tissue_Types from the full HTAN Biospecimen dataset.

In [8]:
ttt_all = client.query("""
  SELECT HTAN_Center, Tumor_Tissue_Type, Count(*) as count
  FROM `isb-cgc-bq.HTAN.biospecimen_current`
  GROUP BY HTAN_Center, Tumor_Tissue_Type
  ORDER BY HTAN_Center, Tumor_Tissue_Type
""").result().to_dataframe()

ttt_all

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,count
0,HTAN BU,,13
1,HTAN BU,Additional Primary,6
2,HTAN BU,Atypia - hyperplasia,231
3,HTAN BU,Normal,63
4,HTAN BU,Normal adjacent,158
...,...,...,...
70,HTAN WUSTL,Normal distant,2
71,HTAN WUSTL,Post therapy neoadjuvant,58
72,HTAN WUSTL,Premalignant,9
73,HTAN WUSTL,Primary,719



Restricting the query to permissible values that do or may indicate precancer yields the following output.

In [9]:
ttt = client.query("""
  SELECT HTAN_Center, Tumor_Tissue_Type, Count(*) as count
  FROM `isb-cgc-bq.HTAN.biospecimen_current`
  WHERE Tumor_Tissue_Type is null OR Tumor_Tissue_Type in
    ('Atypia - hyperplasia',
    'Not Otherwise Specified',
    'Premalignant',
    'Premalignant - in situ') 
  GROUP BY HTAN_Center, Tumor_Tissue_Type
  ORDER BY HTAN_Center, Tumor_Tissue_Type
""").result().to_dataframe()

ttt

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,count
0,HTAN BU,,13
1,HTAN BU,Atypia - hyperplasia,231
2,HTAN BU,Not Otherwise Specified,723
3,HTAN BU,Premalignant,349
4,HTAN BU,Premalignant - in situ,152
5,HTAN DFCI,,42
6,HTAN Duke,Not Otherwise Specified,112
7,HTAN Duke,Premalignant - in situ,45
8,HTAN HMS,Not Otherwise Specified,332
9,HTAN HTAPP,,20


Further restricting the query to definitive values indicating precancer yields the following output, allowing us to identify 1128 precancer biospecimens.


In [10]:
tttr = client.query("""
  SELECT HTAN_Center, Tumor_Tissue_Type, Count(*) as count
  FROM `isb-cgc-bq.HTAN.biospecimen_current`
  WHERE Tumor_Tissue_Type in
    ('Atypia - hyperplasia',
    'Premalignant',
    'Premalignant - in situ')
  GROUP BY HTAN_Center, Tumor_Tissue_Type
  ORDER BY HTAN_Center, Tumor_Tissue_Type
""").result().to_dataframe()

tttr

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,count
0,HTAN BU,Atypia - hyperplasia,231
1,HTAN BU,Premalignant,349
2,HTAN BU,Premalignant - in situ,152
3,HTAN Duke,Premalignant - in situ,45
4,HTAN HTAPP,Premalignant - in situ,2
5,HTAN SRRS,Premalignant,4
6,HTAN Stanford,Premalignant,216
7,HTAN Vanderbilt,Atypia - hyperplasia,17
8,HTAN Vanderbilt,Premalignant,103
9,HTAN WUSTL,Premalignant,9


In [13]:
tttr['count'].sum()

1128

## 5.2 Using 'Tumor Tissue Type' and 'Histologic Morphology Code' in the Biospecimen Table

The histologic morphology code, based on the ICD-O-3 coding, can offer insight into the classification of the biospecimen. In the query below, we look at `Tumor Tissue Type` values for premalignancies, and associated histologic morphology codes.

The histologic morphology codes do not identify additional precancerous specimens.

In [14]:
hmc_all = client.query("""
  SELECT HTAN_Center,Tumor_Tissue_Type, Histologic_Morphology_Code
  FROM `isb-cgc-bq.HTAN.biospecimen_current`
  WHERE Tumor_Tissue_Type='Atypia - hyperplasia'
  OR Tumor_Tissue_Type='Premalignant'
  OR Tumor_Tissue_Type='Premalignant - in situ'
  ORDER BY HTAN_Center DESC, Histologic_Morphology_Code
""").result().to_dataframe()

hmc_all

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,Histologic_Morphology_Code
0,HTAN WUSTL,Premalignant,9732/3
1,HTAN WUSTL,Premalignant,9732/3
2,HTAN WUSTL,Premalignant,9732/3
3,HTAN WUSTL,Premalignant,9732/3
4,HTAN WUSTL,Premalignant,9732/3
...,...,...,...
1123,HTAN BU,Atypia - hyperplasia,Not Specified
1124,HTAN BU,Atypia - hyperplasia,Not Specified
1125,HTAN BU,Premalignant,Not Specified
1126,HTAN BU,Premalignant,Not Specified


When reviewing the `Histologic_Morphology_Code` data element, we find that while some entries are valid morphology codes, some are full International Classification of Diseases for Oncology, Third Edition (ICD-O-3) codes. This field is specifically intended to capture *only morphology codes*. However, since data is entered by different HTAN Centers, some inconsistencies are to be expected.

In [15]:
hmc_all['Histologic_Morphology_Code'].unique().tolist()

['9732/3',
 '0',
 'M82110',
 'M82130',
 'M82630',
 'Unknown',
 'Not Available',
 '8220/0',
 'unknown',
 '8500',
 '99999',
 'Not Specified']

This isn't a big issue as we can easily filter the results to include only the appropriate values for this scenario. Specifically, we will exclude entries that are unknown, not specified, or correspond to malignant ICD-O-3 behavior codes.

**Note:** Values such as `0` and `99999` are *not valid* codes. They indicate missing, unknown, or unspecified data.

ICD-O-3 codes follow a specific structure:

- The first four digits represent the histologic type.

- The digit following the slash indicates tumor behavior:

  - `/0` = benign

  - `/1` = uncertain 

  - `/2` = in situ

  - `/3` = malignant

  - `/6` = metastatic (secondary)

By filtering out malignant and invalid codes, we ensure that only relevant precancer morphology codes are used in downstream analyses.

In [18]:
exclude_vals = ['9732/3', '8500', 'Not Specified', 'Unknown', 'unknown', 'Not Available', '0', '99999']
filtered_hmc = hmc_all[~hmc_all['Histologic_Morphology_Code'].isin(exclude_vals)]
filtered_hmc

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,Histologic_Morphology_Code
15,HTAN Vanderbilt,Premalignant,M82110
16,HTAN Vanderbilt,Premalignant,M82110
17,HTAN Vanderbilt,Premalignant,M82110
18,HTAN Vanderbilt,Premalignant,M82110
19,HTAN Vanderbilt,Premalignant,M82110
...,...,...,...
117,HTAN Vanderbilt,Premalignant,M82630
345,HTAN SRRS,Premalignant,8220/0
346,HTAN SRRS,Premalignant,8220/0
347,HTAN SRRS,Premalignant,8220/0


Examining the morphology codes after filtering inappropriate values, we find codes for tubular adenoma (M82110), serrated adenoma (M82130), tubulovillous adenoma (M82630), and the benign condition of adenomatous polyposis coli (8220/0). 

In [22]:
hmc = client.query("""
  SELECT Tumor_Tissue_Type, Histologic_Morphology_Code, COUNT(*) AS Count
  FROM isb-cgc-bq.HTAN.biospecimen_current
  WHERE (Tumor_Tissue_Type = 'Atypia - hyperplasia'
         OR Tumor_Tissue_Type = 'Premalignant'
         OR Tumor_Tissue_Type = 'Premalignant - in situ')
    AND Histologic_Morphology_Code NOT IN ('9732/3', '8500', 'Not Specified', 'Unknown', 'unknown', 'Not Available', '0', '99999')
  GROUP BY Tumor_Tissue_Type, Histologic_Morphology_Code
""").result().to_dataframe()

hmc

Unnamed: 0,Tumor_Tissue_Type,Histologic_Morphology_Code,Count
0,Premalignant,M82110,72
1,Premalignant,M82130,24
2,Premalignant,M82630,7
3,Premalignant,8220/0,4


The morphology codes allow further granularity for selecting specific types of precancer samples.

## 5.3 Final Outputs


Using the queries described above, we can identify a set of participants as precancerous cases and a set of biospecimens containing precancerous lesion.  Because of limitations in the data, the sets may not be exhaustive, however the included cases and specimens are definitively annotated as precancerous.

HTAN Precancerous Cases: output 1045 cases.

In [20]:
pcc_all = client.query("""
  SELECT HTAN_Center, HTAN_Participant_ID
  FROM `isb-cgc-bq.HTAN.clinical_tier1_diagnosis_current`
  WHERE Age_at_Diagnosis = '0'
  OR (Precancerous_Condition_Type IS NOT NULL AND Precancerous_Condition_Type != "Not Applicable")
  OR Primary_Diagnosis = 'Ductal carcinoma in situ NOS'
  GROUP BY HTAN_Center, HTAN_Participant_ID
""").result().to_dataframe()

pcc_all

Unnamed: 0,HTAN_Center,HTAN_Participant_ID
0,HTAN HTAPP,HTA2_225
1,HTAN HTAPP,HTA2_229
2,HTAN Vanderbilt,HTA11_10034
3,HTAN Vanderbilt,HTA11_10167
4,HTAN Vanderbilt,HTA11_104
...,...,...
1040,HTAN Duke,HTA6_2505
1041,HTAN Duke,HTA6_2506
1042,HTAN Duke,HTA6_2507
1043,HTAN Duke,HTA6_2508


HTAN Precancerous Specimens: output 1128 specimens

In [21]:
pcs_all = client.query("""
  SELECT HTAN_Center, Tumor_Tissue_Type, HTAN_Biospecimen_ID
  FROM `isb-cgc-bq.HTAN.biospecimen_current`
  WHERE Tumor_Tissue_Type in
    ('Atypia - hyperplasia',
    'Premalignant',
    'Premalignant - in situ')
""").result().to_dataframe()

pcs_all

Unnamed: 0,HTAN_Center,Tumor_Tissue_Type,HTAN_Biospecimen_ID
0,HTAN HTAPP,Premalignant - in situ,HTA2_229_1
1,HTAN HTAPP,Premalignant - in situ,HTA2_225_1
2,HTAN Vanderbilt,Premalignant,HTA11_10034_2000001011
3,HTAN Vanderbilt,Premalignant,HTA11_10167_2000001011
4,HTAN Vanderbilt,Premalignant,HTA11_104_2000001011
...,...,...,...
1123,HTAN Stanford,Premalignant,HTA10_16_116
1124,HTAN SRRS,Premalignant,HTA15_100005_007
1125,HTAN SRRS,Premalignant,HTA15_100005_007001
1126,HTAN SRRS,Premalignant,HTA15_100004_114


# 5. Relevant Citations and Links



[HTAN Portal](https://humantumoratlas.org)

[Overview paper, Cell, April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469)

[Cancer Prevention Research, July 2023](https://aacrjournals.org/cancerpreventionresearch/article-abstract/16/7/379/727480/PreCancer-Atlas-Present-and-FuturePreCancer)

[Internaltional Classification of Diseases for Oncology, Third Edition](https://iris.who.int/bitstream/handle/10665/96612/9789241548496_eng.pdf;jsessionid=BD11257ACC2153EBC1E722C3B4E8E7AE?sequence=1)