---
title: Querying Human Reference Epigenome
description: We first collected the TSS of all human genes where we have expression data for the orthologous rat gene. We used Laura's tools to query CAGE tracks from genome-wide Enformer predictions on the reference genome.
date: 8/17/2023
author: Sabrina Mi
---

## Setup

In [35]:
import h5py
import pandas as pd
import numpy as np


enfref_dir = "/grand/TFXcan/imlab/users/lvairus/reftile_project/enformer-reference-epigenome"

def query_epigenome(chr_num, center_bp, num_bins=3, tracks=-1):
    """
    Parameters:
        path_to_enfref (str): path to the directory containing the concatenated reference enformer files
        chr_num (int/string): chromosome number
        center_bp (int): center base pair position (1-indexed)
        num_bins (int): number of bins to extract centered around center_bp (default: 896) 
            note: if the number of bins is even, the center bin will be in the second half of the array
        tracks (int list): list of tracks to extract (default: all 5313 tracks)

    Returns:
        epigen (np.array): enformer predictions centered at center_bp of shape (num_bins, len(tracks))
    """

    # from position choose center bin
    center_ind = center_bp - 1
    center_bin = center_ind // 128
    
    half_bins = num_bins // 2
    start_bin = center_bin - half_bins
    end_bin = center_bin + half_bins
    if num_bins % 2 != 0: # if num_bins is odd
        end_bin += 1

    with h5py.File(f"{enfref_dir}/chr{chr_num}_cat.h5", "r") as f:
        # get tracks if list provided
        if tracks == -1:
            epigen = f[f'chr{chr_num}'][start_bin:end_bin, :] 
        else:
            epigen = f[f'chr{chr_num}'][start_bin:end_bin, tracks] 

    return epigen

## Collect TSS for all human genes

We collected all protein-coding genes and their cannonical TSS using the biomaRt package.


In [49]:

hg38_annot = pd.read_csv("protein_coding_TSS.tsv", header=0, sep="\t", index_col='ensembl_gene_id')
print(hg38_annot.shape[0], "genes with TSS annotation")
hg38_annot.head()

19688 genes with TSS annotation


Unnamed: 0_level_0,external_gene_name,chromosome_name,transcription_start_site
ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ENSG00000142611,PRDM16,1,3069203
ENSG00000157911,PEX10,1,2412564
ENSG00000142655,PEX14,1,10474950
ENSG00000149527,PLCH2,1,2476289
ENSG00000171621,SPSB1,1,9292894


We computed the predicted human reference epigenome by running Enformer on all intervals spanning the genome and concatenting the results. Now we extract the CAGE:Brain track at the bins where each gene's TSS fall in order to quantify relative expression across genes.  

## Query reference epigenome at TSS

In [51]:
def quantify_expression(gene):
    chr = hg38_annot.loc[gene]['chromosome_name']
    tss = hg38_annot.loc[gene]['transcription_start_site']
    return np.average(query_epigenome(chr,tss, num_bins = 3, tracks=[4980]))

In [50]:
CAGE_predictions = []
for gene in hg38_annot.index:
    CAGE_predictions.append(quantify_expression(gene))


In [92]:
gex_df = pd.DataFrame({"enformer": CAGE_predictions}, index=hg38_annot.index)
gex_df.head()

Unnamed: 0_level_0,enformer
ensembl_gene_id,Unnamed: 1_level_1
ENSG00000142611,3.03058
ENSG00000157911,4.573285
ENSG00000142655,7.280577
ENSG00000149527,0.022155
ENSG00000171621,3.718518


## Join with gene expression in GTEx brain cortex tissue

In [93]:
gtex_tpm = pd.read_csv("gene_tpm_2017-06-05_v8_brain_cortex.gct.gz", header=2, sep="\t")
gtex_tpm['Name'] = gtex_tpm['Name'].apply(lambda gene: gene.split('.')[0])
gtex_tpm.set_index('Name', inplace=True)

In [94]:
gene_list = gex_df.index.intersection(gtex_tpm.index)
gtex_tpm = gtex_tpm.loc[gene_list]
print(gtex_tpm.shape[0], "genes in both GTEx and BioMart datasets")

19152 genes in both GTEx and BioMart datasets


In [95]:
# Calculate average gene expression
gtex_mean_tpm = gtex_tpm.drop(columns=['id', 'Description']).mean(axis=1)
gtex_mean_tpm.name = 'gtex'

In [96]:
# Join observed gene expression with Enformer CAGE predicted
gex_df = gex_df.merge(gtex_mean_tpm, left_index=True, right_index=True, how='inner').dropna()
gex_df.head()


Unnamed: 0,enformer,gtex
ENSG00000000003,7.037427,4.587454
ENSG00000000005,1.203017,0.20888
ENSG00000000419,12.208982,26.515365
ENSG00000000457,2.586245,2.538695
ENSG00000000460,1.595694,0.840966
