---
title: Correlation between human reference CAGE predictions to mean expression across GTEx genes
author: Sabrina Mi
date: 8/22/23
---

We learned that pieces of the hg38 reference epigenome are missing, including all gene regions on chromosomes 5, 6, and 7, and a handful (27 out of 1018) on chromosome 12. Based off the genes we could query, the correlation between predicted expression from human reference genome and mean expression in GTEx brain cortex tissues is 0.536, which is too low, even when we consider the missing genes. 

We will definitely need to rerun Enformer on chromosomes 5-7, but I'm still stuck on how we'll debug the low correlation. I double checked that the fasta file and gene annotation were both in hg38.

In [69]:
import h5py
import pandas as pd
import numpy as np

enfref_dir = "/grand/TFXcan/imlab/users/lvairus/reftile_project/enformer-reference-epigenome"

def query_epigenome(chr_num, center_bp, num_bins=3, tracks=-1):
    """
    Parameters:
        path_to_enfref (str): path to the directory containing the concatenated reference enformer files
        chr_num (int/string): chromosome number
        center_bp (int): center base pair position (1-indexed)
        num_bins (int): number of bins to extract centered around center_bp (default: 896) 
            note: if the number of bins is even, the center bin will be in the second half of the array
        tracks (int list): list of tracks to extract (default: all 5313 tracks)

    Returns:
        epigen (np.array): enformer predictions centered at center_bp of shape (num_bins, len(tracks))
    """

    # from position choose center bin
    center_ind = center_bp - 1
    center_bin = center_ind // 128
    
    half_bins = num_bins // 2
    start_bin = center_bin - half_bins
    end_bin = center_bin + half_bins
    if num_bins % 2 != 0: # if num_bins is odd
        end_bin += 1

    with h5py.File(f"{enfref_dir}/chr{chr_num}_cat.h5", "r") as f:
        # get tracks if list provided
        if tracks == -1:
            epigen = f[f'chr{chr_num}'][start_bin:end_bin, :] 
        else:
            epigen = f[f'chr{chr_num}'][start_bin:end_bin, tracks] 

    return epigen

In [70]:
hg38_annot = pd.read_csv("/home/s1mi/enformer_rat_data/annotation/hg38.protein_coding_TSS.txt", header=0, sep="\t", index_col='ensembl_gene_id')
hg38_annot.head()

Unnamed: 0_level_0,external_gene_name,chromosome_name,transcription_start_site
ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ENSG00000142611,PRDM16,1,3069203
ENSG00000157911,PEX10,1,2412564
ENSG00000142655,PEX14,1,10474950
ENSG00000149527,PLCH2,1,2476289
ENSG00000171621,SPSB1,1,9292894


In [83]:
gene_list = hg38_annot[hg38_annot["chromosome_name"] != "Y"].index
CAGE_predictions = []
invalid_queries = []
for gene in gene_list:
    chr = hg38_annot.loc[gene]['chromosome_name']
    tss = hg38_annot.loc[gene]['transcription_start_site']
    bins = query_epigenome(chr,tss, num_bins = 1, tracks=[4980])
    if np.any(np.isnan(bins)):
        invalid_queries.append(f"{chr}:{tss}")
    CAGE_predictions.append(np.average(bins))

In [84]:
print(len(invalid_queries), "out of", len(CAGE_predictions), "TSS regions missing from reference epigenome")

2849 out of 19625 TSS regions missing from reference epigenome


In [85]:
from collections import Counter
print("Number of missing regions by chromosme:")
print(Counter(map(lambda x: x.split(":")[0], invalid_queries)))

Number of missing regions by chromosme:
Counter({'6': 1031, '7': 919, '5': 872, '12': 27})


In [90]:
print("Total number of genes on those chromosome:")
print({key: Counter(hg38_annot['chromosome_name'])[key] for key in ["6", "7", "5", "12"]})

Total number of genes on those chromosome:
{'6': 1031, '7': 919, '5': 872, '12': 1018}


In [75]:
gex_df = pd.DataFrame({"enformer": CAGE_predictions}, index=gene_list)
gex_df.head()

Unnamed: 0_level_0,enformer
ensembl_gene_id,Unnamed: 1_level_1
ENSG00000142611,4.474399
ENSG00000157911,6.886287
ENSG00000142655,21.222824
ENSG00000149527,0.027461
ENSG00000171621,10.673162


In [76]:
gtex_tpm = pd.read_csv("/home/s1mi/enformer_rat_data/expression_data/gene_tpm_2017-06-05_v8_brain_cortex.gct.gz", header=2, sep="\t")
gtex_tpm['Name'] = gtex_tpm['Name'].apply(lambda gene: gene.split('.')[0])
gtex_tpm.set_index('Name', inplace=True)

In [77]:
gene_list = gex_df.index.intersection(gtex_tpm.index)
gtex_tpm = gtex_tpm.loc[gene_list]
print(gtex_tpm.shape[0], "genes in both GTEx and BioMart datasets")

19107 genes in both GTEx and BioMart datasets


In [80]:
# Calculate average gene expression
gtex_mean_tpm = gtex_tpm.drop(columns=['id', 'Description']).mean(axis=1)
gtex_mean_tpm.name = 'gtex'
# Join observed gene expression with Enformer CAGE predicted
merged_gex_df = gex_df.merge(gtex_mean_tpm, left_index=True, right_index=True, how='inner').dropna()
merged_gex_df.head()

Unnamed: 0,enformer,gtex
ENSG00000000003,0.318157,4.587454
ENSG00000000005,2.456774,0.20888
ENSG00000000419,36.233582,26.515365
ENSG00000000457,6.147785,2.538695
ENSG00000000460,3.142622,0.840966


In [81]:
merged_gex_df.corr()

Unnamed: 0,enformer,gtex
enformer,1.0,0.535759
gtex,0.535759,1.0
