---
title: Correlation between Enformer reference CAGE prediction and GTEx brain tissue expression across genes
author: Sabrina Mi
date: 8/22/23
descriptions: We split genes by chromosome number to parallelize querying the predicted human reference epigenome for CAGE predictions, then compared to observed mean gene expression in GTEx individuals. The lowest correlation is 0.241 on chromosome 18 and the highest is 0.729 on chromosome 21. The correlation across all genes is 0.540.
---

## Collect predicted and observed reference gene expression for each chromosome

In [1]:
import h5py
import pandas as pd
import numpy as np
import parsl
from parsl import python_app
from parsl.configs.local_threads import config
parsl.load(config)

enfref_dir = "/grand/TFXcan/imlab/users/lvairus/reftile_project/enformer-reference-epigenome"

def query_epigenome(chr_num, center_bp, num_bins=3, tracks=-1):
    """
    Parameters:
        path_to_enfref (str): path to the directory containing the concatenated reference enformer files
        chr_num (int/string): chromosome number
        center_bp (int): center base pair position (1-indexed)
        num_bins (int): number of bins to extract centered around center_bp (default: 896) 
            note: if the number of bins is even, the center bin will be in the second half of the array
        tracks (int list): list of tracks to extract (default: all 5313 tracks)

    Returns:
        epigen (np.array): enformer predictions centered at center_bp of shape (num_bins, len(tracks))
    """

    # from position choose center bin
    center_ind = center_bp - 1
    center_bin = center_ind // 128
    
    half_bins = num_bins // 2
    start_bin = center_bin - half_bins
    end_bin = center_bin + half_bins
    if num_bins % 2 != 0: # if num_bins is odd
        end_bin += 1

    with h5py.File(f"{enfref_dir}/chr{chr_num}_cat.h5", "r") as f:
        # get tracks if list provided
        if tracks == -1:
            epigen = f[f'chr{chr_num}'][start_bin:end_bin, :] 
        else:
            epigen = f[f'chr{chr_num}'][start_bin:end_bin, tracks] 

    return epigen

In [2]:
## create lists of genes on each chromomsome
hg38_annot = pd.read_csv("/home/s1mi/enformer_rat_data/annotation/hg38.protein_coding_TSS.txt", header=0, sep="\t")
gene_dict = hg38_annot.groupby('chromosome_name')['ensembl_gene_id'].apply(list).to_dict()

In [3]:
## Initialize lists of CAGE predictions for each gene
hg38_annot.set_index("ensembl_gene_id", inplace=True)
chr_list = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X"]
CAGE_predictions = {key: [] for key in chr_list}

In [4]:
## Function to put CAGE predictions in list (in the same order as gene list by chromosome)
@python_app
def query_genes(chr):
    gene_list = gene_dict[chr]
    for gene in gene_list:
        bins = query_epigenome(chr, hg38_annot.loc[gene]['transcription_start_site'], tracks=[4980])
        CAGE_predictions[chr].append(np.average(bins))

In [5]:
app_futures = []
## Collect CAGE predictions in parallel across genes
for chr in chr_list:
    app_futures.append(query_genes(chr))
## Wait for all chromosomes to finish
exec_futures = [q.result() for q in app_futures]

## Count missing genes in reference Enformer

We are only missing TSS CAGE values for 12 genes, all on chromosome 12.

In [8]:
for chr in chr_list:
    n_missing = np.sum(np.isnan(np.array(CAGE_predictions[chr])))
    if n_missing > 0:
        print("Chromosome", chr, "missing", n_missing, "genes in the predicted human reference epigenome")

Chromosome 12 missing 27 genes in the predicted human reference epigenome


In [9]:
merged_dict = {chr: pd.DataFrame(index=gene_dict[chr]) for chr in chr_list}
for chr in chr_list:
    merged_dict[chr]['enformer'] = CAGE_predictions[chr]

In [None]:
all_predicted = pd.concat(merged_dict.values())
all_predicted.to_csv("/home/s1mi/enformer_rat_data/output/hg38_predicted_expression.csv")

In [10]:
gtex_tpm = pd.read_csv("/home/s1mi/enformer_rat_data/expression_data/gene_tpm_2017-06-05_v8_brain_cortex.gct.gz", header=2, sep="\t")
gtex_tpm['Name'] = gtex_tpm['Name'].apply(lambda gene: gene.split('.')[0])
gtex_tpm.set_index('Name', inplace=True)
# Calculate average gene expression
gtex_mean_tpm = gtex_tpm.drop(columns=['id', 'Description']).mean(axis=1)
gtex_mean_tpm.name = 'gtex'

In [11]:
for chr in chr_list:
    gene_list = merged_dict[chr].index.intersection(gtex_tpm.index)
    merged_dict[chr] = merged_dict[chr].loc[gene_list]
    merged_dict[chr] = merged_dict[chr].merge(gtex_mean_tpm, left_index=True, right_index=True, how='inner').dropna()
    print(merged_dict[chr].shape[0], "genes on chromosome", chr, "to be used in correlation test")

1977 genes on chromosome 1 to be used in correlation test
1201 genes on chromosome 2 to be used in correlation test
1031 genes on chromosome 3 to be used in correlation test
738 genes on chromosome 4 to be used in correlation test
858 genes on chromosome 5 to be used in correlation test
996 genes on chromosome 6 to be used in correlation test
878 genes on chromosome 7 to be used in correlation test
662 genes on chromosome 8 to be used in correlation test
748 genes on chromosome 9 to be used in correlation test
707 genes on chromosome 10 to be used in correlation test
1259 genes on chromosome 11 to be used in correlation test
965 genes on chromosome 12 to be used in correlation test
314 genes on chromosome 13 to be used in correlation test
587 genes on chromosome 14 to be used in correlation test
566 genes on chromosome 15 to be used in correlation test
815 genes on chromosome 16 to be used in correlation test
1133 genes on chromosome 17 to be used in correlation test
259 genes on chrom

In [12]:
for chr in chr_list:
    print("Chromosome", chr, "correlation:", merged_dict[chr].corr().iloc[0,1])

Chromosome 1 correlation: 0.6279205208330061
Chromosome 2 correlation: 0.4577084823841293
Chromosome 3 correlation: 0.5590473627720567
Chromosome 4 correlation: 0.6622514111261547
Chromosome 5 correlation: 0.5659772926746754
Chromosome 6 correlation: 0.5405769582022613
Chromosome 7 correlation: 0.4736588849210786
Chromosome 8 correlation: 0.6197347486783672
Chromosome 9 correlation: 0.6160610260212112
Chromosome 10 correlation: 0.7113057334708601
Chromosome 11 correlation: 0.5348642297249536
Chromosome 12 correlation: 0.6363452896594721
Chromosome 13 correlation: 0.4445465856319643
Chromosome 14 correlation: 0.35533005480427243
Chromosome 15 correlation: 0.664532588686653
Chromosome 16 correlation: 0.6644114213442329
Chromosome 17 correlation: 0.597201417137115
Chromosome 18 correlation: 0.24144231917848386
Chromosome 19 correlation: 0.5346556958916295
Chromosome 20 correlation: 0.6870184790969254
Chromosome 21 correlation: 0.7286380655413807
Chromosome 22 correlation: 0.52125580051563

In [22]:
all_genes = pd.concat(merged_dict.values())
print("Correlation across all genes:")
all_genes.corr()

Correlation across all genes:


Unnamed: 0,enformer,gtex
enformer,1.0,0.539788
gtex,0.539788,1.0
