---
title: Querying Human Reference Epigenome
description: We first collected the TSS of all human genes where we have expression data for the orthologous rat gene. We used Laura's tools to query CAGE tracks from genome-wide Enformer predictions on the reference genome.
date: 8/17/2023
author: Sabrina Mi
---

## Setup

In [2]:
import h5py
import pandas as pd


reference_dir = "/grand/TFXcan/imlab/users/lvairus/reftile_project/enformer-reference-epigenome"

def query_epigenome(path_to_enfref, chr_num, center_bp, num_bins=896, tracks=-1):
    """
    Parameters:
        path_to_enfref (str): path to the directory containing the concatenated reference enformer files
        chr_num (int/string): chromosome number
        center_bp (int): center base pair position (1-indexed)
        num_bins (int): number of bins to extract centered around center_bp (default: 896) 
            note: if the number of bins is even, the center bin will be in the second half of the array
        tracks (int list): list of tracks to extract (default: all 5313 tracks)

    Returns:
        epigen (np.array): enformer predictions centered at center_bp of shape (num_bins, len(tracks))
    """

    # from chr_num choose file
    filename = f"chr{chr_num}_cat.h5"

    # from position choose center bin
    center_ind = center_bp - 1
    center_bin = center_ind // 128
    
    # from bins choose number of bins
    half_bins = num_bins // 2
    start_bin = center_bin - half_bins
    end_bin = center_bin + half_bins
    if num_bins % 2 != 0: # if num_bins is odd
        end_bin += 1

    # get bins (with all tracks)
    with h5py.File(f"{path_to_enfref}/{filename}", "r") as f:
        epigen = f[f'chr{chr_num}'][()][start_bin:end_bin] # np.array (num_bins, 5313)

    # get tracks if list provided
    if tracks != -1:
        epigen = epigen[:, tracks] # np.array (num_bins, len(tracks))

    return epigen

## Collect TSS

In [7]:
rn7_gene_list = pd.read_csv("reference_epigenome_predicted_vs_observed.csv", header=0, usecols=['gene_id'])
ortho_genes = pd.read_csv("ortholog_genes_rats_humans.tsv", header=0, sep="\t", usecols=['ensembl_gene_id', 'rnorvegicus_homolog_ensembl_gene'])
