---
title: EpigenomeXcan test on Br rats
author: Sabrina Mi
date: 7/10/2024
description: Calculate associations between predicted gene expression in adipose tissue and BMI to identify significant gene, while I figure out high to scale up predicting epigenome step.
---

CPU times: 

1. Compute founders matrix (n_genes, 8)
    * 3190 genes: ~110s
2. Compute samples matrix (n_samples, n_genes, 8):
    * 3190 genes, 1 sample: ~4s
    * 3190 genes, 10 samples: ~40s

GPU times (combined steps):
    * 3190 genes, 340 samples: 32 minutes

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import h5py
import os
import time
import bisect
columns = ['ACI', 'BN', 'BUF', 'F344', 'M520', 'MR', 'WKY', 'WN']

2024-08-08 06:09:15.855401: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-08 06:09:23.430777: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /soft/compilers/cudatoolkit/cuda-11.8.0/extras/CUPTI/lib64:/soft/compilers/cudatoolkit/cuda-11.8.0/lib64:/soft/libraries/trt/TensorRT-8.5.2.2.Linux.x86_64-gnu.cuda-11.8.cudnn8.6/lib:/soft/libraries/nccl/nccl_2.16.2-1+cuda11.8_x86_64/lib:/soft/libraries/cudnn/cudnn-11-linux-x64-v8.6.0.163/lib:/opt/cray/pe/gcc/11.2.0/snos/lib64:/opt/cray/pe/papi/6.0.0.14/lib64:/opt/cray/libfabric/1.11.0.4.125/lib64:/

In [2]:
gene_annot = pd.read_csv("/eagle/AIHPC4Edu/sabrina/Br_predictions/HS_founder_epigenomes/gene_mapping.txt")
genes_by_chrom = gene_annot.groupby('chromosome')

In [3]:
probabilities_dir = "/eagle/AIHPC4Edu/sabrina/Br_genotype_probabilities"
reference_dir = "/eagle/AIHPC4Edu/sabrina/Br_predictions/HS_founder_epigenomes/human"
output_dir = "/eagle/AIHPC4Edu/sabrina/Br_prediction_from_founders"

Realizing my haplotype probabilities storage is redundant, taking up extra space.

In [24]:
#| code-fold: true
probabilities_file = f'{probabilities_dir}/chr2_probabilities.h5'
with h5py.File(probabilities_file, 'a') as hf:
    for dataset_name in list(hf.keys()):
        if hf[dataset_name].shape[1] == 8:
            continue
        elif hf[dataset_name].shape[1] == 9:
            # Read the dataset
            data = hf[dataset_name][:]
            positions = data[:, 0]  # Assuming the positions column is the first column
            new_data = data[:, 1:]  # All columns except the first one
            
            # Create a temporary dataset without the positions column
            temp_dataset_name = f"temp_{dataset_name}"
            hf.create_dataset(temp_dataset_name, data=new_data)
            
            # Delete the original dataset
            del hf[dataset_name]
        
            # Rename the temporary dataset to the original dataset name
            hf.move(temp_dataset_name, dataset_name)
        
    hf.create_dataset('positions', data=positions)
        # Store the positions vector as metadata (attribute) for the dataset



In [25]:
with h5py.File(probabilities_file, 'r') as hf:
    print(len(hf.keys()))

340


split genes by chromosome number
for individual, query haplotype probabilities at each gene tss, so we return an 8 x n_gene matrix. by batching by genes n_genes should be in the 1-2K range. stack to return a 3d array with dimensions n_samples x 8 x n_gene
next query reference epigenome at each gene [446:450, CAGE_index], return 8 x n_gene matrix
matrix multiply with something like tf.transpose(tf.tensordot(_W, _X, axes=[[1],[1]]),[1,0,2])
https://stackoverflow.com/questions/41870228/understanding-tensordot

In [4]:
def reference_epigenome_matrix(chr, genes_df, individuals):
    reference_file = f'{reference_dir}/{chr}_genes.h5'
    with h5py.File(reference_file, 'r') as ref:
        rows = []
        for gene in genes_df['gene']:
            founder_predictions = ref[gene][:, 446:450, 5278]
            rows.append(founder_predictions)
        ref_matrix = np.stack(rows, axis=0)
        ref_tensor = tf.reduce_mean(tf.convert_to_tensor(ref_matrix, dtype=tf.float32), axis=2)
        return ref_tensor

In [5]:
def probabilities_matrix(chr, genes_df, individuals):
    probabilities_file = f'{probabilities_dir}/{chr}_probabilities.h5'
    with h5py.File(probabilities_file, 'r') as prob:
        positions = prob['positions'][:]
        population_prob = []
        for sample in individuals:
            rows = []
            for tss in genes_df['tss']:
                index = bisect.bisect_left(positions, tss)
                tss_prob = (prob[sample][index-1,:] + prob[sample][index,:]) / 2
                rows.append(tss_prob)
            sample_prob = np.vstack(rows)
            population_prob.append(sample_prob)
        prob_matrix = np.stack(population_prob, axis=0)
        prob_tensor = tf.convert_to_tensor(prob_matrix, dtype = tf.float32)
        return prob_tensor

In [6]:
def predict_epigenome(chr, genes_df, individuals, output_file):
    ref_tensor = reference_epigenome_matrix(chr, genes_df, individuals)
    prob_tensor = probabilities_matrix(chr, genes_df, individuals)
    epigenome_tensor = tf.einsum('ijk,jk->ij', prob_tensor, ref_tensor)
    epigenome_df = pd.DataFrame(epigenome_tensor.numpy(), columns=genes_df['gene'], index = individuals)
    epigenome_df.to_csv(output_file)


In [8]:
with open('/eagle/AIHPC4Edu/sabrina/Br_genotype_probabilities/individuals.txt', 'r') as f:
    individuals = f.read().splitlines()
#pheno = pd.read_csv('/home/s1mi/enformer_rat_data/phenotypes/pheno.fam', sep = '\t', index_col = 'IID')

In [16]:
genes_df = genes_by_chrom.get_group('chr2')
predict_epigenome('chr2', genes_df, individuals, f'{output_dir}/chr2_adipose_predict.txt')

KeyError: "Unable to open object (object 'positions' doesn't exist)"

In [9]:
for chr, genes_df in genes_by_chrom:
    predict_epigenome(chr, genes_df, individuals, f'{output_dir}/{chr}_adipose_predict.txt')
    break

2024-08-08 06:12:22.221154: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /soft/compilers/cudatoolkit/cuda-11.8.0/extras/CUPTI/lib64:/soft/compilers/cudatoolkit/cuda-11.8.0/lib64:/soft/libraries/trt/TensorRT-8.5.2.2.Linux.x86_64-gnu.cuda-11.8.cudnn8.6/lib:/soft/libraries/nccl/nccl_2.16.2-1+cuda11.8_x86_64/lib:/soft/libraries/cudnn/cudnn-11-linux-x64-v8.6.0.163/lib:/opt/cray/pe/gcc/11.2.0/snos/lib64:/opt/cray/pe/papi/6.0.0.14/lib64:/opt/cray/libfabric/1.11.0.4.125/lib64:/dbhome/db2cat/sqllib/lib64:/dbhome/db2cat/sqllib/lib64/gskit:/dbhome/db2cat/sqllib/lib32:/soft/compilers/cudatoolkit/cuda-12.4.1/extras/CUPTI/lib64:/soft/compilers/cudatoolkit/cuda-12.4.1/lib64:/soft/libraries/trt/TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0/lib:/soft/libraries/nccl/nccl_2.21.5-1+cuda12.4_x86_64/lib:/soft/libraries/cudnn/cudnn-c