---
title: Running personalized Enformer pipeline on Br rats for many more genes
date: 8/31/2023
author: Sabrina Mi
---

## Selecting Genes

We are aiming for ~1,000 genes at the end of the selection process

In [1]:
import pandas as pd
import numpy as np

In [16]:
tpm = pd.read_csv("/home/s1mi/enformer_rat_data/expression_data/Brain.rn7.expr.tpm.bed", header = 0, sep="\t",  index_col= 'gene_id')
tpm_var = tpm.iloc[:, 3:].var(axis=1)
tpm_threshold = tpm_var.quantile(0.8)
## subset of genes with high observed expression variation
high_tpm_var_genes = set(tpm[tpm_var> tpm_threshold].index)

  tpm = pd.read_csv("/home/s1mi/enformer_rat_data/expression_data/Brain.rn7.expr.tpm.bed", header = 0, sep="\t",  index_col= 'gene_id')


In [9]:
eqtl = pd.read_csv("/home/s1mi/enformer_rat_data/annotation/Brain.rn7.cis_qtl_signif.txt", sep="\t")
counts = eqtl['gene_id'].value_counts()
counts.describe()

count    11238.000000
mean      2312.935398
std       1490.079008
min          1.000000
25%       1160.250000
50%       2236.000000
75%       3277.750000
max      10799.000000
Name: gene_id, dtype: float64

In [17]:
eqtl_threshold = counts.quantile(0.8)
eqtl_genes = counts[counts > eqtl_threshold].index

In [18]:
gene_list = high_tpm_var_genes.intersection(set(eqtl_genes))
print(len(gene_list), "candidate genes for enformer prediction experiments")

868 candidate genes for enformer prediction experiments


## Run the Pipeline

### Write Metadata

In [19]:
## write intervals.txt
annot_df = pd.read_csv('/home/s1mi/enformer_rat_data/annotation/rn7.gene.txt', sep ='\t',  index_col='geneId')

def write_intervals(gene_list, file):
    with open(file, 'a') as f:
        for gene in gene_list:
            gene_annot = annot_df.loc[gene]
            tss = gene_annot['tss']
            interval = f"chr{gene_annot['chromosome']}_{tss}_{tss}"
            f.write(interval + '\n')

  annot_df = pd.read_csv('/home/s1mi/enformer_rat_data/annotation/rn7.gene.txt', sep ='\t',  index_col='geneId')


In [21]:
with open("gene_list.txt", "w") as f:
    f.write("\n".join(gene_list))
write_intervals(gene_list, "metadata/intervals.txt")

### Submit Jobs

```
module load conda
conda activate /lus/grand/projects/TFXcan/imlab/shared/software/conda_envs/enformer-predict-tools

python /home/s1mi/Github/enformer_epigenome_pipeline/enformer_predict.py --parameters /home/s1mi/Github/deep-learning-in-genomics/posts/2023-08-31-Br-personalized-prediction-on-more-genes/personalized_config.json


```