---
title: Selecting Rat genes for Enformer CAGE predictions
description: We are looking for genes with (1) variation in observed gene expression across individuals, (2) high prediction performance in elastic net training, and (3) multiple causal variants.
author: Sabrina Mi
date: 8/10/23
---

## Calculate gene expression variance across individuals

In [2]:
import pandas as pd
import numpy as np

In [3]:
## Read in expression data
tpm = pd.read_csv("/home/s1mi/enformer_rat_data/Brain.rn7.expr.tpm.bed", header = 0, sep="\t",  index_col= 'gene_id')
iqn = pd.read_csv("/home/s1mi/enformer_rat_data/Brain.rn7.expr.iqn.bed", header = 0, sep="\t",  index_col= 'gene_id')

  tpm = pd.read_csv("/home/s1mi/enformer_rat_data/Brain.rn7.expr.tpm.bed", header = 0, sep="\t",  index_col= 'gene_id')
  iqn = pd.read_csv("/home/s1mi/enformer_rat_data/Brain.rn7.expr.iqn.bed", header = 0, sep="\t",  index_col= 'gene_id')


In [6]:
tpm_var = tpm.iloc[:, 3:].var(axis=1)
np.average(tpm_var)

1003.8945129200853

In [7]:
iqn_var = iqn.iloc[:, 3:].var(axis=1)
np.average(iqn_var)

0.6306594759954833

We first subset to genes in the top decile for both TPM and IQN variance.

In [8]:
tpm_threshold = tpm_var.quantile(0.9)
iqn_threshold = iqn_var.quantile(0.9)
high_tpm_var_genes = set(tpm[tpm_var> tpm_threshold].index)
high_iqn_var_genes = set(iqn[iqn_var> iqn_threshold].index)
high_var_genes = high_tpm_var_genes.intersection(high_iqn_var_genes)
print(len(high_var_genes), "genes with high variance")

167 genes with high variance


## Count eQTLs

In [10]:
eqtl = pd.read_csv("/home/s1mi/enformer_rat_data/annotation/Brain.rn7.cis_qtl_signif.txt", sep="\t")
eqtl.head()

Unnamed: 0,gene_id,variant_id,tss_distance,af,ma_samples,ma_count,pval_nominal,slope,slope_se,pval_nominal_threshold
0,ENSRNOG00000050129,chr1:2002359,695174,0.433432,223,294,0.0015,0.129848,0.04054,0.006989
1,ENSRNOG00000050129,chr1:2002361,695176,0.433432,223,294,0.0015,0.129848,0.04054,0.006989
2,ENSRNOG00000050129,chr1:2002408,695223,0.433432,223,294,0.0015,0.129848,0.04054,0.006989
3,ENSRNOG00000050129,chr1:2002450,695265,0.433432,223,294,0.0015,0.129848,0.04054,0.006989
4,ENSRNOG00000050129,chr1:2002464,695279,0.433432,223,294,0.0015,0.129848,0.04054,0.006989


In [11]:
counts = eqtl['gene_id'].value_counts()

In [12]:
counts.describe()

count    11238.000000
mean      2312.935398
std       1490.079008
min          1.000000
25%       1160.250000
50%       2236.000000
75%       3277.750000
max      10799.000000
Name: gene_id, dtype: float64

In [13]:
eqtl_threshold = counts.quantile(0.9)
eqtl_genes = counts[counts > eqtl_threshold].index

In [14]:
print(eqtl_genes)

Index(['ENSRNOG00000031024', 'ENSRNOG00000000451', 'ENSRNOG00000032708',
       'ENSRNOG00000000455', 'ENSRNOG00000021507', 'ENSRNOG00000009389',
       'ENSRNOG00000066838', 'ENSRNOG00000000432', 'ENSRNOG00000039396',
       'ENSRNOG00000002232',
       ...
       'ENSRNOG00000016364', 'ENSRNOG00000008471', 'ENSRNOG00000043350',
       'ENSRNOG00000012337', 'ENSRNOG00000005248', 'ENSRNOG00000068325',
       'ENSRNOG00000012868', 'ENSRNOG00000068200', 'ENSRNOG00000005610',
       'ENSRNOG00000008356'],
      dtype='object', length=1124)


In [15]:
gene_list = high_var_genes.intersection(set(eqtl_genes))
print(len(gene_list), "candidate genes for enformer prediction experiments")

17 candidate genes for enformer prediction experiments


## Check Elastic Net Prediction Performance

Now that we have a manageable number of genes, we can individual check that these genes have sufficient prediction performance.

In [18]:
model_genes = pd.read_csv("/home/s1mi/Github/deep-learning-in-genomics/posts/2023-08-08-running-enformer-on-rat-genes-at-TSS/highestR2genes.csv", header=0, index_col="gene")
model_genes.loc[[gene for gene in gene_list if gene in model_genes.index]]

Unnamed: 0_level_0,genename,pred.perf.R2,n.snps.in.model,pred.perf.pval,cor,pred.perf.qval
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ENSRNOG00000009734,Akr1b8,0.050379,4,0.0006577101,0.224453,0.0004811327
ENSRNOG00000001311,Rab36,0.559618,2,6.036612999999999e-42,0.748076,7.693936e-41
ENSRNOG00000010079,Ca3,0.003307,2,0.3884876,0.057507,0.1439053
ENSRNOG00000028436,Rprml,0.228241,3,2.40791e-14,0.477745,5.222104e-14
ENSRNOG00000032908,Acaa1a,0.548779,2,9.394297e-41,0.740796,1.111821e-39
ENSRNOG00000050647,Hspa1b,0.078443,6,1.847947e-05,0.280077,1.706734e-05
ENSRNOG00000012235,Ppp1r17,0.3702,2,2.229702e-24,0.608441,9.825563999999999e-24
ENSRNOG00000048258,Cisd2,0.661901,2,6.783282e-55,0.813573,2.3913360000000003e-53
ENSRNOG00000054549,Lss,0.124023,3,4.970816e-08,0.352169,6.16943e-08
ENSRNOG00000004430,Cep131,0.650722,2,2.657264e-53,0.806673,8.153414e-52


All of the genes found in the model have positive correlation. For now, we will keep all 17 genes in our list to run Enformer on.


## Run Pipeline

### Write Metadata


First, write our `metadata/intervals.txt` file with the 17 genes we have narrowed down to.

In [19]:
def write_intervals(gene_list, file):
    with open(file, 'a') as f:
        for gene in gene_list:
            gene_annot = annot_df.loc[gene]
            tss = gene_annot['tss']
            interval = f"chr{gene_annot['chromosome']}_{tss}_{tss}"
            f.write(interval + '\n')

In [20]:
annot_df = pd.read_csv('/home/s1mi/enformer_rat_data/annotation/rn7.gene.txt', sep ='\t',  index_col='geneId')

with open("gene_list.txt", "w") as f:
    f.write("\n".join(gene_list))
write_intervals(gene_list, "metadata/intervals.txt")


  annot_df = pd.read_csv('/home/s1mi/enformer_rat_data/rn7.gene.txt', sep ='\t',  index_col='geneId')


Use all 340 individuals:

In [15]:
!bcftools query -l /home/s1mi/enformer_rat_data/Brain.rn7.vcf.gz > metadata/individuals.txt

### Submit Jobs


```
module load conda

conda activate /lus/grand/projects/TFXcan/imlab/shared/software/conda_envs/enformer-predict-tools

cd /home/s1mi/Github/shared_pipelines/enformer_pipeline
python3 scripts/enformer_predict.py --parameters /home/s1mi/Github/deep-learning-in-genomics/posts/2023-08-10-selecting-genes/personalized_config.json

```