---
title: Testing the Enformer pipeline with added parameters for personalized prediction on rats
author: Sabrina Mi
date: 8/12/23
---

## Test for a single individual and gene

We chose `ENSRNOG00000054549`, centered at the TSS chr20:12118762.


```
conda activate enformer-predict-tools

cd /Users/sabrinami/Github/shared_pipelines/enformer_pipeline

python scripts/enformer_predict.py --parameters /Users/sabrinami/Github/deep-learning-in-genomics/posts/2023-08-15-test-run-of-personalized-enformer-pipeline-for-rats/local_test_personalized.json

```



## Compare results to non-pipeline method

### Read in h5 prediction files

In [2]:
import h5py
import EnformerVCF
import kipoiseq
import numpy as np

2023-08-26 14:55:53.212214: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
f = h5py.File('/Users/sabrinami/Desktop/2022-23/tutorials/enformer_pipeline_test/predictions_folder/personalized_enformer_rat_single_gene/predictions_2023-08-15/enformer_predictions/000789972A/haplotype1/chr20_12118762_12118762_predictions.h5', 'r')
haplotype1 = f['chr20_12118762_12118762'][()]
f = h5py.File('/Users/sabrinami/Desktop/2022-23/tutorials/enformer_pipeline_test/predictions_folder/personalized_enformer_rat_single_gene/predictions_2023-08-15/enformer_predictions/000789972A/haplotype2/chr20_12118762_12118762_predictions.h5', 'r')
haplotype2 = f['chr20_12118762_12118762'][()]

In [4]:
print("haplotype1:\n", haplotype1)
print("haplotype2:\n", haplotype2)

haplotype1:
 [[0.24076067 0.30101207 0.5132549  ... 0.20521325 1.1217918  0.25558835]
 [0.15946281 0.20442429 0.37761706 ... 0.04465578 0.24607326 0.08344302]
 [0.15568599 0.21775411 0.4520394  ... 0.05306218 0.20978831 0.08246609]
 ...
 [0.17938398 0.22463004 0.29506153 ... 0.01107231 0.02651541 0.0338815 ]
 [0.16948122 0.2044945  0.2620006  ... 0.01690046 0.04069382 0.06031117]
 [0.15266503 0.201914   0.22262897 ... 0.02438843 0.03895664 0.05986918]]
haplotype2:
 [[0.23317184 0.29741773 0.5182305  ... 0.20385785 1.1424272  0.26008573]
 [0.15613721 0.20323968 0.37887666 ... 0.04524086 0.257699   0.08468267]
 [0.15380262 0.21736239 0.45358157 ... 0.05439655 0.22475001 0.08432709]
 ...
 [0.17942066 0.2246648  0.29515463 ... 0.01105907 0.02650285 0.03387856]
 [0.16946748 0.20452495 0.2621123  ... 0.0168827  0.04067391 0.06034191]
 [0.15272975 0.20209791 0.22299151 ... 0.0243816  0.03896997 0.05996798]]


### Run non-pipeline Enformer

In [5]:
fasta_file = '/Users/sabrinami/Desktop/2022-23/tutorials/enformer_pipeline_test/rn7_data/rn7_genome.fasta'
fasta_extractor = EnformerVCF.FastaStringExtractor(fasta_file)

In [8]:
## read vcf and encode haplotypes
target_interval = kipoiseq.Interval("chr20", 12118762, 12118762)
chr20_vcf = EnformerVCF.read_vcf("/Users/sabrinami/enformer_pipeline_test/rn7_data/chr20.vcf.gz")
haplo1, haplo2 = EnformerVCF.vcf_to_seq(target_interval, '000789972A', chr20_vcf, fasta_extractor)
haplo1_enc = EnformerVCF.one_hot_encode("".join(haplo1))[np.newaxis]
haplo2_enc = EnformerVCF.one_hot_encode("".join(haplo2))[np.newaxis]

FileNotFoundError: [Errno 2] No such file or directory: '/Users/sabrinami/enformer_pipeline_test/rn7_data/chr20.vcf.gz'

In [9]:
## run predictions
prediction1 = EnformerVCF.model.predict_on_batch(haplo1_enc)['human'][0]
prediction2 = EnformerVCF.model.predict_on_batch(haplo2_enc)['human'][0]

In [10]:
print("There are", sum(sum(haplotype1 != prediction1)), "differences between haplotype1 matrices and", sum(sum(haplotype2 != prediction2)), "differences between haplotype2 matrices.")

There are 0 differences between haplotype1 matrices and 0 differences between haplotype2 matrices.


The pipeline outputs are the same!

## Test mouse head on reference genome

```
conda activate enformer-predict-tools

cd /Users/sabrinami/Github/shared_pipelines/enformer_pipeline

python scripts/enformer_predict.py --parameters /Users/sabrinami/Github/deep-learning-in-genomics/posts/2023-08-15-test-run-of-personalized-enformer-pipeline-for-rats/local_test_reference.json

```


## Compare results to non-pipeline method

In [14]:
import numpy as np
## Read prediction file
f = h5py.File("/Users/sabrinami/Desktop/2022-23/tutorials/enformer_pipeline_test/predictions_folder/reference_enformer_rat_single_gene/predictions_2023-08-26/enformer_predictions/reference_enformer_rat/haplotype0/chr20_12118762_12118762_predictions.h5", "r")
haplotype0 = f["chr20_12118762_12118762"][()]
print("shape:", haplotype0.shape)

shape: (896, 1643)


In [15]:
SEQUENCE_LENGTH = 393216
target_interval = kipoiseq.Interval("chr20", 12118762, 12118762)
sequence_one_hot = EnformerVCF.one_hot_encode(fasta_extractor.extract(target_interval.resize(SEQUENCE_LENGTH)))
predictions = EnformerVCF.model.predict_on_batch(sequence_one_hot[np.newaxis])['mouse'][0]

In [1]:
print("There are", sum(sum(pred != predictions)), "differences between pipeline and non-pipeline outputs.")

NameError: name 'pred' is not defined