---
title: Personalized Borzoi test on a few genes
date: 10/4/2023
author: Sabrina Mi
---

### Select Genes

We want to pick a handful of human genes (with rat orthologs) used in the Enformer personalized runs (on rats) that will roughly have a similar distribution of Spearman correlations.

In [8]:
import pandas as pd
import numpy as np

In [43]:
rn7_gene_list = pd.read_csv("/home/s1mi/enformer_rat_data/output/Br_personalized_spearman_corr_human.csv", index_col = 0)

In [33]:
#| code-fold: true
# Calculate mean and standard deviation
mean = np.mean(gene_list['spearman r'])
std_dev = np.std(gene_list['spearman r'])


# Group the elements based on their distance from the mean
df_1 = pd.DataFrame(columns=['gene', 'spearman r'])
df_2 = pd.DataFrame(columns=['gene', 'spearman r'])
df_3 = pd.DataFrame(columns=['gene', 'spearman r'])

for gene, row in rn7_gene_list.iterrows():
    deviation = abs(row['spearman r'] - mean)
    df_row = pd.DataFrame({'gene': [gene], 'spearman r': row['spearman r']})
    if deviation <= std_dev:
        df_1 = pd.concat([df_1, df_row], ignore_index=True)
    elif deviation <= 2 * std_dev:
        df_2 = pd.concat([df_2, df_row], ignore_index=True)
    else:
        df_3 = pd.concat([df_3, df_row], ignore_index=True)


In [47]:
rn7_hg38_ortho = pd.read_csv("/home/s1mi/enformer_rat_data/annotation/rn7_hg38.ortholog_genes.txt", sep="\t", index_col="ensembl_gene_id")
hg38_annot = pd.read_csv("/home/s1mi/enformer_rat_data/annotation/hg38.gene.txt", sep="\t")
ortho_genes = list((rn7_gene_list.index).intersection(rn7_hg38_ortho.index))

In [50]:
# select human gene from each standard deviation grouping
df_1 = df_1[df_1['gene'].isin(ortho_genes)]
df_2 = df_2[df_2['gene'].isin(ortho_genes)]
df_3 = df_3[df_3['gene'].isin(ortho_genes)]
test_genes = [df_1['gene'].sample().item(), df_2['gene'].sample().item(), df_3['gene'].sample().item()]

In [81]:
hg38_gene_list = rn7_hg38_ortho['hsapiens_homolog_ensembl_gene'].loc[test_genes].to_list()
hg38_gene_df = hg38_annot[hg38_annot['ensembl_gene_id'].isin(hg38_gene_list)]
hg38_gene_df = hg38_gene_df[["ensembl_gene_id", "chromosome_name", "transcript_start", "transcript_end"]]

In [86]:
hg38_gene_df.to_csv("gene_list.csv", index=False)

### Write Individuals List

There are 455 individuals in the GEUVADIS data with LCL gene expression data.

In [None]:
import cyvcf2
vcf_chr = cyvcf2.cyvcf2.VCF("/grand/TFXcan/imlab/data/1000G/vcf_snps_only/ALL.chr1.shapeit2_integrated_SNPs_v2a_27022019.GRCh38.phased.vcf.gz")
vcf_samples = vcf_chr.samples

In [9]:
geuvadis_gex = pd.read_csv("/lus/grand/projects/TFXcan/imlab/data/1000G/expression/GD462.GeneQuantRPKM.50FN.samplename.resk10.txt.gz", sep="\t")
individuals = geuvadis_gex.columns[4:].tolist()
samples = list(set(vcf_samples).intersection(individuals))
with open("individuals.txt", "w") as f:
    f.write("\n".join(samples))

### Run Predictions

I started a pipeline for personalized prediction in this [notebook](https://sabrina-dl.hakyimlab.org/posts/2023-09-26-borzoi-personalized-test/geuvadis_personalized_test), and put it into a [python script](personalized_prediction.py).

I submitted this as a [PBS job](borzoi_test_run.pbs), `qsub borzoi_test_run.pbs`.

```
module load conda
conda activate borzoi
cd /home/s1mi/Github/deep-learning-in-genomics/posts/2023-10-04-personalized-test-on-a-few-genes

python3 personalized_prediction.py \
--gene_df gene_list.csv \
--fasta_file /home/s1mi/borzoi_tutorial/hg38.fa \
--vcf_dir /grand/TFXcan/imlab/data/1000G/vcf_snps_only \
--individuals_file individuals.txt \
--model_dir /home/s1mi/borzoi_tutorial \
--output_dir /grand/TFXcan/imlab/users/sabrina/borzoi-personalized-test

```

### Check Results

In [91]:
import h5py
with h5py.File("/grand/TFXcan/imlab/users/sabrina/borzoi-personalized-test/NA21144/chr1_43530883_43623666_predictions.h5", "r") as hf:
    for key, value in hf.items():
        print(key)
        print(value)

haplotype1
<HDF5 dataset "haplotype1": shape (4, 16352, 7611), type "<f2">
haplotype2
<HDF5 dataset "haplotype2": shape (4, 16352, 7611), type "<f2">
