In this notebook, we will walk through an example of using our pairwise regression model to predict RNA-seq in LCLs on the test individuals for the ZNF83 gene. 

Along the way, we will detail three important artifacts of our study.
- the pre-processed GEUVADIS RNA-seq data
- a h5 object containing one-hot encoded sequences of GEUVADIS individuals
- the pairwise regression model

In [9]:
REPO_DIR = "/data/yosef3/users/ruchir/finetuning-enformer" # Path to the finetuning-enformer repo
DOWNLOAD_DIR = ".download" # Directory to store downloaded data (sequences, models, etc.)

In [10]:
import gzip
import h5py
import os
import sys

import numpy as np
import pandas as pd

# In case you don't have finetuning-enformer installed as a package, add it to the system path
sys.path.append(os.path.join(REPO_DIR, "finetuning"))
from models import PairwiseRegressionFloatPrecision

  if not hasattr(numpy, tp_name):


RuntimeError: operator torchvision::nms does not exist

# 1. Load true (processed) gene expression data for ZNF83

In [7]:
EXPRESSION_DATA_PATH = os.path.join(REPO_DIR, "process_geuvadis_data", "log_tpm", "corrected_log_tpm.annot.csv.gz")

In [8]:
expression_data = pd.read_csv(EXPRESSION_DATA_PATH, index_col=0)

The processed expression dataframe contains log TPM values that have been corrected for batch effects/population stratification by regressing out the top 10 expression PCs.

Each row corresponds to a different gene. Most columns refer to GEUVADIS donor IDs; other columns contain gene metadata, such  as whether a significant eQTL was detected in the European or Yoruba population in the original Geuvadis analysis.

Note: we only train and evaluate on genes where a significant eQTL was detected in the European population, but this table contains information for all genes that are not
lowly expressed in LCLs.

In [9]:
expression_data.head()

Unnamed: 0_level_0,Gene_Symbol,Chr,Coord,HG00096,HG00097,HG00099,HG00100,HG00101,HG00102,HG00103,...,NA20828,stable_id,gencode_v12_gene_name,our_gene_name,EUR_eGene,YRI_eGene,top_EUR_eqtl_rsid,top_YRI_eqtl_rsid,top_EUR_eqtl_distance,top_YRI_eqtl_distance
TargetID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000257527.1,ENSG00000257527.1,16,18505708,-0.057361,-0.31316,-0.684395,-1.209085,-0.012644,-0.270612,-0.930251,...,-1.127696,ENSG00000257527,rp11-1212a22.6,,False,False,,,,
ENSG00000151503.7,ENSG00000151503.7,11,134095348,3.653703,3.555238,3.969966,3.832266,3.620463,3.682108,3.86241,...,3.984807,ENSG00000151503,ncapd3,,False,False,,,,
ENSG00000254681.2,ENSG00000254681.2,16,18495797,2.088882,2.326419,2.128807,2.199625,2.331783,2.627187,1.608311,...,1.565265,ENSG00000254681,rp11-1212a22.3,,False,False,,,,
ENSG00000228477.1,ENSG00000228477.1,1,40428352,5.579332,5.352685,5.758683,6.045576,5.563191,5.176924,5.579479,...,5.187391,ENSG00000228477,rp3-342p20.2,,False,False,,,,
ENSG00000159733.9,ENSG00000159733.9,4,2420390,-0.984586,-1.124469,-0.433654,-1.025796,-0.70515,-1.333362,-0.532541,...,0.044033,ENSG00000159733,zfyve28,zfyve28,True,False,rs4974687,,9347.0,


In [10]:
# All donor columns start with either HG or NA
donor_columns = [c for c in expression_data.columns if c.startswith("HG") or c.startswith("NA")]
print(f"Number of donors: {len(donor_columns)}")

# Restrict to the ZNF83 gene and then store values in a dictionary
znf83_data = expression_data[expression_data["our_gene_name"] == "znf83"].iloc[0] # only one row, so select first
znf83_true_values = znf83_data[donor_columns].to_dict()

# Display a subset of the true values
print("Example true expression values for ZNF83:")
for donor, value in list(znf83_true_values.items())[:5]:
    print(f"{donor}: {value}")

Number of donors: 462
Example true expression values for ZNF83:
HG00096: 1.815088910192252
HG00097: 1.7825214478821585
HG00099: 2.0967075753385607
HG00100: 2.0062706302142934
HG00101: 1.4431392504217315


# 2. Load one-hot sequences for individuals in the test set for ZNF83

The one-hot encoded sequences for the train, val, and test split have been uploaded to [HuggingFace](https://huggingface.co/anikethjr/finetuning-enformer/tree/main/data/). We will download the test.h5.gz file to our local `DOWNLOAD_DIR` and then extract the one-hot encoded sequences for ZNF83.

*Note*: All the test sequences for random-split genes and population-split genes are in `test.h5.gz`. That file also contains the test sequences for 100 unseen genes. The test sequences for the remaining unseen genes are present in `rest_unseen_filtered.h5.gz`. Because ZNF83 is a random-split gene, we will just download `test.h5.gz`.

In [18]:
HF_DATA_DIR = "https://huggingface.co/anikethjr/finetuning-enformer/resolve/main/data"

HF_TEST_PATH = f"{HF_DATA_DIR}/test.h5.gz?download=true"

In [19]:
# Download can take multiple minutes
!curl -L {HF_TEST_PATH} -o {os.path.join(DOWNLOAD_DIR, "test.h5.gz")}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1350  100  1350    0     0  10129      0 --:--:-- --:--:-- --:--:-- 10150
100 1493M  100 1493M    0     0  50.3M      0  0:00:29  0:00:29 --:--:-- 50.9M3 1094M    0     0  49.7M      0  0:00:30  0:00:21  0:00:09 48.8M


In [None]:
# Unzipping can also take several minutes (requires more than 25 GB of disk space)
!pigz -d -p 8 {os.path.join(DOWNLOAD_DIR, "test.h5.gz")}

This h5 file contains multiple fields:
- `seqs`: one-hot-encoded sequences
    - shape: (n_seqs, n_haplotypes = 2, sequence_length = 49152, alphabet_size = 4)
- `genes`: gene associated with each sequence
    - shape: (n_seqs)
- `samples`: donor (individual) associated with each sequence
    - shape (n_seqs)
- `ancestries`: ancesetry of the donor for each sequence
    - shape (n_seqs)
- `Y`: log TPM values corrected for top 10 expression PCs
    - shape (n_seqs)
- `Z`: per-gene z-scores of `Y`, computed across donors
    - shape (n_seqs)
- `P`: per-gene percentiles of `Y`, computed across donors
    - shape (n_seqs)

Here, we will only extract the sequences for individuals in the test set of ZNF83.

In [11]:
with h5py.File(os.path.join(DOWNLOAD_DIR, "test.h5"), "r") as f:
    all_genes = f["genes"][:].astype(str)
    znf83_idxs = np.where(all_genes == "znf83")[0]
    znf83_donors = f["samples"][znf83_idxs].astype(str)
    znf83_seqs = f["seqs"][znf83_idxs]
    znf83_donor_to_seq = {donor: seq for donor, seq in zip(znf83_donors, znf83_seqs)}

print(f"# of ZNF83 individuals in test set: {len(znf83_donor_to_seq)}")

# of ZNF83 individuals in test set: 77


# 3. Make predictions using the pairwise regression model

The weights of all primary models (i.e. those not trained for ablation analyses) are uploaded to [HuggingFace](https://huggingface.co/anikethjr/finetuning-enformer/tree/main/saved_models). Each primary model has three replicates, corresponding to different random seeds. We will first download the seed 42 replicate of the pairwise regression model.

In [36]:
HF_MODEL_DIR = "https://huggingface.co/anikethjr/finetuning-enformer/resolve/main/saved_models"

PR_SEED_42_CKPT_PATH = f"{HF_MODEL_DIR}/regression_data_seed_42_lr_0.0001_wd_0.001_rcprob_0.5_rsmax_3/checkpoints/best.ckpt?download=true"

In [37]:
# Download may take several minutes
!curl -L {PR_SEED_42_CKPT_PATH} -o {os.path.join(DOWNLOAD_DIR, "pairwise_regression_seed42.ckpt")}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1307  100  1307    0     0   9355      0 --:--:-- --:--:-- --:--:--  9335
100 2712M  100 2712M    0     0  47.0M      0  0:00:57  0:00:57 --:--:-- 41.6M
