In this notebook, we will walk through an example of using our pairwise regression model to predict RNA-seq in LCLs on the test individuals for the EFTUD2 gene. 

Along the way, we will detail three important artifacts of our study.
- the pre-processed GEUVADIS RNA-seq data
- a h5 object containing one-hot encoded sequences of GEUVADIS individuals
- the pairwise regression model

In [1]:
REPO_DIR = "/oak/stanford/groups/akundaje/rrastogi/external_repos/finetuning-enformer" # Path to the finetuning-enformer repo
DOWNLOAD_DIR = ".download" # Directory to store downloaded data (sequences, models, etc.)

In [24]:
import gzip
import h5py
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import torch
from scipy.stats import pearsonr
from tqdm.auto import tqdm

# In case you don't have finetuning-enformer installed as a package, add it to the system path
sys.path.append(os.path.join(REPO_DIR, "finetuning"))
from models import PairwiseRegressionFloatPrecision

# 1. Load true (processed) gene expression data for EFTUD2

In [3]:
EXPRESSION_DATA_PATH = os.path.join(REPO_DIR, "process_geuvadis_data", "log_tpm", "corrected_log_tpm.annot.csv.gz")

In [65]:
expression_data = pd.read_csv(EXPRESSION_DATA_PATH, index_col=0)

The processed expression dataframe contains log TPM values that have been corrected for batch effects/population stratification by regressing out the top 10 expression PCs.

Each row corresponds to a different gene. Most columns refer to GEUVADIS donor IDs; other columns contain gene metadata, such  as whether a significant eQTL was detected in the European or Yoruba population in the original Geuvadis analysis.

Note: we only train and evaluate on genes where a significant eQTL was detected in the European population, but this table contains information for all genes that are not
lowly expressed in LCLs.

In [5]:
expression_data.head()

Unnamed: 0_level_0,Gene_Symbol,Chr,Coord,HG00096,HG00097,HG00099,HG00100,HG00101,HG00102,HG00103,...,NA20828,stable_id,gencode_v12_gene_name,our_gene_name,EUR_eGene,YRI_eGene,top_EUR_eqtl_rsid,top_YRI_eqtl_rsid,top_EUR_eqtl_distance,top_YRI_eqtl_distance
TargetID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000257527.1,ENSG00000257527.1,16,18505708,-0.057361,-0.31316,-0.684395,-1.209085,-0.012644,-0.270612,-0.930251,...,-1.127696,ENSG00000257527,rp11-1212a22.6,,False,False,,,,
ENSG00000151503.7,ENSG00000151503.7,11,134095348,3.653703,3.555238,3.969966,3.832266,3.620463,3.682108,3.86241,...,3.984807,ENSG00000151503,ncapd3,,False,False,,,,
ENSG00000254681.2,ENSG00000254681.2,16,18495797,2.088882,2.326419,2.128807,2.199625,2.331783,2.627187,1.608311,...,1.565265,ENSG00000254681,rp11-1212a22.3,,False,False,,,,
ENSG00000228477.1,ENSG00000228477.1,1,40428352,5.579332,5.352685,5.758683,6.045576,5.563191,5.176924,5.579479,...,5.187391,ENSG00000228477,rp3-342p20.2,,False,False,,,,
ENSG00000159733.9,ENSG00000159733.9,4,2420390,-0.984586,-1.124469,-0.433654,-1.025796,-0.70515,-1.333362,-0.532541,...,0.044033,ENSG00000159733,zfyve28,zfyve28,True,False,rs4974687,,9347.0,


In [85]:
# All donor columns start with either HG or NA
donor_columns = [c for c in expression_data.columns if c.startswith("HG") or c.startswith("NA")]
print(f"Number of donors: {len(donor_columns)}")

# Restrict to the EFTUD2 gene and then store values in a dictionary
eftud2_data = expression_data[expression_data["our_gene_name"] == "eftud2"].iloc[0] # only one row, so select first
eftud2_data = eftud2_data[donor_columns]
eftud2_true_values = eftud2_data.to_dict()

# Display a subset of the true values
print("Example true expression values for EFTUD2:")
for donor, value in list(eftud2_true_values.items())[:5]:
    print(f"{donor}: {value}")

Number of donors: 462
Example true expression values for EFTUD2:
HG00096: 3.5083597317592243
HG00097: 3.528019606030445
HG00099: 3.27850089238909
HG00100: 3.4304214030439257
HG00101: 3.53166263223106


# 2. Load one-hot sequences for individuals in the test set for EFTUD2

The one-hot encoded sequences for the train, val, and test split have been uploaded to [HuggingFace](https://huggingface.co/anikethjr/finetuning-enformer/tree/main/data/). We will download the test.h5.gz file to our local `DOWNLOAD_DIR` and then extract the one-hot encoded sequences for EFTUD2.

*Note*: All the test sequences for random-split genes and population-split genes are in `test.h5.gz`. That file also contains the test sequences for 100 unseen genes. The test sequences for the remaining unseen genes are present in `rest_unseen_filtered.h5.gz`. Because EFTUD2 is a random-split gene, we will just download `test.h5.gz`.

In [9]:
HF_DATA_DIR = "https://huggingface.co/anikethjr/finetuning-enformer/resolve/main/data"
HF_TEST_H5_PATH = f"{HF_DATA_DIR}/test.h5.gz?download=true"

LOCAL_TEST_H5_PATH = os.path.join(DOWNLOAD_DIR, "test.h5.gz")

In [11]:
os.makedirs(DOWNLOAD_DIR, exist_ok=True)

In [12]:
# Download can take multiple minutes
!curl -L {HF_TEST_PATH} -o {LOCAL_TEST_H5_PATH}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1356  100  1356    0     0   8521      0 --:--:-- --:--:-- --:--:--  8528
100 1493M  100 1493M    0     0  77.5M      0  0:00:19  0:00:19 --:--:-- 79.4M   0  0:00:19  0:00:08  0:00:11 79.4M


In [13]:
# Unzipping can also take several minutes (requires more than 25 GB of disk space)
!pigz -d -p 8 {os.path.join(DOWNLOAD_DIR, "test.h5.gz")}

This h5 file contains multiple fields:
- `seqs`: one-hot-encoded sequences
    - shape: (n_seqs, n_haplotypes = 2, sequence_length = 49152, alphabet_size = 4)
- `genes`: gene associated with each sequence
    - shape: (n_seqs)
- `samples`: donor (individual) associated with each sequence
    - shape (n_seqs)
- `ancestries`: ancesetry of the donor for each sequence
    - shape (n_seqs)
- `Y`: log TPM values corrected for top 10 expression PCs
    - shape (n_seqs)
- `Z`: per-gene z-scores of `Y`, computed across all donors
    - shape (n_seqs)
- `P`: per-gene percentiles of `Y`, computed across all donors
    - shape (n_seqs)

Here, we will only extract the sequences for individuals in the test set of EFTUD2.

In [95]:
with h5py.File(LOCAL_TEST_H5_PATH.replace(".gz", ""), "r") as f:
    all_genes = f["genes"][:].astype(str)
    eftud2_idxs = np.where(all_genes == "eftud2")[0]
    eftud2_donors = f["samples"][eftud2_idxs].astype(str)
    eftud2_seqs = f["seqs"][eftud2_idxs]
    eftud2_donor_to_seq = {donor: seq for donor, seq in zip(eftud2_donors, eftud2_seqs)}

print(f"# of EFTUD2 individuals in test set: {len(eftud2_donor_to_seq)}")

# of EFTUD2 individuals in test set: 77


# 3. Make predictions using the pairwise regression model

The weights of all primary models (i.e. those not trained for ablation analyses) are uploaded to [HuggingFace](https://huggingface.co/anikethjr/finetuning-enformer/tree/main/saved_models). Each primary model has three replicates, corresponding to different random seeds. We will first download the seed 42 replicate of the pairwise regression model.

In [54]:
HF_MODEL_DIR = "https://huggingface.co/anikethjr/finetuning-enformer/resolve/main/saved_models"

HF_CKPT_PATH = f"{HF_MODEL_DIR}/regression_data_seed_42_lr_0.0001_wd_0.001_rcprob_0.5_rsmax_3/checkpoints/best.ckpt?download=true"
LOCAL_CKPT_PATH = os.path.join(DOWNLOAD_DIR, "pairwise_regression_seed_42.ckpt")

In [16]:
# Download may take several minutes
!curl -L {HF_CKPT_PATH} -o {LOCAL_CKPT_PATH}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1307  100  1307    0     0   4070      0 --:--:-- --:--:-- --:--:--  4084
100 2712M  100 2712M    0     0  69.5M      0  0:00:38  0:00:38 --:--:-- 44.5M


Load model onto GPU

In [None]:
assert torch.cuda.is_available()
device = torch.device("cuda:0")
model = PairwiseRegressionFloatPrecision.load_from_checkpoint(LOCAL_CKPT_PATH)
model = model.to(device)
model.eval()

Make predictions for each individual by averaging over the forward and reverse complement sequences for both haplotypes.

*Note*: The pairwise regression model is trained to predict $z$-scored expression values. This means expression values are only comparable across individuals for a given gene, but not across different genes.

In [77]:
@torch.inference_mode()
def make_predictions_for_donor(
    model: PairwiseRegressionFloatPrecision,
    device: torch.device,
    donor_to_seq: dict[str, np.ndarray],
    donor: str,
):
    fwd_seq = donor_to_seq[donor].astype(np.float32) # [n_haplotypes=2, sequence_length=49152, 4]
    rc_seq = np.flip(fwd_seq, axis=(-1, -2)).copy()
    batch_seq = np.stack([fwd_seq, rc_seq], axis=0) # [n_strands=2, n_haplotypes=2, sequence_length=49152, 4]
    
    batch_seq_tensor = torch.from_numpy(batch_seq).to(device)
    preds = model(batch_seq_tensor) # [n_strands=2]
    preds = preds.detach().cpu().numpy()
    return np.mean(preds) # average over strands

In [88]:
eftud2_pred_values = {}
for donor in tqdm(eftud2_donor_to_seq):
    eftud2_pred_values[donor] = make_predictions_for_donor(
        model=model,
        device=device,
        donor_to_seq=eftud2_donor_to_seq,
        donor=donor,
    )

  0%|          | 0/77 [00:00<?, ?it/s]

In [89]:
def compute_pearson_corr(true_values: dict[str, float], pred_values: dict[str, float]) -> float:
    assert set(pred_values.keys()).issubset(set(true_values.keys()))
    common_donors = list(pred_values.keys())
    x = np.asarray([true_values[donor] for donor in common_donors]) 
    y = np.asarray([pred_values[donor] for donor in common_donors])
    return pearsonr(x, y)[0]

In [96]:
compute_pearson_corr(eftud2_true_values, eftud2_pred_values)

np.float64(0.34543399621640836)

Ensure that this matches the performance reported in the `all_gene_perf.csv` file up to a rounding error.

In [91]:
ALL_GENE_PERF_PATH = os.path.join(REPO_DIR, "analysis", "all_gene_perf.csv")

In [92]:
all_gene_perf_df = pd.read_csv(ALL_GENE_PERF_PATH)
all_gene_perf_df[
    (all_gene_perf_df["gene"] == "eftud2")
    & (all_gene_perf_df["model"] == "regression_data_seed_42_lr_0.0001_wd_0.001_rcprob_0.5_rsmax_3")
]

Unnamed: 0,gene,Pearson,|Pearson|,model,class,Chr
114,eftud2,0.345252,0.345252,regression_data_seed_42_lr_0.0001_wd_0.001_rcp...,random_split,17
