**Variant Effect for MPRA_eQTL from promoterAI**

Here is an example workflow for doing variant effect predictions. For instructions on downloading data, see manuscript repository: [https://github.com/egilfeather/lowrank-s2f-code](https://github.com/egilfeather/lowrank-s2f-code).

**Load `seillra model`**

In [8]:
import torch
import seillra as sl
import pandas as pd

rank = 256  # 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048
quant = "CPU"
model = sl.Sei_LLRA(k=rank, projection=True, mode="variant", quant=quant)

2026-01-23 15:04:02,701 - INFO - Checksum verified for url_a6038b62128b5b01_wts: 28a1a49ca62e4d67a62c170df3751f7255db6eea3923455c119c762dde446308
2026-01-23 15:04:02,702 - INFO - Loading state dict from /home/kostka/.cache/seillra/1.4/url_a6038b62128b5b01_wts
2026-01-23 15:04:03,211 - INFO - Model weights loaded and set to eval mode.
2026-01-23 15:04:03,769 - INFO - Checksum verified for url_9c83e76615711914_wts: ce0baa7e8533604ab579a37ada184832c716e61b285a04b2f97db9367b351df7
2026-01-23 15:04:03,770 - INFO - Loading state dict from /home/kostka/.cache/seillra/1.4/url_9c83e76615711914_wts
2026-01-23 15:04:03,921 - INFO - Model weights loaded and set to eval mode.
2026-01-23 15:04:03,974 - INFO - Checksum verified for url_f6f3b1c27e97399e_wts: ba3e530e53694c66a9573c0a38e0915986b95f19b99a962f357997eadaa33d44
2026-01-23 15:04:03,974 - INFO - Loading state dict from /home/kostka/.cache/seillra/1.4/url_f6f3b1c27e97399e_wts
2026-01-23 15:04:03,979 - INFO - Model weights loaded and set to eva

**Interface with data**

In [9]:
file_path = "./data/MPRA_eQTL.vcf"
df = pd.read_csv("./data/MPRA_eQTL.tsv", sep="\t", header=0)
print(df.head())

  chrom      pos ref alt  strand     gene consequence
0  chr1   966179   G   A       1  PLEKHN1        over
1  chr1  1000335   C   T      -1     HES4        none
2  chr1  6602853   G   A      -1   KLHL21        none
3  chr1  6602971   G   C      -1   KLHL21        none
4  chr1  6613720   C   G       1    PHF13        none


**Use `grelu` to turn this into 1-hot encoded data**

In [26]:
import grelu.sequence.format
import torch
import tqdm
import grelu.io.genome

genome = grelu.io.genome.get_genome("hg38")

nvars = df.shape[0]
refs = torch.empty(nvars, 4, 4096)
alts = torch.empty(nvars, 4, 4096)
labs = torch.empty(nvars)


for index, row in tqdm.tqdm(enumerate(df.itertuples(index=False)), total=nvars):

    # - variant position
    pos = row.pos - 1

    # - chromosome sequence
    chrseq = genome[row.chrom]
    assert (
        str(row.ref).upper() == str(chrseq[pos]).upper()
    ), f"Reference mismatch at chr{row.chrom}:{pos+1}"

    # - alternative sequence
    alt_seq = (
        str(chrseq[pos - 2048 : pos]) + row.alt + str(chrseq[pos + 1 : pos + 2048])
    ).upper()
    # - reference sequence
    ref_seq = str(chrseq[pos - 2048 : pos + 2048]).upper()
    # - one-hot encoding
    alt_one_hot = grelu.sequence.format.convert_input_type(alt_seq, "one_hot")
    alt_one_hot = alt_one_hot.unsqueeze(0)
    ref_one_hot = grelu.sequence.format.convert_input_type(ref_seq, "one_hot")
    ref_one_hot = ref_one_hot.unsqueeze(0)

    # - store
    refs[index] = ref_one_hot
    alts[index] = alt_one_hot

    # - label
    if row.consequence == "over":
        labs[index] = 1
    else:
        labs[index] = 0

100%|██████████| 686/686 [00:08<00:00, 81.43it/s] 


**Make a `DataLoader`**

In [28]:
from torch.utils.data import Dataset, DataLoader


class VarD(Dataset):
    def __init__(self, refs, alts, labels):
        self.refs = refs
        self.alts = alts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        ref = self.refs[idx]
        alt = self.alts[idx]
        label = self.labels[idx]
        return ref, alt, label


var_ds = VarD(refs, alts, labs)
var_dl = DataLoader(var_ds, 16)

**Predict Variant Effect**

In [None]:
import numpy as np

model.eval()
all_sc_ref = []
all_sc_alt = []
all_labs = []

for batch in tqdm.tqdm(var_dl, desc=f"Predicting variant data"):
    ref, alt, label = batch

    # - note that modeel in "variant" mode takes care of reverse complementing internally
    sc_outputs = model((ref, alt))  # sequence classes for alt and ref

    all_sc_ref.append(sc_outputs[0])
    all_sc_alt.append(sc_outputs[1])
    all_labs.append(label)

    # Accumulate by appending to list

all_sc_ref = torch.cat([t.detach() for t in all_sc_ref], dim=0).numpy()
all_sc_alt = torch.cat([t.detach() for t in all_sc_alt], dim=0).numpy()
all_labs = np.concatenate(all_labs, axis=0)

Predicting variant data: 100%|██████████| 43/43 [02:23<00:00,  3.33s/it]


In [50]:
from sklearn.metrics import roc_auc_score

# - find the promoter index from class annotations
promoter_idx = np.where(np.array(model.proj.class_annot) == "P Promoter")
# - calculate the variant score: overexpression -> more "promotery"
variant_score = (all_sc_alt[:, promoter_idx] - all_sc_ref[:, promoter_idx]).squeeze()

# - find indices wherer df. consequence is "over" or "under"
over_under_idx = df.index[df["consequence"].isin(["over", "under"])]

auc = roc_auc_score(all_labs[over_under_idx], variant_score[over_under_idx])
print(f"Promoter AUC: {auc:.2f}")

Promoter AUC: 0.81
