In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

### Wrappers are Producitvity Hacks

A common issue with machine learning models in regulatory genomics is that their outputs are not.. exactly.. what you need. Many downstream analyses work best when models predict a single number, but many of the most well known models predict more than just a single number per example. For example, BPNet predictions a bp resolution profile and also a count, and Enformer predicts a binned profile. How can you use these models with existing downstream functions that are not built for those outputs?

One potential solution is to modify all of your analysis functions to slice and dice and aggregate the outputs from models until they are the right shape. Maybe you write a `bpnet_deep_lift_shap` function that is a copy of the `deep_lift_shap` function in tangermeme but slice out the profile head and just operate on the counts. Or.. maybe you write an `enformer_deep_lift_shap` function that sums the track across the length dimension before calculating attributions. Although these might technically work, they also seem like a lot of brittle code laying around that makes things messy.

An alternate solution is to use wrappers! PyTorch conveniently allows you to put a model within another model, colloquially called a wrapper, that can be extremely flexible. 

#### Slicing and Dicing Inputs and Outputs

Let's take a look using the built-in wrappers for bpnet-lite, a light-weight library for loading and using BPNet and ChromBPNet models.

First, we can load up a BPNet model.

In [2]:
import torch

model = torch.load("../../../../models/bpnet/GATA2.torch", weights_only=False)

For those unfamiliar with the model, let's take a look at the output of a random sequence. BPNet models additionally need a control track which are control experiment counts on each strand which is usually set to all zeroes after training.

In [3]:
from tangermeme.utils import random_one_hot
from tangermeme.predict import predict

X = random_one_hot((1, 4, 2114)).float()
X_ctl = torch.zeros_like(X)[:, :2]

predict(model, X, args=(X_ctl,))

[tensor([[[ 0.4295,  0.5723,  0.4151,  ..., -0.1043, -0.0767, -0.1003],
          [-0.2152,  0.0224,  0.1121,  ..., -0.3272, -0.2603, -0.2479]]]),
 tensor([[0.3379]])]

Okay, looks like the output is a pair of tensors where the first tensor contains logits for the profile predictions and the second tensor contains count predictions across both strands for that locus. Since we passed in only a singl example, both tensors have size 1 for the first dimension.

But... if I wanted attributions, what happens?

In [4]:
from tangermeme.deep_lift_shap import deep_lift_shap

deep_lift_shap(model, X, args=(X_ctl,))

TypeError: tuple indices must be integers or slices, not tuple

We get an error. This is because `deep_lift_shap`, like many other downstream functions, cannot handle outputs of arbitrary shape. In this case, it assumes a single number per example. 

Time for a wrapper! Let's do something simple and just slice out the count predictions.

In [5]:
class CountWrapper(torch.nn.Module):
    def __init__(self, model):
        super(CountWrapper, self).__init__()
        self.model = model\
    
    def forward(self, X, *args):
        return self.model(X, *args)[1]

All we are doing here is running the underlying model but only returning the second output, which for BPNet models is the count head.

In [6]:
count_model = CountWrapper(model)

predict(count_model, X, args=(X_ctl,))

tensor([[0.3379]])

Simple! Now we can pass this into anything downstream without having to modify that function.

In [7]:
deep_lift_shap(count_model, X, args=(X_ctl,))

tensor([[[-0.0000e+00, -0.0000e+00, -0.0000e+00,  ..., -0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00, -1.0636e-08, -8.1422e-09,  ..., -0.0000e+00,
          -0.0000e+00, -0.0000e+00],
         [ 0.0000e+00,  0.0000e+00, -0.0000e+00,  ..., -0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00, -0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]]])

Hooray, it ran.

But it's been pretty annoying to have to keep passing in this empty control track, right? Good news. Wrappers can do more than just modify the outputs from models -- they can really do anything.

Let's make a wrapper that automatically creates an empty control track that matches the input in size. Naturally, if we wanted to pass in informative control tracks this wrapper wouldn't be helpful, but if all we are doing in these downstream steps is using an all-zeroes track it should be fine.

In [8]:
class ControlWrapper(torch.nn.Module):
    def __init__(self, model):
        super(ControlWrapper, self).__init__()
        self.model = model
    
    def forward(self, X):
        X_ctl = torch.zeros_like(X)[:, :2]
        return self.model(X, X_ctl)

Using this wrapper means that we no longer have to pass in the `X_ctl` input in.

In [9]:
control_model = ControlWrapper(count_model)

predict(control_model, X)

tensor([[0.3379]])

Same output as before, without having to write out `args=(X_ctl,)`! 

These wrappers are already implemented in bpnet-lite and make using those models really easy. Loading just becomes the following.

In [10]:
from bpnetlite.bpnet import CountWrapper
from bpnetlite.bpnet import ControlWrapper

model = torch.load("../../../../models/bpnet/GATA2.torch", weights_only=False)
model = CountWrapper(ControlWrapper(model))

Saving some characters like this is always nice but this sort of wrapper, which allows you to control or modify the inputs, is extremely valuable when working with code that may not be as flexible as you would like -- an unfortunately common occurrance in research settings. Imagine that, for example, you want to use a function from another library but that function *does not allow you to pass in additional arguments past the sequence*. Here, you would not be able to use a BPNet model *even though the additional argument is an uninformative all-zeroes*, because the control track argument is required.

Let's pretend we are in a setting where we have an actually informative control track but the function simply does not allow us to pass in more than a single tensor that is intended to be the one-hot encoded sequence. Using the above wrapper does not work because we do not want all zeroes -- we want actual values!

A potential solution (assuming the code does not do dimension checking) is that we can concatenate the two inputs together and have a wrapper separate them out internally. Basically, because in this case both tensors are the same length, we can concatenate the four dimensions of the one-hot encoding with the two dimensions of the strands for the control track into a `(n, 6, 2114)` shaped vector.

In [11]:
class ControlSplitter(torch.nn.Module):
    def __init__(self, model):
        super(ControlSplitter, self).__init__()
        self.model = model
        
    def forward(self, X):
        return self.model(X[:, :4], X[:, 4:])
    
splitter_model = ControlSplitter(count_model)

X_ctl2 = torch.abs(torch.randn(1, 2, 2114))
Xp_ctl = torch.cat([X, X_ctl2], axis=1)

predict(count_model, X, args=(X_ctl2,)), predict(splitter_model, Xp_ctl)

(tensor([[6.5609]]), tensor([[6.5609]]))

Looks like we get the same answer in either case.

#### Advanced Slicing+Dicing with Enformer

Enformer poses potentially even more challenges to work with that BPNet models. It's output comes in the form of a dictionary, because predictions are made for both human and mouse tracks. For each species, predictions are made for thousands of different experiments, and for each track multiple bins are reported. Almost no function is going to have built-in functionality to go from the raw predictions of this model to the single number one might be interested in using.

Oh, also the `enformer_pytorch` implementation assumes that the length dimension is the first dimension instead of the second dimension, even though PyTorch has a standard that the length dimension should go last. This fact alone can break many functions.

What to do? A wrapper!

In [12]:
import os
os.environ['POLARS_ALLOW_FORKING_THREAD'] = '1'  # Needed for Enformer for whatever reason

from enformer_pytorch import from_pretrained

class EnformerInputSwapper(torch.nn.Module):
    def __init__(self, model):
        super(EnformerInputSwapper, self).__init__()
        self.model = model
    
    def forward(self, X):
        return self.model(X.permute(0, 2, 1))
    

enformer_base = from_pretrained('EleutherAI/enformer-official-rough', target_length=16, use_tf_gamma=False)
enformer = EnformerInputSwapper(enformer_base)
enformer(X)

{'human': tensor([[[0.0840, 0.1147, 0.2047,  ..., 0.0486, 0.8827, 0.8969],
          [0.0832, 0.0926, 0.1280,  ..., 0.0093, 0.0721, 0.0559],
          [0.0995, 0.1134, 0.1707,  ..., 0.0017, 0.0161, 0.0112],
          ...,
          [0.0958, 0.1037, 0.1234,  ..., 0.0039, 0.0474, 0.0564],
          [0.0682, 0.0980, 0.1097,  ..., 0.0054, 0.0606, 0.1010],
          [0.0879, 0.1223, 0.1424,  ..., 0.0026, 0.0240, 0.0295]]],
        grad_fn=<SoftplusBackward0>),
 'mouse': tensor([[[0.0811, 0.0877, 0.0587,  ..., 0.6569, 5.2159, 1.7070],
          [0.0698, 0.1061, 0.0628,  ..., 0.2593, 0.6219, 0.4996],
          [0.0818, 0.0929, 0.0783,  ..., 0.2032, 0.2541, 0.3682],
          ...,
          [0.0584, 0.0597, 0.0430,  ..., 0.1296, 0.1003, 0.2206],
          [0.0301, 0.0577, 0.0282,  ..., 0.1081, 0.0935, 0.1894],
          [0.0332, 0.0570, 0.0394,  ..., 0.1293, 0.1594, 0.2478]]],
        grad_fn=<SoftplusBackward0>)}

Now that we have resolved the dimension issue, we can see the dictionary that gets provided by the implementation. Even if functions can handle slicing out indexes from tuples (almost never, anyway), they are even less likely to be able to handle indexing into dictionaries. So, let's put that into our wrapper.

In [13]:
class EnformerWrapper(torch.nn.Module):
    def __init__(self, model):
        super(EnformerWrapper, self).__init__()
        self.model = model
    
    def forward(self, X):
        return self.model(X.permute(0, 2, 1))['human']
    
enformer = EnformerWrapper(enformer_base)

predict(enformer, X)

tensor([[[0.0840, 0.1147, 0.2047,  ..., 0.0486, 0.8827, 0.8969],
         [0.0832, 0.0926, 0.1280,  ..., 0.0093, 0.0721, 0.0559],
         [0.0995, 0.1134, 0.1707,  ..., 0.0017, 0.0161, 0.0112],
         ...,
         [0.0958, 0.1037, 0.1234,  ..., 0.0039, 0.0474, 0.0564],
         [0.0682, 0.0979, 0.1096,  ..., 0.0054, 0.0606, 0.1010],
         [0.0879, 0.1223, 0.1423,  ..., 0.0026, 0.0240, 0.0295]]])

Great, now the input and output are in formats that can be readily used by tangermeme. But what do we do next? Well, we can collapse the predictions across the length dimension. Here, predictions are made for each 128bp bin in the sequence and we could just sum the values across those bins.

In [14]:
class EnformerWrapper2(torch.nn.Module):
    def __init__(self, model):
        super(EnformerWrapper2, self).__init__()
        self.model = model
    
    def forward(self, X):
        return self.model(X.permute(0, 2, 1))['human'].sum(dim=-2)
    
enformer = EnformerWrapper2(enformer_base)

predict(enformer, X)

tensor([[1.8133, 2.1466, 3.1470,  ..., 0.0994, 1.3929, 1.4304]])

At this point we have solved several issues in both the input and output of the Enformer model using only a few lines of code and this wrapper should make it easily usable by most downstream functions without needing to modify them. As a final addition, we could slice out a specific target from the 5313 outputs for humans, yielding a single prediction per example from Enformer.

In [15]:
class EnformerWrapper3(torch.nn.Module):
    def __init__(self, model, target):
        super(EnformerWrapper3, self).__init__()
        self.model = model
        self.target = target
    
    def forward(self, X):
        return self.model(X.permute(0, 2, 1))['human'].sum(dim=-2)[:, self.target]
    
enformer = EnformerWrapper3(enformer_base, 15)

predict(enformer, X)

tensor([2.0108])

#### Correcting Mismatched Shapes

Sometimes you want to compare the predictions from two models but the models do not operate on the same sequence length. Naturally, no wrapper can magically expand the model to operate faithfully outside the hard constraints it was trained on, but a wrapper can resize the inputs to the expected shape and that caveat can be noted.

For example, the above Enformer model was able to be directly applied to a sequence of the same length as the BPNet model. Let's load a larger version of it that requires a ~3kbp input.

In [16]:
enformer_base = from_pretrained('EleutherAI/enformer-official-rough', target_length=24, use_tf_gamma=False)
enformer = EnformerWrapper3(enformer_base, 15)

predict(enformer, X)

ValueError: sequence length 17 is less than target length 24

Oh no, an error message. 

Well, there are two things we can do. The first thing is that we can make a wrapper that plops the 2114bp sequence into the middle of an otherwise-zeroes tensor. This would basically be like padding the sequence with Ns on both sides.

In [17]:
class PaddingWrapper(torch.nn.Module):
    def __init__(self, model, n_padding=500):
        super(PaddingWrapper, self).__init__()
        self.model = model
        self.n_padding = n_padding
    
    def forward(self, X):
        X_ = torch.zeros(X.shape[0], X.shape[1], X.shape[2]+self.n_padding*2, dtype=X.dtype, device=X.device)
        X_[:, :, self.n_padding:self.n_padding+X.shape[2]] = X
        return self.model(X_)
    
enformer_pad = PaddingWrapper(enformer, 500)

predict(enformer_pad, X)

tensor([1.9488])

Ta-da, fixed! This wrapper will allow us to pass the same input tensor to models even when they require different shapes. An important note here is that if these models were not trained to know how to handle Ns that, although they can technically make predictions on the sequences, these sequences will be out of distirubtion and the predictions may not be robust.

A second way that we can handle this issue is that we can expand the sequence we are making predictions for `X` and then trim it for the BPNet models.

In [18]:
class TrimmingWrapper(torch.nn.Module):
    def __init__(self, model, n_trim):
        super(TrimmingWrapper, self).__init__()
        self.model = model
        self.n_trim = n_trim
        
    def forward(self, X):
        return self.model(X[:, :, self.n_trim:-self.n_trim])

This wrapper will trim the edges off the sequence and then make predictions using the stored model on the trimmed sequence. This approach may be more robust than adding the Ns because the entire length of the sequence is real. However, a weakness of this approach is that if there is critical information in the flanks that gets trimmed out, the model has no chance of responding to it correctly.

In [19]:
X2 = random_one_hot((1, 4, 3000)).float()
trim_bpnet = TrimmingWrapper(control_model, (3000-2114)//2)

predict(enformer, X2), predict(trim_bpnet, X2)

(tensor([2.1116]), tensor([[0.9279]]))

A technical detail here is that the BPNet models can actually be run on sequences of any length but because they were only trained on sequences of length 2114 the predictions may be unreliable on other lengths.

#### Squishing Models Together

So far, we have shown how one can modify the inputs to a model and the outputs from a model using wrappers. But, we can also squish models together into the same object so that they act like a single model! Although forward and backward passes will still take the same amount of time as running both separately (wrapping models together is not compressing them), having a single object can sometimes be more managable.

In [20]:
class SquishWrapper(torch.nn.Module):
    def __init__(self, models):
        super(SquishWrapper, self).__init__()
        self.models = models
    
    def forward(self, X):
        return torch.cat([model(X) for model in self.models], axis=-1)

This wrapper gives us an object that takes in an input, applies a series of models to it, and returns the concatenated predictions from all of the models. Let's test it out with three BPNet models and one ChromBPNet model.

In [21]:
from bpnetlite import BPNet

model0 = torch.load("../../../../models/bpnet/GATA2.torch", weights_only=False)
model0 = CountWrapper(ControlWrapper(model0))

model1 = torch.load("../../../../models/bpnet/SOX6.torch", weights_only=False)
model1 = CountWrapper(ControlWrapper(model1))

model2 = torch.load("../../../../models/bpnet/CTCF.torch", weights_only=False)
model2 = CountWrapper(ControlWrapper(model2))

model3 = BPNet.from_chrombpnet("../../../../models/chrombpnet/fold_0/model.chrombpnet_nobias.fold_0.ENCSR868FGK.h5")
model3 = CountWrapper(model3).cuda()

wrapper = SquishWrapper([model0, model1, model2, model3])

predict(wrapper, X).shape, predict(model0, X).shape

(torch.Size([1, 4]), torch.Size([1, 1]))

Looks like it just simply works out of the box even with models that have different sizes and inputs. Each individual model returns only a single value -- the count prediction -- but the wrapper here returns the four numbers together. Because the wrapper is a model, it can be passed into any downstream function like `saturated_mutagenesis` or `marginalize`, etc.

This example also demonstrates how one can stack many wrappers on top of each other to get the desired output without needing to modify the underlying model. Remember, the BPNet models here take in a control track and output a profile output and a count output, but the ChromBPNet model does not need a control track. Using these stacks, we've removed the need to specify the control track input for only those models that previously needed it, sliced out all the profile outputs, and concatenated together the count outputs. No big deal when you're using wrappers.

#### Adding in Processing

Naturally, wrappers are not limited in their abilities to modifying the inputs and outputs of models. They can also do processing of the data within the wrapper itself, as we have seen with the Enformer example converting a profile into a single number. 

As an example of this, let's consider reverse complementing. Most regulatory genomics models are trained and evaluated on sequences that are from only one strand. Sometimes, these examples are derived from reverse complementing another example, but the model still only sees one directionality at a time.

An alternate strategy is to make predictions on an example and its reverse complement and then to average those predictions together. This can be either during training or only in evaluation once a model has been trained. Handling making predictions on an example and its reverse complement is, of course, a huge hassle, particularly if you want to do anything downstream with the model like calculating attributions or marginalizations.

But it becomes trivial when you have a wrapper!

In [22]:
class RCWrapper(torch.nn.Module):
    def __init__(self, model):
        super(RCWrapper, self).__init__()
        self.model = model
        
    def forward(self, X):
        return (self.model(X) + self.model(torch.flip(X, dims=(-1, -2)))) / 2.0

In the above wrapper, predictions from the model are averaged between the forward and reverse versions of the sequence. *Importantly*, this does not correctly handle stranded outputs, which would have to also be flipped, because the toy models here only make count predictions.

In [23]:
wrapper = RCWrapper(model0)

predict(wrapper, X), predict(model0, X), (predict(model0, X) + predict(model0, torch.flip(X, dims=(-1, -2)))) / 2

(tensor([[0.3525]]), tensor([[0.3379]]), tensor([[0.3525]]))

Looks like the predictions from the wrapper are the same as running the forward and reverse versions of the sequence through. This is probably not particularly surprising, but having this single object that does all that processing is a whole lot easier for downstream applications.

In [24]:
from tangermeme.marginalize import marginalize

marginalize(wrapper, X, "GATAAC")

(tensor([[0.3525]]), tensor([[0.6588]]))

Here, we get the predictions before and after substituting a GATA-like motif into the sequence and see that the GATA model makes a higher prediction after substitution. But... remember that it is not just making a single prediction, but averaging the prediction in the forward and reverse direction also before and after substituting in the motif. In this one line, a total of four forward passes are happening.

### Conclusions

Wrappers are producitivty hacks. Rather than spending (potentially significant amounts of) time modifying or even rewriting code that someone else has written to accomodate the oddities of your model, you can write a few lines of code that make your model work within the assumptions of the function you want to use. Because wrappers are so light-weight and do not modify the original model itself, you can write them on-the-fly for any particular function or analysis you encounter and even include multiple of them in the same notebook, as we have done here!

This flexibility has saved me a significant amount of time. In the past, I would spend a lot of time looking for model implementations that matched my assumptions exactly or even spent time retraining models if I found out that the alphabet was flipped (looking at you, DeepSEA/Beluga with your AGCT alphabet). Wrappers are flexible enough to correct *any* issue in the input and output, as well as add in processing that would be very challenging to account for outside the context of a wrapper.