In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

### Tutorial 8: Variant Effect

A common use-case of sequence-based machine learning methods is predicting the effects of variants. By comparing the predictions of the model before and after incorporating the variant one can evaluate the predicted effect and, potentially, better understand the mechanism of its action. For example, if a variant was predicted to stop the binding of some particular protein in an enhancer, one could determine that (1) the variant could be involved in some change in phenotype and that (2) the reason it is involved is by blocking the binding of this protein which is important in some downstream pathway. In contrast, if a variant does not change the predictions from the model, potentially it is not involved with the change in phenotype.

A challenge with evaluating the effects of variants is that there are many kinds of variants and some kinds are not trivial to calculate the effect for. The most basic kind are <i>substitutions</i>, where one character is changed into another. A more complicated pair of variant types are called <i>indels</i> (or insertions/deletions), where one character is inserted into or deleted from a sequence. And, finally, there are <i>structural variants</i>, where large blocks of sequence are inserted, deleted, or flipped, with respect to a reference sequence.

Orthogonally to the type of mutation being considered, the evaluation can be done in two settings: <i>marginal</i>, where each variant of considered individually and independently from other variants, or <i>joint</i>, where all variants are incorporated simultaneously and the effect of each variant is just the difference in model predictions when centered on that variant.

Here, we will explore how to calculate the effect of substitutions and indels in the marginal setting. Note that the following examples are genomics based but the library and functions are general to any type of bio-sequence.

#### Substitutions

The simplest form of variant is the substitution, where one character is exchanged for another. These variants are the simplest to evaluate because one can simply extract reference sequences centered at the mutations and then swap out the characters.

We can use `tangermeme` to calculate the effect of each substitution individually using the `marginal_substitution_effect` function. This function takes in a model, the filename of a FASTA file or a dictionary where the keys are the names like in a FASTA file and the values are one-hot encoded sequences, and the variants as a pandas DataFrame. The variants must have the first column be the key (like in the FASTA file or dictionary keys), the second column by the position (ZERO INDEXED, NOT ONE INDEXED LIKE VCFS), and the third column is the new character. Note that this differs slightly from VCF format in that you do not need to include the identity of the original character, since that information would not be used by the function. The names of the columns do not matter, only their order.

For the purpose of demontrating the usage we can use our simple untrained dense model again.

In [2]:
import torch
from tangermeme.utils import random_one_hot

class FlattenDense(torch.nn.Module):
	def __init__(self, seq_len=10):
		super(FlattenDense, self).__init__()
		self.dense = torch.nn.Linear(seq_len*4, 3)
		self.seq_len = seq_len

	def forward(self, X, alpha=0, beta=1):
		X = X.reshape(X.shape[0], self.seq_len*4)
		return self.dense(X) * beta + alpha

model = FlattenDense()

Instead of pointing to a FASTA file on disk, we can pass in a dictionary to make it clearer to us what's happening in this example.

In [3]:
from tangermeme.utils import one_hot_encode

X = {'chr1': one_hot_encode("ACCAGTAGTGTACCCACGTTGACCTA")}

Next, let's create the DataFrame. This has the first few columns of a VCF file but does not have the ones related to quality or other metadata as decisions on how to use those to filter variants should be done by the user independently. Here, we can specify two substitutions from the given sequence. Remember that we also do not need to provide the original character at each position like a VCF file does.

In [4]:
import pandas

variants = pandas.DataFrame({
    'chrom': ['chr1', 'chr1'],
    'pos': [8, 14],
    'alt': ['A', 'G']
})

Now, we can use the function to calculate variant effect. This function will return the predictions on the original sequence when centered at the position that the substitution will occur, and the predictions after incorporating the mutation. When using the functions starting with `marginal`, each variant is considered independently of the others. Specifically, each example will contain only one mutation in it even if multiple provided mutations fall within the same window.

In [5]:
from tangermeme.variant_effect import marginal_substitution_effect

y_orig, y_var = marginal_substitution_effect(model, X, variants, in_window=10, device='cpu')
y_orig.shape, y_var.shape

  @numba.jit(params, nopython=False)


(torch.Size([2, 3]), torch.Size([2, 3]))

The tensors have the shape `(2, 3)` because there are two variants to evaluate the effect of, and the model returns three predictions per example.

Importantly, `tangermeme` does not force the user to adopt a specific distance function. Rather, the functions returns the raw predictions with and without the substitutions and allows the user to define their own distance. For example, we could use a simple Euclidean distance function to calculate the difference between predictions before and after including the substitution.

In [6]:
torch.sqrt(torch.sum((y_var - y_orig) ** 2, dim=-1))

tensor([0.3576, 0.1783])

More generally, these functions are meant to be the base operation that your downstream functions are built off of. In your library, you might write your own `variant_effect` function with a similar signature, call this function to get the predictions, calculate distance in the manner you choose, and return that -- either individually, or as part of the variant DataFrame that was passed in originally.

#### Deletions

Another form of mutation is a deletion, where a character is removed from the sequence. Evaluating deletions is a little more challenging than substitutions because, with substitutions, every character remains the same except for the substitution and the length of the sequence does not change. With a deletion, many characters will change because they are being moved over by one position. Further, the sequence after the deletion will be one position shorter than the original sequence. This function overcome this issue by initially loading up a window with one additional position on the right hand side for the sequence to use after the deletion and excluding that position for the original sequence. So, if the original sequence were `[ACGT]A` with the A on the right side being excluded because the window size were four and we wanted to delete the C, the two examples fed into the model would be `ACGT` and `AGTA`.

The API for deletions is very similar to that of substitutions. However, the dataframe is even easier because you only need to pass in keys and positions without needing original or alternate characters.

In [7]:
variants = pandas.DataFrame({
    'chrom': ['chr1', 'chr1'],
    'pos': [8, 14]
})

The function signature is identical as for substitutions.

In [8]:
from tangermeme.variant_effect import marginal_deletion_effect

y_orig, y_var = marginal_deletion_effect(model, X, variants, in_window=10, device='cpu')
y_orig.shape, y_var.shape

(torch.Size([2, 3]), torch.Size([2, 3]))

Like with substitutions, you get the predictions before and after the incorporation of the variant and can so then define your own distance measure.

#### Insertions

The conceptual opposite of a deletion is an insertion, where one character is inserted into a sequence. Rather than loading up an additional character, the original sequence remains the same as before but the ersatz sequence contains an inserted character in the middle and trims one character off the right hand side. 

Similarly to substititons, you need to define the keys, positions, and the character to be inserted.

In [9]:
variants = pandas.DataFrame({
    'chrom': ['chr1', 'chr1'],
    'pos': [8, 14],
    'alt': ['A', 'G']
})

Likewise, the function signature is the same as the other two functions.

In [10]:
from tangermeme.variant_effect import marginal_insertion_effect

y_orig, y_var = marginal_insertion_effect(model, X, variants, in_window=10, device='cpu')
y_orig.shape, y_var.shape

(torch.Size([2, 3]), torch.Size([2, 3]))

#### How can I implement my own?

Maybe you disagree with some of the choices made in calculating the variant effect score. If you'd like to change any of the details you can just copy/paste the functions and make the changes you'd like or write something from scratch. Ultimately, the function should take a form like below, where the edits are made using whatever strategy the user would like and then the `predict` function is called on the original sequences and the edited sequences.

In [11]:
def my_variant_effect_method(model, X, variants, ...):
    X_alt = ... # Incorporate the changes you'd like

    return predict(model, X, ...), predict(model, X_alt, ...)

SyntaxError: invalid syntax (1181983125.py, line 1)