### Tutorial 1: Sequence Manipulation

Being able to manipulate sequences is central to analyzing what machine learning models have learned because they allow you to easily ask hypotheticals. Many common downstream analyses, such as <i>in silico marginalizations</i> and <i>feature attributions</i> rely on sequence manipulations to handle motif insertion or construct background sequences, respectively. But, these forms of manipulations are also invaluable tools for dissecting individual regions and doing sequence design. For instance, given a locus of interest, one can shuffle regions (such as known motifs) and see what impact the shuffling has on the prediction. Alternatively, one can iterative insert motifs into a sequence to see whether a predictive model yields a desired result.

Given the foundational nature of these operations, they are central to tangermeme and are implemented in `tangermeme.ersatz`. Why `ersatz`?

<img src="imgs/ersatz.png" width=500>

Accordingly, "ersatz sequences" in this context are those that are not natural sequences -- either because they are alterations of a native sequence, or they are fully synthetic.

#### Substitutions

The simplest operation is the substitution, where some positions are switched out for others. Consider the following where an AP-1 motif (https://jaspar.elixir.no/matrix/MA1144.1/) is inserted into uniformly random nucleotides. Colloquially, "substitutions" are sometimes called "insertions", and while there are similarities between the two they are not exactly the same thing. Let's begin by randomly generating one sequence. Following the PyTorch format, the one-hot encoded sequences should have the shape `(batch_size, alphabet_size, sequence_length)`. Generation can be made deterministic using the `random_state` parameter.

As a note, all of the operations in `tangermeme.ersatz` happen on one-hot encoded tensors but, for simplicity of seeing the effect of each operation in this tutorial, we will convert these tensors back to characters.

In [2]:
from tangermeme.utils import random_one_hot
from tangermeme.utils import characters

X = random_one_hot((1, 4, 20), random_state=0)
characters(X[0])

'ATCATTTTCTCGATGAAAGC'

Now, let's put the motif in. We can either pass in a string, in which case the same motif is added in to each sequence in the batch at the same position, or we can pass in a one-hot encoding. If the one-hot encoding has shape `(1, alphabet_size, motif_size)`, the same motif is added to each sequence similarly to if a string were used, but if the one-hot encoding has the shape `(batch_size, alphabet_size, motif_size)` then the i-th sequence in the batch has the i-th one-hot encoding substituted in.

In [3]:
from tangermeme.ersatz import substitute

X_ap1 = substitute(X, "TGACTCA")
characters(X_ap1[0])

'ATCATTTTGACTCAGAAAGC'

By default, the substitution will happen in the middle of the sequence. If you'd like to control where it happens you can pass in a parameter `start` with the index to start the substitution.

In [4]:
X_ap1 = substitute(X, "TGACTCA", start=2)
characters(X_ap1[0])

'ATTGACTCATCGATGAAAGC'

If we have a one-hot encoding rather than a string sequence, we can pass that in instead without needing to first convert it back to characters.

In [5]:
from tangermeme.utils import one_hot_encode

motif = one_hot_encode("TGACTCA").unsqueeze(0)

X_ap1 = substitute(X, motif)
characters(X_ap1[0])

'ATCATTTTGACTCAGAAAGC'

#### Multisubstitutions

Sometimes one would like to make multiple substitutions in the same sequence given some spacing between substitutions. Although this could be achieved by calling `substitute` multiple times, we can provide a convenient wrapper for this with the `multisubstitue` function. This function has a very similar signature to `substitute` except that it takes in a list of motifs and the spacing between them. We can try it out first given no spacing between the two motifs.

In [7]:
from tangermeme.ersatz import multisubstitute

X_ap12 = multisubstitute(X, ["TGACTCA", "TGACTCA"], 0, start=2)
characters(X_ap12[0])

'ATTGACTCATGACTCAAAGC'

Now, let's add a little bit of spacing.

In [8]:
X_ap12 = multisubstitute(X, ["TGACTCA", "TGACTCA"], 2, start=2)
characters(X_ap12[0])

'ATTGACTCATCTGACTCAGC'

Finally, if we have more than two motifs we can optionally provide spacing values between each set of motifs. Note that if we keep the spacing value as an integer but provide more than two motifs that the same spacing is used between all motif pairs.

In [10]:
X_ap12 = multisubstitute(X, ["TGA", "TCA", "TGA", "TGAC"], [0, 2, 1], start=2)
characters(X_ap12[0])

'ATTGATCACTTGATTGACGC'

#### Insertions

Related to substitutions are insertions. As mentioned before, sometimes people say "insertions" when what they mean are "substitutions", but insertions involve adding the new sequence without modifying or deleting any of the existing sequence. Essentially, the returned sequence will be longer than the original sequence because it now additionally contains the sequence being added. In contrast, substitutions preserve the length of the sequence because characters are explicitly being changed from those in the original sequence to those in the new sequence. Let's see that in action.

In [5]:
from tangermeme.ersatz import insert

X_ap1 = insert(X, "TGACTCA")
characters(X_ap1[0])

'ATCATTTTCTTGACTCACGATGAAAGC'

In [6]:
X.shape[-1], len("TGACTCA"), len(characters(X_ap1[0]))

(20, 7, 27)

#### Deletion

In direct contrast to insertions are deletions: insertions involve adding new sequence to an existing sequence and deletions involve deleting existing sequence and not replacing it with anything. Accordingly, one only passes in sequences and the coordinates to delete instead of passing in any form of new sequence.

In [7]:
from tangermeme.ersatz import delete

X_del = delete(X, start=0, end=5)
characters(X[0]), characters(X_del[0])

('ATCATTTTCTCGATGAAAGC', 'TTTCTCGATGAAAGC')

Next, let's try deleting a portion from the middle.

In [8]:
X_del = delete(X, start=10, end=15)
characters(X_del[0])

'ATCATTTTCTAAAGC'

In [9]:
X.shape[-1], X_del.shape[-1]

(20, 15)

#### Randomize

Deleting positions is one way that we can remove information from a sequence. However, this can pose some issues -- both practically, in terms of needing a sequence of the same length when using machine learning models, and conceptually, in that removing a motif entirely isn't really a biologically plausible alternative to observed sequence. An alternative approach to deleting positions is to replace those positions with randomly generated characters. This would keep the sequence the same length but remove the motif.

Here, we replace the first five positions with randomly generated characters and keep the remaining characters the same.

In [10]:
from tangermeme.ersatz import randomize

X_rand = randomize(X, start=0, end=5, random_state=0)
characters(X[0]), characters(X_rand[0, 0])

('ATCATTTTCTCGATGAAAGC', 'GGGGCTTTCTCGATGAAAGC')

Frequently, when using randomly generated sequences, one wishes to generate many randomizations so that one can average over the randomness induces by the sequences. To make this easy, `tangermeme` allows you to pass in a parameter `n` specifying the number of randomizations to perform, and returns a tensor with one more dimension than the original tensor whenever randomness is involved. Specifically, the returned tensor will have shape `(n_orig_sequences, n_shuffles, alphabet_size, seq_len)` so that you can shuffle each of many sequence many times.

In [11]:
X_rand = randomize(X, start=0, end=5, n=10, random_state=0)

for i in range(10):
    print(characters(X_rand[0, i]))

GGGGCTTTCTCGATGAAAGC
GCTTCTTTCTCGATGAAAGC
TGGTATTTCTCGATGAAAGC
AATTTTTTCTCGATGAAAGC
TTCTATTTCTCGATGAAAGC
GATGCTTTCTCGATGAAAGC
CTCGATTTCTCGATGAAAGC
GGGTGTTTCTCGATGAAAGC
CCGAGTTTCTCGATGAAAGC
GAACCTTTCTCGATGAAAGC


#### Shuffle

A problem with independently generating sequence at each position is that the sampled sequences might have unrealistic compositions. For instance, when you use uniformly randomly generated sequences, the GC content is fairly high compared to naturally occuring sequences. When trying to create backgrounds for specific loci, you might prefer to instead shuffle the positions to ensure that the composition of these backgrounds are the same, while any types of motif are disrupted. 

We can use the `shuffle` function to completely shuffle a batch of sequences. Each returned sequence will have the same composition as the original sequence that was shuffled, but a different ordering of the elements. In the genomics setting, this means that the same number of each nucleotide will be present but motifs will likely be removed.

In [12]:
from tangermeme.ersatz import shuffle

X_shuf = shuffle(X, random_state=0)
characters(X[0]), characters(X_shuf[0, 0])

('ATCATTTTCTCGATGAAAGC', 'GTCCCATTTCTGTTAGAAAA')

Similar to the `randomize` function, if we want to shuffle a sequence many times, we can use the parameter `n`.

In [13]:
X_shuf = shuffle(X, n=10, random_state=0)

for i in range(10):
    print(characters(X_shuf[0, i]))

GTCCCATTTCTGTTAGAAAA
GTGACACACAATACTTTGTT
ATATGCCTAATCAGTTATGC
GATCAATAGAGCTATCCTTT
TTCCTGCAAATTGTCTAAGA
ATCCATTCGGGACTTTATAA
GGTTTACTAACGCCAATTAT
TGTAAGTACCCTATTCTGAA
TTTTTATACAGCAGAGACTC
CAAACTTTCGCTTAGTATAG


Furthermore, we can restrict our shuffling to only a portion of the sequence. This can be valuable if you want to knock out a portion of the sequence, such as a known motif or broader regulatory region. All you need to do is specify the start (inclusive) and end (not inclusive) positions that the shuffling should occur.

In [14]:
X_shuf = shuffle(X, start=5, end=15, random_state=3)
characters(X[0]), characters(X_shuf[0, 0])

('ATCATTTTCTCGATGAAAGC', 'ATCATCTTTGGATCTAAAGC')

#### Dinucleotide Shuffle

In the genomics setting the `CG` dinucleotide plays an outsized role compared to other dinucleotides and so is significantly underrepresented in the genome. Because normal shuffling will disrupt dinucleotide content, and hence change the proportion of CGs in the sequence, sometimes one wants to use a shuffling strategy that explicitly preserves dinucleotide content.

In `tangermeme`, the `dinucleotide_shuffle` function operates similarly to the `shuffle` function. For instance, we can shuffle entire sequences:

In [15]:
from tangermeme.ersatz import dinucleotide_shuffle

X_shuf = dinucleotide_shuffle(X, random_state=0)
characters(X[0]), characters(X_shuf[0, 0])

('ATCATTTTCTCGATGAAAGC', 'ATCATTTCTCGATTGAAAGC')

We can generate many shuffles using the `n` parameter.

In [16]:
X_shuf = dinucleotide_shuffle(X, n=10, random_state=0)

for i in range(10):
    print(characters(X_shuf[0, i]))

ATCATTTCTCGATTGAAAGC
ATCTCAAATTTTCGATGAGC
ATTTTCTCAATCGAATGAGC
AAATTTTCTCATCGATGAGC
ATCTTCATCGATTTGAAAGC
ATCAATTCTTTCGAATGAGC
AATTCTCAATTCGATTGAGC
AAATCTCATTTCGATTGAGC
ATCATTTCTTCGAAATGAGC
ATCTTTCAATTCGAATGAGC


And we can shuffle only a portion of the sequence.

In [17]:
X_shuf = dinucleotide_shuffle(X, start=5, end=15, random_state=3)
characters(X[0]), characters(X_shuf[0, 0])

('ATCATTTTCTCGATGAAAGC', 'ATCATTCTTTCGATGAAAGC')

As a note, the strategy for doing dinucleotide shuffling that is implemented will always keep the first and last nucleotides the same. Depending on the sequence composition and the length of the region being shuffled, it can be impossible to produce new dinucleotide shuffled sequences. Passing in `verbose` will raise a warning when at least one position (other than the first and last positions) are always the same character. Regardless of the value of `verbose`, an error will be raised when all returned sequences are identical.

In [18]:
X_shuf = dinucleotide_shuffle(X, start=10, end=15, random_state=0, n=100, verbose=True)
characters(X[0]), characters(X_shuf[0, 0])



ValueError: All dinucleotide shuffles yield identical sequences, potentially due to a lack of diversity in sequence.