### Tutorial D1 (Built-in Tools): FIMO

The name `tangermeme` is a play on `MEME` and the `MEME suite`, which are one of the original collection of tools for biological sequence analyses. These tools included `MEME`, which would discover repeating patterns in collections of short sequences, `FIMO` which would scan a PWM over sequences and find statistically significant matches, `TOMTOM` which would compare a PWM to a collection of PWMs, and many other tools that have been developed over decades.

Although the scope of `tangermeme` is larger than that of the MEME suite -- in that `tangermeme` implements operations and analysis tools for machine learning models -- some of the MEME suite tools are also implemented because they are used in downstream `tangermeme` methods. So far, these implementations are in numba and not in PyTorch because they can be sped up much more efficiently when not treated as a dense batched operation.

#### Using FIMO

Finding Individual Motif Occurances ([FIMO](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3065696/)) is a tool for scanning PWMs across sequences and returning statistically significant occurances. There are basically two steps to the procedure: calculating a score that is just the convolution of the given PWMs and the one-hot encoded sequence, and converting that score to a valid p-value. The first step is trivial to implement. The second step involves using a dynamic programming algorithm that accounts for the length of the sequence and the information content at each position. 

This algorithm is implemented in `tangermeme.tools.fimo` in the function `fimo`. Minimally, one must provide a filename for a MEME-formatted file of motifs and a filename for a FASTA-formatted file of sequences to scan against.

NOTE: This API has changed significantly in v0.3.0. Rather than having `FIMO` class that is a PyTorch module, there is now only a `fimo` function that uses numba.

In [1]:
from tangermeme.tools.fimo import fimo

hits = fimo("../../tests/data/test.meme", "../../tests/data/test.fa") 
len(hits)

12

There are 12 motifs in `test.meme` and so there are 12 dataframes returned -- one for each motif.

In [2]:
hits[0]

Unnamed: 0,motif_id,motif_alt_id,sequence_name,start,stop,strand,score,p-value
0,MEOX1_homeodomain_1,,chr7,1350,1360,+,11.446572,7.5e-05


The format of the returned dataframe is meant to match the output from FIMO closely EXCEPT THAT THE COORDIINATES HERE ARE 0-based not 1-based. This decision was made to keep everything in tangermeme consistent -- everything is 0-based. This means that the starts will be 1 position lower than those returned by FIMO. Further, the stop coordinates are inclusive in FIMO and not inclusive here, meaning that position 1360 (0-indexed) is not included in the hit here. This means that the stop coordinates will be the same in both implementations, but the logic `start:stop` to extract a slice is correct when using the coordinates when returned by tangermeme.

You might notice that there are two missing columns from this format: `q-value` and `matched_sequence`. `q-value` is not implemented because, in my opinion, q-values are not meaningful for this task and are very compute- and memory-inefficient to calculate. Likewise, `matched_sequence` takes a fair amount of time to calculate compared to everything else, since the main implementation is in numba, and is not always used. I decided to remove it to speed things up for the majority of people who do not need it.

#### Alternate Inputs

One of the main reasons I implemented these built-in tools was so they could be easily accessable via Python without the need for intermediary files. So, if you already have a set of motifs you would like to scan, you can pass in a dictionary where the keys are motif names and the values are the PWMs. Note that the PWMs can be either `numpy.ndarray` or `torch.Tensor` objects but they must be formatted to have the shape `(len(alphabet), motif_length)`. The built-in `read_meme` command will read motifs into this format automatically, but you can also scan your own custom motifs built however you'd like.

In [3]:
from tangermeme.io import read_meme

motifs = read_meme("../../tests/data/test.meme")
hits = fimo(motifs, "../../tests/data/test.fa")
hits[0]

Unnamed: 0,motif_id,motif_alt_id,sequence_name,start,stop,strand,score,p-value
0,MEOX1_homeodomain_1,,chr7,1350,1360,+,11.446572,7.5e-05


You can also pass in `numpy.ndarray` or `torch.Tensor` objects as your sequences. In this case, the sequences must be a single object that has the shape `(n_seqs, len(alphabet), sequence_length)`. 

In [4]:
import pyfaidx

from tangermeme.utils import one_hot_encode

X = pyfaidx.Fasta("../../tests/data/test.fa")['chr7'][:].seq.upper()
X = one_hot_encode(X).unsqueeze(0)
X.shape

torch.Size([1, 4, 2000])

Once you have the sequence object, you can pass it in instead of the filename.

In [5]:
hits = fimo(motifs, X)
hits[0]

Unnamed: 0,motif_id,motif_alt_id,sequence_name,start,stop,strand,score,p-value
0,MEOX1_homeodomain_1,,0,1350,1360,+,11.446572,7.5e-05


Note here that the sequence name will just be the index of the sequence that the hit matched to, but that the score and p-value are still identical. For simplicity, we only had a single sequence, but you can use as many sequences as you would like.

In [6]:
from tangermeme.utils import random_one_hot

X = random_one_hot((100, 4, 5000), random_state=0)

hits = fimo(motifs, X)
hits[0]

Unnamed: 0,motif_id,motif_alt_id,sequence_name,start,stop,strand,score,p-value
0,MEOX1_homeodomain_1,,1,4220,4230,+,12.104512,0.000031
1,MEOX1_homeodomain_1,,2,4774,4784,+,11.133726,0.000093
2,MEOX1_homeodomain_1,,4,1498,1508,+,12.225502,0.000027
3,MEOX1_homeodomain_1,,5,4973,4983,+,13.126775,0.000007
4,MEOX1_homeodomain_1,,8,2169,2179,+,11.218552,0.000086
...,...,...,...,...,...,...,...,...
74,MEOX1_homeodomain_1,,90,665,675,-,11.118969,0.000093
75,MEOX1_homeodomain_1,,92,2941,2951,-,12.177395,0.000031
76,MEOX1_homeodomain_1,,94,870,880,-,11.619565,0.000063
77,MEOX1_homeodomain_1,,94,4133,4143,-,12.225502,0.000027


#### Annotating Sequences

Sometimes, rather than getting all the hits for a motif across all sequences, you would like to annotate each sequence according to what motifs bind to it. In a sense, this is asking to transpose the results -- rather than one dataframe per motif, you would like one dataframe per example. You can easily do this by passing in `dim=1`.

In [7]:
hits = fimo(motifs, X, dim=1)
len(hits)

100

Now, we are getting 100 dataframes because there are 100 sequences, rather than getting 12 because there are 12 motifs.

In [8]:
hits[0]

Unnamed: 0,motif_id,motif_alt_id,sequence_name,start,stop,strand,score,p-value
0,HIC2_MA0738.1,,0,3263,3272,+,11.987469,1.1e-05
1,HIC2_MA0738.1,,0,3480,3489,+,11.316123,3.4e-05
2,ZN263_HUMAN.H11MO.0.A,,0,4178,4198,+,11.073596,4.3e-05
3,ZN263_HUMAN.H11MO.0.A,,0,4049,4069,-,10.941541,4.6e-05
4,TBX19_MA0804.1,,0,4653,4673,+,12.223569,1.5e-05
5,TBX19_MA0804.1,,0,4653,4673,-,10.531827,3.5e-05
6,Hes1_MA1099.1,,0,1101,1111,+,9.933365,5e-05
7,Hes1_MA1099.1,,0,2855,2865,+,9.928051,5e-05
8,Hes1_MA1099.1,,0,1101,1111,-,9.933365,5e-05


When looking at one of these dataframes, we can see that multiple motifs are binding, and that `sequence_name` is constant across them. Basically, we are getting all the motifs that are found in this example, rather than having to do some operations manually after calling FIMO, as one would have to do with the command-line tool.

#### Optional Arguments

In addition to the motifs and sequences, the `fimo` function has a few optional arguments. The first is that you can change the p-value threshold for which hits are reported.

In [9]:
hits = fimo(motifs, X, threshold=1e-5)
hits[0]

Unnamed: 0,motif_id,motif_alt_id,sequence_name,start,stop,strand,score,p-value
0,MEOX1_homeodomain_1,,5,4973,4983,+,13.126775,7e-06
1,MEOX1_homeodomain_1,,40,4426,4436,+,13.126775,7e-06
2,MEOX1_homeodomain_1,,48,4472,4482,+,13.074241,1e-05
3,MEOX1_homeodomain_1,,49,997,1007,+,13.268936,5e-06
4,MEOX1_homeodomain_1,,69,91,101,+,13.33845,4e-06
5,MEOX1_homeodomain_1,,5,4973,4983,-,13.126775,7e-06
6,MEOX1_homeodomain_1,,40,4426,4436,-,13.126775,7e-06


In [10]:
hits = fimo(motifs, X, threshold=1e-2)
hits[0]

Unnamed: 0,motif_id,motif_alt_id,sequence_name,start,stop,strand,score,p-value
0,MEOX1_homeodomain_1,,0,205,215,+,4.453169,0.003351
1,MEOX1_homeodomain_1,,0,255,265,+,4.307101,0.003472
2,MEOX1_homeodomain_1,,0,410,420,+,1.666516,0.008586
3,MEOX1_homeodomain_1,,0,676,686,+,4.730534,0.002991
4,MEOX1_homeodomain_1,,0,742,752,+,1.547685,0.008838
...,...,...,...,...,...,...,...,...
9872,MEOX1_homeodomain_1,,99,4417,4427,-,3.817555,0.004143
9873,MEOX1_homeodomain_1,,99,4452,4462,-,2.583183,0.006454
9874,MEOX1_homeodomain_1,,99,4574,4584,-,8.203560,0.000618
9875,MEOX1_homeodomain_1,,99,4624,4634,-,1.326745,0.009375


As you might expect, setting a loose threshold will result in a lot of hits.

You can also add a pseudocount to the motifs. This is supposed to just be for numeric stability, but you may find it useful to tweak. As the pseudocount is increased, the chance of observing very low p-values goes down and so there will be fewer hits (after all, if the pseudocount were a very large number, all regions would be equally likely to be hits).

In [11]:
hits = fimo(motifs, X, eps=1e-2)
hits[0]

Unnamed: 0,motif_id,motif_alt_id,sequence_name,start,stop,strand,score,p-value
0,MEOX1_homeodomain_1,,1,4220,4230,+,12.390564,0.000043
1,MEOX1_homeodomain_1,,4,1498,1508,+,12.509832,0.000029
2,MEOX1_homeodomain_1,,5,4973,4983,+,13.379201,0.000009
3,MEOX1_homeodomain_1,,8,2169,2179,+,11.539705,0.000094
4,MEOX1_homeodomain_1,,9,962,972,+,13.236133,0.000010
...,...,...,...,...,...,...,...,...
67,MEOX1_homeodomain_1,,86,3595,3605,-,11.916640,0.000069
68,MEOX1_homeodomain_1,,92,2941,2951,-,12.481323,0.000035
69,MEOX1_homeodomain_1,,94,870,880,-,11.922870,0.000069
70,MEOX1_homeodomain_1,,94,4133,4143,-,12.509832,0.000029


By default, all motifs will be scanned in both the forward and backward direction. If you want to only scan in the forward direction, you can set that as an option.

In [12]:
hits = fimo(motifs, X, reverse_complement=False)
hits[0]

Unnamed: 0,motif_id,motif_alt_id,sequence_name,start,stop,strand,score,p-value
0,MEOX1_homeodomain_1,,1,4220,4230,+,12.104512,3.1e-05
1,MEOX1_homeodomain_1,,2,4774,4784,+,11.133726,9.3e-05
2,MEOX1_homeodomain_1,,4,1498,1508,+,12.225502,2.7e-05
3,MEOX1_homeodomain_1,,5,4973,4983,+,13.126775,7e-06
4,MEOX1_homeodomain_1,,8,2169,2179,+,11.218552,8.6e-05
5,MEOX1_homeodomain_1,,9,962,972,+,12.974658,1.1e-05
6,MEOX1_homeodomain_1,,10,3735,3745,+,12.020241,3.7e-05
7,MEOX1_homeodomain_1,,13,4817,4827,+,11.618673,6.3e-05
8,MEOX1_homeodomain_1,,18,1800,1810,+,11.638134,6.3e-05
9,MEOX1_homeodomain_1,,18,1945,1955,+,12.710449,1.3e-05
