# Motif Searching

We will start by loading `pandas`, a package for working with a kind of data table called a "data frame" in Python. Data frames are similar to spreadsheets -- they have rows and columns.

In [None]:
import pandas as pd

The cell below creates a data frame with observed nucleotide counts from 389 TATA boxes taken from eukaryotic promoters (Bucher, *J Mol Biol* (1990) **212**, 563-578).

The data frame is built from a dictionary -- each key of the dictionary is a column name, and the associated value is a list of values for that column. In this example, all of the columns have integer values.

In [None]:
tata_counts = pd.DataFrame({'A': [  16, 352,   3, 354, 268, 360, 222, 155],
                            'C': [  46,   0,  10,   0,   0,   3,   2,  44],
                            'G': [  18,   2,   2,   5,   0,  20,  44, 157],
                            'T': [ 309,  35, 374,  30, 121,   6, 121,  33]})
tata_counts

Each row is a position in the `TATA` motif, and each column is a nucleotide. It's possible to read off the consensus sequence of `TATA(A/T)A(A/T)(A/G)`, sometimes written `TATAWAWR`, just from looking at the counts in the table.

Data frames have many useful methods. For instance, we can use the `.sum()` method to take the sum across each row and create a new, column-like result.

In [None]:
tata_counts.sum(1)

We can then turn these counts into probabilities by dividing each nucleotide count by the total number of sequences counted. That is if 35 out of 389 TATA-box sequences have a `T` at the second position, then the probability of a `T` at position 1 in a random TATA-box sequence is 35/389, just under 10%.

In [None]:
tata_probs = tata_counts / 389
tata_probs

We can index into data tables using square brackets, just like we index into lists and dictionaries. The format for doing this is to index the column first, by name, and then the row, by number. The cell below looks up column `T` and then finds row index 1 in that column -- this is the second row, since Python starts counting from 0.

In [None]:
tata_probs['T'][1]

We want to build a probabilistic model of TATA-box sequences. In this example, we'll assume that each nucleotide position in the TATA-box is independent. That allows us to multiply each of the probabilities together:

```
P(TATAAAAG) = P(T at 0) * P(A at 1) * ... * P(G at 7)
```
Of course, we need to know which position of the motif we're looking at in order to do this -- `P(T at 0)` is very different than `P(T at 1)`! The `enumerate()` function lets us loop over a string or list and keep track of the position as well. Our `for` loop needs two variables, of course -- `idx` for the index and `nt` for the nucleotide itself.

In [None]:
sequ = 'TATAAAAG'
for idx,nt in enumerate(sequ):
    print(idx,nt)

**Exercise** Complete the `for` loop below in order to look up the relevant probability `P(nt at idx)` along with the index and the nucleotide.

In [None]:
for idx,nt in enumerate(sequ):
    print(...)

**Exercise** Complete the `for` loop below to compute the running product of each probability for each position of `sequ`

In [None]:
prob = 1
for idx,nt in enumerate(sequ):
    prob = ...
print(prob)

We computed the probability of one specific sequence -- the "perfect" TATA-box `TATAAAAG` -- under our probabilistic model of a TATA-box. Now compute the probability of the "worst" TATA box sequence, `ACGCGCCT`.

In [None]:
sequ = 'ACGCGCCT'
prob = 1
for idx,nt in enumerate(sequ):
    prob = prob * tata_probs[nt][idx]
print(prob)

While `P(ACGCGCCT)` is low, it probably shouldn't actually be  0 in our model. Positions 3 and 4 in our TATA-box data have some 0 counts for certain nucleotides. Of course, we only counted 389 TATA-boxes -- perhaps if we counted 389,000 different TATA-boxes we'd find a few with `C` and `G` nucleotides right in the middle.

We often handle these situations by adding a _pseudocount_ to our data. In essensce, we add a fake count for every nucleotide at every position, in order to eliminate zero counts. The impact of this pseudocount depends on the number of real counts. If we add a pseudocount with 9 real observations, it represents 10% of our overall counts, but if we add a pseudocount with 999 real observations, it's only 0.1%. 

It's easy to just add 0.25 to every entry in our count table, and use this to compute a new table of probabilities.

In [None]:
print(tata_counts + 0.25)
print((tata_counts + 0.25).sum(1))
tata_probs = (tata_counts + 0.25) / 390
tata_probs

Use the new probabilistic model with pseudocounts to compute the probabilities of the "best" and "worst" TATA-box sequences.

In [None]:
sequ = 'TATAAAAG'
prob_TATAAAAG = 1
for idx,nt in enumerate(sequ):
    prob_TATAAAAG = prob_TATAAAAG * tata_probs[nt][idx]
print(prob_TATAAAAG)

sequ = 'ACGCGCCT'
prob_ACGCGCCT = 1
for idx,nt in enumerate(sequ):
    prob_ACGCGCCT = prob_ACGCGCCT * tata_probs[nt][idx]
print(prob_ACGCGCCT)

It's getting tedious to write the same `for` loop every time we want to try a different sequence.

We can write our own function, `likelihood_tata()`, that will compute the likelihood of a sequence under our TATA-box probability model. We _define_ a function with `def` followed by the function name. The _arguments_ to the function are named in parentheses, and inside the function, these become _variables_ that take on a different value each time we use the function.

In the cell below, we run the `likelihood_tata` function. Inside the function, `sequ` is a variable with the value `TATAAAAG`.

In [None]:
def likelihood_tata(sequ):
    prob = 1
    for idx,nt in enumerate(sequ):
        prob = prob * tata_probs[nt][idx]
    return prob

print(likelihood_tata('TATAAAAG'))

Now we can easily use our function to compute the likelihood of some other possible TATA-box sequences. For example, the three sequences below are "very good" TATA-boxes that differ from the "best" TATA box at one of the three "degenerate" positions in the motif. Notice that the overall probability of getting one of these three imperfect motifs is substantially higher than the probability of the perfect TATA-box. In fact, although the TATA-box is a strong motif, fewer than 10% of the sequences generated according to our model will actually match the "best" sequence.

In [None]:
prob_TATATAAG = likelihood_tata('TATATAAG')
prob_TATAAATG = likelihood_tata('TATAAATG')
prob_TATAAAAA = likelihood_tata('TATAAAAA')
print(prob_TATATAAG, 
      prob_TATAAATG, 
      prob_TATAAAAA,
      prob_TATATAAG + prob_TATAAATG + prob_TATAAAAA)

Now, let's search for TATA boxes in yeast genomic sequence. Of course, the TATA-box is a very AT-rich motif and the yeast genome is very AT-rich, and we need to take this fact into account. We'll start with a background model for yeast nucleotide sequence based on our previous analysis. For simplicity, we won't include the nearest-neighbor correlations we discussed last time.

In [None]:
background = {'A': 0.31, 'C': 0.19, 'G': 0.19, 'T': 0.31}
background

**Exercise** Complete the `for` loop below to compute the probability of the "best" TATA-box sequence `TATAAAAG` arising purely by chance according to our (very simple) background model.

In [None]:
sequ = 'TATAAAAG'
bkgnd_TATAAAAG = 1
for idx,nt in enumerate(sequ):
    ...
print(bkgnd_TATAAAAG)

**Exercise** Complete the function `likelihood_bkgnd(sequ)` below to compute the likelihood of a sequence under our background model, and use the function to compute the background likelihood of the "worst" TATA-box.

In [None]:
def likelihood_bkgnd(sequ):
    ...
bkgnd_ACGCGCCT = likelihood_bkgnd('ACGCGCCT')
print(bkgnd_ACGCGCCT)

Since the "worst" TATA-box is GC-rich and the "best" TATA-box is AT-rich, the odds of getting the "best" TATA-box by chance in random sequence is somewhat higher. Of course, the chance of getting the "best" sequence under our TATA-box probabilistic model is dramatically higher than the chance of getting the "worst" sequence. We can use the _ratio_ of the likelihoods as a measure of how well two different models fit a given sequence.

Below, we compute the likelihood ratios for the "best" sequence `TATAAAAG`, the "worst" sequence `ACGCGCCT`, and getting any one of the three very-good sequences `TATAAATG` and `TATAAAAA`. 

In [None]:
print(prob_TATAAAAG / bkgnd_TATAAAAG)
print(prob_ACGCGCCT / bkgnd_ACGCGCCT)

print( (likelihood_tata('TATATAAG') + likelihood_tata('TATAAATG') + likelihood_tata('TATAAAAA'))
       / (likelihood_bkgnd('TATATAAG') + likelihood_bkgnd('TATAAATG') + likelihood_bkgnd('TATAAAAA')) )

**Exercise** Complete the `likelihood_ratio()` function below

In [None]:
def likelihood_ratio(sequ):
    ...
print(likelihood_ratio('TATAAAAG'))
print(likelihood_ratio('ACGCGCCT'))

Now that we have a likelihood ratio function, we can use it to search the entire yeast genome for potential TATA-boxes! We'll need to start by installing biopython and importing the `Bio.SeqIO` module that lets us read in and parse a Fasta file.

In [None]:
!pip install biopython

In [None]:
from Bio import SeqIO

**Exercise** Complete the loop below to find every perfect match to the "ideal" `TATAAAAG` sequence in the yeast genome and then print the chromosome name, position, and sequence.

In [None]:
for record in SeqIO.parse("S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    seq = str(record.seq)
    for position in range(0, len(record.seq) - 8):
        subseq = ...
        if subseq == "TATAAAAG":
            print(record.id, position, subseq)

**Exercise** Now, adapt the `for` loop to compute the likelihood ratio for each position, and print the name, position, sequence, and likelihood ratio whenever the ratio is greater than 200.

You may want to stop this before it finishes running! It's both verbose and time-consuming...

In [None]:
for record in SeqIO.parse("S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    seq = str(record.seq)
    for position in range(0, len(record.seq) - 8):
        ...

Biopython actually includes some tools to work with motifs, in the aptly-named `motifs` module.

In [None]:
from Bio import motifs

The `motifs` package contains a special kind of object, a `Motif`, that represents a sequence motif. We can create a `Motif` directly from our nucleotide counts.

In [None]:
tata_motif = motifs.Motif(counts=tata_counts)
print(tata_motif.counts)

The `Motif` object has some useful methods. For instance, we can get the "best" sequence with `consensus`, the "worst" sequence with `anticonsensus`, and a representation of the motif with degenerate nucleotides using `degenerate_consensus`.

We can also get a table of probabilities for each position, a Position Weight Matrix (PWM), using `pwm`.

In [None]:
print(tata_motif.consensus)
print(tata_motif.anticonsensus)
print(tata_motif.degenerate_consensus)

print(tata_motif.pwm)

The PWM from the `Motif` object has the same problem as the probability matrix we constructed by hand: we see zero probabilities for certain nucleotides at key positions in the motif. We can set the `pseudocounts` property on the ,otif in order to include pseudocounts in our calculation. Since only two significant figures are displayed, we rarely see the effects of these pseudocounts, although we can see that the probability of `T` at the last position goes from 0.08 to 0.09.

In [None]:
tata_motif.pseudocounts = 0.25
print(tata_motif.pwm)

When scoring motifs, we multiplied probabilities together. Some of the probabilities got quite small -- lower than 1e-16. For a variety of practical reasons, we often take the log of probabilities. When we do this, multiplication of probabilities turns into addition of log-probabilities, which is also more convenient. We can then consider the log-probability as a "score" for nucleotides and add up the score for each position. A matrix of log-probability scores is called a Position-Specific Scoring Matrix (PSSM).

In [None]:
print(tata_motif.pssm)

The PSSM from the `pssm` method is actually another special Biopython object. In addition to the table of log-likelihood scores, it also has methods to score sequences.

In [None]:
tata_pssm = tata_motif.pssm
print(tata_pssm.calculate('TATAAAAG'))
print(tata_pssm.calculate('ACGCCCCT'))
print(tata_pssm.calculate('TATAAATA'))

The `calculate` method will score each position of a longer sequence and return the result in an array with one entry per starting position.

In [None]:
print(tata_pssm.calculate('CGCTATAAATATATAAGCGCCCCTAC'))

This makes it easier to extract the sequence from a region of a yeast chromosome and plot the TATA-box score at each position. Below, we use `next()` to get the first entry in the yeast genome sequence in the `chr1` variable.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

chr1 = next(SeqIO.parse("S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"))
print(chr1.id)

We can score the entire chromosome I sequence for TATA motifs using the `calculate` method, and then plot these scores.

In [None]:
score = tata_pssm.calculate(chr1.seq)
plt.plot(score)

the graph is too dense to see much of anything, since it covers over 200,000 chromosome positions.

The glycolytic enzyme pyruvate kinase, encoded by _CDC19_, starts at position 71786 on chromosome I. Here, we'll extract the scores around the _CDC19_ promoter and gene in the range from 70,000 to 72,500. We'll plot each position along with its score.

In [None]:
cdc19_region = range(70000,72500)
plt.plot(np.array(cdc19_region), score[cdc19_region])

The highest score seems to show up around 71,600, which is just upstream of the transcription start site for _CDC19_.

**Exercise** Plot the region of scores from 71,500 to 71,700. Then zoom in more until you can figure out the exact position of the score peak. Use `plt.figure()` between each plot in order to make a new plot, rather than a new line on the existing plot.

In [None]:
cdc19_promoter = ...
plt.plot(...)
plt.figure()
...

**Exercise** Slice out the sequence of the candidate TATA box and compute its score.

In [None]:
box_seq = ...
print(box_seq)
print(...)