# Motif Searching

We will start by loading `pandas`, a package for working with a kind of data table called a "data frame" in Python. Data frames are similar to spreadsheets -- they have rows and columns.

In [1]:
!pip3 install pandas
import pandas as pd



The cell below creates a data frame with observed nucleotide counts from 389 TATA boxes taken from eukaryotic promoters (Bucher, *J Mol Biol* (1990) **212**, 563-578).

The data frame is built from a dictionary -- each key of the dictionary is a column name, and the associated value is a list of values for that column. In this example, all of the columns have integer values.

In [2]:
tata_counts = pd.DataFrame({'A': [  16, 352,   3, 354, 268, 360, 222, 155],
                            'C': [  46,   0,  10,   0,   0,   3,   2,  44],
                            'G': [  18,   2,   2,   5,   0,  20,  44, 157],
                            'T': [ 309,  35, 374,  30, 121,   6, 121,  33]})
tata_counts

Unnamed: 0,A,C,G,T
0,16,46,18,309
1,352,0,2,35
2,3,10,2,374
3,354,0,5,30
4,268,0,0,121
5,360,3,20,6
6,222,2,44,121
7,155,44,157,33


Each row is a position in the `TATA` motif, and each column is a nucleotide. It's possible to read off the consensus sequence of `TATA(A/T)A(A/T)(A/G)`, sometimes written `TATAWAWR`, just from looking at the counts in the table.

Data frames have many useful methods. For instance, we can use the `.sum()` method to take the sum across each row and create a new, column-like result.

In [5]:
tata_counts.sum(1)

0    389
1    389
2    389
3    389
4    389
5    389
6    389
7    389
dtype: int64

We can then turn these counts into probabilities by dividing each nucleotide count by the total number of sequences counted. That is if 35 out of 389 TATA-box sequences have a `T` at the second position, then the probability of a `T` at position 1 in a random TATA-box sequence is 35/389, just under 10%.

In [9]:
tata_probs = tata_counts / 389
tata_probs

Unnamed: 0,A,C,G,T
0,0.041131,0.118252,0.046272,0.794344
1,0.904884,0.0,0.005141,0.089974
2,0.007712,0.025707,0.005141,0.96144
3,0.910026,0.0,0.012853,0.077121
4,0.688946,0.0,0.0,0.311054
5,0.92545,0.007712,0.051414,0.015424
6,0.570694,0.005141,0.113111,0.311054
7,0.398458,0.113111,0.403599,0.084833


We can index into data tables using square brackets, just like we index into lists and dictionaries. The format for doing this is to index the column first, by name, and then the row, by number. The cell below looks up column `T` and then finds row index 1 in that column -- this is the second row, since Python starts counting from 0.

In [12]:
tata_probs['A'][1]

0.9048843187660668

We want to build a probabilistic model of TATA-box sequences. In this example, we'll assume that each nucleotide position in the TATA-box is independent. That allows us to multiply each of the probabilities together:

```
P(TATAAAAG) = P(T at 0) * P(A at 1) * ... * P(G at 7)
```
Of course, we need to know which position of the motif we're looking at in order to do this -- `P(T at 0)` is very different than `P(T at 1)`! The `enumerate()` function lets us loop over a string or list and keep track of the position as well. Our `for` loop needs two variables, of course -- `idx` for the index and `nt` for the nucleotide itself.

In [15]:
sequ = 'TATAAAAG'
for position,nt in enumerate(sequ):
    print("position=", position, "nt=",nt)

position= 0 nt= T
position= 1 nt= A
position= 2 nt= T
position= 3 nt= A
position= 4 nt= A
position= 5 nt= A
position= 6 nt= A
position= 7 nt= G


Now, we'll write for loops to iterate over the motif sequence and compute a running probability `P(sequence | motif)` -- our likelihood function for our sequence under the TATA-box model.

In [17]:
prob = 1
for position, nt in enumerate(sequ):
    p = tata_probs[nt][position]
    prob = prob * p
    print(position, nt, p, prob)

0 T 0.794344473007712 0.794344473007712
1 A 0.9048843187660668 0.7187898573231738
2 T 0.961439588688946 0.6910730247785783
3 A 0.910025706940874 0.6288942179218939
4 A 0.6889460154241646 0.43327416556058507
5 A 0.9254498714652957 0.40097352082727666
6 A 0.570694087403599 0.22883321754153063
7 G 0.40359897172236503 0.09235685129568202


We computed the probability of one specific sequence -- the "perfect" TATA-box `TATAAAAG` -- under our probabilistic model of a TATA-box. Now compute the probability of the "worst" TATA box sequence, `ACGCGCCT`.

In [18]:
sequ = 'ACGCGCCT'
prob = 1
for position, nt in enumerate(sequ):
    p = tata_probs[nt][position]
    prob = prob * p
    print(position, nt, p, prob)

0 A 0.04113110539845758 0.003798739384912371
1 C 0.0 0.0
2 G 0.005141388174807198 0.0
3 C 0.0 0.0
4 G 0.0 0.0
5 C 0.007712082262210797 0.0
6 C 0.005141388174807198 0.0
7 T 0.08483290488431877 0.0


While `P(ACGCGCCT)` is low, it probably shouldn't actually be  0 in our model. Positions 3 and 4 in our TATA-box data have some 0 counts for certain nucleotides. Of course, we only counted 389 TATA-boxes -- perhaps if we counted 389,000 different TATA-boxes we'd find a few with `C` and `G` nucleotides right in the middle.

We often handle these situations by adding a _pseudocount_ to our data. In essensce, we add a fake count for every nucleotide at every position, in order to eliminate zero counts. The impact of this pseudocount depends on the number of real counts. If we add a pseudocount with 9 real observations, it represents 10% of our overall counts, but if we add a pseudocount with 999 real observations, it's only 0.1%. 

It's easy to just add 0.25 to every entry in our count table, and use this to compute a new table of probabilities.

In [23]:
tata_counts_pseudo = tata_counts + 0.25
print(tata_counts_pseudo)
print(tata_counts_pseudo.sum(1))
tata_probs = tata_counts_pseudo / tata_counts_pseudo.sum(1)[0]
print(tata_probs)

        A      C       G       T
0   16.25  46.25   18.25  309.25
1  352.25   0.25    2.25   35.25
2    3.25  10.25    2.25  374.25
3  354.25   0.25    5.25   30.25
4  268.25   0.25    0.25  121.25
5  360.25   3.25   20.25    6.25
6  222.25   2.25   44.25  121.25
7  155.25  44.25  157.25   33.25
0    390.0
1    390.0
2    390.0
3    390.0
4    390.0
5    390.0
6    390.0
7    390.0
dtype: float64
          A         C         G         T
0  0.041667  0.118590  0.046795  0.792949
1  0.903205  0.000641  0.005769  0.090385
2  0.008333  0.026282  0.005769  0.959615
3  0.908333  0.000641  0.013462  0.077564
4  0.687821  0.000641  0.000641  0.310897
5  0.923718  0.008333  0.051923  0.016026
6  0.569872  0.005769  0.113462  0.310897
7  0.398077  0.113462  0.403205  0.085256


Use the new probabilistic model with pseudocounts to compute the probabilities of the "best" and "worst" TATA-box sequences.

In [25]:
sequ = 'TATAAAAG'
prob = 1
for position, nt in enumerate(sequ):
    p = tata_probs[nt][position]
    prob = prob * p
    print(position, nt, p, prob)
    
sequ = 'ACGCGCCT'
prob = 1
for position, nt in enumerate(sequ):
    p = tata_probs[nt][position]
    prob = prob * p
    print(position, nt, p, prob)

0 T 0.7929487179487179 0.7929487179487179
1 A 0.9032051282051282 0.7161953484549638
2 T 0.9596153846153846 0.6872720747673595
3 A 0.9083333333333333 0.6242721345803516
4 A 0.6878205128205128 0.4293871797466136
5 A 0.9237179487179488 0.3966326448813271
6 A 0.5698717948717948 0.2260297572432691
7 G 0.4032051282051282 0.09113635724744633
0 A 0.041666666666666664 0.041666666666666664
1 C 0.000641025641025641 2.670940170940171e-05
2 G 0.0057692307692307696 1.5409270216962526e-07
3 C 0.000641025641025641 9.877737318565722e-11
4 G 0.000641025641025641 6.331882896516489e-14
5 C 0.008333333333333333 5.276569080430407e-16
6 C 0.0057692307692307696 3.0441744694790814e-18
7 T 0.08525641025641026 2.5953538746199862e-19


It's getting tedious to write the same `for` loop every time we want to try a different sequence.

We can write our own function, `likelihood_tata()`, that will compute the likelihood of a sequence under our TATA-box probability model. We _define_ a function with `def` followed by the function name. The _arguments_ to the function are named in parentheses, and inside the function, these become _variables_ that take on a different value each time we use the function.

In the cell below, we run the `likelihood_tata` function. Inside the function, `sequ` is a variable with the value `TATAAAAG`.

In [29]:
def likelihood_tata(sequ):
    prob = 1
    for position, nt in enumerate(sequ):
        p = tata_probs[nt][position]
        prob = prob * p
        #print(position, nt, p, prob)
    return(prob)
print(likelihood_tata('TATAAAAG'))
print(likelihood_tata('ATATATAT'))

0.09113635724744633
1.3036395799606078e-09


Now we can easily use our function to compute the likelihood of some other possible TATA-box sequences. For example, the three sequences below are "very good" TATA-boxes that differ from the "best" TATA box at one of the three "degenerate" positions in the motif. Notice that the overall probability of getting one of these three imperfect motifs is substantially higher than the probability of the perfect TATA-box. In fact, although the TATA-box is a strong motif, fewer than 10% of the sequences generated according to our model will actually match the "best" sequence.

In [30]:
prob_TATATAAG = likelihood_tata('TATATAAG')
prob_TATAAATG = likelihood_tata('TATAAATG')
prob_TATAAAAA = likelihood_tata('TATAAAAA')
print(prob_TATATAAG, 
      prob_TATAAATG, 
      prob_TATAAAAA,
      prob_TATATAAG + prob_TATAAATG + prob_TATAAAAA)

0.041193973219954765 0.04972005991564844 0.08997723028722442 0.18089126342282763


Now, let's search for TATA boxes in yeast genomic sequence. Of course, the TATA-box is a very AT-rich motif and the yeast genome is very AT-rich, and we need to take this fact into account. We'll start with a background model for yeast nucleotide sequence based on our previous analysis. For simplicity, we won't include the nearest-neighbor correlations we discussed last time.

In [31]:
background = {'A': 0.31, 'C': 0.19, 'G': 0.19, 'T': 0.31}
print(background)

{'A': 0.31, 'C': 0.19, 'G': 0.19, 'T': 0.31}


Below, we'll compute the probability of the "best" TATA-box sequence `TATAAAAG` arising purely by chance according to our (very simple) background model.

In [39]:
def likelihood_bkgnd(sequ):
    prob = 1
    for nt in sequ:
        prob_nt = background[nt]
        prob = prob * prob_nt
        #print(nt, prob_nt, prob)
    return(prob)
print(likelihood_bkgnd('TATAAAAG'))
print(likelihood_bkgnd('ACGCGCCT'))

5.227396681090001e-05
4.5211091640999994e-06


Now, we'll write the function `likelihood_bkgnd(sequ)` below to compute the likelihood of a sequence under our background model, and use the function to compute the background likelihood of the "worst" TATA-box.

Since the "worst" TATA-box is GC-rich and the "best" TATA-box is AT-rich, the odds of getting the "best" TATA-box by chance in random sequence is somewhat higher. Of course, the chance of getting the "best" sequence under our TATA-box probabilistic model is dramatically higher than the chance of getting the "worst" sequence. We can use the _ratio_ of the likelihoods as a measure of how well two different models fit a given sequence.

Below, we compute the likelihood ratios for the "best" sequence `TATAAAAG`, the "worst" sequence `ACGCGCCT`, and getting any one of the three very-good sequences `TATAAATG` and `TATAAAAA`. 

In [37]:
print(likelihood_tata('TATAAAAG') / likelihood_bkgnd('TATAAAAG'))
print(likelihood_tata('ACGCGCCT') / likelihood_bkgnd('ACGCGCCT'))

print( (likelihood_tata('TATATAAG') + likelihood_tata('TATAAATG') + likelihood_tata('TATAAAAA'))
       / (likelihood_bkgnd('TATATAAG') + likelihood_bkgnd('TATAAATG') + likelihood_bkgnd('TATAAAAA')) )

T 0.31 0.31
A 0.31 0.0961
T 0.31 0.029791
A 0.31 0.00923521
A 0.31 0.0028629151
A 0.31 0.0008875036810000001
A 0.31 0.00027512614111000004
G 0.19 5.227396681090001e-05
1743.4367966971058
A 0.31 0.31
C 0.19 0.0589
G 0.19 0.011191
C 0.19 0.00212629
G 0.19 0.00040399509999999996
C 0.19 7.6759069e-05
C 0.19 1.458422311e-05
T 0.31 4.5211091640999994e-06
5.74052468192644e-14
T 0.31 0.31
A 0.31 0.0961
T 0.31 0.029791
A 0.31 0.00923521
T 0.31 0.0028629151
A 0.31 0.0008875036810000001
A 0.31 0.00027512614111000004
G 0.19 5.227396681090001e-05
T 0.31 0.31
A 0.31 0.0961
T 0.31 0.029791
A 0.31 0.00923521
A 0.31 0.0028629151
A 0.31 0.0008875036810000001
T 0.31 0.00027512614111000004
G 0.19 5.227396681090001e-05
T 0.31 0.31
A 0.31 0.0961
T 0.31 0.029791
A 0.31 0.00923521
A 0.31 0.0028629151
A 0.31 0.0008875036810000001
A 0.31 0.00027512614111000004
A 0.31 8.528910374410001e-05
952.8765615645912


We'll write the `likelihood_ratio()` function below to compute the likelihood ratio between the TATA-box and background models

In [38]:
def likelihood_ratio(sequ):
    return(likelihood_tata(sequ) / likelihood_bkgnd(sequ))
print(likelihood_ratio('TATAAAAG'))
print(likelihood_ratio('ACGCGCCT'))

T 0.31 0.31
A 0.31 0.0961
T 0.31 0.029791
A 0.31 0.00923521
A 0.31 0.0028629151
A 0.31 0.0008875036810000001
A 0.31 0.00027512614111000004
G 0.19 5.227396681090001e-05
1743.4367966971058
A 0.31 0.31
C 0.19 0.0589
G 0.19 0.011191
C 0.19 0.00212629
G 0.19 0.00040399509999999996
C 0.19 7.6759069e-05
C 0.19 1.458422311e-05
T 0.31 4.5211091640999994e-06
5.74052468192644e-14


Now that we have a likelihood ratio function, we can use it to search the entire yeast genome for potential TATA-boxes! We'll need to start by installing biopython and importing the `Bio.SeqIO` module that lets us read in and parse a Fasta file.

In [40]:
!pip install biopython



In [41]:
from Bio import SeqIO

First, we'll find every perfect match to the "ideal" `TATAAAAG` sequence in the yeast genome and then print the chromosome name, position, and sequence.

In [43]:
for record in SeqIO.parse("../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    chrname = record.id
    chrseq = str(record.seq)
    for position in range(0,len(chrseq) - 7):
        if chrseq[position:position+8] == 'TATAAAAG':
            print(chrname, position)
            

ref|NC_001133| 23391
ref|NC_001133| 35005
ref|NC_001133| 55261
ref|NC_001133| 143155
ref|NC_001133| 146280
ref|NC_001133| 201501
ref|NC_001133| 203305
ref|NC_001134| 39041
ref|NC_001134| 73204
ref|NC_001134| 80368
ref|NC_001134| 84446
ref|NC_001134| 87952
ref|NC_001134| 90631
ref|NC_001134| 122353
ref|NC_001134| 131985
ref|NC_001134| 140395
ref|NC_001134| 164037
ref|NC_001134| 167342
ref|NC_001134| 180457
ref|NC_001134| 193623
ref|NC_001134| 256222
ref|NC_001134| 270130
ref|NC_001134| 351147
ref|NC_001134| 365960
ref|NC_001134| 370835
ref|NC_001134| 437694
ref|NC_001134| 442686
ref|NC_001134| 474657
ref|NC_001134| 497672
ref|NC_001134| 504325
ref|NC_001134| 533640
ref|NC_001134| 547875
ref|NC_001134| 566277
ref|NC_001134| 621098
ref|NC_001134| 624782
ref|NC_001134| 676878
ref|NC_001134| 685062
ref|NC_001134| 704465
ref|NC_001134| 706930
ref|NC_001134| 722509
ref|NC_001134| 774600
ref|NC_001134| 776456
ref|NC_001134| 781943
ref|NC_001134| 786381
ref|NC_001134| 789142
ref|NC_001134| 7927

ref|NC_001146| 264805
ref|NC_001146| 271472
ref|NC_001146| 275737
ref|NC_001146| 323232
ref|NC_001146| 339461
ref|NC_001146| 382111
ref|NC_001146| 463030
ref|NC_001146| 518976
ref|NC_001146| 536593
ref|NC_001146| 561344
ref|NC_001146| 576134
ref|NC_001146| 583458
ref|NC_001146| 629114
ref|NC_001146| 716007
ref|NC_001146| 727387
ref|NC_001146| 731872
ref|NC_001146| 744069
ref|NC_001147| 28394
ref|NC_001147| 45288
ref|NC_001147| 87386
ref|NC_001147| 96577
ref|NC_001147| 125884
ref|NC_001147| 145818
ref|NC_001147| 150254
ref|NC_001147| 183742
ref|NC_001147| 234199
ref|NC_001147| 268115
ref|NC_001147| 331099
ref|NC_001147| 357611
ref|NC_001147| 391318
ref|NC_001147| 423618
ref|NC_001147| 429925
ref|NC_001147| 470754
ref|NC_001147| 483156
ref|NC_001147| 506035
ref|NC_001147| 519569
ref|NC_001147| 539938
ref|NC_001147| 562636
ref|NC_001147| 581354
ref|NC_001147| 592935
ref|NC_001147| 660858
ref|NC_001147| 685711
ref|NC_001147| 690223
ref|NC_001147| 697869
ref|NC_001147| 722818
ref|NC_001147|

Now, we'll adapt the `for` loop to compute the likelihood ratio for each position, and print the name, position, sequence, and likelihood ratio whenever the ratio is greater than 200.

You may want to stop this before it finishes running! It's both verbose and time-consuming...

In [44]:
for record in SeqIO.parse("../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    chrname = record.id
    chrseq = str(record.seq)
    for position in range(0,len(chrseq) - 7):
        tataseq = chrseq[position:position+8]
        lr = likelihood_ratio(tataseq)
        if lr > 200:
            print(chrname, position, tataseq,lr)

ref|NC_001133| 756 CATAAAAA 257.42409309736615
ref|NC_001133| 2508 TATAAAAC 490.6014515347976
ref|NC_001133| 3747 CATAAATG 232.08993515238404
ref|NC_001133| 5010 TATAAAAA 1054.9674734206444
ref|NC_001133| 5520 TATAAAAA 1054.9674734206444
ref|NC_001133| 5974 TATAAAAA 1054.9674734206444
ref|NC_001133| 7085 TATAAAAA 1054.9674734206444
ref|NC_001133| 7132 TATATATA 260.1483429923473
ref|NC_001133| 7134 TATATAAA 476.84923076329227
ref|NC_001133| 7565 TATAAAAC 490.6014515347976
ref|NC_001133| 7603 CATAAATG 232.08993515238404
ref|NC_001133| 9026 TATATATG 429.9205475046852
ref|NC_001133| 9141 TATAAATA 575.5446845995642
ref|NC_001133| 11074 TATAAAAA 1054.9674734206444
ref|NC_001133| 16132 TATAAAAT 225.94311427527495
ref|NC_001133| 16499 CATAAATG 232.08993515238404
ref|NC_001133| 16767 TATAAAGA 342.703601128357
ref|NC_001133| 16943 CATAAAAA 257.42409309736615
ref|NC_001133| 18162 CATAAAAA 257.42409309736615
ref|NC_001133| 18894 TATAAAAA 1054.9674734206444
ref|NC_001133| 22053 CATAAAAG 425.4184584

ref|NC_001133| 183302 TATAAAAC 490.6014515347976
ref|NC_001133| 184512 TATAAAAT 225.94311427527495
ref|NC_001133| 187831 TATAAATA 575.5446845995642
ref|NC_001133| 189591 TATAAAAC 490.6014515347976
ref|NC_001133| 190099 TATATATA 260.1483429923473
ref|NC_001133| 196552 TATAAAGA 342.703601128357
ref|NC_001133| 197040 TATAAATC 267.650960623596
ref|NC_001133| 197839 TATAAATC 267.650960623596
ref|NC_001133| 198419 TATAAAAA 1054.9674734206444
ref|NC_001133| 200084 TATAAAAT 225.94311427527495
ref|NC_001133| 200198 CATAAAAA 257.42409309736615
ref|NC_001133| 200492 TATAAAAA 1054.9674734206444
ref|NC_001133| 200734 TATAAATA 575.5446845995642
ref|NC_001133| 201365 TATAAATC 267.650960623596
ref|NC_001133| 201501 TATAAAAG 1743.4367966971058
ref|NC_001133| 203305 TATAAAAG 1743.4367966971058
ref|NC_001133| 213347 CATAAAAA 257.42409309736615
ref|NC_001133| 214188 TATAAAAA 1054.9674734206444
ref|NC_001133| 215045 TATAAAAT 225.94311427527495
ref|NC_001133| 216247 TATAAAAA 1054.9674734206444
ref|NC_001133

ref|NC_001134| 136462 TATAAATA 575.5446845995642
ref|NC_001134| 136468 TATATATA 260.1483429923473
ref|NC_001134| 136470 TATATATA 260.1483429923473
ref|NC_001134| 136472 TATATATA 260.1483429923473
ref|NC_001134| 136474 TATATATA 260.1483429923473
ref|NC_001134| 137598 TATATATA 260.1483429923473
ref|NC_001134| 138634 TATATAAG 788.039931405495
ref|NC_001134| 140393 TATATAAA 476.84923076329227
ref|NC_001134| 140395 TATAAAAG 1743.4367966971058
ref|NC_001134| 141988 TATAAAAA 1054.9674734206444
ref|NC_001134| 143926 CATAAAAA 257.42409309736615
ref|NC_001134| 146912 TATATAAC 221.75368499009954
ref|NC_001134| 147445 TATATATA 260.1483429923473
ref|NC_001134| 147447 TATATATA 260.1483429923473
ref|NC_001134| 149734 CATAAAAG 425.41845845457607
ref|NC_001134| 150261 TATAAAGA 342.703601128357
ref|NC_001134| 150330 TATATAAA 476.84923076329227
ref|NC_001134| 150350 TATATATA 260.1483429923473
ref|NC_001134| 150352 TATATATA 260.1483429923473
ref|NC_001134| 150354 TATATATA 260.1483429923473
ref|NC_001134| 

KeyboardInterrupt: 

Biopython actually includes some tools to work with motifs, in the aptly-named `motifs` module.

In [45]:
from Bio import motifs

The `motifs` package contains a special kind of object, a `Motif`, that represents a sequence motif. We can create a `Motif` directly from our nucleotide counts.

In [55]:
tata_motif = motifs.Motif(counts=tata_counts)
tata_motif

<Bio.motifs.Motif at 0x7f56352e16d8>

The `Motif` object has some useful methods. For instance, we can get the "best" sequence with `consensus`, the "worst" sequence with `anticonsensus`, and a representation of the motif with degenerate nucleotides using `degenerate_consensus`.

We can also get a table of probabilities for each position, a Position Weight Matrix (PWM), using `pwm`.

In [57]:
print(tata_motif.consensus)
print(tata_motif.anticonsensus)
print(tata_motif.degenerate_consensus)
print(tata_motif.pwm)

TATAAAAG
ACGCCCCT
TATAAAWR
        0      1      2      3      4      5      6      7
A:   0.04   0.90   0.01   0.91   0.69   0.93   0.57   0.40
C:   0.12   0.00   0.03   0.00   0.00   0.01   0.01   0.11
G:   0.05   0.01   0.01   0.01   0.00   0.05   0.11   0.40
T:   0.79   0.09   0.96   0.08   0.31   0.02   0.31   0.08



The PWM from the `Motif` object has the same problem as the probability matrix we constructed by hand: we see zero probabilities for certain nucleotides at key positions in the motif. We can set the `pseudocounts` property on the ,otif in order to include pseudocounts in our calculation. Since only two significant figures are displayed, we rarely see the effects of these pseudocounts, although we can see that the probability of `T` at the last position goes from 0.08 to 0.09.

In [59]:
tata_motif.pseudocounts=1
print(tata_motif.pwm)

        0      1      2      3      4      5      6      7
A:   0.04   0.90   0.01   0.90   0.68   0.92   0.57   0.40
C:   0.12   0.00   0.03   0.00   0.00   0.01   0.01   0.11
G:   0.05   0.01   0.01   0.02   0.00   0.05   0.11   0.40
T:   0.79   0.09   0.95   0.08   0.31   0.02   0.31   0.09



When scoring motifs, we multiplied probabilities together. Some of the probabilities got quite small -- lower than 1e-16. For a variety of practical reasons, we often take the log of probabilities. When we do this, multiplication of probabilities turns into addition of log-probabilities, which is also more convenient. We can then consider the log-probability as a "score" for nucleotides and add up the score for each position. A matrix of log-probability scores is called a Position-Specific Scoring Matrix (PSSM).

In [60]:
print(tata_motif.pssm)

        0      1      2      3      4      5      6      7
A:  -2.53   1.85  -4.62   1.85   1.45   1.88   1.18   0.67
C:  -1.06  -6.62  -3.16  -6.62  -6.62  -4.62  -5.03  -1.13
G:  -2.37  -5.03  -5.03  -4.03  -6.62  -2.23  -1.13   0.69
T:   1.66  -1.45   1.93  -1.66   0.31  -3.81   0.31  -1.53



The PSSM from the `pssm` method is actually another special Biopython object. In addition to the table of log-likelihood scores, it also has methods to score sequences.

In [62]:
tata_motif.pssm.calculate('TATAAAAG')

12.486985

The `calculate` method will score each position of a longer sequence and return the result in an array with one entry per starting position.

In [63]:
tata_motif.pssm.calculate('CGCTATAAATATATAAGCGCCCCTAC')

array([-11.428785 ,  -1.16384  ,  -5.8126936,  11.598444 ,  -9.450841 ,
         3.906972 ,  -9.674721 ,   6.2690573, -10.77038  ,  11.34626  ,
        -9.184476 ,   2.1675098, -20.234314 , -20.15334  , -29.734318 ,
       -32.15345  , -38.44177  , -26.131962 , -29.139618 ], dtype=float32)

This makes it easier to extract the sequence from a region of a yeast chromosome and plot the TATA-box score at each position. Below, we use `next()` to get the first entry in the yeast genome sequence in the `chr1` variable.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

chr1 = next(SeqIO.parse("../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"))
print(chr1.id)

We can score the entire chromosome I sequence for TATA motifs using the `calculate` method, and then plot these scores.

the graph is too dense to see much of anything, since it covers over 200,000 chromosome positions.

The glycolytic enzyme pyruvate kinase, encoded by _CDC19_, starts at position 71786 on chromosome I. 

Here, we'll extract the scores around the _CDC19_ promoter and gene in the range from 70,000 to 72,500. We'll plot each position along with its score.

In [None]:
cdc19_region = range(70000,72500)


The highest score seems to show up around 71,600, which is just upstream of the transcription start site for _CDC19_.

Now, we'll plot the region of scores from 71,500 to 71,700. Then zoom in more until you can figure out the exact position of the score peak. Use `plt.figure()` between each plot in order to make a new plot, rather than a new line on the existing plot.

Finally, we'll slice out the sequence of the candidate TATA box and compute its score.