### Data frames

In addition to the `Series`, Pandas also provides a `DataFrame` which has rows and columns, like a table or a spreadsheet. They're similar to (and based on) data frames in the statistics programming language R.

We can build a data frame from a dictionary where the _columns_ are entries in a dictionary. Each dictionary _key_ is a column header, and the associated _value_ is a list. The `pd.DataFrame()` function creates a data frame.

```
nucls = pd.DataFrame({'letter': [ 'A', 'C', 'G', 'T' ],
                      'name': ['adenine', 'cytosine', 'guanine', 'thymine'],
                      'ring': ['purine', 'pyrimidine', 'purine', 'pyrimidine']})
```

In [5]:
import pandas as pd
nucls = pd.DataFrame({ 'letter': [ 'A', 'C', 'G', 'T'], 
                      'name': ['adenine', 'cytosine', 'guanine', 'thymine'],
                     'ring': ['purine', 'pyrimidine', 'purine', 'pyrimidine']})
nucls

Unnamed: 0,letter,name,ring
0,A,adenine,purine
1,C,cytosine,pyrimidine
2,G,guanine,purine
3,T,thymine,pyrimidine


We can extract one column of a `DataFrame` as a `Series` using square brackets to index it by the name of the column:
```
nucls['name']
```

In [7]:
nucls['letter']

0    A
1    C
2    G
3    T
Name: letter, dtype: object

We can then index by row into the `Series` with a second set of square brackets
```
nucls['letter'][2]
```


In [8]:
nucls['letter'][2]

'G'

In [10]:
nucls[2:4]

Unnamed: 0,letter,name,ring
2,G,guanine,purine
3,T,thymine,pyrimidine


Here is some Python code to create a data frame with observed nucleotide counts from 389 TATA boxes taken from eukaryotic promoters (Bucher, J Mol Biol (1990) 212, 563-578).
```
tata_counts = pd.DataFrame({'A': [  16, 352,   3, 354, 268, 360, 222, 155],
                            'C': [  46,   0,  10,   0,   0,   3,   2,  44],
                            'G': [  18,   2,   2,   5,   0,  20,  44, 157],
                            'T': [ 309,  35, 374,  30, 121,   6, 121,  33]})
```
Each row is a position in the TATA motif, and each column is a nucleotide. It's possible to read off the consensus sequence of TATA(A/T)A(A/T)(A/G), sometimes written TATAWAWR, just from looking at the counts in the table.

In [11]:
tata_counts = pd.DataFrame({'A': [  16, 352,   3, 354, 268, 360, 222, 155],
                            'C': [  46,   0,  10,   0,   0,   3,   2,  44],
                            'G': [  18,   2,   2,   5,   0,  20,  44, 157],
                            'T': [ 309,  35, 374,  30, 121,   6, 121,  33]})
tata_counts

Unnamed: 0,A,C,G,T
0,16,46,18,309
1,352,0,2,35
2,3,10,2,374
3,354,0,5,30
4,268,0,0,121
5,360,3,20,6
6,222,2,44,121
7,155,44,157,33


Data frames have many useful methods. For instance, we can use the .sum() method to take the sums across rows or columns. The argument `0` will calculate column sums and the argument `1` will calculate row sums.

In [13]:
tata_counts.sum(1)

0    389
1    389
2    389
3    389
4    389
5    389
6    389
7    389
dtype: int64

We can then turn these counts into probabilities by dividing each nucleotide count by the total number of sequences counted. That is if 35 out of 389 TATA-box sequences have a `T` at the second position, then the probability of a `T` at position 1 in a random TATA-box sequence is 35/389, just under 10%.

```
tata_counts / 389
```

will make a new data frame dividing each individual entry in our data frame by 389. We'll use this to make a new `tata_probs` data frame with the _probabilities_ of each nucleotide.

In [18]:
tata_probs = tata_counts / 389
tata_probs

Unnamed: 0,A,C,G,T
0,0.041131,0.118252,0.046272,0.794344
1,0.904884,0.0,0.005141,0.089974
2,0.007712,0.025707,0.005141,0.96144
3,0.910026,0.0,0.012853,0.077121
4,0.688946,0.0,0.0,0.311054
5,0.92545,0.007712,0.051414,0.015424
6,0.570694,0.005141,0.113111,0.311054
7,0.398458,0.113111,0.403599,0.084833


We can now look up, e.g., the probability of a `T` at the second position, which is position 1 in Python counting
```
tata_probs['T'][1]
```

In [19]:
tata_probs['T'][1]

0.08997429305912596

We're most of the way to a probabilistic model of a TATA box. We will assume that each of the nucleotides in the TATA box is independent, so we can multiply these probabilities together
$$P(\;\mathtt{TATAAAAG}\;|\;\mathrm{TATA-box}\;) = 
P(\;\mathtt{T}\mathrm{\,at\,0\;}) \times
P(\;\mathtt{A}\mathrm{\,at\,1\;}) \times
P(\;\mathtt{T}\mathrm{\,at\,2\;}) \times
P(\;\mathtt{A}\mathrm{\,at\,3\;}) \times
P(\;\mathtt{A}\mathrm{\,at\,4\;}) \times
P(\;\mathtt{A}\mathrm{\,at\,5\;}) \times
P(\;\mathtt{G}\mathrm{\,at\,6\;})
$$

We need to keep track of which position is which, because $P(\;\mathtt{T}\mathrm{\,at\,0\;}) \neq P(\;\mathtt{T}\mathrm{\,at\,1\;})$. The `enumerate()` function lets us keep track of a position when we're iterating over a sequence.

```
for position, nt in enumerate(sequ):
    print('position = ' + str(position) + ', nt = ' + str(nt))
```

In [24]:
for position, nt in enumerate('TATAAAAG'):
    print(position, nt)

0 T
1 A
2 T
3 A
4 A
5 A
6 A
7 G


Now, we'll write a `for` loop to iterate over the positions in a sequence and compute a running probability.

We'll start with probability 1
```
prob = 1
```
and then multiply the probability for each independent position
```
for position, nt in enumerate(sequ):
   p = tata_probs[nt][position]
   prob = prob * p
   print(position, nt, p, prob)
```

We can use this to compute the probability of a "very good" TATA-box like `TATATATA`. We can also try the worst possible TATA box, `ACGCGCCT`.

In [28]:
sequ = 'ACGCGCCT'
prob = 1
for position, nt in enumerate(sequ):
    # P( nt at position | TATA-box )
    p = tata_probs[nt][position]
    prob = prob * p
    print(position, nt, p, prob)
prob

0 A 0.04113110539845758 0.04113110539845758
1 C 0.0 0.0
2 G 0.005141388174807198 0.0
3 C 0.0 0.0
4 G 0.0 0.0
5 C 0.007712082262210797 0.0
6 C 0.005141388174807198 0.0
7 T 0.08483290488431877 0.0


0.0

Our final probability is 0! While $P(\;\mathtt{ACGCGCCT}\;|\;\textrm{TATA-box}\;)$ is definitely very small, it's probably not 0. We see zero `C` nucleotides at position 1 out of 389 TATA-boxes, but what if we counted 389,000? Would we find 100, 10, or 1? 

We often handle these situations by adding a _pseudocount_ to our data. We add a fake count for each nucleotide, at each position, in order to eliminate zeros. The impact of this pseudocount depends on the number of real counts. If we add a pseudocount with 9 real observations, it represents 10% of our overall counts, but if we add a pseudocount with 999 real observations, it's only 0.1%.

We can just add 1 to every entry and use this table with pseudocounts to make our new data.

```
tata_counts_pseudo = tata_counts + 1
```

In [30]:
tata_counts_pseudo = tata_counts + 1
tata_counts_pseudo

Unnamed: 0,A,C,G,T
0,17,47,19,310
1,353,1,3,36
2,4,11,3,375
3,355,1,6,31
4,269,1,1,122
5,361,4,21,7
6,223,3,45,122
7,156,45,158,34


Now we can use the new tata_probs to compute the probability of the best TATA-box, which is pretty similar. We can also compute the worst TATA-box, which is very low but not zero.

In [34]:
tata_counts_pseudo.sum(1)
tata_probs = tata_counts_pseudo / 393
sequ = 'ACGCGCCT'
prob = 1
for position, nt in enumerate(sequ):
    p = tata_probs[nt][position]
    prob = prob * p
    print(position, nt, p, prob)
prob

0 A 0.043256997455470736 0.043256997455470736
1 C 0.002544529262086514 0.00011006869581544718
2 G 0.007633587786259542 8.402190520263142e-07
3 C 0.002544529262086514 2.1379619644435476e-09
4 G 0.002544529262086514 5.440106779754575e-12
5 C 0.010178117048346057 5.5370043559843004e-14
6 C 0.007633587786259542 4.2267208824307635e-16
7 T 0.08651399491094147 3.656705089125851e-17


3.656705089125851e-17

It's getting tedious to write the same for loop every time we want to try a different sequence.

We can write our own function, `likelihood_tata()`, that will compute the likelihood of a sequence under our TATA-box probability model. We define a function with def followed by the function name. The arguments to the function are named in parentheses, and inside the function, these become variables that take on a different value each time we use the function. The `return` keyword gives the computed "value" for the function.

```
def likelihood_tata(sequ):
    prob = 1
    for position, nt in enumerate(sequ):
        p = tata_probs[nt][position]
        prob = prob * p
        print(position, nt, p, prob)
    return(prob)
```

In [36]:
# likelihood_tata('TATAAAAG')
def likelihood_tata(sequ):
    prob = 1
    for position, nt in enumerate(sequ):
        p = tata_probs[nt][position]
        prob = prob * p
    return(prob)
likelihood_tata('TATAAAAG')

0.08759454254685192

Now we can easily use our function to compute the likelihood of some other possible TATA-box sequences. For example, the three sequences below are "very good" TATA-boxes that differ from the "best" TATA box at one of the three "degenerate" positions in the motif. Notice that the overall probability of getting one of these three imperfect motifs is substantially higher than the probability of the perfect TATA-box. In fact, although the TATA-box is a strong motif, fewer than 10% of the sequences generated according to our model will actually match the "best" sequence.
```
TATATAAG
TATAAATG
TATAAAAA
```

In [40]:
likelihood_tata('TATATAAG') + likelihood_tata('TATAAATG') + likelihood_tata('TATAAAAA')


0.17413432175660665

If we want to use our Bayesian framework to think about TATA-boxes, we need some additional information. What is $P(\;\mathtt{TATAAAAG}\;|\;\textit{not}\,\textrm{TATA-box}\;)$? We need a model for all the other sequences in the genome, often called a "background" model.

The easy background model is independent nucleotides, with probabilities determined by the overall composition of the genome. We just counted the overall number of `A`s etc in the yeast genome. A rough estimate is

```
background = pd.Series({'A': 0.31, 'C': 0.19, 'G': 0.19, 'T': 0.31})
```

In [43]:
background = pd.Series({ 'A': 0.31, 'C': 0.19, 'G': 0.19, 'T': 0.31})
background
background['C']

0.19

_Exercise_ Use the `background` defined above to write a `likelihood_background()` function that calculates the likelihood of generating a given sequence under the model of random yeast genome.

In [44]:
def likelihood_background(sequ):
    prob = 1
    # for position, nt in enumerate(sequ)
    for nt in sequ:
        p = background[nt]
        prob = prob * p
    return prob

In [45]:
likelihood_background('TATAAAAG')

5.227396681090001e-05

In [46]:
likelihood_background('ACGCGCCT')

4.5211091640999994e-06

Since the "worst" TATA-box is GC-rich and the "best" TATA-box is AT-rich, the odds of getting the "best" TATA-box by chance in random sequence is somewhat higher. Of course, the chance of getting the "best" sequence under our TATA-box probabilistic model is dramatically higher than the chance of getting the "worst" sequence. We can use the _ratio of the likelihoods_ as a measure of how well two different models fit a given sequence.

Below, we compute the likelihood ratios for the "best" sequence TATAAAAG, the "worst" sequence ACGCGCCT, and getting any one of the three very-good sequences TATAAATG and TATAAAAA.
```
print(likelihood_tata('TATAAAAG') / likelihood_background('TATAAAAG'))
print(likelihood_tata('ACGCGCCT') / likelihood_background('ACGCGCCT'))

print( (likelihood_tata('TATATAAG') + likelihood_tata('TATAAATG') + likelihood_tata('TATAAAAA'))
       / (likelihood_background('TATATAAG') + likelihood_background('TATAAATG') + likelihood_background('TATAAAAA')) )
```

In [48]:
likelihood_tata('ACGCGCCT') / likelihood_background('ACGCGCCT')

8.088070772902423e-12

We can go one step further and turn this likelihood ratio into a function
```
def likelihood_ratio(sequ):
    return(likelihood_tata(sequ) / likelihood_background(sequ))
```

In [49]:
def likelihood_ratio(sequ):
    return( likelihood_tata(sequ) / likelihood_background(sequ) )
likelihood_ratio('TATATATA')

251.60167494462385

We might want to scan a whole promoter to find a TATA-box. Here is the promoter region for the yeast _CDC19_ gene.
```
cdc19_prm = 'TATGATGCTAGGTACCTTTAGTGTCTTCCTAAAAAAAAAAAAAGGCTCGCCATCAAAACGATATTCGTTGGCTTTTTTTTCTGAATTATAAATACTCTTTGGTAACTTTTCATTTCCAAGAACCTCTTTTTTCCAGTTATATCATG'
```
We need to extract 8-nucleotide chunks out of the promoter. Square brackets can extract a _range_ of values from a string or a list. To do this, we do `[start:end]` where the start is _included_ and the end is _excluded_.

```
alphabet = 'abcdefghijklmnopqrstuvwxyz'
alphabet[2:6]
```

This code goes from index 2 (the 3rd entry, `c`) to index 5 (`f`) and does not include index 6 (`g`).

In [54]:
cdc19_prm = 'TATGATGCTAGGTACCTTTAGTGTCTTCCTAAAAAAAAAAAAAGGCTCGCCATCAAAACGATATTCGTTGGCTTTTTTTTCTGAATTATAAATACTCTTTGGTAACTTTTCATTTCCAAGAACCTCTTTTTTCCAGTTATATCATG'
len(cdc19_prm)
cdc19_prm[0:8]
cdc19_prm[1:9]

'ATGATGCT'

We can use this to run
```
likelihood_ratio(cdc19_prm[0:8])
likelihood_ratio(cdc19_prm[1:9])
```

In [56]:
likelihood_ratio(cdc19_prm[1:9])

1.5243007104870045e-05

Now we can loop over each starting position in `cdc19_prm` and compute its likelihood.

We start at position 0 and we run until the _end_ of our 8-position window is at the end of the promoter. This happens when `start+8 = len(cdc19_prm)` or equivalently `start = len(cdc19_prm) - 8`.

The `range(start, end)` function creates a series of numbers.

To start, we can write the loop
```
for start in range(0, len(cdc19_prm) - 8):
    print(str(start) + ' ' + cdc19_prm[start:start+8])
```
and if all of that looks good we can add in a `likelihood_ratio()`.

Then we can build a _list_ of these likelihoods and covert it into a Pandas `Series`.

In [61]:
scores = []
for start in range(0, len(cdc19_prm) - 7):
    print(str(start), cdc19_prm[start:start+8], likelihood_ratio(cdc19_prm[start:start+8]))
    scores.append(likelihood_ratio(cdc19_prm[start:start+8]))
scores

0 TATGATGC 0.08402037003121614
1 ATGATGCT 1.5243007104870045e-05
2 TGATGCTA 7.856685684214901e-07
3 GATGCTAG 0.0005434620620826783
4 ATGCTAGG 8.394749243658516e-05
5 TGCTAGGT 0.00040021996619650475
6 GCTAGGTA 0.00014760985830600325
7 CTAGGTAC 4.1651000215254796e-07
8 TAGGTACC 0.001710047300323067
9 AGGTACCT 7.60088191631401e-08
10 GGTACCTT 1.8383645356791617e-05
11 GTACCTTT 7.1095678908228174e-09
12 TACCTTTA 0.0010731331632946833
13 ACCTTTAG 1.5610378954175922e-05
14 CCTTTAGT 0.0032948800433311625
15 CTTTAGTG 0.1916779419022246
16 TTTAGTGT 0.0008727613759418125
17 TTAGTGTC 0.00033714539424310264
18 TAGTGTCT 6.502912112563353e-07
19 AGTGTCTT 2.0787467771005432e-05
20 GTGTCTTC 3.569503832491865e-07
21 TGTCTTCC 5.870894461335304e-06
22 GTCTTCCT 1.6952310264786767e-06
23 TCTTCCTA 2.4552142763171573e-05
24 CTTCCTAA 1.3827684211247231e-05
25 TTCCTAAA 0.010316576568886387
26 TCCTAAAA 0.019587871586508113
27 CCTAAAAA 1.1593876079163083
28 CTAAAAAA 0.2728674834631414
29 TAAAAAAA 10.816325127584

[0.08402037003121614,
 1.5243007104870045e-05,
 7.856685684214901e-07,
 0.0005434620620826783,
 8.394749243658516e-05,
 0.00040021996619650475,
 0.00014760985830600325,
 4.1651000215254796e-07,
 0.001710047300323067,
 7.60088191631401e-08,
 1.8383645356791617e-05,
 7.1095678908228174e-09,
 0.0010731331632946833,
 1.5610378954175922e-05,
 0.0032948800433311625,
 0.1916779419022246,
 0.0008727613759418125,
 0.00033714539424310264,
 6.502912112563353e-07,
 2.0787467771005432e-05,
 3.569503832491865e-07,
 5.870894461335304e-06,
 1.6952310264786767e-06,
 2.4552142763171573e-05,
 1.3827684211247231e-05,
 0.010316576568886387,
 0.019587871586508113,
 1.1593876079163083,
 0.2728674834631414,
 10.816325127584454,
 0.5931533134481795,
 0.5931533134481795,
 0.5931533134481795,
 0.5931533134481795,
 0.5931533134481795,
 0.5931533134481795,
 0.9801838492811008,
 0.32271807168919886,
 0.008723663365821753,
 1.6335098688582658e-06,
 4.618124342853626e-07,
 8.538449699044403e-08,
 5.8769349542515984e-