### Conditionals (`if` statements)

We can perform true/false tests, like comparing two numbers or checking whether two strings are equal. The code below gives the example of comparing two numbers.
```
3 > 2
```

These `True` / `False` results are a data type called `bool`, short for Boolean, which refers to a systematic approach to this sort of true/false logic.

In Boolean logic, true/false values can be combined using _and_ and *or*, and transformed using _not_. Python has exact equivalents of these, allowing us to build more complex tests. Note that Python's `or` is "inclusive", that is, `a or b` is `true` when at least one of `a` or `b` are true, including when both are true.
```
print( 3 > 2 and 3 < 4 )
print( 3 > 2 or  3 < 4 )
print( 3 > 2 and 3 < 1 )
```

To test whether two values are equal to each other, we use a double equal sign symbol, `==`. This symbol distinguishes _testing_ whether two values are equal from _assigning_ a value to a variable, which uses a single equal sign (_e.g._, `x = 3`).

For instance, we can compare integers and strings as shown below
```
print(3 == 4)
print(2*2 == 4)
print('4' == 4)
print('4' == str(4))
```

The `!=` operator tests whether two values are not equal, like this
```
print(3 != 4)
```

An `if` statement runs the Python code in its _body_ when a **condition** is `True`, and skips it when the condition is `False`. The body of the `if` statement is indented, like the body of a `for` statement.

```
if x > 0:
    print('x is positive!')
```

We'll test this with a positive value for `x` and a negative value. Then, we'll add a second `if` statement to print a different message when `x` is negative.

We can do more than just `print()` something in the body of an `if` statement. For instance, we can change the value of a variable and then use that variable outside the `if` statement.
```
sign = 'zero'
if x > 0:
    sign = 'positive'
```

In the examples above, we're picking between one of a few alternatives. We can use a special "if-else" construction to do this more easily. In this example, `elif` is short for else-if.
```
nt = 'C'
if nt == 'A':
    base = 'Adenine'
elif nt == 'C':
    base = 'Cytosine'
elif nt == 'G':
    base = 'Guanine'
elif nt == 'T':
    base = 'Thymine'
```
We can also add an `else` clause that is run when none of the other alternatives are.
```
else:
    base = 'Unknown'
```

We can also combine `if` and `for` to carry out complex operations.

The recognition sequence for the restriction enzyme EcoRI is G/AATTC. Here we will use a `for` loop to iterate over each nucleotide in this site and print the name of the base.

```
ecori = 'GAATTC'
for nt in ecori:
    if nt == 'A':
    ...
    print(base)
```

_Exercise_ Here is the protein sequence of the yeast gene _MFA1_, encoding a secreted mating pheromone.

```
mfa1='MQPSTATAAPKEKTSSEKKDNYIIKGVFWDPACVIA'
```

Write a `for` loop to count the number of lysine residues (one-letter code `K`) in `mfa1`.

### Dictionaries

**Dictionaries**, often called **dicts** for short, are useful and versatile data structures that link a _key_ with a _value_. It's fast and easy to look up the _value_ based on the _key_, like you look up a definition from a word in a dictionary. 

Dictionaries have one entry per key, and keys can be almost anything -- numbers, strings, and so forth.

The code below creates a dictionary where the _keys_ are one-letter amino acid codes and the _values_ are the names of the amino acid.

```
amino_acids = { 'A': 'alanine', 'C': 'cysteine', 'D': 'aspartic acid' }
```

Looking up dictionary entries is done by _indexing_ wiht square brackets, just as we saw for lists. An example is given below

```
amino_acids['C']
```

Indexing can also be used to add or replace entries. For example, the code below replaces the name of protonated aspartic acid with the deprotonated form typically found in water, and also adds an entry for glutamate.

```
amino_acids['D'] = 'aspartate'
amino_acids['E'] = 'glutamate'
```

Dictionaries are another kind of **collection**, and so we can iterate over them. A `for` loop will iterate over the _keys_ in a dictionary.
```
for aa in amino_acids:
    print(aa)
```

It's straightforward to use the keys in a for loop to look up the values.
```
for aa in amino_acids:
    print(amino_acids[aa])
```

Dictionaries also have an `items()` method that iterates over keys and values _together_.

This requires _two_ loop variables, one for the key and one for the value. These two loop variables are separated with a comma, as shown below:

```
for key, value in amino_acids.items():
    print(key)
    print(value)
```

There's an operator called `in` that tests whether a key is contained in a dictionary. For example, the code below tests whether 'B' is a key in our dictionary of amino acids.

```
'B' in amino_acids
```

The `del` operator will delete an entry from a dictionary. Here is an example of using `del`
```
del amino_acids['C']
```

### Counting with dictionaries

Dictionaries are useful for counting things. We counted the occurrences of lysine in one small yeast protein. Now, we're going to count all amino acids in the protein.

The basic strategy is to create an empty dictionary, and then update the entries to keep a running count of the number of times we see each amino acid.
```
mfa1 = 'MQPSTATAAPKEKTSSEKKDNYIIKGVFWDPACVIA'
aa_count = {}
for aa in mfa1:
    aa_count[aa] = aa_count[aa] + 1
    print(aa_count)
```

_But_ this code gives us an error.

We need to do something special when we see an amino acid for the first time. There are two ways to handle this situation. One is to test whether an amino acid is already in the dictionary, and add the amino acid if it isn't already there.
```
    ...
    if not (aa in aa_count):
        aa_count[aa] = 0
    ...
```

Another is to use the `get()` method instead of indexing with square brackets. This method looks up a key and returns its value, or uses the default value when the key is absent.

Here's an example with a dictionary of nucleoside names.

```
nucleotides = { 'A': 'adenosine', 'C': 'cytidine', 'G': 'guanosine', 'T': 'thymidine' }
print('A is ' + nucleotides.get('A', 'unknown'))
print('E is ' + nucleotides.get('E', 'unknown'))
```

We can use the `get()` method to write a shorter and clearer version of our counting loop.
```
    ...
    aa_count[aa] = aa_count.get(aa, 0) + 1
    ...
```

### BioPython

Next, we'll move on to counting amino acids in the whole yeast proteome. We don't want to include all ~6,000 protein sequences in this notebook, and so we'll use existing Python tools to read it from a file. First, we need to install the biopython package.

Doing this within a Jupyter notebook requires the following, somewhat cryptic python code that I copied and pasted from an informative web page

```
import sys
!{sys.executable} -m pip install biopython
```

The biopython module `Bio` has a sub-module specialized for reading and writing files of sequence data, called `SeqIO`. The code below imports this one sub-module.

```
from Bio import SeqIO
```

The SeqIO module has a function called `parse()` that reads sequence entries from a Fasta-format file. The Fasta format is pretty simple: each sequence has a name on a line starting with a >, followed by the sequence itself. So, a Fasta file might look like:

```
>one
AGCTACGT...
>two
TGACTGCA...
...
```

The `parse()` function returns, in essence, an iterator that can loop over all the entries in the file. We just want to look at the first one, though, so we'll use `next` to take just one entry.

```
proteins = SeqIO.parse("../S288C_R64-2-1/orf_trans_all_R64-2-1_20150113.fasta", "fasta")
protein = next(proteins)
```

The `parse()` function will turn each of these into a `SeqRecord`, a custom data type that bundles together the name and the sequence. You can get the sequence name from record using `record.id` and the sequence itself using `record.seq`. This sequence isn't an ordinary Python string -- it's another custom data type, called a `Seq`, but you can convert it into a string using `str(record.seq)`.

```
print('ID = ' + protien.id)
print('Seq = ' + str(protein.seq))
```

Now let's run our amino acid counting loop on this protein.
```
...
for aa in str(protein.seq):
    ...
```

Finally, lets loop over every protein in the proteome and count all of the amino acids.
```
proteins = SeqIO.parse("../S288C_R64-2-1/orf_trans_all_R64-2-1_20150113.fasta", "fasta")
...
for protein in proteins:
    ...
```

What are the most and least common amino acids?

### Pandas

Dictionaries are great tools, but it's tedious to read a dictionary. Pandas is the Python Data Analysis library, a module that is excellent for many kinds of data analysis. We'll import Pandas to start working with it.
```
import pandas as pd
```

We can use our dictionary of counts to create a Pandas `Series`, which is a list of data values with a label for each entry. For us, the labels are the amino acids and the data values are the counts. We can create a `Series` directly from our dictionary using the `pd.Series()` function.
```
aa_series = pd.Series(aa_count)
```

Our `aa_series` is still in an arbitrary order. The `Series` type has two sorting methods. `Series.sort_index()` sorts the series by its index (here, by the amino acid) and `Series.sort_values()` sorts the series by the data values.
```
print(aa_series.sort_index())
print(aa_series.sort_values())
```
These output tables are useful for answering questions like "how many glycines are there in the yeast genome" or "what are the most and least common amino acids"?

### matplotlib

It's also pretty easy to make plots of data in a `Series`. To do this, we need to import another module
```
import matplotlib.pyplot as plt
```

Now, we can use a `plot()` method on our data series. The default plot is a line plot, but a bar plot makes more sense for this kind of data and so we use the `kind='bar'` argument to the `Series.plot()` method.
```
aa_series.plot(kind='bar')
```
You may find it makes more sense to plot the sorted versions of these `Series`

_Exercise_ The file `"../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa"` has the nucleotide sequence of the yeast genome. Count the nucleotide frequencies in the genome.

In [None]:
chroms = SeqIO.parse("../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta")

### Data frames

In addition to the `Series`, Pandas also provides a `DataFrame` which has rows and columns, like a table or a spreadsheet. They're similar to (and based on) data frames in the statistics programming language R.

We can build a data frame from a dictionary where the _columns_ are entries in a dictionary. Each dictionary _key_ is a column header, and the associated _value_ is a list. The `pd.DataFrame()` function creates a data frame.

```
nucls = pd.DataFrame({'letter': [ 'A', 'C', 'G', 'T' ],
                      'name': ['adenine', 'cytosine', 'guanine', 'thymine'],
                      'ring': ['purine', 'pyrimidine', 'purine', 'pyrimidine']})
```

We can extract one column of a `DataFrame` as a `Series` using square brackets to index it by the name of the column:
```
nucls['name']
```

We can then index by row into the `Series` with a second set of square brackets
```
nucls['letter'][2]
```


Here is some Python code to create a data frame with observed nucleotide counts from 389 TATA boxes taken from eukaryotic promoters (Bucher, J Mol Biol (1990) 212, 563-578).
```
tata_counts = pd.DataFrame({'A': [  16, 352,   3, 354, 268, 360, 222, 155],
                            'C': [  46,   0,  10,   0,   0,   3,   2,  44],
                            'G': [  18,   2,   2,   5,   0,  20,  44, 157],
                            'T': [ 309,  35, 374,  30, 121,   6, 121,  33]})
```
Each row is a position in the TATA motif, and each column is a nucleotide. It's possible to read off the consensus sequence of TATA(A/T)A(A/T)(A/G), sometimes written TATAWAWR, just from looking at the counts in the table.

Data frames have many useful methods. For instance, we can use the .sum() method to take the sums across rows or columns. The argument `0` will calculate column sums and the argument `1` will calculate row sums.

We can then turn these counts into probabilities by dividing each nucleotide count by the total number of sequences counted. That is if 35 out of 389 TATA-box sequences have a `T` at the second position, then the probability of a `T` at position 1 in a random TATA-box sequence is 35/389, just under 10%.

```
tata_counts / 389
```

will make a new data frame dividing each individual entry in our data frame by 389. We'll use this to make a new `tata_probs` data frame with the _probabilities_ of each nucleotide.

We can now look up, e.g., the probability of a `T` at the second position, which is position 1 in Python counting
```
tata_probs['T'][1]
```

We're most of the way to a probabilistic model of a TATA box. We will assume that each of the nucleotides in the TATA box is independent, so we can multiply these probabilities together
$$P(\;\mathtt{TATAAAG}\;|\;\mathrm{TATA-box}\;) = 
P(\;\mathtt{T}\mathrm{\,at\,0\;}) \times
P(\;\mathtt{A}\mathrm{\,at\,1\;}) \times
P(\;\mathtt{T}\mathrm{\,at\,2\;}) \times
P(\;\mathtt{A}\mathrm{\,at\,3\;}) \times
P(\;\mathtt{A}\mathrm{\,at\,4\;}) \times
P(\;\mathtt{A}\mathrm{\,at\,5\;}) \times
P(\;\mathtt{G}\mathrm{\,at\,6\;})
$$

We need to keep track of which position is which, because $P(\;\mathtt{T}\mathrm{\,at\,0\;}) \neq P(\;\mathtt{T}\mathrm{\,at\,1\;})$. The `enumerate()` function lets us keep track of a position when we're iterating over a sequence.

```
for position, nt in enumerate(sequ):
    print('position = ' + str(position) + ', nt = ' + str(nt))
```

Now, we'll write a `for` loop to iterate over the positions in a sequence and compute a running probability.

We'll start with probability 1
```
prob = 1
```
and then multiply the probability for each independent position
```
for position, nt in enumerate(sequ):
   p = tata_probs[nt][position]
   prob = prob * p
   print(position, nt, p, prob)
```

We can use this to compute the probability of a "very good" TATA-box like `TATATATA`. We can also try the worst possible TATA box, `ACGCGCCT`.

Our final probability is 0! While $P(\;\mathtt{ACGCGCCT}\;|\;\textrm{TATA-box}\;)$ is definitely very small, it's probably not 0. We see zero `C` nucleotides at position 1 out of 389 TATA-boxes, but what if we counted 389,000? Would we find 100, 10, or 1? 

We often handle these situations by adding a _pseudocount_ to our data. We add a fake count for each nucleotide, at each position, in order to eliminate zeros. The impact of this pseudocount depends on the number of real counts. If we add a pseudocount with 9 real observations, it represents 10% of our overall counts, but if we add a pseudocount with 999 real observations, it's only 0.1%.

We can just add 1 to every entry and use this table with pseudocounts to make our new data.

```
tata_counts_pseudo = tata_counts + 1
```

Now we can use the new tata_probs to compute the probability of the best TATA-box, which is pretty similar. We can also compute the worst TATA-box, which is very low but not zero.

It's getting tedious to write the same for loop every time we want to try a different sequence.

We can write our own function, `likelihood_tata()`, that will compute the likelihood of a sequence under our TATA-box probability model. We define a function with def followed by the function name. The arguments to the function are named in parentheses, and inside the function, these become variables that take on a different value each time we use the function. The `return` keyword gives the computed "value" for the function.

```
def likelihood_tata(sequ):
    prob = 1
    for position, nt in enumerate(sequ):
        p = tata_probs[nt][position]
        prob = prob * p
        print(position, nt, p, prob)
    return(prob)
```

Now we can easily use our function to compute the likelihood of some other possible TATA-box sequences. For example, the three sequences below are "very good" TATA-boxes that differ from the "best" TATA box at one of the three "degenerate" positions in the motif. Notice that the overall probability of getting one of these three imperfect motifs is substantially higher than the probability of the perfect TATA-box. In fact, although the TATA-box is a strong motif, fewer than 10% of the sequences generated according to our model will actually match the "best" sequence.
```
TATATAAG
TATAAATG
TATAAAAA
```

If we want to use our Bayesian framework to think about TATA-boxes, we need some additional information. What is $P(\;\mathtt{TATAAAAG}\;|\;\textit{not}\,\textrm{TATA-box}\;)$? We need a model for all the other sequences in the genome, often called a "background" model.

The easy background model is independent nucleotides, with probabilities determined by the overall composition of the genome. We just counted the overall number of `A`s etc in the yeast genome. A rough estimate is

```
background = pd.Series({'A': 0.31, 'C': 0.19, 'G': 0.19, 'T': 0.31})
```

_Exercise_ Use the `background` defined above to write a `likelihood_background()` function that calculates the likelihood of generating a given sequence under the model of random yeast genome.

Since the "worst" TATA-box is GC-rich and the "best" TATA-box is AT-rich, the odds of getting the "best" TATA-box by chance in random sequence is somewhat higher. Of course, the chance of getting the "best" sequence under our TATA-box probabilistic model is dramatically higher than the chance of getting the "worst" sequence. We can use the _ratio of the likelihoods_ as a measure of how well two different models fit a given sequence.

Below, we compute the likelihood ratios for the "best" sequence TATAAAAG, the "worst" sequence ACGCGCCT, and getting any one of the three very-good sequences TATAAATG and TATAAAAA.
```
print(likelihood_tata('TATAAAAG') / likelihood_background('TATAAAAG'))
print(likelihood_tata('ACGCGCCT') / likelihood_background('ACGCGCCT'))

print( (likelihood_tata('TATATAAG') + likelihood_tata('TATAAATG') + likelihood_tata('TATAAAAA'))
       / (likelihood_background('TATATAAG') + likelihood_background('TATAAATG') + likelihood_background('TATAAAAA')) )
```

We can go one step further and turn this likelihood ratio into a function
```
def likelihood_ratio(sequ):
    return(likelihood_tata(sequ) / likelihood_background(sequ))
```

We might want to scan a whole promoter to find a TATA-box. Here is the promoter region for the yeast _CDC19_ gene.
```
cdc19_prm = 'TATGATGCTAGGTACCTTTAGTGTCTTCCTAAAAAAAAAAAAAGGCTCGCCATCAAAACGATATTCGTTGGCTTTTTTTTCTGAATTATAAATACTCTTTGGTAACTTTTCATTTCCAAGAACCTCTTTTTTCCAGTTATATCATG'
```
We need to extract 8-nucleotide chunks out of the promoter. Square brackets can extract a _range_ of values from a string or a list. To do this, we do `[start:end]` where the start is _included_ and the end is _excluded_.

```
alphabet = 'abcdefghijklmnopqrstuvwxyz'
alphabet[2:6]
```

This code goes from index 2 (the 3rd entry, `c`) to index 5 (`f`) and does not include index 6 (`g`).

We can use this to run
```
likelihood_ratio(cdc19_prm[0:8])
likelihood_ratio(cdc19_prm[1:9])
```

Now we can loop over each starting position in `cdc19_prm` and compute its likelihood.

We start at position 0 and we run until the _end_ of our 8-position window is at the end of the promoter. This happens when `start+8 = len(cdc19_prm)` or equivalently `start = len(cdc19_prm) - 8`.

The `range(start, end)` function creates a series of numbers.

To start, we can write the loop
```
for start in range(0, len(cdc19_prm) - 8):
    print(str(start) + ' ' + cdc19_prm[start:start+8])
```
and if all of that looks good we can add in a `likelihood_ratio()`.

Then we can build a _list_ of these likelihoods and covert it into a Pandas `Series`.