# Probability As Counting

_MFA1_ is a fairly short budding yeast gene that encodes a-factor, a mating pheromone. The cell below creates a variable named `mfa1` that holds a string containing the nucleotide sequence of the _MFA1_ gene.

Because even this short gene sequence is a long string, it is split across two lines, using a backslash (`\`) to continue the string on the next line.

In [None]:
mfa1="ATGCAACCATCTACCGCTACCGCCGCTCCAAAAGAAAAGACCAGCAGTGAAAAGAAGGAC\
AACTATATTATCAAAGGTGTCTTCTGGGACCCAGCATGTGTTATTGCTTAG"
mfa1

We can use the `len()` function to find the length of this string, which should be a multiple of 3.

In [None]:
len(mfa1)

We can pick out individual nucleotides out of this string using square brackets to _index_ into the string. Python starts counting positions from 0, and so the first nucleotide is retrieved using `[0]`.

In [None]:
mfa1[0]

and the second nucleotide is retrieved using `[1]`

In [None]:
mfa1[1]

**Exercise** retrieve the _last_ nucleotide of _MFA1_ from the string -- but don't just rely on knowing the length of _MFA1_, compute its length using `len()`.

In [None]:
mfa1[...]

We can also _slice_ out a range of nucleotides using square brackets. The cell below will _slice_ out the second codon of _MFA1_.

In [None]:
mfa1[3:6]

In the cell above, we use a colon inside the square brackets to slice out a range of positions. The range is written `start:end`, but notice that the slice above runs from 3 to 6 in order to extract the nucleotides at positions 3, 4, 5 and _not_ 6. That is, the `start` is _included_ in the range but the `end` is _excluded_.

**Exercise** Slice out the last codon of _MFA1_, using `len()`. (_Hint_ the result should be a stop codon)

In [None]:
mfa1[...]

Now, we want to count how many times each individual nucleotide appears in the _MFA1_ gene. We'll use a dictionary to keep a tally of how many times we encounter each nucleotide as we run down the length of the `mfa1` gene. In this dictionary, the _keys_ will be nucleotides and the _values_ will be counts.

At each position, we'll look up the current nucleotide count and update it by adding one.

In [None]:
nt_counts = {'A': 3, 'C': 7, 'T': 5}
nt_counts['A'] = nt_counts['A'] + 1
nt_counts

We'll start with an empty dictionary, though, and so we won't have an entry for a nucleotide the first time we encounter it. The cell below produces an error because there's no entry for `G`.

In [None]:
nt_counts['G'] = nt_counts['G'] + 1

Instead, we can use the `get()` function to look up a key, or return 0 if the key doesn't exist. In the cell below, we show how `get()` allows us to count the first occurrence of `G`, when it isn't in the dictionary yet, and also to count the eigth occurrence of `C`, by finding the old entry for `C` in the dictionary and adding one.

In [None]:
nt_counts['G'] = nt_counts.get('G', 0) + 1
nt_counts['C'] = nt_counts.get('C', 0) + 1
nt_counts

In order to get ready for some more complicated tasks later on, we'll write a `for` loop that iterates through each position of the `mfa1` gene and uses square brackets to look up the nucleotide at that position. We'll use the `range()` function to make this list. Here's an example of using `range()` in a `for` loop:

In [None]:
for position in range(0,4):
    nt = mfa1[position]
    print(position, nt)

Now we can put all of this together in order to 
* create a new, empty dictionary for nucleotide counts
* loop over each position in `mfa1`,
* index into `mfa1` to find the nucleotide at that position, and
* increment the count of nucleotides

In [None]:
nt_counts = {}
for position in range(0, len(mfa1)):
    nt = mfa1[position]
    nt_counts[nt] = nt_counts.get(nt, 0) + 1
nt_counts

We can use a similar approach to count the _codons_ in `mfa1`. To do this, we need to make two changes to our nucleotide-counting approach:
1. We need to slice out a whole codon rather than indexing a single position, and
2. We need to loop over codon starting positions (0, 3, 6, 9, ...) and not nucleotide positions. Our `range()` function can go by steps of 3 like this: `range(start, end, 3)`.

**Exercise** Fill in the `...` below in order to count how many times each codon occurs in _MFA1_.

In [None]:
codon_counts = {}
for position in range(0, len(mfa1), 3):
    codon = ...
    codon_counts[codon] = codon_counts.get(codon, 0) + 1
codon_counts

Next, we'll move on to counting nucleotides in the whole yeast genome. We don't want to include the whole genome sequence in this notebook, and so we'll use existing python tools to read it from a file. First, we need to install the `biopython` package

In [None]:
!pip install biopython

Then, we need to import the `SeqIO` module, which has functions for reading Fasta-format files.

In [None]:
from Bio import SeqIO

The `SeqIO` module has a function called `parse()` that reads sequence entries from a Fasta-format file. The Fasta format is pretty simple: each sequence has a name on a line starting with a `>`, followed by the sequence itself. So, a Fasta file might look like:
```
>one
AGCTACGT...
>two
TGACTGCA...
...
```
The `parse()` function will turn each of these into a `SeqRecord`, a custom data type that holds the name and the sequence. You can get the sequence name from `record` using `record.id` and the sequence itself using `record.seq`. This sequence isn't an ordinary Python string -- it's another custom data type, called a `Seq`, but you can convert it into a string using `str(record.seq)`.

The cell below will loop through entries from the yeast genome sequence and print the sequence name, the length of the sequence, and the first 20 nucleotides of the sequence. 

In [None]:
for record in SeqIO.parse("S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    seq = str(record.seq)
    print(record.id, "\t", len(seq), "\t", seq[0:20])

**Exercise** Fill in the `...` below in order to count the occurrences of each nucleotide in the yeast genome.

In [None]:
nt_counts = {}
for record in SeqIO.parse("S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    seq = str(record.seq)
    for position in range(0, len(record.seq)):
        ...
nt_counts

Our `nt_counts` tells us the _frequency_ of each nucleotide, that is, the number of times it occurs in the genome. We now want to use this to calculate the _probability_ of each nucleotide. To do this, we can simply divide by the total number of nucleotides we counted. Although we do know the four possible nucleotide options and we could manually look up each of `A`, `C`, `G`, and `T` in the `nt_counts` dictionary, it's a better practice to do this in a more programmatic way.

The `values()` method returns a kind of list of all the value entries in a dictionary:

In [None]:
nt_counts.values()

We can loop over this list in a `for` loop:

In [None]:
for value in nt_counts.values():
    print(value)

**Exercise** Complete the for loop below in order to count the total number of nucleotides, which should match the size of the yeast genome.

In [None]:
nt_total = 0
for value in nt_counts.values():
    ...
nt_total

Since looping over a list and addig up all the values is a pretty common task, python includes a function, `sum()`, that carries out this task.

In [None]:
sum(nt_counts.values())

Now we have all the data we need to construct a new dictionary of nucleotide probabilities. Conceptually, we want to loop over each nucleotide and its count in the nt_counts dictionary and use thse to build a new `nt_probs` dictionary. Dictionaries have another method, `items()`, that produces a kind of list of keys (nucleotides) and values (counts)

In [None]:
nt_counts.items()

We can loop over this list in a `for` loop as well. Notice how we capture the two parts of the "item" entry, the nucleotide and the count, using two different loop variables.

In [None]:
for nt, count in nt_counts.items():
    print(nt + "\t" + str(count))

**Exercise** Complete the cell below in order to build and print the dictionary of nucleotide probabilities.

In [None]:
nt_probs = {}
for nt, count in nt_counts.items():
    ...
nt_probs

# Conditional Probabilities

Next, we want to investigate the _dependence_ of adjacent nucleotides in the yeast genome. In probabilistic terms, we want to learn the _conditional probabilities_ of, for instance, seeing a `C` when the previous nucleotide was an `A`, and compare this to the unconditional probability of seeing a `C`.

In this case, the easiest way to find these _conditional_ probabilities is to find the _joint_ probability of each dinucleotide in the genome and then use the formula for conditional probability.

**Exercise** Complete the cell below in order to build and print the dictionary of dinucleotide counts in the yeast genome. Just as we saw in counting codons, we'll need to slice a dinucleotide out of the sequence string. However, we want to count _every_ dinucleotide, which is slightly different than the situation when counting codons. Also, we need to remember that the last _dinucleotide_ is the second-to-last nucleotide in the sequence. 

In [None]:
dint_counts = {}
for record in SeqIO.parse("S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    seq = str(record.seq)
    for position in range(0, len(record.seq)-1):
        ...
dint_counts

**Exercise** Complete the cell below in order to compute dinucleotide probabilities

In [None]:
dint_total = sum(dint_counts.values())
dint_probs = {}
...
dint_probs

Note that some of these dinucleotides are related by symmetry -- our Fasta file contains the sequence of one DNA strand, and that choice is arbitrary, so for instance every 5'-AG-3' on one strand corresponds to a 5'-CT-3' on the opposite strand. For this reason, the number of `AG` and `CT` dinucleotides is almost the same. In some cases, like 5'-AT-3', the dinucleotide is its own reverse complement and so it does't have a dinucleotide "partner". There are six symmetric dinucleotide pairs and four individual dinucleotides.

Recall that the formula for conditional probability is
```
Pr(event | condition) = Pr(event AND condition) / Pr(condition)
```
To look at the conditional dependence of neighboring nucleotides, the _event_ is the second nucleotide and the _condition_ is the first nucleotide. Thus, the Pr(event AND condition) is the dinucleotide probability, and the Pr(condition) is the single nucleotide probability we computed above.

In order to compute these conditional probabilities for neighboring nucleotides, we'll nest two `for` loops, one for each possible first nucleotide, and one for each possible second nucleotide. We can create a dinucleotide sequence from two individual nucleotide sequences by "adding together" strings.

In [None]:
for nt1 in "ACGT":
    for nt2 in "ACGT":
        print(nt1 + nt2)

**Exercise** Complete the cell below in order to compute a dictionary of conditional probabilities for adjacent nucleotides.

In [None]:
dint_conds = {}
for nt1 in "ACGT":
    for nt2 in "ACGT":
        dint = ...
        dint_conds[dint] = ...
dint_conds

You should notice that the _conditional_ probabilities of different values for `nt2` don't exactly match the _unconditional_ probabilities for that nucleotide. This means that adjacent nucleotides in the genome are not _independent_. We can look at how strong this dependence is by asking either
* How different is the conditional probability `Pr(nt2=N|nt1=M)` from the marginal (unconditional) probability `Pr(nt2=N)`?
* How different is the _actual_ dinucleotide probability `Pr(nt1=M AND nt2=N)` from the probability that we would expect under independence, i.e., `Pr(nt1=M) * Pr(nt2=N)`?

It turns out that these are equivalent mathematically:
```
   Pr(nt1=M AND nt2=N) / (Pr(nt1=M)  * Pr(nt2=N))
= (Pr(nt1=M AND nt2=N) /  Pr(nt1=M)) / Pr(nt2=N)
=  Pr(nt2=N | nt1=M)                 / Pr(nt2=N)
```

In [None]:
dint_dependence = {}
for nt1 in "ACGT":
    for nt2 in "ACGT":
        dint = nt1+nt2
        dint_dependence[dint] = dint_conds[dint] / nt_probs[nt2]
dint_dependence

You should see that there are strong dependencies between adjacent nucleotides.

Because the identity of a nucleotide at position *i* affects what nucleotide shows up at *i+1*, 
and the nucleotide at *i+1* in turn affects what happens at position *i+2*, you can see longer-range correlations in nucleotide sequences even with purely nearest-neighbor interactions.

In reality, genomes have many different nearest-neighbor and longer-range interactions, but we won't worry about those today.

```
  Pr(nt3=P AND nt2=N AND nt1=M)
= Pr(nt3=P |   nt2=N AND nt1=M) * Pr(nt2=N AND nt1=M)
= Pr(nt3=P |   nt2=N)           * Pr(nt2=N AND nt1=M)
```
This last step relies on our assumption for the moment that there aren't any longer-range direct correlations, and so we can ignore the identity of `nt1` once we know that `nt2=N`.
```
= Pr(nt3=P | nt2=N) * Pr(nt2=N | nt1=M) * Pr(nt1=M)
```

Now we can use this to calculate the over- or under-representation of trinucleotide sequences:
```
   Pr(nt3=P AND nt2=N AND nt1=M)                     / (Pr(nt3=P) * Pr(nt2=N) * Pr(nt1=M)
=  Pr(nt3=P | nt2=N) * Pr(nt2=N | nt1=M) * Pr(nt1=M) / (Pr(nt3=P) * Pr(nt2=N) * Pr(nt1=M)
=  Pr(nt3=P | nt2=N) * Pr(nt2=N | nt1=M)             / (Pr(nt3=P) * Pr(nt2=N)
= (Pr(nt3=P | nt2=N) / Pr(nt3=P)) * (Pr(nt2=N | nt1=M) / Pr(nt2=N))
```
That is, when `MNP` consists of over-represented dinucleotides `MN` and `NP`, the trinucleotide will be over-represented, and likewise for under-represented pairs of dinucleotides. Of course, the pairs must share the common nucleotide `N` in the middle.

**Exercise** Based only on dinucleotide nearest-neighbor dependencies above, what trinucleotide sequence do you expect to be highly over-represented? Highly under-represented?

In [None]:
over_represented = "AAA"
under_represented = "TAC"

# Random Variables

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
di_nts = [ nt1 + nt2 for nt1 in "ACGT" for nt2 in "ACGT"]
di_nts

In [None]:
tetra_nts = [ di1 + di2 for di1 in di_nts for di2 in di_nts]
len(tetra_nts)

In [None]:
octa_nts = [ tetra1 + tetra2 for tetra1 in tetra_nts for tetra2 in tetra_nts]
len(octa_nts)

In [None]:
c_count = np.zeros(9)
for sequence in octa_nts:
    n_c = 0
    for nt in sequence:
        if nt == 'C':
            n_c = n_c + 1
    c_count[n_c] += 1
print(c_count)
print(sum(c_count))
c_count = c_count / sum(c_count)
print(c_count)
print(sum(c_count))

In [None]:
plt.plot(c_count, 'o-b')

In [None]:
c_count_100 = np.zeros(9)
for seqno in range(0,100):
    seq = ''.join(np.random.choice(np.array(['A', 'C', 'G', 'T']), 8))
    n_c = 0
    for nt in seq:
        if nt == 'C':
            n_c += 1
    c_count_100[n_c] += 1
print(c_count_100)
c_count_100 = c_count_100 / sum(c_count_100)
print(c_count_100)

In [None]:
plt.plot(c_count, 'o-b')
plt.plot(c_count_100, '.--r')

In [None]:
c_count_1000 = np.zeros(9)
for seqno in range(0,1000):
    seq = ''.join(np.random.choice(np.array(['A', 'C', 'G', 'T']), 8))
    n_c = 0
    for nt in seq:
        if nt == 'C':
            n_c += 1
    c_count_1000[n_c] += 1
c_count_1000 = c_count_1000 / sum(c_count_1000)


plt.plot(c_count, 'o-b')
plt.plot(c_count_1000, '.--r')

In [None]:
c_count_10000 = np.zeros(9)
for seqno in range(0,10000):
    seq = ''.join(np.random.choice(np.array(['A', 'C', 'G', 'T']), 8))
    n_c = 0
    for nt in seq:
        if nt == 'C':
            n_c += 1
    c_count_10000[n_c] += 1
c_count_10000 = c_count_10000 / sum(c_count_10000)

plt.plot(c_count, 'o-b')
plt.plot(c_count_10000, '.--r')

In [None]:
from scipy.stats import binom
binom.pmf(1, 8, 0.25)

In [None]:
binom_dist = binom(8, 0.25)
x = np.arange(0, 9)
print(x)
c_binom = binom_dist.pmf(x)
print(c_binom)
plt.plot(c_count, 'o-b')
plt.plot(c_binom, '.--r')

In [None]:
plot19, = plt.plot(binom.pmf(x, 8, 0.19), 'o-c', label='19%') ## Like 'C' in the yeast genome
plot25, = plt.plot(binom.pmf(x, 8, 0.25), 'o-k', label="25%") ## Equal probability
plot31, = plt.plot(binom.pmf(x, 8, 0.31), 'o-m', label="31%") ## Like 'T' in the yeast genome
plt.legend(handles=[plot19, plot25,plot31])