## Dinucleotides and dipeptides

We counted the occurrence of individual nucleotides in the genome and residues in the proteome.

In real biological sequences, adjacent positions are rarely independent. We now have ways to talk about these sort of inter-dependencies using probabilities.

We'll start by counting adjacent pairs of nucleotides in the genome. When a sequence has $N$ bases, it has $N-1$ adjacent pairs: $0$ and $1$, $1$ and $2$, $2$ and $3$, and so forth all the way to $N-2$, $N-1$.

An easy way to get a pandas `Series` of these adjacent pairs is to:
1. create a Series of first nucleotides in a pair
2. create a Series of second nucleotides in a pair
3. add together these two series

We'll see how this works on a test string
```
alphabet='abcdefghijklmnopqrstuvwxyz'
```

### Yeast proteome dipeptides
First we need to import the `Bio.SeqIO` module from `biopython` so we can read in our yeast sequences.

Then we need to import the `pandas` module for our `Series` and `DataFrame` types, and the `matplotlib.pyplot` module to make graphs.

Here is a copy of our code to
1. Create `proteins` as an iterator over all the protein sequences
2. Create an empty `Series` of amino acid counts
3. Loop over each protein
    1. Count the nubmer of residues in that one protein
    1. Add that residue count to the running tally
5. Print the sorted version of our count series
6. Plot a bar graph of our counts

### Dipeptides

Now we'll use the approach above to count every adjacent pair of amino acids.

We'll make a series of first amino acids in `first_aas`, a series of second amino acids in `second_aas`, and then combine them to count them.

In [None]:
total_counts.sort_values()

#### Probabilities

Convert the counts to probabilities in a variable `dipep_probs` by 
1. Using the `.sum()` method to find the total number of amino acid pairs counted
2. Dividing the `total_counts` series by this sum to get "normalized" probabilities

#### Marginal probabilities

The table of amino acid _pair_ probabilities give the _joint_ distribution.

There are two way to compute the _marginal_ probability of an `A`. We can count every time an `A` shows up in the first position, and we can count every time an `A` shows up in the second position.

Compute this both ways and compare it to the value we got from the single-nucleotide counting above.

Compute all of the marginal probabilities. 

There are many reasonable ways to approach this -- one is to use a for loop

#### Conditional probabilities

Compute the _conditional_ probability of a `C` following a first `A`. Is this higher or lower than the unconditional (marginal) probability of a `C`?

Another way of looking at this is to compute the ratio `P(CA) / (P(C) * P(A))`, which is the ratio between the observed dinucleotide probability and the expected dinucleotide probabilty under the assumption of independence.

### _Exercise_ 

The file `../S288C_R64-3-1/S288C_reference_sequence_R64-3-1_20210421.fsa` has the nucleotide sequence of the yeast genome. Each chromosome has its own sequence entry. 

Count the dinucleotide frequencies in the genome.

Use the dinucleotide frequencies to compute dinucleotide probabilities in a variable named `dint_probs`.

Compute the marginal probabilities of each nucleotide in a variable called `marginal_probs`.

Determine the _conditional_ probability of a `G` base following a `C` base. How does this compare to the overall probability of a `G` base?