## Dinucleotides and dipeptides

We counted the occurrence of individual nucleotides in the genome and residues in the proteome.

In real biological sequences, adjacent positions are rarely independent. We now have most of the tools to measure this directly.

There are a couple of small, additional things we need to learn first, though. We know how to get _one_ nucleotide from a string, but we need a way to get two adjacent nucleotides out of a string.

We could get each of the two letters separately and add them together. Here is the way we would get the 3rd and 4th letter of the alphabet out of a string that's the whole alphabet
```
alphabet='abcdefghijklmnopqrstuvwxyz'
letter_2 = alphabet[2]
letter_3 = alphabet[3]
letters = letter_2 + letter_3
print(letters)
```

Try this out in the cell below, recalling that Python will start counting from 0.

In [None]:
alphabet='abcdefghijklmnopqrstuvwxyz'

Alternately, we can **slice** out two letters at once from a string.

Square brackets can extract a _range_ of values from a string or a list. To do this, we do `[start:end]` where the start is _included_ and the end is _excluded_.

```
alphabet[2:4]
```

This code goes from index 2 (the letter `c`) to index 3 (the letter `d`) and does not include index 4 (`e`). This can be a bit confusing, but one nice aspect of this is that it's easy to see the length of the slice. For instance, `alphabet[5:7]` is `7 - 5 = 2` nucleotides long.

Try out this slicing in the cell below.

In [None]:
alphabet[2:4]

Whether we index two nucleotides individually or slice a two-nucleotide piece out of our list, we need to loop over each possible starting position. We want to go all the way from `alphabet[0:2]` through `alphabet[24:26]`.

The `range()` function allows us to iterate over a range of numbers. Just like in slices, we do `[start:end]` and the start is included while the end is not. Try out the example below to see how this runs.
```
for x in range(3,6):
    print(x)
```

In [None]:
for x in range(3,6):
    print(x)

Each chromosome is a different length, and so we want to compute the start and end using `len(...)` rather than picking a specific number. In the `alphabet` example, for a 2 letter slice we run from starting position 0 to starting position `len(alphabet) - 2 = 26 - 2 = 24`.

However, the end of the range is not included, and so we want to do `range(0, 1 + len(alphabet)-2)`, which should use all twenty-five starting positions from 0 through 24 inclusive.

The careful tracking of whether an endpoint is included or excluded, and whether you need to add or subtract 1 from a starting and ending point, shows up all the time in bioinformatics. It's often called a "fencepost" problem, referring to the fact that you need four fenceposts to hold up three sections of fence. It's also called an "off-by-one" problem, because we often find ourselves one position too long or too short.

Try out this example below to confirm how we can get every two-letter pair out of a string.

```
for start in range(0, 1 + len(alphabet)-2):
    print(str(start) + ' ' + alphabet[start:start+2])
```

In [None]:
for start in range(0, 1 + len(alphabet)-2):
    print(str(start) + ' ' + alphabet[start:start+2])

Now that we know how to use `range()` and slices, we are really equipped with all the tools we need to count dinucleotides in the yeast genome.

### Yeast genome dinucleotides
First we need to import the `Bio.SeqIO` module from `biopython` so we can read in our yeast sequences.

In [None]:
import sys
!{sys.executable} -m pip install biopython
from Bio import SeqIO

Then we need to import the `pandas` module for our `Series` and `DataFrame` types, and the `matplotlib.pyplot` module to make graphs.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt 

Here is a copy of our code to
1. Create `chroms` as an iterator over all the chromosomes
2. Create an empty dictionary to hold single-nucleotide counts
3. Loop over each chromosome
    1. Assign the sequence of the chromosome to `chrom_seq`
    1. Loop over each position in the chromosome
        1. Assign the nucleotide at that position to `nt`
        1. Add that nucleotide to the running tally
4. Convert the count dictionary into a `Series`
5. Print the sorted version of our count series
6. Plot a bar graph of our counts

In [None]:
chroms = SeqIO.parse("../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta")
nt_count = {}
for chrom in chroms:
    chrom_seq = str(chrom.seq)
    for position in range(0, len(chrom_seq)):
        nt = chrom_seq[position]
        nt_count[nt] = nt_count.get(nt, 0) + 1
nt_series = pd.Series(nt_count)
print(nt_series.sort_index())
nt_series.sort_index().plot(kind='bar')

### Dinucleotides

Convert this to count every adjacent pair of dinucleotides. You'll need to "slice" these out of the the chromosome sequences.

#### Probabilities

Convert the counts to probabilities by 
1. Using the `.sum()` method to find the total number of dinucleotides counted
2. Dividing the `nt_series` series by this sum to get "normalized" probabilities

#### Marginal probabilities

The table of dinucleotide probabilities give the _joint_ distribution.

There are two way to compute the _marginal_ probability of an `A`. Compute this both ways and compare it to the value we got from the single-nucleotide counting above.

Write a `for` loop to compute all four marginal probabilities. It's probably easiest to create an empty dictionary, then loop over each nucleotide option, compute its marginal probability, and store it in the dictionary. 

There are many reasonable ways to approach this, though

#### Conditional probabilities

Compute the _conditional_ probability of a `C` following a first `A`. Is this higher or lower than the unconditional (marginal) probability of a `C`?

If you want to take this a bit further: write a pair of nested for loops to compute all of the conditional probabilities for the 2nd nucleotide of a dinucleotide, conditional on the identity of the first.

What nucleotide combinations have conditional probabilities that are very different from the marginal?

Another way of looking at this is to compute the ratio `P(MN) / (P(M) * P(N))`, which is the ratio between the observed dinucleotide probability and the expected dinucleotide probabilty under the assumption of independence.

#### Dipeptides

If you want to take this a lot further, you can run the same sort of analysis on dipeptides in the yeast proteome. 
Here's a slightly updated version of our loop to count amino acid frequencies in the yeast proteome, if you want ot try this out.

In [None]:
proteins = SeqIO.parse("../S288C_R64-2-1/orf_trans_all_R64-2-1_20150113.fasta", "fasta")
aa_count = {}
for protein in proteins:
    protseq = str(protein.seq)
    for pos in range(0, len(protseq)):
        aa = protseq[pos]
        aa_count[aa] = aa_count.get(aa, 0) + 1
print(aa_count)