# Conditional probabilities

We're going to analyze conditional probabilities of neighboring nucleotides in order to look for patterns in the yeast genome. We'll start by setting up Biopython and re-running our single-nucleotide counting analysis.

In [None]:
!pip3 install biopython

Then, we need to import the `SeqIO` module, which has functions for reading Fasta-format files.

In [None]:
from Bio import SeqIO

# Count occurrences of nucleotides in the yeast genome
nt_counts = {}
for record in SeqIO.parse("../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    seq = str(record.seq)
    for position in range(0, len(record.seq)):
        nt = seq[position]
        nt_counts[nt] = nt_counts.get(nt, 0) + 1
print("Nucleotide counts:", nt_counts)

# Sum up total number of nucleotides
nt_total = sum(nt_counts.values())
print("Total nucleotides:", nt_total)

# Divide counts by nucleotide totals to get probabilities
nt_probs = {}
for nt in nt_counts:
    nt_probs[nt] = nt_counts[nt] / nt_total
print("Nucleotide probabilities:", nt_probs)

print("Total nucleotide probabilities:", sum(nt_probs.values()))

# Conditional Probabilities

Next, we want to investigate the _dependence_ of adjacent nucleotides in the yeast genome. In probabilistic terms, we want to learn the _conditional probabilities_ of, for instance, seeing a `C` when the previous nucleotide was an `A`, and compare this to the unconditional probability of seeing a `C`.

In this case, the easiest way to find these _conditional_ probabilities is to find the _joint_ probability of each dinucleotide in the genome and then use the formula for conditional probability.

Since we're talking about neighboring nucleotides, the joint probability is the probability of a given dinuleotide pair.

Next, we'll convert these into dinucleotide probabilities

Note that some of these dinucleotides are related by symmetry -- our Fasta file contains the sequence of one DNA strand, and that choice is arbitrary, so for instance every 5'-AG-3' on one strand corresponds to a 5'-CT-3' on the opposite strand. For this reason, the number of `AG` and `CT` dinucleotides is almost the same. In some cases, like 5'-AT-3', the dinucleotide is its own reverse complement and so it does't have a dinucleotide "partner". There are six symmetric dinucleotide pairs and four individual dinucleotides.

Recall that the formula for conditional probability is
```
Pr(event | condition) = Pr(event AND condition) / Pr(condition)
```
To look at the conditional dependence of neighboring nucleotides, the _event_ is the second nucleotide and the _condition_ is the first nucleotide. Thus, the Pr(event AND condition) is the dinucleotide probability, and the Pr(condition) is the single nucleotide probability we computed above.

In order to compute these conditional probabilities for neighboring nucleotides, we'll nest two `for` loops, one for each possible first nucleotide, and one for each possible second nucleotide. We can create a dinucleotide sequence from two individual nucleotide sequences by "adding together" strings.

We can use this to build a dictionary of conditional nucleotide probabilities. As we build this table, we can also print out the relevant probabilities.

You should notice that the _conditional_ probabilities of different values for `nt2` don't exactly match the _unconditional_ probabilities for that nucleotide. This means that adjacent nucleotides in the genome are not _independent_. We can look at how strong this dependence is by asking either
* How different is the conditional probability `Pr(nt2=N|nt1=M)` from the marginal (unconditional) probability `Pr(nt2=N)`?
* How different is the _actual_ dinucleotide probability `Pr(nt1=M AND nt2=N)` from the probability that we would expect under independence, i.e., `Pr(nt1=M) * Pr(nt2=N)`?

It turns out that these are equivalent mathematically:
```
   Pr(nt1=M AND nt2=N) / (Pr(nt1=M)  * Pr(nt2=N))
= (Pr(nt1=M AND nt2=N) /  Pr(nt1=M)) / Pr(nt2=N)
=  Pr(nt2=N | nt1=M)                 / Pr(nt2=N)
```

You should see that there are strong dependencies between adjacent nucleotides.

Because the identity of a nucleotide at position *i* affects what nucleotide shows up at *i+1*, 
and the nucleotide at *i+1* in turn affects what happens at position *i+2*, you can see longer-range correlations in nucleotide sequences even with purely nearest-neighbor interactions.

In reality, genomes have many different nearest-neighbor and longer-range interactions, but we won't worry about those today.

```
  Pr(nt3=P AND nt2=N AND nt1=M)
= Pr(nt3=P |   nt2=N AND nt1=M) * Pr(nt2=N AND nt1=M)
= Pr(nt3=P |   nt2=N)           * Pr(nt2=N AND nt1=M)
```
This last step relies on our assumption for the moment that there aren't any longer-range direct correlations, and so we can ignore the identity of `nt1` once we know that `nt2=N`.
```
= Pr(nt3=P | nt2=N) * Pr(nt2=N | nt1=M) * Pr(nt1=M)
```

Now we can use this to calculate the over- or under-representation of trinucleotide sequences:
```
   Pr(nt3=P AND nt2=N AND nt1=M)                     / (Pr(nt3=P) * Pr(nt2=N) * Pr(nt1=M)
=  Pr(nt3=P | nt2=N) * Pr(nt2=N | nt1=M) * Pr(nt1=M) / (Pr(nt3=P) * Pr(nt2=N) * Pr(nt1=M)
=  Pr(nt3=P | nt2=N) * Pr(nt2=N | nt1=M)             / (Pr(nt3=P) * Pr(nt2=N)
= (Pr(nt3=P | nt2=N) / Pr(nt3=P)) * (Pr(nt2=N | nt1=M) / Pr(nt2=N))
```
That is, when `MNP` consists of over-represented dinucleotides `MN` and `NP`, the trinucleotide will be over-represented, and likewise for under-represented pairs of dinucleotides. Of course, the pairs must share the common nucleotide `N` in the middle.

Based only on dinucleotide nearest-neighbor dependencies above, what trinucleotide sequence do you expect to be highly over-represented? Highly under-represented?