# Conditional probabilities

We're going to analyze conditional probabilities of neighboring nucleotides in order to look for patterns in the yeast genome. We'll start by setting up Biopython and re-running our single-nucleotide counting analysis.

In [1]:
!pip3 install biopython

Collecting biopython
  Using cached https://files.pythonhosted.org/packages/ed/77/de3ba8f3d3015455f5df859c082729198ee6732deaeb4b87b9cfbfbaafe3/biopython-1.74-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: biopython
Successfully installed biopython-1.74


Then, we need to import the `SeqIO` module, which has functions for reading Fasta-format files.

In [2]:
from Bio import SeqIO

# Count occurrences of nucleotides in the yeast genome
nt_counts = {}
for record in SeqIO.parse("../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    seq = str(record.seq)
    for position in range(0, len(record.seq)):
        nt = seq[position]
        nt_counts[nt] = nt_counts.get(nt, 0) + 1
print("Nucleotide counts:", nt_counts)

# Sum up total number of nucleotides
nt_total = sum(nt_counts.values())
print("Total nucleotides:", nt_total)

# Divide counts by nucleotide totals to get probabilities
nt_probs = {}
for nt in nt_counts:
    nt_probs[nt] = nt_counts[nt] / nt_total
print("Nucleotide probabilities:", nt_probs)

print("Total nucleotide probabilities:", sum(nt_probs.values()))

Nucleotide counts: {'C': 2320576, 'A': 3766349, 'T': 3753080, 'G': 2317100}
Total nucleotides: 12157105
Nucleotide probabilities: {'C': 0.19088228653120953, 'A': 0.309806405390099, 'T': 0.3087149448820258, 'G': 0.19059636319666565}
Total nucleotide probabilities: 0.9999999999999999


# Conditional Probabilities

Next, we want to investigate the _dependence_ of adjacent nucleotides in the yeast genome. In probabilistic terms, we want to learn the _conditional probabilities_ of, for instance, seeing a `C` when the previous nucleotide was an `A`, and compare this to the unconditional probability of seeing a `C`.

In this case, the easiest way to find these _conditional_ probabilities is to find the _joint_ probability of each dinucleotide in the genome and then use the formula for conditional probability.

Since we're talking about neighboring nucleotides, the joint probability is the probability of a given dinuleotide pair.

In [3]:
dint_counts = {}
for record in SeqIO.parse("../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    seq = str(record.seq)
    for position in range(0, len(seq) - 1):
        dint = seq[position:position+2]
        dint_counts[dint] = dint_counts.get(dint, 0) + 1
print(dint_counts)

{'CC': 473632, 'CA': 786382, 'AC': 639445, 'AT': 1097112, 'TC': 754106, 'CT': 705263, 'TA': 902972, 'AA': 1320952, 'AG': 708838, 'GC': 453382, 'TG': 782421, 'GG': 470541, 'GT': 637132, 'TT': 1313572, 'CG': 355299, 'GA': 756039}


Next, we'll convert these into dinucleotide probabilities

In [7]:
dint_total = sum(dint_counts.values())
print(dint_total)
dint_probs = {}
for dint,counts in dint_counts.items():
    dint_probs[dint] = counts / dint_total
print(dint_probs)
# OR EQUIVALENTLY
dint_probs_alt = {}
for dint in dint_counts:
    dint_probs_alt[dint] = dint_counts[dint] / dint_total
print(dint_probs_alt)

12157088
{'CC': 0.03895932973422583, 'CA': 0.06468506273870848, 'AC': 0.05259853346459284, 'AT': 0.09024463753161942, 'TC': 0.062030150641337795, 'CT': 0.05801249443945787, 'TA': 0.07427535278185039, 'AA': 0.10865694153073499, 'AG': 0.05830656157132366, 'GC': 0.03729363479148954, 'TG': 0.06435924458225523, 'GG': 0.03870507476790495, 'GT': 0.05240827408668918, 'TT': 0.10804988826271554, 'CG': 0.029225666541198025, 'GA': 0.06218915253389627}
{'CC': 0.03895932973422583, 'CA': 0.06468506273870848, 'AC': 0.05259853346459284, 'AT': 0.09024463753161942, 'TC': 0.062030150641337795, 'CT': 0.05801249443945787, 'TA': 0.07427535278185039, 'AA': 0.10865694153073499, 'AG': 0.05830656157132366, 'GC': 0.03729363479148954, 'TG': 0.06435924458225523, 'GG': 0.03870507476790495, 'GT': 0.05240827408668918, 'TT': 0.10804988826271554, 'CG': 0.029225666541198025, 'GA': 0.06218915253389627}


Note that some of these dinucleotides are related by symmetry -- our Fasta file contains the sequence of one DNA strand, and that choice is arbitrary, so for instance every 5'-AG-3' on one strand corresponds to a 5'-CT-3' on the opposite strand. For this reason, the number of `AG` and `CT` dinucleotides is almost the same. In some cases, like 5'-AT-3', the dinucleotide is its own reverse complement and so it does't have a dinucleotide "partner". There are six symmetric dinucleotide pairs and four individual dinucleotides.

Recall that the formula for conditional probability is
```
Pr(event | condition) = Pr(event AND condition) / Pr(condition)
```
To look at the conditional dependence of neighboring nucleotides, the _event_ is the second nucleotide and the _condition_ is the first nucleotide. Thus, the Pr(event AND condition) is the dinucleotide probability, and the Pr(condition) is the single nucleotide probability we computed above.

In order to compute these conditional probabilities for neighboring nucleotides, we'll nest two `for` loops, one for each possible first nucleotide, and one for each possible second nucleotide. We can create a dinucleotide sequence from two individual nucleotide sequences by "adding together" strings.

In [9]:
for nt1 in "ACGT":
    for nt2 in "ACGT":
        print(nt1, nt2, nt1+nt2)

A A AA
A C AC
A G AG
A T AT
C A CA
C C CC
C G CG
C T CT
G A GA
G C GC
G G GG
G T GT
T A TA
T C TC
T G TG
T T TT


We can use this to build a dictionary of conditional nucleotide probabilities. As we build this table, we can also print out the relevant probabilities.

In [11]:
dint_conds = {}
for nt1 in "ACGT":
    for nt2 in "ACGT":
        dint = nt1 + nt2
        joint = dint_probs[dint]
        marginal = nt_probs[nt1]
        conditional = joint / marginal
        print(nt1, nt2, dint, joint, marginal, conditional)
        dint_conds[dint] = conditional
print(dint_conds)

A A AA 0.10865694153073499 0.309806405390099 0.3507252905049442
A C AC 0.05259853346459284 0.309806405390099 0.16977871518944979
A G AG 0.05830656157132366 0.309806405390099 0.18820321515917585
A T AT 0.09024463753161942 0.309806405390099 0.29129364648863876
C A CA 0.06468506273870848 0.19088228653120953 0.3388740983471632
C C CC 0.03895932973422583 0.19088228653120953 0.20410133618058854
C G CG 0.029225666541198025 0.19088228653120953 0.1531083217426756
C T CT 0.05801249443945787 0.19088228653120953 0.3039176420907591
G A GA 0.06218915253389627 0.19059636319666565 0.3262871939992202
G C GC 0.03729363479148954 0.19059636319666565 0.1956681343022707
G G GG 0.03870507476790495 0.19059636319666565 0.20307352206908252
G T GT 0.05240827408668918 0.19059636319666565 0.274969958543291
T A TA 0.07427535278185039 0.3087149448820258 0.24059526114044927
T C TC 0.062030150641337795 0.3087149448820258 0.2009301838789903
T G TG 0.06435924458225523 0.3087149448820258 0.20847466457074135
T T TT 0.1080

You should notice that the _conditional_ probabilities of different values for `nt2` don't exactly match the _unconditional_ probabilities for that nucleotide. This means that adjacent nucleotides in the genome are not _independent_. We can look at how strong this dependence is by asking either
* How different is the conditional probability `Pr(nt2=N|nt1=M)` from the marginal (unconditional) probability `Pr(nt2=N)`?
* How different is the _actual_ dinucleotide probability `Pr(nt1=M AND nt2=N)` from the probability that we would expect under independence, i.e., `Pr(nt1=M) * Pr(nt2=N)`?

It turns out that these are equivalent mathematically:
```
   Pr(nt1=M AND nt2=N) / (Pr(nt1=M)  * Pr(nt2=N))
= (Pr(nt1=M AND nt2=N) /  Pr(nt1=M)) / Pr(nt2=N)
=  Pr(nt2=N | nt1=M)                 / Pr(nt2=N)
```

In [12]:
for nt1 in "ACGT":
    for nt2 in "ACGT":
        dint = nt1 + nt2
        conditional = dint_conds[dint]
        unconditional = nt_probs[nt2]
        ratio = conditional / unconditional
        print(dint, conditional, unconditional, ratio)

AA 0.3507252905049442 0.309806405390099 1.1320788866948097
AC 0.16977871518944979 0.19088228653120953 0.8894419606697802
AG 0.18820321515917585 0.19059636319666565 0.9874438945352779
AT 0.29129364648863876 0.3087149448820258 0.9435683348597053
CA 0.3388740983471632 0.309806405390099 1.0938253452844624
CC 0.20410133618058854 0.19088228653120953 1.0692523643214933
CG 0.1531083217426756 0.19059636319666565 0.803311874239131
CT 0.3039176420907591 0.3087149448820258 0.9844604128475221
GA 0.3262871939992202 0.309806405390099 1.0531970557173247
GC 0.1956681343022707 0.19088228653120953 1.0250722466606594
GG 0.20307352206908252 0.19059636319666565 1.0654637825357791
GT 0.274969958543291 0.3087149448820258 0.8906920869942648
TA 0.24059526114044927 0.309806405390099 0.776598730544318
TC 0.2009301838789903 0.19088228653120953 1.0526392340031925
TG 0.20847466457074135 0.19059636319666565 1.0938019019577414
TT 0.3499988907372346 0.3087149448820258 1.133728368320443


You should see that there are strong dependencies between adjacent nucleotides.

Because the identity of a nucleotide at position *i* affects what nucleotide shows up at *i+1*, 
and the nucleotide at *i+1* in turn affects what happens at position *i+2*, you can see longer-range correlations in nucleotide sequences even with purely nearest-neighbor interactions.

In reality, genomes have many different nearest-neighbor and longer-range interactions, but we won't worry about those today.

```
  Pr(nt3=P AND nt2=N AND nt1=M)
= Pr(nt3=P |   nt2=N AND nt1=M) * Pr(nt2=N AND nt1=M)
= Pr(nt3=P |   nt2=N)           * Pr(nt2=N AND nt1=M)
```
This last step relies on our assumption for the moment that there aren't any longer-range direct correlations, and so we can ignore the identity of `nt1` once we know that `nt2=N`.
```
= Pr(nt3=P | nt2=N) * Pr(nt2=N | nt1=M) * Pr(nt1=M)
```

Now we can use this to calculate the over- or under-representation of trinucleotide sequences:
```
   Pr(nt3=P AND nt2=N AND nt1=M)                     / (Pr(nt3=P) * Pr(nt2=N) * Pr(nt1=M)
=  Pr(nt3=P | nt2=N) * Pr(nt2=N | nt1=M) * Pr(nt1=M) / (Pr(nt3=P) * Pr(nt2=N) * Pr(nt1=M)
=  Pr(nt3=P | nt2=N) * Pr(nt2=N | nt1=M)             / (Pr(nt3=P) * Pr(nt2=N)
= (Pr(nt3=P | nt2=N) / Pr(nt3=P)) * (Pr(nt2=N | nt1=M) / Pr(nt2=N))
```
That is, when `MNP` consists of over-represented dinucleotides `MN` and `NP`, the trinucleotide will be over-represented, and likewise for under-represented pairs of dinucleotides. Of course, the pairs must share the common nucleotide `N` in the middle.

Based only on dinucleotide nearest-neighbor dependencies above, what trinucleotide sequence do you expect to be highly over-represented? Highly under-represented?

In [13]:
x = "ABCDEFGHIJ"
for position in range(0,len(x)-1):
    y = x[position:position+2]
    print(position, y)
for position in range(0,len(x)-1,2):
    y = x[position:position+2]
    print(position, y)

0 AB
1 BC
2 CD
3 DE
4 EF
5 FG
6 GH
7 HI
8 IJ
0 AB
2 CD
4 EF
6 GH
8 IJ
