# Probability As Counting

## Strings

Python has a data type for _strings_. String variables are enclosed in quotes -- either single or double quotes can be used. So, to create a string:

In [1]:
my_name = 'Nick'
print("printing my_name:", my_name)
my_name

printing my_name: Nick


'Nick'

Notice that when we `print(...)` a string, we see the contents of the string. When the output of a cell is a string, we get that string in quotes. This lets us distinguish between the numeric value 1 and the string `'1'`

In [4]:
print('2' + '2')
print(2+2)
2 + 2

22
4


4

_MFA1_ is a fairly short budding yeast gene that encodes a-factor, a mating pheromone. The cell below creates a variable named `mfa1` that holds a string containing the nucleotide sequence of the _MFA1_ gene.

Because even this short gene sequence is a long string, it is split across two lines, using a backslash (`\`) to continue the string on the next line.

In [5]:
mfa1='ATGCAACCATCTACCGCTACCGCCGCTCCAAAAGAAAAGACCAGCAGTGAAAAGAAGGAC\
AACTATATTATCAAAGGTGTCTTCTGGGACCCAGCATGTGTTATTGCTTAG'
mfa1

'ATGCAACCATCTACCGCTACCGCCGCTCCAAAAGAAAAGACCAGCAGTGAAAAGAAGGACAACTATATTATCAAAGGTGTCTTCTGGGACCCAGCATGTGTTATTGCTTAG'

In many ways, we can treat a string like a list of characters.

We can use the `len()` function to find the length of this string, which should be a multiple of 3.

In [6]:
len(mfa1)

111

We can pick out individual nucleotides out of this string using square brackets to _index_ into the string. Python starts counting positions from 0, and so the first nucleotide is retrieved using `[0]`.

In [7]:
mfa1[0]

'A'

and the second nucleotide is retrieved using `[1]`

In [8]:
mfa1[1]

'T'

**Exercise** retrieve the _last_ nucleotide of _MFA1_ from the string -- but don't just rely on knowing the length of _MFA1_, compute its length using `len()`.

In [9]:
mfa1[len(mfa1) - 1]

'G'

We can also _slice_ out a range of nucleotides using square brackets. The cell below will _slice_ out the second codon of _MFA1_.

In [10]:
mfa1[3:6]

'CAA'

In the cell above, we use a colon inside the square brackets to slice out a range of positions. The range is written `start:end`, but notice that the slice above runs from 3 to 6 in order to extract the nucleotides at positions 3, 4, 5 and _not_ 6. That is, the `start` is _included_ in the range but the `end` is _excluded_.

**Exercise** Slice out the last codon of _MFA1_, using `len()`. (Hint: the result should be a stop codon)

In [12]:
mfa1[len(mfa1)-3:len(mfa1)]

'TAG'

## Dictionaries ("Dicts")

Dictionaries are incredibly useful data structures that link keys with values. You can have one dictionary entry per key, and keys can be almost anything -- numbers, strings, tuples, and so forth. It's very fast to look up a key in the dictionary.

We might want a dictionary in order to look up the standard name of a yeast gene given its systematic name. Since we want to quickly find entries according to the systematic name, this is the key. We can create a dictionary using curly braces around `key:value` pairs separated by commas:

In [13]:
yeast_genes = { 'YAL001C': 'TFC3', 'YAL002W': 'VPS8', 'YAL003W': 'EFB1'}
print(yeast_genes)

{'YAL001C': 'TFC3', 'YAL002W': 'VPS8', 'YAL003W': 'EFB1'}


We can look up entries from the dictionary using square brackets, just like looking up entries in a list

In [15]:
yeast_genes['YAL001C']

'TFC3'

We can also add or replace entries using square brackets:

In [16]:
print(yeast_genes)
yeast_genes['YAL003W'] = 'TEF5'
yeast_genes['YAL005C'] = 'SSA1'
print(yeast_genes)

{'YAL001C': 'TFC3', 'YAL002W': 'VPS8', 'YAL003W': 'EFB1'}
{'YAL001C': 'TFC3', 'YAL002W': 'VPS8', 'YAL003W': 'TEF5', 'YAL005C': 'SSA1'}


We can use a `for` loop to run through all the keys in a dictionary. We **are not** guaranteed anything about the order of the keys in this loop.

In [19]:
for zzz in sorted(yeast_genes):
    print("gene", zzz, "is", yeast_genes[zzz])

gene YAL001C is TFC3
gene YAL002W is VPS8
gene YAL003W is TEF5
gene YAL005C is SSA1


Alternately, we can use `items()` to get `(key, value)` pairs directly

In [24]:
for key,value in yeast_genes.items():
    print("gene",key,"is",value)
    
backwards = {}
for systematic,standard in yeast_genes.items():
    backwards[standard] = systematic
print("backwards:",backwards)

gene YAL001C is TFC3
gene YAL002W is VPS8
gene YAL003W is TEF5
gene YAL005C is SSA1
backwards: {'TFC3': 'YAL001C', 'VPS8': 'YAL002W', 'TEF5': 'YAL003W', 'SSA1': 'YAL005C'}


We can also quickly test whether a dictionary contains a key using `in`

In [29]:
print("YAL001C in yeast_genes?", 'YAL001C' in yeast_genes)
print("YPR204W in yeast_genes?", 'YPR204W' in yeast_genes)
print( 7 > 5 )
if 'YAL002W' in yeast_genes:
    print("Yes it is")
else:
    print("No it is not")

YAL001C in yeast_genes? False
YPR204W in yeast_genes? False
True
Yes it is


We can delete entries from a dictionary using `del`

In [26]:
del yeast_genes['YAL001C']
print(yeast_genes)

{'YAL002W': 'VPS8', 'YAL003W': 'TEF5', 'YAL005C': 'SSA1'}


In [30]:
a = 0
c = 0
g = 0
t = 0
for nt in mfa1:
    if nt == 'A':
        a = a + 1
    elif nt == 'C':
        c = c + 1
    elif nt == 'G':
        g = g + 1
    elif nt == 'T':
        t = t + 1
print("A", a, "C", c, "G", g, "T", t)    


A 37 C 27 G 23 T 24


## Counting things with dictionaries

Now, we want to count how many times each individual nucleotide appears in the _MFA1_ gene. We'll use a dictionary to keep a tally of how many times we encounter each nucleotide as we run down the length of the `mfa1` gene. In this dictionary, the keys will be nucleotides and the values will be counts.

At each position, we'll look up the current nucleotide count and update it by adding one.

In [31]:
nt_counts = {'A': 3, 'C': 7, 'T': 5}
nt_counts['A'] = nt_counts['A'] + 1
print(nt_counts)

{'A': 4, 'C': 7, 'T': 5}


We'll start with an empty dictionary, though, and so we won't have an entry for a nucleotide the first time we encounter it. The cell below produces an error because there's no entry for `G`.

In [36]:
nt_counts['G'] = nt_counts.get('G', 0) + 1
print(nt_counts)
nt_counts.get('U', 0)

{'A': 4, 'C': 7, 'T': 5, 'G': 4}


0

Instead, we can use the `get()` function to look up a key, or return 0 if the key doesn't exist. In the cell below, we show how `get()` allows us to count the first occurrence of `G`, when it isn't in the dictionary yet, and also to count the eigth occurrence of `C`, by finding the old entry for `C` in the dictionary and adding one.

In [37]:
nt_counts['G'] = nt_counts.get('G', 0) + 1
nt_counts['C'] = nt_counts.get('C', 0) + 1
print(nt_counts)

{'A': 4, 'C': 8, 'T': 5, 'G': 5}


Now we can put all of this together in order to 
* create a new, empty dictionary for nucleotide counts
* loop over each position in `mfa1`,
  * index into `mfa1` to find the nucleotide at that position, and
  * increment the count of nucleotides

In [40]:
mfa1 = mfa1 + 'UUU'
nt_counts = {}
for position in range(0, len(mfa1)):
    nt = mfa1[position]
    nt_counts[nt] = nt_counts.get(nt, 0) + 1
print(nt_counts)

{'A': 37, 'T': 24, 'G': 23, 'C': 27, 'U': 3}


We can use a similar approach to count the _codons_ in `mfa1`. To do this, we need to make two changes to our nucleotide-counting approach:
1. We need to slice out a whole codon rather than indexing a single position, and
2. We need to loop over codon starting positions (0, 3, 6, 9, ...) and not nucleotide positions. Our `range()` function can go by steps of 3 like this: `range(start, end, 3)`.

**Exercise** Fill in the `...` below in order to count how many times each codon occurs in _MFA1_.

In [47]:
for i in range(0,len(mfa1), 3):
    print(i)

0
3
6
9
12
15
18
21
24
27
30
33
36
39
42
45
48
51
54
57
60
63
66
69
72
75
78
81
84
87
90
93
96
99
102
105
108
111


In [48]:
codon_counts = {}
for position in range(0, len(mfa1), 3):
    codon = mfa1[position:position+3]
    codon_counts[codon] = codon_counts.get(codon, 0) + 1
print(codon_counts)

{'ATG': 1, 'CAA': 1, 'CCA': 3, 'TCT': 1, 'ACC': 3, 'GCT': 3, 'GCC': 1, 'AAA': 2, 'GAA': 2, 'AAG': 3, 'AGC': 1, 'AGT': 1, 'GAC': 2, 'AAC': 1, 'TAT': 1, 'ATT': 2, 'ATC': 1, 'GGT': 1, 'GTC': 1, 'TTC': 1, 'TGG': 1, 'GCA': 1, 'TGT': 1, 'GTT': 1, 'TAG': 1, 'UUU': 1}


Next, we'll move on to counting nucleotides in the whole yeast genome. We don't want to include the whole genome sequence in this notebook, and so we'll use existing python tools to read it from a file. First, we need to install the `biopython` package

In [49]:
!pip install biopython

Collecting biopython
  Using cached https://files.pythonhosted.org/packages/ed/77/de3ba8f3d3015455f5df859c082729198ee6732deaeb4b87b9cfbfbaafe3/biopython-1.74-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: biopython
Successfully installed biopython-1.74


Then, we need to import the `SeqIO` module, which has functions for reading Fasta-format files.

In [50]:
from Bio import SeqIO

The `SeqIO` module has a function called `parse()` that reads sequence entries from a Fasta-format file. The Fasta format is pretty simple: each sequence has a name on a line starting with a `>`, followed by the sequence itself. So, a Fasta file might look like:
```
>one
AGCTACGT...
>two
TGACTGCA...
...
```
The `parse()` function will turn each of these into a `SeqRecord`, a custom data type that holds the name and the sequence. You can get the sequence name from `record` using `record.id` and the sequence itself using `record.seq`. This sequence isn't an ordinary Python string -- it's another custom data type, called a `Seq`, but you can convert it into a string using `str(record.seq)`.

The cell below will loop through entries from the yeast genome sequence and print the sequence name, the length of the sequence, and the first 20 nucleotides of the sequence. 

In [58]:
my_string = "this has a line break\nhere"
print(my_string)

this has a line break
here


In [59]:
for record in SeqIO.parse("../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    seq = str(record.seq)
    print(record.id, "\n", len(seq), " ", seq[0:20])

ref|NC_001133| 
 230218   CCACACCACACCCACACACC
ref|NC_001134| 
 813184   AAATAGCCCTCATGTACGTC
ref|NC_001135| 
 316620   CCCACACACCACACCCACAC
ref|NC_001136| 
 1531933   ACACCACACCCACACCACAC
ref|NC_001137| 
 576874   CGTCTCCTCCAAGCCCTGTT
ref|NC_001138| 
 270161   GATCTCGCAAGTGCATTCCT
ref|NC_001139| 
 1090940   CCACACCCACACACACCACA
ref|NC_001140| 
 562643   CCCACACACACCACACCCAC
ref|NC_001141| 
 439888   CACACACACCACACCCACAC
ref|NC_001142| 
 745751   CCCACACACACACCACACCC
ref|NC_001143| 
 666816   CACCACACCCACACACCACA
ref|NC_001144| 
 1078177   CACACACACACACCACCCAC
ref|NC_001145| 
 924431   CCACACACACACCACACCCA
ref|NC_001146| 
 784333   CCGGCTTTCTGACCGAAATT
ref|NC_001147| 
 1091291   ACACCACACCCACACCACAC
ref|NC_001148| 
 948066   AAATAGCCCTCATGTACGTC
ref|NC_001224| 
 85779   TTCATAATTAATTTTTTATA


**Exercise** Fill in the `...` below in order to count the occurrences of each nucleotide in the yeast genome.

In [60]:
nt_counts = {}
for record in SeqIO.parse("../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa", "fasta"):
    seq = str(record.seq)
    for position in range(0, len(record.seq)):
        nt = seq[position]
        nt_counts[nt] = nt_counts.get(nt, 0) + 1
print(nt_counts)

{'C': 2320576, 'A': 3766349, 'T': 3753080, 'G': 2317100}


Our `nt_counts` tells us the _frequency_ of each nucleotide, that is, the number of times it occurs in the genome. We now want to use this to calculate the _probability_ of each nucleotide. To do this, we can simply divide by the total number of nucleotides we counted. Although we do know the four possible nucleotide options and we could manually look up each of `A`, `C`, `G`, and `T` in the `nt_counts` dictionary, it's a better practice to do this in a more programmatic way.

The `values()` method returns a kind of list of all the value entries in a dictionary:

In [61]:
print(nt_counts.values())

dict_values([2320576, 3766349, 3753080, 2317100])


We can loop over this list in a `for` loop:

In [62]:
for value in nt_counts.values():
    print(value)

2320576
3766349
3753080
2317100


**Exercise** Complete the for loop below in order to count the total number of nucleotides, which should match the size of the yeast genome.

In [65]:
nt_total_new = 0
for value in nt_counts.values():
    nt_total_new = nt_total_new + value
print(nt_total_new)

12157105


Since looping over a list and addig up all the values is a pretty common task, python includes a function, `sum()`, that carries out this task.

In [None]:
print("sum of nt_counts:", sum(nt_counts.values()))

Now we have all the data we need to construct a new dictionary of nucleotide probabilities. Conceptually, we want to loop over each nucleotide and its count in the nt_counts dictionary and use thse to build a new `nt_probs` dictionary. Dictionaries have another method, `items()`, that produces a kind of list of keys (nucleotides) and values (counts)

In [66]:
print(nt_counts.items())

dict_items([('C', 2320576), ('A', 3766349), ('T', 3753080), ('G', 2317100)])


We can loop over this list in a `for` loop as well. Notice how we capture the two parts of the "item" entry, the nucleotide and the count, using two different loop variables.

In [67]:
for nt, count in nt_counts.items():
    print(nt + "\t" + str(count))

C	2320576
A	3766349
T	3753080
G	2317100


**Exercise** Complete the cell below in order to build and print the dictionary of nucleotide probabilities.

In [68]:
prob_a = nt_counts['A'] / nt_total
print(prob_a)

0.309806405390099


In [70]:
nt_probs = {}
for nt in nt_counts:
    print(nt_probs)
    print(nt)
    print(nt_counts[nt])
    nt_probs[nt] = nt_counts[nt] / nt_total
print(nt_probs)

{}
C
2320576
{'C': 0.19088228653120953}
A
3766349
{'C': 0.19088228653120953, 'A': 0.309806405390099}
T
3753080
{'C': 0.19088228653120953, 'A': 0.309806405390099, 'T': 0.3087149448820258}
G
2317100
{'C': 0.19088228653120953, 'A': 0.309806405390099, 'T': 0.3087149448820258, 'G': 0.19059636319666565}
