<h1 id="toctitle">Dictionaries</h1>
<ul id="toc"/>

##Introducing paired data

Say we want to count all the A's in a DNA sequence:

In [None]:
dna = "ATCGATCGATCGTACGCTGA"
a_count = dna.count("A")








a_count

That was pretty straightforward. How about all four bases:

In [None]:
dna = "ATCGTATCGATGTACGCTGA"
a_count = dna.count("A")
t_count = dna.count("T")
g_count = dna.count("G")
c_count = dna.count("C")
print(a_count, t_count, g_count, c_count)










g_count

Getting repetitive. How about dinucldeotides (16 variables):

```python
aa_count = dna.count("AA")
at_count = dna.count("AT")
ag_count = dna.count("AG")
...
```

or trinucleotides (64 variables):

```python
aaa_count = dna.count("AAA")
aat_count = dna.count("AAT")
aag_count = dna.count("AAG")
```

We could use a list to store these counts:

In [None]:
dna = "ATCGTATCGATGTACGCTGA"
dinucleotides = ['AA','AT','AG','AC',
                 'TA','TT','TG','TC',
                 'GA','GT','GG','GC',
                 'CA','CT','CG','CT']
all_counts = []
for dinucleotide in dinucleotides:
    count = dna.count(dinucleotide)
    all_counts.append(str(count) + " : " + dinucleotide)
print(all_counts)

But you can see the problem: once they're stored in the list, there's no easy way to look up the count for a given dinucleotide. There's no longer any connection between the dinucleotides and the counts.

This is an example of paired data - also called key/value data

| keys | values |
|------|--------|
|dinucleotide|count|
|name|protein sequence|
|codon|amino acid residue|
|sample|coordinates|
|word|definition|

Python's data structure for storing this type of data is a __dict__ (short for dictionary).

##Creating dicts

##Literal dicts

To make a dict 

- start and end with curly brackets
- separate keys and values with colons
- separate each pair (item) with a comma

In [None]:
enzymes = { 
'EcoRI' : 'GAATTC',
'AvaII' : 'GGACC',
'BisI' : 'GCNGC' 
}

We often write dicts on multiple lines. Getting a single value is similar to a list, but instead of giving the index, we give the key for the value we want:

In [None]:
motif = enzymes['BisI']
print(motif)

###Building up a dict

We can create an empty dict, and add items to it one at a time:

In [None]:
# create an empty dict
enzymes = {}

# add one key/value pair at a time
enzymes['EcoRI'] = 'GAATTC'
enzymes['AvaII'] = 'GGACC'
enzymes['BisI'] = 'GCNGC'
#print(enzymes)
print(enzymes['EcoRI'])

The thing that goes inside the square brackets is always the key, whether we are setting a value or retrieving a value. 

How does this help us with our dinucletodies problem?

##Counting dinucleotides with a dict

Here's how we store the counts in a dict. We start with an empty dict, and add one key/value pair for each dinucleotide:

In [None]:
dna = "AATGATGAACGAC" 
dinucleotides = ['AA','AT','AG','AC', 
                 'TA','TT','TG','TC', 
                 'GA','GT','GG','GC', 
                 'CA','CT','CG','CT'] 


all_counts = {} 
for dinucleotide in dinucleotides: 
    count = dna.count(dinucleotide) 
    print("count is " + str(count) + " for " + dinucleotide) 
    all_counts[dinucleotide] = count 
    
    
print(all_counts) 

Notice how although it's bigger than our previous examples the `all_counts` dict has the same key/value structure. 

We can now look up the count (value) for a particular dinucleotide (key) very easily:

In [None]:
all_counts['GA']

###Removing zero counts

Problem: many of the counts are zero (and for 3mers, 4mer, etc. nearly all the counts will be zero). Solution: just store the counts that are greater than zero:

In [30]:
dna = "AATGATCGATCGTACGCTGA"
counts = {}


for base1 in ['A', 'T', 'G', 'C']:
    for base2 in ['A', 'T', 'G', 'C']:
        dinucleotide = base1 + base2
        
        
        count = dna.count(dinucleotide)
        if count > 0:
            counts[dinucleotide] = count
print(counts)

{'AA': 1, 'AC': 1, 'GT': 1, 'CG': 3, 'GC': 1, 'AT': 3, 'GA': 3, 'TG': 2, 'CT': 1, 'TC': 2, 'TA': 1}


Now we are just storing the positive counts. This can lead to trouble when looking up counts for a dinucelotide that doesn't occur in the sequence:

In [None]:
counts['AA']

In [None]:
counts['AG']

The `get()` method lets us specify a default for when the key isn't found:


In [38]:
#print(counts.get('AA', 0))

print(counts.get('NG', 0))

0


##Looping with dicts

The `keys()` method returns a list of all the keys in a dict:

In [39]:
counts.keys()

['AA', 'AC', 'GT', 'CG', 'GC', 'AT', 'GA', 'TG', 'CT', 'TC', 'TA']

so we can easily answer questions that require us to look at all pairs. E.g. which dinucleotides occur exactly twice in the sequence?

In [41]:
for dinucleotide in counts.keys():
    dinuc_count = counts.get(dinucleotide)
    
    if dinuc_count == 2:
        print(dinucleotide)

TG
TC


###Looping over pairs

This pattern of iterating over keys and looking up values is common, so there's a shortcut:

In [43]:
for dinucleotide, count in counts.items():
    if count == 2:
        print(dinucleotide)

TG
TC


###Lookup vs. iteration

Remember, we don't need to write a loop if we just want to get a single value. If we are looking for the count for 'AT' then we __dont__ need to do this:

In [None]:
for dinucleotide, count in counts.items():
    if dinucleotide == 'AT':
        print(count)

We can just ask for the value directly:

In [None]:
print(counts.get('AT'))

##Exercises

###DNA translation 

Here's a dict that stores the (a) genetic code using a dict:

In [None]:
gencode = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

Write a program that will take a DNA sequence and translate it into protein using the translation table.

What happens if the DNA sequence contains undetermined bases (e.g. N)?

Can you generate a translation in all three forward frames? All three reverse frames?

In [None]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

In [None]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")