##What do we already know about data structures?

lists store collections of elements

In [2]:
t = [1,2,3]
print t[1] 
for e in t:
    print(e+1) 

2
2
3
4


dicts store pairs of values as items:

In [4]:
d = {'monday' : 1, 'tuesday' : 2}
for day, number in d.items():
    print(day, number)

('tuesday', 2)
('monday', 1)


###Tuples

Tuples appear to be similar to lists:

In [6]:
t = (4, 5, 6)
print t[1] 
for e in t:
    print(e+1) 

5
5
6
7


until we try to change one of the values:

In [8]:
t[1] = 9

TypeError: 'tuple' object does not support item assignment

####Tuples are immutable; they cannot be changed once created

This lets Python make some time/memory optimizations.

Tuples are useful for storing heterogenous data (think records / rows from a table / simple objects):

In [11]:
t1 = ('actgctagt', 'ABC123', 1)
t2 = ('ttaggttta', 'XYZ456', 1)
t3 = ('cgcgatcgt', 'HIJ789', 5)

More to say about tuples later...

##Sets

Sets are like lists but with
- no order, and
- fast lookup

Imagine we have a long list of accession numbers that contains duplicates and we want to do some expensive processing on each:

In [34]:
# cheat by generating random accession numbers
import random
accessions = [random.choice(xrange(10000)) for _ in range(100000)]

In [38]:
%%timeit
import time
for acc in accessions:
    #pretend to do some long calculation
    time.sleep(0.00001)

1 loops, best of 3: 6.56 s per loop


To avoid doing the slow processing multiple times for the same accession number, we will keep track of which ones we've processed:

In [39]:
%%timeit
processed = []
for acc in accessions:
    if not acc in processed:
        #pretend to do some long calculation
        time.sleep(0.00001)
        processed.append(acc)

1 loops, best of 3: 7.73 s per loop


but as `processed` gets bigger, it takes longer to evaluate `acc in processed`. One solution: switch to using a dict:

In [40]:
%%timeit
processed = {}
for acc in accessions:
    if not acc in processed:
        #pretend to do some long calculation
        time.sleep(0.00001)
        processed[acc] = 1

1 loops, best of 3: 682 ms per loop


Much quicker, but confusing as we are storing useless values. Switching to a set:

In [42]:
%%timeit
processed = set()
for acc in accessions:
    if not acc in processed:
        #pretend to do some long calculation
        time.sleep(0.00001)
        processed.add(acc)

1 loops, best of 3: 675 ms per loop


Think of sets as either like
- unordered lists with rapid lookup, or
- dicts without values

##A closer look at lists

Hopefully we are all familiar with the idea of lists of numbers and strings:

In [43]:
[1,2,3,4]
['a', 'b', 'c']

['a', 'b', 'c']

A slightly more exotic idea: we can have lists of File objects:

In [45]:
files = [open("blast_result.txt"), open("sequences.fasta")]
files

[<open file 'blast_result.txt', mode 'r' at 0x7f24f89d6780>,
 <open file 'sequences.fasta', mode 'r' at 0x7f24f89d6810>]

or of regular expression match objects:

In [47]:
import re
regex_list = [ re.search(r'[^ATGC]', 'ACTRGGT'), 
               re.search(r'[^ATGC]', 'ACTYGGT')]
regex_list

[<_sre.SRE_Match at 0x7f24f4a9e440>, <_sre.SRE_Match at 0x7f24f4a9e4a8>]

If we create a list where each element is also a list, we have a two-dimention list or list-of-lists:

In [49]:
list_of_lists = [[1,2,3],[4,5,6],[7,8,9]]
list_of_lists

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

More readably:

In [51]:
list_of_lists = [[1,2,3],
                 [4,5,6],
                 [7,8,9]]

Each element is just a normal list:

In [53]:
list_of_lists[1]

[4, 5, 6]

and we can use two brackets to address one of the inner elements:

In [55]:
list_of_lists[1][2]

6

Where might this be useful? Imagine storing a multiple sequence alignment:

In [57]:
aln = [['A', 'T', '-', 'T', 'G'], 
       ['A', 'A', 'T', 'A', 'G'], 
       ['T', '-', 'T', 'T', 'G'], 
       ['A', 'A', '-', 'T', 'A']]  

We could get a single aligned sequence

In [59]:
aln[2]

['T', '-', 'T', 'T', 'G']

or a single column (don't worry about list comprehension if you haven't seen them yet):

In [62]:
#get the fourth column
[seq[3] for seq in aln]

['T', 'A', 'T', 'T']

We can build lists of other things too. Imagine we have a collection of DNA sequence records. We could store this a list of dicts:

In [65]:
# a list of dicts
records = [
    {'seq' : 'actgctagt', 'accession' : 'ABC123', 'genetic_code' : 1},
    {'seq' : 'ttaggttta', 'accession' : 'XYZ456', 'genetic_code' : 1},
    {'seq' : 'cgcgatcgt', 'accession' : 'HIJ789', 'genetic_code' : 5}
]

for record in records:
    print('accession number : ' + record['accession'])
    print('genetic code: ' + str(record['genetic_code'])) 

accession number : ABC123
genetic code: 1
accession number : XYZ456
genetic code: 1
accession number : HIJ789
genetic code: 5


Question: do we really need rapid lookup for this? Not really; we could just use tuples and rely on the order:

In [67]:
# a list of tuples
records = [
    ('actgctagt', 'ABC123', 1),
    ('ttaggttta', 'XYZ456', 1),
    ('cgcgatcgt', 'HIJ789', 5)
]

for record in records:
    (this_sequence, this_accession, this_code) = record
    print('accession number : ' + this_accession)
    print('genetic code: ' + str(this_code)) 


accession number : ABC123
genetic code: 1
accession number : XYZ456
genetic code: 1
accession number : HIJ789
genetic code: 5


##Interlude

We have looked at
- lists of lists
- lists of dicts
- lists of tuples

It's also possible to build lists of sets. 

How about other data structures?

###Sets
Elements in sets have to be immutable (so they can be hashed) so we can't build
- sets of lists
- sets of dicts
- sets of sets

we can build sets of tuples (though I'm not sure why).

###Tuples
We can build tuples where the individual elements are lists/sets/dicts/tuples, but they tend not to be very useful.

###Dicts
Dicts of things turn out to be **very** useful. They allow us to 
- attach names to other data structures, and
- rapidly look up other data structures using those names.

####Dicts of sets
Imagine we have run a gene expression experiment in which we subject some cells to various metal elements and record which genes are overexpressed in response. The data might look like this:

In [68]:
gene_sets = {
    'arsenic' : {1,2,3,4,5,6,8,12},
    'cadmium' : {2,12,6,4},
    'copper' : {7,6,10,4,8},
    'mercury' : {3,2,4,5,1}
}

This data structure leverages the features of dicts (rapidly look up a gene set from the metal name) and sets (rapidly check membership). E.g. is gene number 3 over-expressed in response to arsenic?

In [70]:
3 in gene_sets['arsenic']

True

Which conditions is gene 5 over-expressed in response to?

In [71]:
for metal, genes in gene_sets.items(): 
    if 5 in genes: 
        print(metal) 

mercury
arsenic


Or even more concisely (wait for comprehensions....):

In [73]:
[metal for metal, genes in gene_sets.items() if 5 in genes]

['mercury', 'arsenic']

Now, a more interesting question: are there any conditions whose genes are a subset of another condition's genes? 

In [75]:
for condition1,set1 in gene_sets.items(): 
     for condition2,set2 in gene_sets.items():
            if set1.issubset(set2) and condition1 != condition2: 
                print(condition1 + ' is a subset of ' + condition2)  

mercury is a subset of arsenic
cadmium is a subset of arsenic


Notice how we use the features of both dicts (to get hold of the condition names) and sets (using the `issubset()` method).

####Dicts of tuples

Remember our list of tuples for storing DNA sequence records:

In [76]:
records = [
    ('actgctagt', 'ABC123', 1),
    ('ttaggttta', 'XYZ456', 1),
    ('cgcgatcgt', 'HIJ789', 5)
]

This is great for iterating over all records:

In [78]:
for record in records:
    (sequence, accession, code) = record
    print("looking at record " + accession + " with genetic code " + str(code))
    # do something with the record

looking at record ABC123 with genetic code 1
looking at record XYZ456 with genetic code 1
looking at record HIJ789 with genetic code 5


but not great for finding a specific record:

In [81]:
for record in records:
    if record[1] == 'XYZ456':
        print("Found it!")
        (sequence, accession, code) = record
        # do something with the record

Found it!


Here's the same data as a dict of tuples. We turn the accession into the key:

In [82]:
records = {
    'ABC123' : ('actgctagt', 1),
    'XYZ456' : ('ttaggttta', 1),
    'HIJ789' : ('cgcgatcgt', 5)
}

Now it's just as easy to iterate over all records:

In [84]:
for accession, record in records.items():
    (sequence, code) = record
    print("looking at record " + accession + " with genetic code " + str(code))
    # do something with the record

looking at record XYZ456 with genetic code 1
looking at record HIJ789 with genetic code 5
looking at record ABC123 with genetic code 1


and it's also easy to retrieve a specific record by accession:

In [86]:
my_record = records.get('XYZ456')
(this_sequence, this_code) = my_record
print("looking at record " + accession + " with genetic code " + str(code))
# do something with the record

looking at record ABC123 with genetic code 1


##Some special data structures from the standard library

Common scenario number one: we want to count the number of times each unique element occurs in a collection of things. E.g. counting bases in a DNA sequence:

In [92]:
dna = 'aattggaattggaattg'
base_counts = {}
for pos in range(len(dna)):
    base = dna[pos]
    current_count = base_counts.get(base, 0)
    base_counts[base] = current_count + 1

print(base_counts)

{'a': 6, 't': 6, 'g': 5}


`collections.Counter` is a special dict class for doing this. Construct it by passing a list (or string, etc.) as the argument:

In [94]:
import collections

dna = 'aattggaattggaattg'
base_counter = collections.Counter(dna)
print(base_counter)

Counter({'a': 6, 't': 6, 'g': 5})


Common scenario number two: we want to have a dict where there's a default value for new keys. Example: storing kmer start positions. We can't just use a normal dict like this:

In [95]:
dna = 'aattggaattggaattg'
k = 4
kmer2pos = {}
for start in range(len(dna) - k + 1):
    kmer = dna[start:start + k]
    kmer2pos[kmer] = start # danger, overwrites

print(kmer2pos)

{'ggaa': 10, 'aatt': 12, 'gaat': 11, 'tgga': 9, 'attg': 13, 'ttgg': 8}


because it will overwrite earlier positions with later ones so we just end up with the right-most position for each kmer. We need a **dict of lists** i.e. a list of start positions for each kmer:

In [96]:
dna = 'aattggaattggaattg'
k = 4 
kmer2list = {} 
for start in range(len(dna) - k + 1): 
    kmer = dna[start:start + k] 
    list_of_positions = kmer2list.get(kmer, []) 
    list_of_positions.append(starlt) 
    kmer2list[kmer] = list_of_positions 
print(kmer2list)

{'ggaa': [4, 10], 'aatt': [0, 6, 12], 'gaat': [5, 11], 'tgga': [3, 9], 'attg': [1, 7, 13], 'ttgg': [2, 8]}


but note that we have to explicitly get the current list of positions with an empty list for the first time we see a given kmer. An alternative is to use `collections.defaultdict` which lets us supply the name of a function to create the default value. The `list()` function creates an empty list:

In [97]:
list()

[]

so we can create a `defaultdict` where the default value is a list:

In [99]:
import collections
kmer2list = collections.defaultdict(list)

Now we can manipulate values in the dict without worrying about whether there is already a value there:

In [101]:
for start in range(len(dna) - k + 1): 
    kmer = dna[start:start + k] 
    kmer2list[kmer].append(start)
kmer2list

defaultdict(<type 'list'>, {'ggaa': [4, 10, 4, 10], 'aatt': [0, 6, 12, 0, 6, 12], 'gaat': [5, 11, 5, 11], 'tgga': [3, 9, 3, 9], 'attg': [1, 7, 13, 1, 7, 13], 'ttgg': [2, 8, 2, 8]})

#Exercises
## Transforming data between structures

Use the heavy metal gene expression data.
 
The similarity score between two conditions is the number of over-expressed genes in common (the intersection) divided by the total number of over-expressed genes (the union).
 
Write a program that will start with the list of sets and produce a pairwise similarity matrix stored as a dict of dicts. 
 
We should be able to get the score for a given pair of conditions like this:

In [None]:
score = similarity_matrix['arsenic']['cadmium']

Hint: take a look at [the documentation for sets](https://docs.python.org/2/library/stdtypes.html#set)

