#Iterators, comprehensions and generators

##List comprehensions

In the session on functional programming, we were defining lists. E.g. the result of calling `get_at()` on each element of `dna_list`:

In [2]:
from __future__ import division

def get_at(dna): 
    return (dna.count('A') + dna.count('T')) / len(dna) 

dna_list = ['TAGC', 'ACGTATGC', 'ATG', 'ACGGCTAG'] 
map(get_at, dna_list) 

[0.5, 0.5, 0.6666666666666666, 0.375]

The elements of `dna_list` which are at least 4 bases long: 

In [5]:
filter(lambda x: len(x) > 3, dna_list)

['TAGC', 'ACGTATGC', 'ACGGCTAG']

Python has a special syntax for defining lists called **list comprehensions**. Here's the list of lengths of the DNA sequences in three ways:

In [10]:
# with a loop

l1 = []
for dna in dna_list:
    l1.append(len(dna))
    
# with a map
l2 = map(len, dna_list)

# as a list comprehension
l3 = [len(dna) for dna in dna_list]

assert l1 == l2
assert l1 == l3

Another examples: AT contents

In [13]:
# with a map
l1 = map(get_at, dna_list)

# as a list comprehension
l2 = [get_at(dna) for dna in dna_list]

assert l1 == l2
l1

[0.5, 0.5, 0.6666666666666666, 0.375]

But what if we don't have `get_at()` defined? 

In [20]:
# with a map
l1 = map(lambda dna : (dna.count('A') + dna.count('T')) / len(dna), dna_list)

# as a list comprehension
l2 = [(dna.count('A') + dna.count('T')) / len(dna) for dna in dna_list]

assert l1 == l2

The list comprehension allows us to express the transformation without having to write either a function or a lambda expression. 

List comprehensions can also have conditions:

In [22]:
[len(dna) for dna in dna_list if get_at(dna) >= 0.5]

[4, 8, 3]

This allows us to effectively do a map+filter all in one go. More readably:

In [24]:
[len(dna) 
 for dna in dna_list 
 if get_at(dna) >= 0.5]

[4, 8, 3]

List comprehensions can be very concise. They can operate on any iterable type, not just a list - E.g. get a list of all FASTA headers:

In [26]:
# using sequences.fasta from previous exercise
[line[1:] for line in open('sequences.fasta') if line.startswith('>')]

['normal_sequence\n',
 'sequence_in_lowercase\n',
 'sequence_with_unknown_bases\n',
 'this header contains spaces\n',
 'this_header_is_very_long_and_should_be_truncated\n',
 'some_gene\n']

This will become very interesting when we start to talk about iterables later on. 

List comprehensions can be nested, which is just like using nested `for` loops. To generate all possible dinucleotides:

In [28]:
bases = ['A', 'T', 'G', 'C']
[base1 + base2 for base1 in bases for base2 in bases]


['AA',
 'AT',
 'AG',
 'AC',
 'TA',
 'TT',
 'TG',
 'TC',
 'GA',
 'GT',
 'GG',
 'GC',
 'CA',
 'CT',
 'CG',
 'CC']

Needless to say, this can get complicated very quickly. 

Closely related to list comprehensions are **generator expressions**. A generator expression is just a list comprehension with round brackets rather than square ones:

In [31]:
(len(dna) for dna in dna_list)

<generator object <genexpr> at 0x7f40e02ae1e0>

The only difference is that generator expressions are **lazy** i.e. they only calculate the elements as they are needed. This can save a lot of memory since we don't have create the whole list. E.g. our previous example of getting FASTA headers creates a list in memory, but if we change to a generator:

In [34]:
headers = (line[1:] for line in open('sequences.fasta') if line.startswith('>'))
for header in headers:
    print(header)
    # real processing goes here

normal_sequence

sequence_in_lowercase

sequence_with_unknown_bases

this header contains spaces

this_header_is_very_long_and_should_be_truncated

some_gene



Then we are only dealing with a single element at a time. To prove it:

In [39]:
%%timeit
l = [2 ** x for x in range(1000)]

100 loops, best of 3: 2.17 ms per loop


In [41]:
%%timeit 
l = (2 ** x for x in range(1000))

100000 loops, best of 3: 12.8 µs per loop


##Dict comprehensions

Just like we often write loops to create lists (which we can replace with map/filter or list comprehensions), we often write loops to create dicts:

In [42]:
d = {}
for dna in dna_list:
    d[dna] = get_at(dna)
d

{'ACGGCTAG': 0.375, 'ACGTATGC': 0.5, 'ATG': 0.6666666666666666, 'TAGC': 0.5}

Dict comprehensions allow us to express these more compactly:

In [44]:
d = { x : get_at(x) for x in dna_list }
d

{'ACGGCTAG': 0.375, 'ACGTATGC': 0.5, 'ATG': 0.6666666666666666, 'TAGC': 0.5}

This only works for very simple dicts i.e. when we can describe an item with a single expression (`x : get_at(x)`). It wouldn't work for our kmer counting example because that requires us to update the dict. 

One very common use: creating an index for a list of complex objects. Think back to our list-of-tuples DNA records from the session on data structures:

In [46]:
records = [
    ('actgctagt', 'ABC123', 1),
    ('ttaggttta', 'XYZ456', 1),
    ('cgcgatcgt', 'HIJ789', 5)
]

Using a dict comprehension we can quickly create a dict where the keys are the accession number and the values are the tuples themselves:

In [48]:
index = {r[1] : r for r in records}
index

{'ABC123': ('actgctagt', 'ABC123', 1),
 'HIJ789': ('cgcgatcgt', 'HIJ789', 5),
 'XYZ456': ('ttaggttta', 'XYZ456', 1)}

Which (as we saw before) allows us to very quickly look up the complete record if we know the accession:

In [50]:
index['HIJ789']

('cgcgatcgt', 'HIJ789', 5)

###Set comprehensions

Mentioned for completeness. Curly brackets like a dict comprehension, but single elements rather than pairs:

In [55]:
even_integers = {x for x in range(1000) if x % 2 == 0}
# same as...
even_integers == set((x for x in range(1000) if x % 2 == 0))

True

##Iterators and generators

Iterable types are very useful in Python:
- we can iterate over them, obviously
- we can turn them into lists
- we can use map/filter/sort on them
- we can use them in various types of comprehensions

We already know about a bunch of iterable types:
- lists
- sets
- tupes
- strings (as characters)
- File objects (as lines)
- generator expressions 

And about methods that return iterable types:
- re.finditer()
- dict.items()
- map() in Python 3
- range() in Python 3

####How do we create our own iterable data types?

Short story: to make a class iterable, we have to give it a special method called `__iter__()` which returns an iterator. When we want to iterate over a class that is already iterable e.g. a string we can get the iterator by calling the `iter()` function. 

Say we want to make it possible to iterate over a `DNARecord` object:


In [68]:
class DNARecord(object): 
    
    def __init__(self, sequence, gene_name, species_name):
        self.sequence = sequence
        self.gene_name = gene_name
        self.species_name = species_name
        
d1 = DNARecord('ATATAT', 'COX1', 'Homo sapiens')

We can't do this:

In [69]:
for base in d1:
    print(base)

TypeError: 'DNARecord' object is not iterable

We have to do this:

In [70]:
for base in d1.sequence:
    print(base)

A
T
A
T
A
T


This is not the worst thing in the world, but it requires more typing and ties us to a specific implementation of the `DNARecord` class (what happens if we want to change the name of the `sequence` variable?).

To make it possible to iterate over the `DNARecord` object itself, we add the `__iter__()` method thus:

In [71]:
class DNARecord(object): 
    
    def __init__(self, sequence, gene_name, species_name):
        self.sequence = sequence
        self.gene_name = gene_name
        self.species_name = species_name
    
    def __iter__(self):
        return iter(self.sequence)   
    
d1 = DNARecord('ATATATT', 'COX1', 'Homo sapiens')

for base in d1:
    print(base)

A
T
A
T
A
T
T


The `__iter__()` method here is very simple, it just calls `iter()` on the sequence to get an iterator, then returns it. In other words:

>to iterate over a DNARecord, simply iterate over the sequence variable

Now something more complicated: what if we want to iterate over **codons**? We cannot just grab the string iterator, so we must create our own. An iterator has a method called `next()` which has to follow two rules: either it returns the next value, or if it's at the end it raises a `StopIteration` exception (don't worry about exceptions). 

Here it is:

In [80]:
class DNARecord(object): 
    position = 0
    
    def __init__(self, sequence, gene_name, species_name):
        self.sequence = sequence
        self.gene_name = gene_name
        self.species_name = species_name
    
    def __iter__(self):
        return self   
    
    def next(self): 
        if self.position < (len(self.sequence) - 2): 
            codon = self.sequence[self.position:self.position+3] 
            self.position += 3 
            return codon 
        else: 
            self.position = 0
            raise StopIteration  


d1 = DNARecord('ATCGTCGACTGACTACG', 'COX1', 'Homo sapiens')

for codon in d1:
    print(codon)   


ATC
GTC
GAC
TGA
CTA
ATC
GTC
GAC
TGA
CTA


Take a moment to look at the logic of the `next()` method. 

Note that:
- the object now is responsible for keeping track of its position, so we need a `position` vaiable
- the `next()` method checks to see if the position is more than three bases away from the end of the sequence
- it updates the position each time it's called
- when it gets to the end of the sequence it resets the position back to zero for the next time
- we might miss incomplete codons from the end of the sequence
- the `__iter__()` method returns the object itself

Complicated stuff!

**Aside: `next()` has been renamed to `__next__()` in Python 3. **

The codon iteration stuff works but isn't great; ideally we'd like to be able to iterate over either bases or codons....

##Digression: generators

A generator (not to be confused with a generator expression above) is a function that calculates one value at a time. Instead of using `return()` to return the complete result, we use `yield()` to return a single value:


In [83]:
def get_4mers(dna):
    result = [] 
    for i in range(len(dna) - 3): 
        result.append(dna[i:i+4]) 
    return result

get_4mers('acgatcgatgc')

['acga', 'cgat', 'gatc', 'atcg', 'tcga', 'cgat', 'gatg', 'atgc']

In [86]:
def generate_4mers(dna): 
    for i in range(len(dna) - 3): 
        yield dna[i:i+4] 
        
generate_4mers('acgatcgatgc')

<generator object generate_4mers at 0x7f40e02ae280>

In [101]:
x = generate_4mers('acgatcgatgc')

In [103]:
x.next()

'cgat'

Generators are useful for all sorts of things. They are often more readable, they are very memory efficient as they only calculate one value at a time, and they are **iterators**. So we can do this:

In [105]:
for fourmer in generate_4mers('acgatcgatgc'):
    print(fourmer)

acga
cgat
gatc
atcg
tcga
cgat
gatg
atgc


and

In [107]:
list(generate_4mers('acgatcgatgc'))

['acga', 'cgat', 'gatc', 'atcg', 'tcga', 'cgat', 'gatg', 'atgc']

and we can use them to implement clever iteration behaviour in our classes:

In [111]:
class DNARecord(object): 
    position = 0
    
    def __init__(self, sequence, gene_name, species_name):
        self.sequence = sequence
        self.gene_name = gene_name
        self.species_name = species_name
    
    def bases(self):
        return iter(self.sequence)
    
    def kmers(self, k):
        for i in range(len(self.sequence) - k +1): 
            yield self.sequence[i:i+k] 
            
    def codons(self):
        return self.kmers(3)

In [113]:
d1 = DNARecord('ATCGTCGACTGACTACG', 'COX1', 'Homo sapiens')
for base in d1.bases():
    print(base)
    
for kmer in d1.kmers(5):
    print(kmer)
    
for codon in d1.codons():
    print(codon)

A
T
C
G
T
C
G
A
C
T
G
A
C
T
A
C
G
ATCGT
TCGTC
CGTCG
GTCGA
TCGAC
CGACT
GACTG
ACTGA
CTGAC
TGACT
GACTA
ACTAC
CTACG
ATC
TCG
CGT
GTC
TCG
CGA
GAC
ACT
CTG
TGA
GAC
ACT
CTA
TAC
ACG


##Exercises

###BLAST processor

As a warm up, take your solutions to the BLAST processing exercise from the Functional Programming session and rewrite them to just use list comprehensions.

###Primer search

Write a function that will create a list of all possible PCR primers of a given length. 

Hint: start with the kmer-generating function from the session on recursion. 
Further hint: *all possible PCR primers of a given length* simply means all possible DNA sequnces of a given length. In the real world we would do some filtering on the results to find sequences with similarity to the target, melting points, etc. 

For length l there are 

$$4^l$$

possible primers so eventually you will run out of memory when trying to list them all. Rewrite your function as a generator to get around this problem. 

Finally, write a generator which will generate all possible pairs of primers. For length l there will be 

$$(4^l)^2$$

possible pairs, so if you try to do this with a normal function you will run out of memory even faster!