### itertools module
How to generate all the possible codons of the genetic code? 'AAA', 'AAT', 'AAC', 'AAG', ...

We want <strong>all</strong> possible rearrangements of the four nucleotides.

In [1]:
def codons_v1():
    bases = 'ATCG'
    
    codons = []
    
    for base1 in bases:
        for base2 in bases:
            for base3 in bases:
                codons.append(
                    base1 + base2 + base3
                )
    
    return codons

len(codons_v1()), codons_v1()

(64,
 ['AAA',
  'AAT',
  'AAC',
  'AAG',
  'ATA',
  'ATT',
  'ATC',
  'ATG',
  'ACA',
  'ACT',
  'ACC',
  'ACG',
  'AGA',
  'AGT',
  'AGC',
  'AGG',
  'TAA',
  'TAT',
  'TAC',
  'TAG',
  'TTA',
  'TTT',
  'TTC',
  'TTG',
  'TCA',
  'TCT',
  'TCC',
  'TCG',
  'TGA',
  'TGT',
  'TGC',
  'TGG',
  'CAA',
  'CAT',
  'CAC',
  'CAG',
  'CTA',
  'CTT',
  'CTC',
  'CTG',
  'CCA',
  'CCT',
  'CCC',
  'CCG',
  'CGA',
  'CGT',
  'CGC',
  'CGG',
  'GAA',
  'GAT',
  'GAC',
  'GAG',
  'GTA',
  'GTT',
  'GTC',
  'GTG',
  'GCA',
  'GCT',
  'GCC',
  'GCG',
  'GGA',
  'GGT',
  'GGC',
  'GGG'])

The nested for loops is the equivalent of the <strong>cartesian product</strong>: for sets A and B, the cartesian product A × B is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B:

\begin{equation*}
A \times B = \{(x, y) | x \in A, y \in B\}
\end{equation*}

<br/>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4e/Cartesian_Product_qtl1.svg/1280px-Cartesian_Product_qtl1.svg.png" width="40%" />

The <strong>itertools</strong> module allows you to create all the permutations (and combinations) with or without replacement.

In [2]:
A = ['x', 'y', 'z']
B = [1, 2, 3]

import itertools

for x in itertools.product(A, B):
    print(x)

('x', 1)
('x', 2)
('x', 3)
('y', 1)
('y', 2)
('y', 3)
('z', 1)
('z', 2)
('z', 3)


The <code>product()</code> function takes as input a series of iterables (even only one) and a parameter <code>repeat</code> and computes the cartesian product of all elements.

In [3]:
def codons_v2():
    bases = 'ATCG'
    
    codons = []
    for codon_tuple in itertools.product(bases, repeat=3):
        codons.append(
            ''.join(codon_tuple)
        )
    
    return codons

len(codons_v2()), codons_v2()

(64,
 ['AAA',
  'AAT',
  'AAC',
  'AAG',
  'ATA',
  'ATT',
  'ATC',
  'ATG',
  'ACA',
  'ACT',
  'ACC',
  'ACG',
  'AGA',
  'AGT',
  'AGC',
  'AGG',
  'TAA',
  'TAT',
  'TAC',
  'TAG',
  'TTA',
  'TTT',
  'TTC',
  'TTG',
  'TCA',
  'TCT',
  'TCC',
  'TCG',
  'TGA',
  'TGT',
  'TGC',
  'TGG',
  'CAA',
  'CAT',
  'CAC',
  'CAG',
  'CTA',
  'CTT',
  'CTC',
  'CTG',
  'CCA',
  'CCT',
  'CCC',
  'CCG',
  'CGA',
  'CGT',
  'CGC',
  'CGG',
  'GAA',
  'GAT',
  'GAC',
  'GAG',
  'GTA',
  'GTT',
  'GTC',
  'GTG',
  'GCA',
  'GCT',
  'GCC',
  'GCG',
  'GGA',
  'GGT',
  'GGC',
  'GGG'])

How to generate the codons without repeated bases?

In mathematics, <strong>permutation</strong> is the act of arranging the members of a set into a sequence or order.

In [4]:
def codons_no_repetitions_v1():
    bases = 'ATCG'

    codons = []
    for codon_tuple in itertools.permutations('ATCG', 3):
        codons.append(
            ''.join(codon_tuple)
        )
    
    return codons

len(codons_no_repetitions_v1()), codons_no_repetitions_v1()

(24,
 ['ATC',
  'ATG',
  'ACT',
  'ACG',
  'AGT',
  'AGC',
  'TAC',
  'TAG',
  'TCA',
  'TCG',
  'TGA',
  'TGC',
  'CAT',
  'CAG',
  'CTA',
  'CTG',
  'CGA',
  'CGT',
  'GAT',
  'GAC',
  'GTA',
  'GTC',
  'GCA',
  'GCT'])

In [5]:
def count_all_bases_v5(dna):
    counts = {
        'A': 0,
        'T': 0,
        'C': 0,
        'G': 0
    }

    for base in dna:
        counts[base] += 1

    return counts

def codons_no_repetitions_v0():
    bases = 'ATCG'
    
    codons = []
    for codon_tuple in codons_v2():
        counts = count_all_bases_v5(codon_tuple).values()
        #{'A':2, ...}
        
        add_codon = True
        for count in counts:
            if count > 1:
                add_codon = False
        
        if add_codon:
            codons.append(
                ''.join(codon_tuple)
            )
    
    return codons

len(codons_no_repetitions_v0()), codons_no_repetitions_v0()

(24,
 ['ATC',
  'ATG',
  'ACT',
  'ACG',
  'AGT',
  'AGC',
  'TAC',
  'TAG',
  'TCA',
  'TCG',
  'TGA',
  'TGC',
  'CAT',
  'CAG',
  'CTA',
  'CTG',
  'CGA',
  'CGT',
  'GAT',
  'GAC',
  'GTA',
  'GTC',
  'GCA',
  'GCT'])

How to get one representative for each group of codons with the same nucleotide composition? For example, the codons 'AAT', 'ATA', and 'TAA' have the two As and one T.

Permutations differ from <strong>combinations</strong>, which are selections of some members of a set regardless of order.

In [6]:
list(itertools.combinations_with_replacement('ATCG', 3))

[('A', 'A', 'A'),
 ('A', 'A', 'T'),
 ('A', 'A', 'C'),
 ('A', 'A', 'G'),
 ('A', 'T', 'T'),
 ('A', 'T', 'C'),
 ('A', 'T', 'G'),
 ('A', 'C', 'C'),
 ('A', 'C', 'G'),
 ('A', 'G', 'G'),
 ('T', 'T', 'T'),
 ('T', 'T', 'C'),
 ('T', 'T', 'G'),
 ('T', 'C', 'C'),
 ('T', 'C', 'G'),
 ('T', 'G', 'G'),
 ('C', 'C', 'C'),
 ('C', 'C', 'G'),
 ('C', 'G', 'G'),
 ('G', 'G', 'G')]

How to get one representative for each group of codons with the same nucleotide composition, but without nucleotide repetitions?

In [7]:
list(itertools.combinations('ATCG', 3))

[('A', 'T', 'C'), ('A', 'T', 'G'), ('A', 'C', 'G'), ('T', 'C', 'G')]

### Dictionary (again)
<strong>Recall</strong>: dictionaries are a convenient way to store data for later retrieval by key. Keys must be unique, immutable objects; the values can be anything.

In [8]:
seq = 'TAGGATTACAGGCATGAGCTACCGTATAATGGCCAGGCCCCCTGCCTTTGTAAATAAATTTTCACTGGAACCTGGACACACTTGTTTATGTGTTGTTTGTGCCTGTTTTCACGCTGCGGCAGGAAAGTTGAGTCGTTGTGTCAGAGACCAGAGAGAGAGCCTGCAGAACCTCAAATACTATCTGGCCCTTGCCAGAAAAAGTTTACCAACCCCCTGCCTCCCTGGAATGGGTGGAGGGTGGTTGTAAAGGTACTGGAGGATCTGAAGACATAATAGGGTCCGTGACCCTTGTGAGGTTGTGAAGCTCCCTTAAGGCACATGGTGGCTGGGCTGTGGATTTGGGGTATGGGCAGAGAGTGTGGAGAGCACTTCCAGGGGCCATGTCTGAGAGACTACATGATGCCACTTTGAATGCCCAGTTTGTTCATCCTTTTCTGTTTTCCCCACTTCCCCAGATGGGTGATCTACAATGACCAGAAAGTGTGTGCCTCCGAGAAGCCGCCCAAGGATATAATACATCTACTTCTACCAGAGAGTGGCCAGCTAAGAGCCTGCCTCACCCCTTACCAATGAGGGCAGGGGAAGACCACCTGGCATGAGGGAGAGGGGCTGAGGGATGGACTTCAGCCCCTCTGCTCTGTACCCTTTTTCCTTTTGTCCCCGGCAGCAGGGAAGAAGCTGGAGGCCGTGGGAGAATGGCTGGGCAGAGCAGAGGGGCAGCGATAGACTCTGGGGATGGAGCAGGACGGGGACGGGAGGGGCCGGCCACCTGTCTGTAAGGAGACTTTGTTGCTTCCCCTGCCCCCGGAATCCACAGTGCTCTGCTTCTCTGTGTCGCCCCGCCCAGCCCCCTGGTGTGGAGGGAGGGGTCTCGTTTGTGCGCGTGGGTGTAGCTTTGTGCATCCTCTCCCAGTGGAGCGATCACCTGTGCCTCCCCTCCCCCTTTGTTTGCCCCTGTGTGGTTGGTCAAGGAGGGATGTGAGGGAAATAGGGACCCCCCGACTTGCCCTCCTGCCTCAGTCTTTCCCCCACCCTGTCTCTTCCTTGTCCTTCTCTGGAAAATGCCAAAATACACGATGTGAATAAAAGTACAACGGCTAAATTGTGTCCTGTTTGATACCTTGGGGGAGAGGCTTACCTTCCTGGGGTTAGCAGGAGGGCGCTTAAGAAAACTCCTAACTCTGGCCGCCTCCCTGCCAAAGTCAAGTCTCCACTTTTCACTGGTTCTAGAGCTCTAGGAAAATTGGGGTTGGGTGGGGAGGTGGAGTAGAGTGACTAAATGCCGACACAAAGCCAAGGAAAGATGGAGTGAAGAACCCTTCCCTCTCTTTATTCACACAGGAGTGGAGGATTTCCCAAATGTCCCTAACTGGCTAGCTGGCTTCAGGCTGGGACTCAGTCCCTGCAGTTCCTGCCAGGCCTTGCCAGCCGGGGCGAGGGTTGGGATGATCCTGGCGGCCTATGCCTTATAATGCTGCCCCTCCCGCTGTGAACCCTGCATTTGTCCCGCAAGTTTTCACTCAGGTAGACTCCCTGGGTACAAGGGTGCCTGCTCAGCAGTCGGGCATGAGCTGCTCCGATGGGCGAAGGAGGTTGTCTATCCCACAGTTGGAGAGGGGCCCTCTCTGCCCCAGTGGGCGATCTGGGCTACGGCCAAGTTGCCACCAGCTAGTTCCGCTTGAAAACCACTTCTGGCCCCGTGGGGGACTCAAGTCGCCAAGCGAGGGTTCCCCTGAGCGCCGGAGCTCACAGGTCTCGCCTTGTCCCGAAAGCCCCGCAATCGAGGCGGAGGCGACCGAGCCCCCGACTCTCCTAGAACGTTGCCACAAGAAGGGGGAACGTCGGAACAGTGCATCATCGGGCGGCGGCCGGGGCGGCGGCAGGAGGGCGGGCGGGGGGCAGGGCTCCGGGGGACTGGGCGGGCCATGGCGGAGGACGGCGAGGAGGCGGAGTTCCACTTCGCGGCGCTCTATATAAGTGGGCAGTGGCCGCGACTGCGCGCAGACACTGACCTTCAGCGCCTCGGCTCCAGCGCCATGGCGCCCTCCAGGAAGTTCTTCGTTGGGGGAAACTGGAAGATGAACGGGCGGAAGCAGAGTCTGGGGGAGCTCATCGGCACTCTGAACGCGGCCAAGGTGCCGGCCGACACCG'

consensus_motifs = {
    'Shine_Dalgarno': 'AGGAGG',
    'TATA_BOX': 'TATAAT',
    'CAT_BOX': 'GGGCGG',
    'FAKE_BOX_1': 'ATGGAAGGCA',
    'FAKE_BOX_2': 'TTTTA'
}

#### Slicing operation
The slicing operation allows us to extract a contiguous piece of a string or list.

In [9]:
example = '0123456'

print('example[0]', example[0]) # Indexing

print('\nexample[2:5]', example[2:5])

print('\nexample[0:2]', example[0:2])
print('example[:2]', example[:2])

print('\nexample[2:len(example)]', example[1:len(example)])
print('example[2:]', example[1:])

example[0] 0

example[2:5] 234

example[0:2] 01
example[:2] 01

example[2:len(example)] 123456
example[2:] 123456


In [10]:
def find_motifs_v1(sequence, motif_list):
    motifs_dict = {}
    
    for motif in motif_list:
        motif_len = len(motif)

        for pos in range(len(sequence) - motif_len + 1):
            # Returns a slice of the string, starting at index pos,
            # and going up to, but not including, index pos+motif_len.
            kmer = sequence[pos:pos+motif_len]
            
            if motif == kmer:
                if motif not in motifs_dict.keys():
                    motifs_dict[motif] = set()
                motifs_dict[motif].add(pos)

    return motifs_dict

find_motifs_v1(seq, consensus_motifs.values())

{'AGGAGG': {969, 1153, 1587, 1881, 1941},
 'TATAAT': {24, 509, 1467},
 'GGGCGG': {1859, 1871, 1885, 1889, 1916, 2084}}

In [11]:
def find_motifs_v2(sequence, motif_list):
    motifs_dict = {}
    
    for motif in motif_list:
        if motif in seq:
            motif_len = len(motif)

            for pos in range(len(sequence) - motif_len + 1):
                kmer = sequence[pos:pos+motif_len]
                if motif == kmer:
                    if motif not in motifs_dict.keys():
                        motifs_dict[motif] = set()
                    motifs_dict[motif].add(pos)

    return motifs_dict

%timeit find_motifs_v1(seq, consensus_motifs.values())
%timeit find_motifs_v2(seq, consensus_motifs.values())

1.6 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.02 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


A Python dictionary throws a KeyError if you try to get an item with a key that is not currently in the dictionary.

In [12]:
motifs_dict = {}

#motifs_dict['TATAAT'].add(3)

The <code>defaultdict</code> lets to specify the default value when the container is initialized.

In [13]:
from collections import defaultdict

def find_motifs_v3(sequence, motif_list):
    motifs_dict = defaultdict(set)
    
    for motif in motif_list:
        if motif in seq:
            motif_len = len(motif)

            for pos in range(len(sequence) - motif_len + 1):
                kmer = sequence[pos:pos+motif_len]
                if motif == kmer:
                    # Now it's useless
                    #if motif not in motifs_dict.keys():
                    #    motifs_dict[motif] = set()
                    motifs_dict[motif].add(pos)

    return motifs_dict

find_motifs_v3(seq, consensus_motifs.values())

defaultdict(set,
            {'AGGAGG': {969, 1153, 1587, 1881, 1941},
             'TATAAT': {24, 509, 1467},
             'GGGCGG': {1859, 1871, 1885, 1889, 1916, 2084}})