<img src="IMAGES_PYTHON_COURSE_2018/LOGO_MEINBIO_TRANSPARENT.png" width="200" height="200" />

# <center>Python course 2018
### <center>MeInBio Training Group 
#### <center> by Florian Heyl & Francesco Ferrari 

# PART 6: DICTIONARIES

For some type of problems, you may want to make a 1:1 association between one value and another. For example, we may want to associate each gene name with its length, or associate each gene name with a list of information about that gene, for example its length, its symbol, its expression level, etc. 

A python dictionary allows you to do just that! You can associate one item, that we call "key", with a certain "value", that can be any python object (a string, a number, a list, a tuple, a dictionary, a dataframe, etc.). 

The syntax for creating a dictionary is similar to that used for creating a list, but we use curly brackets rather than square ones. Each pair of data, consisting of a key and a value, is called an item. When storing items in a dictionary, we separate them with commas. Within an individual item, we separate the key and the value with a colon.
Here's a bit of code that creates a dictionary of restriction enzymes with three items:

In [None]:
# create a dictionary

enzymes = { 'EcoRI':'GAATTC', 'AvaII':'GG(A|T)CC', 'BisI':'GC[ATGC]GC' }

To retrieve a bit of data from the dictionary – i.e. to look up the motif for a particular enzyme – we write the name of the dictionary, followed by the key in square brackets:

In [None]:
# retrieve a value from a dictionary through its assoaciated key

print(enzymes['AvaII'])

### IMPORTANT NOTE: 

There are two important restrictions concerning dictionaries that we should ***NEVER*** forget:

1) We can use as keys only strings, numbers or tuples.

2) The keys **MUST** be **UNIQUE**; if we use the same key and we want it to be associated with multiple values, this will not work. Only one of these values will be associated with a given key.

3) Dictionaries are inherently unordered: that means, the order of the keys in a dictionary is random (contrary to the order of objects in a list).


Usually, we don't build dictionaries by writing each key-value pair by hand. Often, we populate the dictionary through a for loop. 

For example, given a sequence, we want to count all possible trinucleotides that are present in that sequence. 

In [None]:
# count the occurrances of all possible trinucleotides that are present in a DNA sequence

sequence = "ATGCGCGTAGTTCAGATTTAAAACGCGATTACGTGACT"
nucleotides = ['A','T','C','G']

trinuc_dic = {}
for n in nucleotides:
    for nn in nucleotides:
        for nnn in nucleotides:
            trinuc_dic[n+nn+nnn] = sequence.count(n+nn+nnn)
print(trinuc_dic)


Then, if we want to retrieve the number of occurrences of the trinucleotide "CGT", how do we do?

In [None]:
### insert your code in the space below





### Exercise 1 (medium)

Given a sequence and a value for k, count how many time each k-mer is present in the sequence

(hint: if k is 4, that means that we want to count all possible four-nucleotides combinations. To do that for any k, we can use a function present in the itertools library, as in the example below)

In [None]:
import itertools as it

def get_possible_kmers(k=3):
    
    if k > 10:
        print("Choose a value of k between 1 and 10")
    
    else:
        combi = it.product(["A","T","G","C"],repeat=k)
        combi = ["".join(j) for j in combi]

        if k < 5:
            #print(combi, len(combi))
            pass
        else:
            pass
            #print("The first 10 items in the list are:\n {}\n\nThe list contains {} {}-mers".format(combi[1:10], len(combi), k))
    return combi


In [None]:
possible_kmers = get_possible_kmers(k=4) # select the k of your choice (from 1 to 10)

sequence_1 = "ATTCGATGCTAGCTTGGGATATACGGGCCCCCGGATGACCCCTGGAAACATGCGATGCATCCCTAAAAGT"

kmer_dic = {}

# now it's up to you! Insert your code in the space below







Now, try to retrieve a value using a key that does not exist in the dictionary. For example, try to retrieve the counts of the k-mer "NNNN"

In [None]:
print(kmer_dic["NNN"])

You get a **KeyError**. Remember it, as you may encounter it often when working with dictionaries. 

We can also use the dictionary's **get** method to retrieve values from a dictionary. The thing that makes **get** really useful, however, is that it can take an optional
second argument, which is the default value to be returned if the key isn't present
in the dictionary.

In [None]:
# syntax for get method

print(kmer_dic.get("NNN",0))

- ## Iterating over a dictionary 

We can iterate over a dictionary in two main ways:

1) iterate over the keys of a dictionary;  
2) iterate over items (pairs of key, value)

Here are some examples that show you the syntax to use

In [None]:
# 1) 
# get all keys in a dictionary
all_keys = enzymes.keys()
print("The keys present in the dictionary are:\n{}\n".format(list(all_keys)))

# iterate over the keys to retrieve the corresponding value of each key
for key in all_keys:
    print("{} --> {}".format(key, enzymes[key]))
print("\n")


# 2)
# get all items in a dictionary
all_items = enzymes.items()
print("The touples (key,value) present in the dictionary are:\n{}\n".format(list(all_items)))

# iterate over key and values at the same time
for key, value in all_items:
    print("{} --> {}".format(key, value))

### Exercise 2 (easy/medium)

Given a dictionary storing the genetic code and a DNA sequence, translate the DNA sequence into the corresponding amino acids (protein) sequence. Assume that the sequence you are given is the open reading frame (ORF) of the gene (so you don't have to search through all possible ORFs)

In [None]:
# genetic code dictionary
gencode = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

sequence_gapdh = """GCTCTCTGCTCCTCCTGTTCGACAGTCAGCCGCATCTTCTTTTGCGTCGCCAGGTGAAGACGGGCGGAGA
GAAACCCGGGAGGCTAGGGACGGCCTGAAGGCGGCAGGGGCGGGCGCAGGCCGGATGTGTTCGCGCCGCT
GCGGGGTGGGCCCGGGCGGCCTCCGCATTGCAGGGGCGGGCGGAGGACGTGATGCGGCGCGGGCTGGGCA
TGGAGGCCTGGTGGGGGAGGGGAGGGGAGGCGTGTGTGTCGGCCGGGGCCACTAGGCGCTCACTGTTCTC
TCCCTCCGCGCAGCCGAGCCACATCGCTCAGACACCATGGGGAAGGTGAAGGTCGGAGTCAACGGGTGAG
TTCGCGGGTGGCTGGGGGGCCCTGGGCTGCGACCGCCCCCGAACCGCGTCTACGAGCCTTGCGGGCTCCG
GGTCTTTGCAGTCGTATGGGGGCAGGGTAGCTGTTCCCCGCAAGGAGAGCTCAAGGTCAGCGCTCGGACC
TGGCGGAGCCCCGCACCCAGGCTGTGGCGCCCTGTGCAGCTCCGCCCTTGCGGCGCCATCTGCCCGGAGC
CTCCTTCCCCTAGTCCCCAGAAACAGGAGGTCCCTACTCCCGCCCGAGATCCCGACCCGGACCCCTAGGT
GGGGGACGCTTTCTTTCCTTTCGCGCTCTGCGGGGTCACGTGTCGCAGAGGAGCCCCTCCCCCACGGCCT
CCGGCACCGCAGGCCCCGGGATGCTAGTGCGCAGCGGGTGCATCCCTGTCCGGATGCTGCGCCTGCGGTA
GAGCGGCCGCCATGTTGCAACCGGGAAGGAAATGAATGGGCAGCCGTTAGGAAAGCCTGCCGGTGACTAA
CCCTGCGCTCCTGCCTCGATGGGTGGAGTCGCGTGTGGCGGGGAAGTCAGGTGGAGCGAGGCTAGCTGGC
CCGATTTCTCCTCCGGGTGATGCTTTTCCTAGATTATTCTCTGGTAAATCAAAGAAGTGGGTTTATGGAG
GTCCTCTTGTGTCCCCTCCCCGCAGAGGTGTGGTGGCTGTGGCATGGTGCCAAGCCGGGAGAAGCTGAGT
CATGGGTAGTTGGAAAAGGACATTTCCACCGCAAAATGGCCCCTCTGGTGGTGGCCCCTTCCTGCAGCGC
CGGCTCACCTCACGGCCCCGCCCTTCCCCTGCCAGCCTAGCGTTGACCCGACCCCAAAGGCCAGGCTGTA
AATGTCACCGGGAGGATTGGGTGTCTGGGCGCCTCGGGGAACCTGCCCTTCTCCCCATTCCGTCTTCCGG
AAACCAGATCTCCCACCGCACCCTGGTCTGAGGTTAAATATAGCTGCTGACCTTTCTGTAGCTGGGGGCC
TGGGCTGGGGCTCTCTCCCATCCCTTCTCCCCACACACATGCACTTACCTGTGCTCCCACTCCTGATTTC
TGGAAAAGAGCTAGGAAGGACAGGCAACTTGGCAAATCAAAGCCCTGGGACTAGGGGGTTAAAATACAGC
TTCCCCTCTTCCCACCCGCCCCAGTCTCTGTCCCTTTTGTAGGAGGGACTTAGAGAAGGGGTGGGCTTGC
CCTGTCCAGTTAATTTCTGACCTTTACTCCTGCCCTTTGAGTTTGATGATGCTGAGTGTACAAGCGTTTT
CTCCCTAAAGGGTGCAGCTGAGCTAGGCAGCAGCAAGCATTCCTGGGGTGGCATAGTGGGGTGGTGAATA
CCATGTACAAAGCTTGTGCCCAGACTGTGGGTGGCAGTGCCCCACATGGCCGCTTCTCCTGGAAGGGCTT
CGTATGACTGGGGGTGTTGGGCAGCCCTGGAGCCTTCAGTTGCAGCCATGCCTTAAGCCAGGCCAGCCTG
GCAGGGAAGCTCAAGGGAGATAAAATTCAACCTCTTGGGCCCTCCTGGGGGTAAGGAGATGCTGCATTCG
CCCTCTTAATGGGGAGGTGGCCTAGGGCTGCTCACATATTCTGGAGGAGCCTCCCCTCCTCATGCCTTCT
TGCCTCTTGTCTCTTAGATTTGGTCGTATTGGGCGCCTGGTCACCAGGGCTGCTTTTAACTCTGGTAAAG
TGGATATTGTTGCCATCAATGACCCCTTCATTGACCTCAACTACATGGTGAGTGCTACATGGTGAGCCCC
AAAGCTGGTGTGGGAGGAGCCACCTGGCTGATGGGCAGCCCCTTCATACCCTCACGTATTCCCCCAGGTT
TACATGTTCCAATATGATTCCACCCATGGCAAATTCCATGGCACCGTCAAGGCTGAGAACGGGAAGCTTG
TCATCAATGGAAATCCCATCACCATCTTCCAGGAGTGAGTGGAAGACAGAATGGAAGAAATGTGCTTTGG
GGAGGCAACTAGGATGGTGTGGCTCCCTTGGGTATATGGTAACCTTGTGTCCCTCAATATGGTCCTGTCC
CCATCTCCCCCCCACCCCCATAGGCGAGATCCCTCCAAAATCAAGTGGGGCGATGCTGGCGCTGAGTACG
TCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAGGCTGGGGTGAGTGCAGGAGGGCCCGCGGGAGG
GGAAGCTGACTCAGCCCTGCAAAGGCAGGACCCGGGTTCATAACTGTCTGCTTCTCTGCTGTAGGCTCAT
TTGCAGGGGGGAGCCAAAAGGGTCATCATCTCTGCCCCCTCTGCTGATGCCCCCATGTTCGTCATGGGTG
TGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAGGTGAGGAAGGCAGGGCCCGTGGAGAAGCGG
CCAGCCTGGCACCCTATGGACACGCTCCCCTGACTTGCGCCCCGCTCCCTCTTTCTTTGCAGCAATGCCT
CCTGCACCACCAACTGCTTAGCACCCCTGGCCAAGGTCATCCATGACAACTTTGGTATCGTGGAAGGACT
CATGGTATGAGAGCTGGGGAATGGGACTGAGGCTCCCACCTTTCTCATCCAAGACTGGCTCCTCCCTGCC
GGGGCTGCGTGCAACCCTGGGGTTGGGGGTTCTGGGGACTGGCTTTCCCATAATTTCCTTTCAAGGTGGG
GAGGGAGGTAGAGGGGTGATGTGGGGAGTACGCTGCAGGGCCTCACTCCTTTTGCAGACCACAGTCCATG
CCATCACTGCCACCCAGAAGACTGTGGATGGCCCCTCCGGGAAACTGTGGCGTGATGGCCGCGGGGCTCT
CCAGAACATCATCCCTGCCTCTACTGGCGCTGCCAAGGCTGTGGGCAAGGTCATCCCTGAGCTGAACGGG
AAGCTCACTGGCATGGCCTTCCGTGTCCCCACTGCCAACGTGTCAGTGGTGGACCTGACCTGCCGTCTAG
AAAAACCTGCCAAATATGATGACATCAAGAAGGTGGTGAAGCAGGCGTCGGAGGGCCCCCTCAAGGGCAT
CCTGGGCTACACTGAGCACCAGGTGGTCTCCTCTGACTTCAACAGCGACACCCACTCCTCCACCTTTGAC
GCTGGGGCTGGCATTGCCCTCAACGACCACTTTGTCAAGCTCATTTCCTGGTATGTGGCTGGGGCCAGAG
ACTGGCTCTTAAAAAGTGCAGGGTCTGGCGCCCTCTGGTGGCTGGCTCAGAAAAAGGGCCCTGACAACTC
TTTTCATCTTCTAGGTATGACAACGAATTTGGCTACAGCAACAGGGTGGTGGACCTCATGGCCCACATGG
CCTCCAAGGAGTAAGACCCCTGGACCACCAGCCCCAGCAAGAGCACAAGAGGAAGAGAGAGACCCTCACT
GCTGGGGAGTCCCTGCCACACTCAGTCCCCCACCACACTGAATCTCCCCTCCTCACAGTTGCCATGTAGA
CCCCTTGAAGAGGGGAGGGGCCTAGGGAGCCGCACCTTGTCATGTACCATCAATAAAGTACCCTGTGCTC
AACCAGTTA"""

sequence_gapdh = sequence_gapdh.replace("\n","")


# insert your code in the space below

    




Expected Outcome  

ALCSSCSTVSRIFFCVAR_RRAERNPGG_GRPEGGRGGRRPDVFAPLRGGPGRPPHCRGGRRT_CGAGWAWRPGGGGEGRRVCRPGPLGAHCSLPPRSRATSLRHHGEGEGRSQRVSSRVAGGPWAATAPEPRLRALRAPGLCSRMGAG_LFPARRAQGQRSDLAEPRTQAVAPCAAPPLRRHLPGASFP_SPETGGPYSRPRSRPGPLGGGRFLSFRALRGHVSQRSPSPTASGTAGPGMLVRSGCIPVRMLRLR_SGRHVATGKEMNGQPLGKPAGD_PCAPASMGGVACGGEVRWSEASWPDFSSG_CFS_IILW_IKEVGLWRSSCVPSPQRCGGCGMVPSREKLSHG_LEKDISTAKWPLWWWPLPAAPAHLTAPPFPCQPSVDPTPKARL_MSPGGLGVWAPRGTCPSPHSVFRKPDLPPHPGLRLNIAADLSVAGGLGWGSLPSLLPTHMHLPVLPLLISGKELGRTGNLANQSPGTRGLKYSFPSSHPPQSLSLL_EGLREGVGLPCPVNF_PLLLPFEFDDAECTSVFSLKGAAELGSSKHSWGGIVGW_IPCTKLVPRLWVAVPHMAASPGRASYDWGCWAALEPSVAAMP_ARPAWQGSSREIKFNLLGPPGGKEMLHSPS_WGGGLGLLTYSGGASPPHAFLPLVS_IWSYWAPGHQGCF_LW_SGYCCHQ_PLH_PQLHGECYMVSPKAGVGGATWLMGSPFIPSRIPPGLHVPI_FHPWQIPWHRQG_EREACHQWKSHHHLPGVSGRQNGRNVLWGGN_DGVAPLGIW_PCVPQYGPVPISPPPP_ARSLQNQVGRCWR_VRRGVHWRLHHHGEGWGECRRARGRGS_LSPAKAGPGFITVCFSAVGSFAGGSQKGHHLCPLC_CPHVRHGCEP_EV_QQPQDHQVRKAGPVEKRPAWHPMDTLP_LAPRSLFLCSNASCTTNCLAPLAKVIHDNFGIVEGLMV_ELGNGTEAPTFLIQDWLLPAGAACNPGVGGSGDWLSHNFLSRWGGR_RGDVGSTLQGLTPFADHSPCHHCHPEDCGWPLRETVA_WPRGSPEHHPCLYWRCQGCGQGHP_AEREAHWHGLPCPHCQRVSGGPDLPSRKTCQI__HQEGGEAGVGGPPQGHPGLH_APGGLL_LQQRHPLLHL_RWGWHCPQRPLCQAHFLVCGWGQRLALKKCRVWRPLVAGSEKGP_QLFSSSRYDNEFGYSNRVVDLMAHMASKE_DPWTTSPSKSTRGRERPSLLGSPCHTQSPTTLNLPSSQLPCRPLEEGRGLGSRTLSCTINKVPCAQPV?