## Genome Computing
### Genome Sequences with Lists & Dictionaries

In [None]:
import random

There are four collection data types in the Python programming language:
<li><b>List is a collection which is ordered and changeable. Allows duplicate members.</b> </li>
<li>Tuple is a collection which is ordered and unchangeable. Allows duplicate members.</li>  
<li>Set is a collection which is unordered and unindexed. No duplicate members.</li>
<li><b>Dictionary is a collection which is unordered, changeable and indexed. No duplicate members.</b></li>

When choosing a collection type, it is useful to understand the properties of that type. Choosing the right type for a particular data set could mean retention of meaning, and, it could mean an increase in efficiency or security.  

### List  
A list is a collection which is ordered and changeable. In Python lists are written with square brackets. Some great list methods to work with:

| Method |	Description |
|----|----|
|append() |	Adds an element at the end of the list |
|clear() |	Removes all the elements from the list |
|copy() |	Returns a copy of the list |
|count() |	Returns the number of elements with the specified value |
|extend() |	Add the elements of a list (or any iterable), to the end of the current list |
|index() |	Returns the index of the first element with the specified value |
|insert() |	Adds an element at the specified position |
|pop() |	Removes the element at the specified position |
|remove() |	Removes the item with the specified value |
|reverse() |	Reverses the order of the list |
|sort() |	Sorts the list |

There are methods in place where we can take a sequence string, such as <b>AGTCGGTAATTATACCGACTA</b>, and turn it into a list by character.



In [None]:
seq = "AGTCGGTAATTATACCGACTA"

print(seq)

We can start with the `list()` function

In [None]:
seq_list_1 = list(seq)

seq_list_1


We can make an `IF` loop and `append()` a list a number of times equal to the length of the sequence string

In [None]:
seq_list_2 = []

for BASE_IDX in range( len(seq) ):
    seq_list_2.append( seq[BASE_IDX] )

seq_list_2

We can take the above IF loop and condense it using <b>list comprehension</b>

* NOTE: <i>list comprehension</i> is often refered to as the more "Pythonic way" of doing a loop

In [None]:
seq_list_3 = [ seq[BASE_IDX] for BASE_IDX in range( len(seq) ) ]

seq_list_3

We can check these 3 lists using a boolean logic operator. 

Remove the comment symbol from one of the lines below to see if they are True. Do not remove all three at the same time as the output will only be for the final argument


In [None]:
# seq_list_1 == seq_list_2
# seq_list_1 == seq_list_3
# seq_list_2 == seq_list_3

Use the code block below to make a list of a randomly-generated 125-bp insert between two sequences:


5-prime - AAGATACA -- random insert -- GGCGTATA - 3-prime

In [None]:
str1 = "AAGATACA"
str3 = "GGCGTATA"


# Store the nucleotides in a list called "bases"
bases = ['A','C','G','T']


# We can use the function library 'random' imported earlier here with the .choice() function to get any one of those 4 bases 125 times
LENGTH   = 125

sequence = [random.choice(bases) for times in range( LENGTH ) ]

str2     = ''.join(sequence)

del sequence, LENGTH


# Concatenate the three strings together

sequence = str1+str2+str3

#print(sequence)

In [None]:
# OPTIONAL: Use this code block to turn the 'sequence' string into a list using list comprehension
# ------------------------------------------------------------------------------------------------





A _list_ stores items, but if you want to store items based on specific indicators, you can use a __dictionary__. 

----------------------------------

"Genome" is defined in the dictionary as "the complete set of genes or genetic material present in a cell or organism". In finding this definition, you can say that you accessed some system (a <b>dictionary</b>), looked for a key phrase (a "<b>key</b>", which is "Genome"), and got something as a result (the "<b>value</b>", which is the definition).

A programming dictionary uses this same {"key":"value"} system. Think of the key as the word you want to learn about in an actual dictionary and its definition is its value. 

Dictionary items are ordered, can be of any data type, and does not allow duplicate keys (one can, however, have duplicated values). They are also changeable, meaning that we can alter, add, or remove items after the dictionary has been created.


With lists (using square braces), we can access using index notation.
With __dictionaries__ (using curly braces), we access by keys.

----------------------------------

Let's make a dictionary of complementary base pairs:


In [None]:
dna_comp_bases = {'A':'T',
                 'C':'G',
                 'G':'C',
                 'T':'A'}

print(dna_comp_bases)


Some great dictionary methods to work with:

| Method |	Description |
|----|----|
|clear() |	Removes all the elements from the dictionary |
|copy()	| Returns a copy of the dictionary |
|fromkeys() |	Returns a dictionary with the specified keys and value |
|get() |	Returns the value of the specified key |
|items() |	Returns a list containing a tuple for each key value pair |
|keys() |	Returns a list containing the dictionary's keys |
|pop() |	Removes the element with the specified key |
|popitem() |	Removes the last inserted key-value pair |
|setdefault()	| Returns the value of the specified key. If the key does not exist: insert the key, with the specified value |
|update() |	Updates the dictionary with the specified key-value pairs |
|values() |	Returns a list of all the values in the dictionary |

What are the keys in the dna_comp_bases dictionary?

In [None]:
key_list = dna_comp_bases.keys()

print(key_list)
print()

# The code above simply gives the keys in the dictionary. 
# It's not subscriptable: we cannot use index notation to access anything specifically in here


# Use the list() method to make a list of the keys
key_list = list( dna_comp_bases.keys() )


# NOW we can access this with index notation
print(key_list[0])


We have a sequence. We now have a way to get the complementary strand.

Start with the sequence.

Access the dictionary using every character in the sequence string


In [None]:
print()
print(sequence)

COMPSEQ = []

for BASE in range( len(sequence) ):

    # we know sequence[BASE] will give us a character at this index
    # that's our KEY
    # access the dictionary using dictionary_name[ ]
    
    COMPSEQ.append( dna_comp_bases[ sequence[BASE] ] )

comp_sequence = ''.join(COMPSEQ)
del COMPSEQ

print()
print(comp_sequence)
print()

We can use this sequence to get the RNA code that will be __transcribed__ from the DNA coding strand. This process is the same as determining the complementary strand but instead of thymines we will use uracils (U).

Lets make the RNA sequence using list comprehension, making sure to report the RNA sequence in the correct reading direction: 5-prime to 3-prime


In [None]:
rna_bases = {'A':'U',
             'C':'G',
             'G':'C',
             'T':'A'}

RNASEQ = [ rna_bases[ sequence[BASE] ] for BASE in range( len(sequence) ) ]

# Want to flip the order of a list? Use .reverse()
RNASEQ.reverse()

rna_sequence = ''.join(RNASEQ)

del RNASEQ

print(sequence)
print()
print(rna_sequence)

________________

We get RNA from DNA using <b>transcription</b> where 1 DNA base --> 1 RNA base.

We can now get PROTEINS from RNA using <b>translation</b>, where 3 RNA bases --> 1 amino acid residue

![rna_to_aa_code_genome-gov.jpg](attachment:d3bb793e-7f3e-422d-bfa2-db850ab12298.jpg)

This image from the National Human Genome Research Institute Genetics Glossary page (https://www.genome.gov/genetics-glossary/Genetic-Code) is designed to be read in-to-out. 

<li>The "UCA" RNA trimer -- the "codon" -- found in the upper-right quadrant will result in a 'serine' amino acid residue.
<li>The "ACU" RNA codon found in the lower-left quadrant will result in a 'threonine' amino acid residue.

While DNA and RNA have 1-character codes, amino acid residues can use both a 1- or  3-character code, as seen here:


![amino-acid-code.png](attachment:41a36be3-ab7d-4db2-9dff-435900f48fad.png)


Unlike DNA and RNA, an interesting feature to amino acids is that there is a <i>START</i> RNA codon and three <i>STOP</i> codon. Translating RNA to protein occurs for all RNA sequence between these codons.

-------------------

Let's think about this programmatically. There are a bunch of codons in this circle. How many? I have three letters in a codon and each letter can be one of four:

[A, C, G, U] + [A, C, G, U] + [A, C, G, U]

<li>A base can be one of 4 possibilities. 
<li>A dimer can be one of 4x4 = 4^2 = 16 possibilities.
<li>A codon or trimer can be one of 4x4x4 = 4^3 = 64 possbilities, and some of these codons repeat.

    
Manually programming the translating sequences can and will be quite taxing. 

Here is a dictionary I made earlier that can be of help. Note: "*" is the STOP codon

In [None]:
dna_to_aa_dict = {
    "GGG":"G", "GGA":"G", "GGT":"G", "GGC":"G", 
    "GAG":"E", "GAA":"E", "GAT":"D", "GAC":"D", 
    "GTG":"V", "GTA":"V", "GTT":"V", "GTC":"V", 
    "GCG":"A", "GCA":"A", "GCT":"A", "GCC":"A", 
    "AGG":"R", "AGA":"R", "AGT":"S", "AGC":"S", 
    "AAG":"K", "AAA":"K", "AAT":"B", "AAC":"B", 
    "ATG":"M", "ATA":"I", "ATT":"I", "ATC":"I", 
    "ACG":"T", "ACA":"T", "ACT":"T", "ACC":"T", 
    "TGG":"W", "TGA":"*", "TGT":"C", "TGC":"C", 
    "TAG":"*", "TAA":"*", "TAT":"Y", "TAC":"Y", 
    "TTG":"L", "TTA":"L", "TTT":"F", "TTC":"F", 
    "TCG":"S", "TCA":"S", "TCT":"S", "TCC":"S", 
    "CGG":"R", "CGA":"R", "CGT":"R", "CGC":"R", 
    "CAG":"Q", "CAA":"Q", "CAT":"H", "CAC":"H", 
    "CTG":"L", "CTA":"L", "CTT":"L", "CTC":"L", 
    "CCG":"P", "CCA":"P", "CCT":"P", "CCC":"P"}


In [None]:
# What if I want to flip this:

aa_to_dna_dict = {}

for key, value in dna_to_aa_dict.items():
    if value not in aa_to_dna_dict:
        aa_to_dna_dict[value] = [key]
    else:
        aa_to_dna_dict[value].append(key)

    

In [None]:
aa_to_dna_dict

Here's the problem: How do we know which DNA sequence we want to use? This is where more advanced work with BIOINFORMATICS comes in.

Until then, we can simplify this dictionary to just include 1 codon per residue:


In [None]:
aa_to_dna_dict = {}

for key, value in dna_to_aa_dict.items():
    if value not in aa_to_dna_dict:
        aa_to_dna_dict[value] = key
aa_to_dna_dict

In [None]:
aa_sequence  = ''
codon_length = 3
number_of_codons = len(sequence)//3

print(sequence)
print(number_of_codons)
print()

for i in range( number_of_codons ):
    
    aa_sequence += dna_to_aa_dict[sequence[codon_length*i] + sequence[codon_length*i+1] + sequence[codon_length*i+2]]

del codon_length, number_of_codons
    
print(aa_sequence)

In [None]:
# Let's start with a string of possible text
# Get the associated DNA sequence
# Then see what THAT sequence brings out

aa_sequence = "ITSNOTEASYBEINGGREEN*"
dna_sequence = ""

print(aa_sequence)
print()
for i in range( len(aa_sequence) ):
    if aa_sequence[i] in aa_to_dna_dict.keys():
        dna_sequence += aa_to_dna_dict[aa_sequence[i]]

print(dna_sequence)
print()
# -------------------------------------------------------

aa_sequence_check = ''
codon_length = 3
number_of_codons = len(dna_sequence)//3

for i in range( number_of_codons ):
    
    aa_sequence_check += dna_to_aa_dict[dna_sequence[codon_length*i] + dna_sequence[codon_length*i+1] + dna_sequence[codon_length*i+2]]

del codon_length, number_of_codons
    
print(aa_sequence_check)

del aa_sequence, aa_sequence_check, dna_sequence

------------
New tool time:  `BioPython`

![biopython_logo_white.png](attachment:8843f639-ff15-48d6-b1dc-b58e41834706.png)

All of this work is already handled in BioPython, under the 'Seq' module (wiki here: https://biopython.org/wiki/Seq)

Therefore, I will show you these commands and will expect you to visit in the Biopython Wiki page to read about what each of these code blocks did and why.

NOTE: One thing that BioPython does NOT do is reverse_translation. 

In [None]:
from Bio.Seq import Seq


In [None]:
dna1 = Seq(sequence)

dna1

In [None]:
dimers = [base1 + base2 for base1 in ['A','C','G','T'] for base2 in ['A','C','G','T']]

for dimer in dimers:
    print(dimer, dna1.count(dimer))

print()

for dimer in dimers:
    print(dimer, dna1.count_overlap(dimer))  
    
del dimer

In [None]:
dna1.complement()


In [None]:
dna1.transcribe()


In [None]:
dna1.translate()


In [None]:
dna1.translate(to_stop=True)


In [None]:
dna1.transcribe().translate(to_stop=True)


----------------------

In [None]:
# Dr. Robert Young, University of Maryland
# UMD FIRE Genome Computing