# `lab03`—DNA & RNA Sequencing

![](./img/header-bkgd.png)

❖ Objectives

-   Write a simple function to implement a mathematical formula.  <!-- connor -->
-   Use functions to modularize code.  <!-- geppar -->
-   Explain how variable scope impacts what the program "sees".  <!-- somzym -->
-   Understand the difference between _returning_ a value and _printing_ a value.  <!-- dinlor -->
-   Use default values in functions.  <!-- luxget -->

### DNA Elements

A DNA sequence is composed of adenine (`'A'`), guanine (`'G'`), cytosine (`'C'`), and thymine (`'T'`) nucleobases.  During the process of gene expression, RNA reads off each nucleobase with its opposite.  Thus an RNA sequence is a string containing uracil (`'U'`), cytosine (`'C'`), guanine (`'G'`), and adenine (`'A'`) bases<sup>[[Types of RNA](https://infogalactic.com/info/RNA#Types_of_RNA)]</sup>.  (Note that U pairs with A as RNA does not contain thymine.)

| Symbol | Name     | Complementary Base |
|--------|----------|--------------------|
| A  | adenine  | T (DNA); U (RNA)   |
| C  | cytosine | G                  |
| G  | guanine  | C                  |
| T  | thymine  | A                  |
| U  | uracil   | A                  |

This multi-part problem will lead you through processing DNA sequence data through transcription into RNA and then translating RNA sequences into codons.

![](https://oerpub.github.io/epubjs-demo-book/resources/0324_DNA_Translation_and_Codons.jpg)

In this notebook, you will create short programs and functions to carry out each part of this process.

#### Complementing DNA

The first step in the process above is to transcribe DNA into RNA.  We are going to write two functions, which will use an accumulator pattern with a loop and some comparison logic to convert from DNA to RNA and backwards.

-   Write a function `dna2rna` which accepts a string `seq_dna` representing a template strand of DNA.  `dna2rna` should `return` a string `seq_rna` which should contain the RNA strand corresponding to its DNA input.  That is, the input `'ACGT'` should return `'UGCA'`.  The function can expect upper-case input.

![](./img/dna2rna.png)

In [None]:
# define your function here
def dna2rna( seq_dna ):
    seq_rna = ''
    for symbol in seq_dna:
        if symbol == 'A':
            seq_rna = seq_rna + 'U'
        elif symbol == 'T':
            # you write the rest of this function
    return seq_rna

In [None]:
# it should pass this test---do NOT edit this cell
# test for simple case
assert dna2rna( 'CGAT' ) == 'GCUA'
print( 'Success!' )

In [None]:
# it should pass this test---do NOT edit this cell
# test for case insensitivity
assert dna2rna( 'CGATAATTGCGGATTCAGATCGAAACGCG' ) == 'GCUAUUAACGCCUAAGUCUAGCUUUGCGC'
print( 'Success!' )

-   Now turn things around.  Write a function `rna2dna` which accepts a string `seq_rna` representing a template strand of RNA.  `rna2dna` should `return` a string `seq_dna` which should contain the DNA strand which corresponds to its RNA complement.  That is, the input `'ACGU'` should return `'TGCA'`.  The function can expect upper-case input.

In [None]:
# define your function here
def rna2dna( seq_rna ):
    # you write this function

In [None]:
# it should pass this test---do NOT edit this cell
# test for simple case
assert rna2dna( 'CGAU' ) == 'GCTA'
print( 'Success!' )

In [None]:
# it should pass this test---do NOT edit this cell
# test for case insensitivity
assert rna2dna( 'GCUAUUAACGCCUAAGUCUAGCUUUGCGC' ) == 'CGATAATTGCGGATTCAGATCGAAACGCG'
print( 'Success!' )

#### Mapping RNA to Amino Acids (Codons)

At this point, you have two functions which can convert DNA and RNA from one representation to the other.  Next, we require the ability to translate an RNA string into a codon.

One of the major functions of RNA in the body is as “messenger RNA”, which contains groups of three-letter *codons* mapping to amino acids expressed in the cell.  Thus if we find `CUU CAG` in mRNA, we anticipate that the cell will create leucine and glutamine, written `LQ`:

    'CUUCAG' → 'LQ'

or, in terms of our program, we could write

    rna2amino( 'CUU' )

which yields

    'L'
    
and so forth.

The full table of codons follows.

<table class="wikitable">
    <h4>Standard genetic code<sup><a href="https://en.wikipedia.org/wiki/Genetic_code#RNA_codon_table">RNA codon table</a></sup></h4>
<tr>
<th rowspan="2">1st<br />
base</th>
<th colspan="8">2nd base</th>
<th rowspan="2">3rd<br />
base</th>
</tr>
<tr>
<th colspan="2">U</th>
<th colspan="2">C</th>
<th colspan="2">A</th>
<th colspan="2">G</th>
</tr>
<tr>
<th rowspan="4">U</th>
<td>UUU</td>
<td rowspan="2" style="background-color:#ffe75f">(Phe/F) Phenylalanine</td>
<td>UCU</td>
<td rowspan="4" style="background-color:#b3dec0">(Ser/S) Serine</td>
<td>UAU</td>
<td rowspan="2" style="background-color:#b3dec0">(Tyr/Y) Tyrosine</td>
<td>UGU</td>
<td rowspan="2" style="background-color:#b3dec0">(Cys/C) Cysteine</td>
<th>U</th>
</tr>
<tr>
<td>UUC</td>
<td>UCC</td>
<td>UAC</td>
<td>UGC</td>
<th>C</th>
</tr>
<tr>
<td>UUA</td>
<td rowspan="6" style="background-color:#ffe75f">(Leu/L) Leucine</td>
<td>UCA</td>
<td>UAA</td>
<td style="background-color:#B0B0B0;">Stop (<i>Ochre</i>)</td>
<td>UGA</td>
<td style="background-color:#B0B0B0;">Stop (<i>Opal</i>)</td>
<th>A</th>
</tr>
<tr>
<td>UUG</td>
<td>UCG</td>
<td>UAG</td>
<td style="background-color:#B0B0B0;">Stop (<i>Amber</i>)</td>
<td>UGG</td>
<td style="background-color:#ffe75f;">(Trp/W) Tryptophan&#160;&#160;&#160;&#160;</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">C</th>
<td>CUU</td>
<td>CCU</td>
<td rowspan="4" style="background-color:#ffe75f">(Pro/P) Proline</td>
<td>CAU</td>
<td rowspan="2" style="background-color:#bbbfe0">(His/H) Histidine</td>
<td>CGU</td>
<td rowspan="4" style="background-color:#bbbfe0">(Arg/R) Arginine</td>
<th>U</th>
</tr>
<tr>
<td>CUC</td>
<td>CCC</td>
<td>CAC</td>
<td>CGC</td>
<th>C</th>
</tr>
<tr>
<td>CUA</td>
<td>CCA</td>
<td>CAA</td>
<td rowspan="2" style="background-color:#b3dec0">(Gln/Q) Glutamine</td>
<td>CGA</td>
<th>A</th>
</tr>
<tr>
<td>CUG</td>
<td>CCG</td>
<td>CAG</td>
<td>CGG</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">A</th>
<td>AUU</td>
<td rowspan="3" style="background-color:#ffe75f">(Ile/I) Isoleucine</td>
<td>ACU</td>
<td rowspan="4" style="background-color:#b3dec0">(Thr/T) Threonine&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</td>
<td>AAU</td>
<td rowspan="2" style="background-color:#b3dec0">(Asn/N) Asparagine</td>
<td>AGU</td>
<td rowspan="2" style="background-color:#b3dec0">(Ser/S) Serine</td>
<th>U</th>
</tr>
<tr>
<td>AUC</td>
<td>ACC</td>
<td>AAC</td>
<td>AGC</td>
<th>C</th>
</tr>
<tr>
<td>AUA</td>
<td>ACA</td>
<td>AAA</td>
<td rowspan="2" style="background-color:#bbbfe0">(Lys/K) Lysine</td>
<td>AGA</td>
<td rowspan="2" style="background-color:#bbbfe0">(Arg/R) Arginine</td>
<th>A</th>
</tr>
<tr>
<td>AUG<sup class="reference" id="ref_methionineA">[A]</sup></td>
<td style="background-color:#ffe75f;">(Met/M) Methionine</td>
<td>ACG</td>
<td>AAG</td>
<td>AGG</td>
<th>G</th>
</tr>
<tr>
<th rowspan="4">G</th>
<td>GUU</td>
<td rowspan="4" style="background-color:#ffe75f">(Val/V) Valine</td>
<td>GCU</td>
<td rowspan="4" style="background-color:#ffe75f">(Ala/A) Alanine</td>
<td>GAU</td>
<td rowspan="2" style="background-color:#f8b7d3">(Asp/D) Aspartic acid</td>
<td>GGU</td>
<td rowspan="4" style="background-color:#ffe75f">(Gly/G) Glycine</td>
<th>U</th>
</tr>
<tr>
<td>GUC</td>
<td>GCC</td>
<td>GAC</td>
<td>GGC</td>
<th>C</th>
</tr>
<tr>
<td>GUA</td>
<td>GCA</td>
<td>GAA</td>
<td rowspan="2" style="background-color:#f8b7d3">(Glu/E) Glutamic acid</td>
<td>GGA</td>
<th>A</th>
</tr>
<tr>
<td>GUG</td>
<td>GCG</td>
<td>GAG</td>
<td>GGG</td>
<th>G</th>
</tr>
</table>

We provide the function `rna2amino` which accepts a three-letter codon and returns the corresponding amino acid.  This `genetic_code` is a `dict`, which you see in the lesson `python/dict`.  This `dict` can store tabular data for looking up later.

In [None]:
genetic_code = {
    'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L',        'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L',
    'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M',        'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V',
    
    'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S',        'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
    'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',        'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
    
    'UAU': 'Y', 'UAC': 'Y', 'UAA': '*', 'UAG': '*',        'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
    'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',        'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
    
    'UGU': 'C', 'UGC': 'C', 'UGA': '*', 'UGG': 'W',        'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
    'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',        'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',
}
allowed_codons = set('ACGU')

def rna2amino(codon):
    '''
    Convert a three-letter RNA codon to an amino acid.
    '''
    # Check for the correct length of the codon.
    if len(codon) != 3:
        return None
    codon = codon.upper()
    # Check that the codon is valid.
    if (set(codon) > allowed_codons):
        return None
    return genetic_code[codon]

Now we can convert an RNA codon to an amino acid trivially using `rna2amino`.  Test it out with a few codons:

In [None]:
rna2amino( 'CGU' )

-   We next need a function `str2seq` which accepts a string `rna_seq` containing RNA sequence data and maps it to amino acids.  This requires that you:
    
    1.  Break each string into three-letter chunks.
    2.  For each chunk, map it to a valid amino acid codon according to the table below.  (We provide code for this step.)
    3.  Return the result.
    
![](./img/str2seq.png)

The tricky part is figuring out how to get a string chopped into three-letter chunks.  (This is harder than it seems at first.)  There are many ways you can think of to do this.  One possibility:

In [None]:
example_string = 'abcdefghijklmnopqrstuvwxyz'
for i in range( 0,int( len( example_string ) / 3 ) ):
    print( example_string[ 3*i:3*i+3 ] )

In [None]:
# define your function here
def str2seq( '''(delete this string and replace it with the incoming variables)''' ):
    # divide the string into three-letter chunks
    
    # map each three-letter codon to a protein using rna2amino
    
    # append the protein to the result string
    
    # return the result string

In [None]:
# it should pass this test---do NOT edit this cell
# test for simple case
assert str2seq( 'ACUGAU' ) == 'TD'
print( 'Success!' )

In [None]:
# it should pass this test---do NOT edit this cell
# test for a more complicated case
assert str2seq( 'AUCACUGUAGUAGUAGCUGGAAAGAGAAAUCUGUGACUCCAAUUAGCC' ) == 'ITVVVAGKRNL*LQLA'
print( 'Success!' )

In [None]:
# it should pass this test---do NOT edit this cell
# test for failure case
try:
    str2seq( 'ASDF' )
except KeyError:
    assert True
else:
    assert False
print( 'Success!  You rejected invalid data.' )

Finally, we are interested in taking a string of DNA sequence data, transcribing its RNA complement, and translating the resulting RNA to amino acids.  This requires that you:

1.  Convert the string from DNA to RNA.  (Which function does this?)
2.  Convert the RNA string to its corresponding protein expression string.  (Which function does this?)
3.  Return the resulting string.

-   Write a function `dna2seq` which accepts a string `dna_seq`.  This function will `return` (NOT `print`) a string containing the amino acids described in the string `dna_seq`.

![](./img/dna2seq.png)

In [None]:
# define your function here
def dna2seq( '''(delete this string and replace it with the incoming variables)''' ):
    result = ''
    
    # Convert the string from DNA to RNA. (Which function does this?)
    
    # Convert the RNA string to its corresponding protein expression string. (Which function does this?)
    
    # Return the resulting string.

In [None]:
# it should pass this test---do NOT edit this cell
# test for simple case
test_data = 'ATGTTTTCCGGTGGCGGCGGCCCGCTGTCCCCCGGAGGAAAGTCGGCGGCCAGGGCGGCGTCCGGGTTTTTTGCGCCCGCCGGCCCTCGCGGAGCCGGCCGGGGACCCCCGCCTTGCTTGAGGCAAAACTTTTACAACCCCTACCTCGCCCCAGTCGGGACGCAACAGAAGCCGACCGGGCCAACCCAGCGCCATACGTACTATAGCGAATGCGATGAATTTCGATTCATCGCCCCGCGGGTGCTGGACGAGGATGCCCCCCCGGAGAAGCGCGCCGGGGTGCACGACGGTCACCTCAAGCGCGCCCCCAAGGTGTACTGCGGGGGGGACGAGCGCGACGTCCTCCGCGTCGGGTCGGGCGGCTTCTGGCCGCGGCGCTCGCGCCTGTGGGGCGGCGTGGACCACGCCCCGGCGGGGTTCAACCCCACCGTCACCGTCTTTCACGTGTACGACATCCTGGAGAACGTGGAGCACGCGTAC'
assert dna2seq( test_data ) == 'YKRPPPPGDRGPPFSRRSRRRPKKRGRPGAPRPAPGGGTNSVLKMLGMERGQPCVVFGWPGWVAVCMISLTLLKAK*RGAHDLLLRGGLFARPHVLPVEFARGFHMTPPLLALQEAQPSPPKTGAASADTPPHLVRGRPKLGWQWQKVHML*DLLHLVRM'
print( 'Success!' )