# Day 1 notebook

The objectives of this notebook are to practice
* the biological concepts of 
  * representing DNA molecules as strings
  * determining the complementary strand of a double-stranded DNA molecule
  * transcription of DNA to RNA
  * codons and their translation to amino acids
* the Python concepts of 
  * working with strings and lists
  * writing functions
  * looping over sequences
  * using list comprehensions
  * using dictionaries
* using and submitting Jupyter notebooks

## Providing answers to problems in Jupyter Notebooks

Much of your work in this class will be done in Jupyter Notebooks with a Python 3 kernel (Python 3.10, to be specific).  In general, you will be provided with a *template* notebook and your job will be to fill in the missing sections by following the specifications preceding each missing section.  The missing sections may be either code blocks, text (in Markdown), or perhaps even a plot or image.  

Following each missing section will be some tests (in the form of `assert` statements) of the accuracy of your solutions.  Some tests will be visible to you and others may be hidden for the purposes of grading.  You should not modify the test blocks, but you should certainly run them to see if your code is passing the visible tests.  Do not worry if you modify the test blocks.  They will be automatically replaced with their original tests when you submit.

### PROBLEM 1 - variable assignment (10 POINTS)
Your first exercise/problem is to simply assign to the variable `my_name` a string containing your name.  Please do so in the code cell below.

In [1]:
my_name= "Matej Popovski"



The code block below will test that you have successfully assigned a non-empty string to the variable `my_name`.  If all of the assertions run without failure, you have passed the tests for this problem and earned the points assigned to it.

In [2]:
# tests for my_name
assert isinstance(my_name, str), "my_name was not assigned to a string object"
assert my_name != "", "my_name is empty"
print("SUCCESS: all tests for my_name passed!")

SUCCESS: all tests for my_name passed!


## Representing and working with DNA molecules as strings

In this notebook we will work with a fragment of the DNA sequence for the human gene *CFTR*, mutations in which are known to cause the disorder [Cystic Fibrosis](https://en.wikipedia.org/wiki/Cystic_fibrosis).

In [3]:
cftr_gene_fragment = ("ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAA"
                      "AATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTA"
                      "TGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAA"
                      "TATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG")
print(cftr_gene_fragment)

ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG


Just to get you familiar with writing functions in Python, here is a silly function that returns the concatenation of the first and last characters of a string.

In [4]:
def first_and_last(s):
    return s[0] + s[-1]

In [5]:
first_and_last(cftr_gene_fragment)

'AG'

### PROBLEM 2 - slicing and concatenating strings (1 POINT)

One of the most famous disease causing mutations in humans is the deltaF508 mutation in the *CFTR* gene.  This is the most common mutation carried by people with Cystic Fibrosis.  This mutation occurs in the gene fragment specified above and corresponds to the deletion of 3 consecutive bases, starting at base 129 (using 1-based numbering).  The code below shows how to "slice" the string representing this gene fragment to determine the identity of the 3 bases that are deleted by this mutation.  Note that Python uses 0-based indexing!


In [6]:
print("The deleted bases in the deltaF508 mutation are", cftr_gene_fragment[128:131])

The deleted bases in the deltaF508 mutation are CTT


In this problem, you are to write a function that takes as input a string, a `start` position (0-based index), and a `length`, and returns a *new* string that is the result of deleting `length` characters starting at the `start` position from the input string.  Note that string objects are immutable (i.e., you cannot modify them) in Python. We will use this function to determine the sequence of the gene fragment that results from the deltaF508 mutation.  

In [7]:
def delete_substring(s, start, length):
    ### YOUR CODE HERE
    return s[:start]+ s[start+length:]
    


In [8]:
# tests for delete_substring
assert delete_substring("Testing", 0, 2) == "sting"
assert delete_substring("Testing", 4, 3) == "Test"
assert delete_substring("Testing", 1, 3) == "Ting"
assert delete_substring("Testing", 6, 1) == "Testin"
print("SUCCESS: all tests for delete_substring passed!")

SUCCESS: all tests for delete_substring passed!


Now let's use your function to determine the sequence of the gene fragment that results from the deltaF508 mutation:

In [9]:
delete_substring(cftr_gene_fragment, 128, 3)

'ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG'

### PROBLEM 3 - complementary bases (1 POINT)
In this problem we will practice using the Python `if` statement to write a function that returns the complement of a single-base (single-character) string.  You are provided with the first few lines of the function.

In [10]:
def complement(base):
    if base == 'A':
        return 'T'
    elif base == 'C':
    ### YOUR CODE HERE
        return 'G'
    elif base == 'T':
        return 'A'
    elif base == 'G':
        return 'C'
    else:
        raise ValueError(base + " is not a valid DNA base")

In [11]:
# tests for complement
assert complement('A') == 'T'
assert complement('T') == 'A'
assert complement('C') == 'G'
assert complement('G') == 'C'
print("SUCCESS: all tests for complement passed!")

SUCCESS: all tests for complement passed!


### PROBLEM 4 - reverse complement (1 POINT)
In this problem, we will write a function that is given the string representing of one strand of a double-stranded DNA molecule and outputs a string representing the opposite (complementary) strand, specified in the 5' to 3' direction.  To do this, we will first construct a `list` containing the characters of the opposite strand and then use the [`join`](https://docs.python.org/3/library/stdtypes.html#str.join) method of string objects to convert from a `list` back to a `string`.  This will be more efficient than repeated concatentation.

One function you might consider using is [`reversed`](https://docs.python.org/3/library/functions.html?#reversed), which allows you to loop over sequences in reverse order.  For example:

In [12]:
for number in reversed(range(5)):
    print(number)

4
3
2
1
0


def reverse_complement(s):
    # we first start with an empty list
    opposite_bases = []
    # using a for loop, we loop through the bases of s in *reverse* order
    # and append the complementary base to the opposite bases list
    ### YOUR CODE HERE
    for x in reversed(s):
        opposite_bases.append(complement(x))
    
    # we can use the 'join' method with an empty string
    # to concatenate all strings in the list
    return ''.join(opposite_bases)

In [14]:
# tests for reverse_complement
assert reverse_complement("ATCG") == "CGAT"
assert reverse_complement("A") == "T"
assert reverse_complement("") == ""
assert reverse_complement("GAATTC") == "GAATTC", "Failed on palindromic EcoR1 recognition sequence"
print("SUCCESS: all tests for reverse_complement passed!")

SUCCESS: all tests for reverse_complement passed!


Finally, let us use your `reverse_complement` function to determine the opposite strand of the *CFTR* gene fragment that we have been working with in this notebook.

In [None]:
reverse_complement(cftr_gene_fragment)

### PROBLEM 5 - transcription (1 POINT)
Write a function that takes as input a DNA string (**sense strand**) and outputs the string representing the RNA that would result from transcribing this DNA sequence.

You will likely find one of the [methods of string objects](https://docs.python.org/3/library/stdtypes.html#string-methods) very convenient for accomplishing this with a single line.  For example, to obtain a string with characters of a string in uppercase, you can use the `upper` method.

In [None]:
"AcTggggTTatgatTAG".upper()

In [16]:
def transcribe(dna_sequence):
### YOUR CODE HERE
    dna_sequence = dna_sequence.replace('T','U')
    return dna_sequence


In [17]:
# tests for function transcribe
assert transcribe("ACGT") == "ACGU", "Failed on input 'ACGT'"
assert transcribe("") == "", "Failed on empty string input"
assert transcribe("CTACTACTGCTA") == 'CUACUACUGCUA', "Failed on input 'CTACTACTGCTA'"
print("SUCCESS: transcribe passed all tests!")

SUCCESS: transcribe passed all tests!


### PROBLEM 6 - Splitting an RNA into its codons (1 POINT)

We will next write a function that takes as input an RNA string and returns a list of its codons, in the same order as they appear in the RNA.  We will assume that the first three bases of the RNA correspond to a codon and that the length of the RNA sequence is a multiple of three.  You need not check that these assumptions are valid.

In this problem I encourage you to learn about and use Python's [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) syntax, which is concise way to construct a list.  For example, the code below, which constructs a list of the squares of the first ten non-negative integers

In [43]:
squares = []
for x in range(10):
    squares.append(x**2)

can be more concisely writtten as

In [44]:
squares = [x**2 for x in range(10)]

In [49]:
def codons(rna):
### YOUR CODE HERE
### squares = [squares[i:i+3] for i in range(0, len(rna), 3)]
    squares = []
    for x in range(0, len(rna), 3):
        squares.append(rna[x:x+3])
        
###string[start:end:step]
    print (squares)
    return squares

In [50]:
# tests for function codons
assert codons("ACGCCUGGG") == ["ACG", "CCU", "GGG"], "Failed on input 'ACGCCUGGG'"
assert codons("UUU") == ["UUU"], "Failed on input 'UUU'"
assert codons("") == [], "Failed on empty string"
print("SUCCESS: codons passed all tests!")

['ACG', 'CCU', 'GGG']
['UUU']
[]
SUCCESS: codons passed all tests!


### PROBLEM 7 - codon translation (1 POINT)

Finally, we will explore the use of the Python [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) data structure as a convenient way in which to implement a function that translates an RNA codon to its corresponding amino acid.  We could write a giant set of if-elif statements for this, but it will be more elegant, convenient, and (usually) efficient to implement this as a lookup in a table, such as a dictionary object.  I have done the hard work for you and specified the standard genetic code (RNA codon to amino acid) as a dictionary below.

In [56]:
genetic_code = {
 'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAU': 'N',
 'ACA': 'U', 'ACC': 'U', 'ACG': 'U', 'ACU': 'U',
 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGU': 'S',
 'AUA': 'I', 'AUC': 'I', 'AUG': 'M', 'AUU': 'I',
 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAU': 'H',
 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCU': 'P',
 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGU': 'R',
 'CUA': 'L', 'CUC': 'L', 'CUG': 'L', 'CUU': 'L',
 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAU': 'D',
 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCU': 'A',
 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGU': 'G',
 'GUA': 'V', 'GUC': 'V', 'GUG': 'V', 'GUU': 'V',
 'UAA': '*', 'UAC': 'Y', 'UAG': '*', 'UAU': 'Y',
 'UCA': 'S', 'UCC': 'S', 'UCG': 'S', 'UCU': 'S',
 'UGA': '*', 'UGC': 'C', 'UGG': 'W', 'UGU': 'C',
 'UUA': 'L', 'UUC': 'F', 'UUG': 'L', 'UUU': 'F'}

Note that in the table above, `*` is used to represent a stop codon, and is not an amino acid.

Using this dictionary object, write a (very simple) function that takes as input a single RNA codon string and outputs its corresponding amino acid (or the symbol representing the stop codon).  You need not check that the input is valid.

In [57]:
def translate_codon(rna_codon):
### YOUR CODE HERE
    return genetic_code[rna_codon]


In [58]:
# tests for function translate_codon
assert translate_codon("ACG") == "U", "Failed on input 'ACG'"
assert translate_codon("UUU") == "F", "Failed on input 'UUU'"
assert translate_codon("UGA") == "*", "Failed on stop codon input 'UGA'"

# Check that the genetic_code dictionary is being used
orig_genetic_code = genetic_code
del genetic_code
try:
    translate_codon("ACG")
except NameError:
    pass
else:
    raise AssertionError("translate_codon does not use genetic_code")
finally:
    genetic_code = orig_genetic_code

print("SUCCESS: translate_codon passed all tests!")

SUCCESS: translate_codon passed all tests!


## Submitting your notebook

Congratulations, you have reached the end of this notebook!  To double check that you have completed all of the problems correctly, you are encouraged to restart Python and run all of the cells in the notebook from beginning to end.  You can do so automatically by running the command "Restart Kernel and Run All Cells" from the "Kernel" menu.  After running this command, look through your notebook and check that all of the tests ran successfully.

To submit your work, click on "Submit" button in the upper right corner.  You may submit as many times as you wish and your final grade will be based on your most recent submission.  After you submit, a grade report will become available telling you how many points you received on each problem in the notebook.