# Week 2 - Transcription & Translation

In this workshop, we will be writing Python code to perform tasks related to transcription and translation in silico. The workshop has two parts: writing functions from scratch, and using existing implementations and data structures to store and process sequences.

Previous Python experience is expected for COMP90016 students. However, if you need to review some coding concepts, there are guides to help you in the additional resources modules on the LMS.

You may also want to refer back to workshop 1 for some tips on using Jupyter notebooks.

These exercises build on the concepts presented in the first week of lectures. We recommend watching them before completing the workshop.

<br>

## Task 1 - Compute the reverse complement

First, we will write a script to determine the reverse complement of a given sequence. We begin by creating a dictionary of mappings.

In this workshop we will not be considering the [extended DNA alphabet](http://www.bioinformatics.org/sms/iupac.html), only the 4 standard bases.

In [None]:
complement_dict = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
complement_dict['C']

In [None]:
dna_seq = 'ACTATTAAACCCATATAACCTCCCCCAAAATTCAGAATAATAAC'
complement_seq = ''  # An empty string to store the reverse complement sequence.

# Iterate through the bases of the DNA sequence and use the complement mapping dictionary to add the complementary bases to the rev_complement_seq string.
for base in dna_seq:
    complement_seq += complement_dict[base]
    
complement_seq

This gives us the complement of `dna_seq`, but we still need to get the reverse. You can reverse a string using the code snippet `dna_seq[::-1]`. This is a shorter way to write `dna_seq[44::-1]` which means start at position 44, go all the way to the end (position 0 inclusive) and move with a step of -1 (step backwards). Try it out below.

In [None]:
print(dna_seq)
print(dna_seq[44::-1])
print(dna_seq[::-1])

In [None]:
# Note: we do not modify the original DNA sequence variable. This allows it to be reused in other places.
dna_seq

We can apply this to `complement_seq` to get the reverse complement

In [None]:
complement_seq[::-1]

All of this code can be combined and written as a function, so it can be reused.

In [None]:
def rev_complement(seq):
    """
    Compute the reverse complement of a given DNA sequence.
    The input and output sequences should be DNA strings with capital letters. 
    """
    
    complement_dict = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    complement_seq = ''
    
    # Iterate through the bases of the DNA sequence and use the complement mapping dictionary to add the complementary bases to the complement_seq string.
    for base in seq:
        complement_seq += complement_dict[base]
        
    rev_complement_seq = complement_seq[::-1]
    
    return rev_complement_seq

In [None]:
print(rev_complement('TAAAG')) # should give 'CTTTA'
print(rev_complement(dna_seq))

<br>

## Task 2 - Transcribe DNA sequences

Here, we trancribe a DNA sequence into an RNA-sequence. Write a function to transcribe a given DNA sequence. 

Note that when referring to the DNA sequence of a gene, we are referring to the coding strand by default.

In [None]:
def transcribe(dna):
    """
    Compute the transcript resulting from a DNA sequence.
    The input and output sequences should be DNA strings with capital letters.
    """
    
    # your code here
    

In [None]:
print(transcribe('ATAT')) # should give 'AUAU'
print(transcribe(dna_seq))

<br>

## Task 3 - Translate DNA sequences

As with task 1, we will be needing a dictionary to help us map codons to their respective amino acids. We first form the dictionary from a text-based codon table.

In [None]:
# Note: * represents the stop codon and M the start codon
base1 = 'TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG'
base2 = 'TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG'
base3 = 'TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG'
aa = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'

codon_map = {} # Build a codon map using this dictionary.

# your code here

codon_map

Now, use your dictionary to compute the amino-acid sequence for the first reading frame (no offset in the sequence). You can use the `dict.get` function to return default values if the keys do not exist in the dictionary.

In [None]:
def translate(dna, codon_dict):
    """
    Translate a DNA sequence from the first reading frame, given a codon mapping dictionary.
    Codons are keys and amino acids are values in this dictionary.
    The input and output sequences should be DNA strings with capital letters.
    """
    
    # your code here
    

In [None]:
print(translate('ATGATGA', codon_map)) # should give MM or MMX where X represents an incomplete codon
print(translate(dna_seq, codon_map))

Now, write a function that uses the above function to return the amino-acid sequence of all 6 reading frames. 

Hint: three reading frames will be from the reverse complement strand. 

In [None]:
def six_rfs(dna, codon_dict):
    """
    Return the amino-acid sequence from all six reading frames of a sequence.
    This function should use the translate function implemented earlier.
    Return the result as a list of size 6. The list should contain amino-acid strings with capital letters.
    The input sequence should be a DNA string with capital letters.
    """
    
    # your code here
    

In [None]:
# should give: (with or without the X)
# 'TIKPI*PPPKFRIIX'
# 'LLNPYNLPQNSE**X'
# 'Y*THITSPKIQNNN'
# 'VIILNFGGGYMGLIX'
# 'LLF*ILGEVIWV**X'
# 'YYSEFWGRLYGFNS'

six_rfs(dna_seq, codon_map)

<br>

## Task 4 - the scikit-bio library

All of the above tasks can be performed using functions from the `scikit-bio` library. It provides functions and methods to read and parse some popular file formats, and store and modify sequences.

`scikit-bio` is already installed on SWAN. If you are running this notebook on SWAN, you can skip to the import cell.

If you are using your (Mac or Linux) local computer, you will have to install it by using the cell below. Note that the ! allows us to use UNIX commands from inside a Jupyter notebook. Unfortunately, there is no `scikit-bio` version for Windows computers.

In [None]:
# Uncomment and execute the line below to install scikit-bio on your local machine using pip
#!pip install scikit-bio

# Alternate conda installation command
#!conda install -c conda-forge scikit-bio

In [None]:
# Import the library
import skbio

`scikit-bio`, like many Python libraries, uses an object oriented programming paradigm. As an example, a DNA sequence is treated as an object. All objects have properties and behaviours. Properties could be metadata such as the sequence ID of a DNA sequence or its quality. Behaviours could be accessing the transcribed or translated sequence. Properties and behaviours are referred to as *attributes* and *methods* in Python.

In [None]:
# Define an skbio.sequence.DNA object using the same test sequence we used above.
# Note the additional statistics that are computed by default.

dna_seq_skbio = skbio.sequence.DNA(dna_seq)
dna_seq_skbio

In [None]:
# The alphabet used to encode a DNA sequence is an attribute of the skbio.sequence.DNA object.
dna_seq_skbio.alphabet

<br>

## Task 5 - skbio.sequence.DNA object methods

Next we will load the sequence of a bacterial *dnaA* gene from a FASTA file in your data directory using the `skbio.io.read` function. Type `?skbio.io.read` in a code cell to access the help page for this function.

A FASTA file stores sequence information. It is a common filetype that you will learn more about in lecture 6.

In [None]:
dnaA = skbio.io.read('dnaA.fa', format = 'fasta', into = skbio.sequence.DNA)
dnaA

The above DNA object holds attributes such as a description and an ID. We can compute the complement of this sequence, transcribe it and translate it using functions from the scikit-bio library. For more information on all the functions and classes (DNA, RNA, etc.) the library provides, read the [documentation page](http://scikit-bio.org/docs/0.5.1/index.html).

In [None]:
dnaA.complement()

In [None]:
dnaA.reverse_complement()

Note that `reverse_complement` is different to `complement`

In [None]:
dnaA.transcribe()

In [None]:
dnaA.translate()

In [None]:
list(dnaA.translate_six_frames())

<br>

<br>

# If you get stuck

There are many creative solutions for these tasks, but here are some coding tips that may help.


## Making dictionaries

Python dictionaries can be made quickly using `dict(zip())`. 

`zip()` can be used to group elements of multiple lists together,
while `dict()` can be used to create a dictionary.

In [None]:
bases = ["A", "T", "C", "G"]
complements = ["T", "A", "G", "C"]

In [None]:
complement_dict = dict(zip(bases, complements))
complement_dict


## Iterating over lists or strings

For loops are easy to understand and are clear to the reader, but there can be more efficient ways to achieve an equivalent outcome.

For example, the in-built Python function `str.replace()` can be used to iterate over elements of a string.


## The OS library

Sometimes you may have trouble specifying paths to important files such as `dnaA.fa`.

The `os` library can be useful for this. You can read more about it here https://docs.python.org/3/library/os.html

`os` should come already installed so all you need to do is `import os`. Then you can use functions such as `getcwd()` to retrieve the current working directory (cwd)

In [None]:
import os
os.getcwd() 

<br>

<br>

`Workshop developed by Steven Morgan, Dr Dieter Bulach and Dharmesh Bhuva.`