# Section 01: Practice Python Problems

This notebook contains several example biologically-motivated problems for you to try out your Python skills. Type your answers in the code cells as indicated. Reference the section notes for help with Jupyter notebook and Python. This isn't an assignment, so there's no need to submit it. It's purely a chance for you to practice your python skills and gauge your understanding of basic Python commands.

These problems are adapted from [Rosalind](http://rosalind.info/problems/list-view/). If you want more Python problems to try out, explore the site!

### 1.  Nucleotides

DNA is comprised of four nucleotides: A, C, G, T. Often we will represent a DNA sequence as a string of these characters in Python.

**Problem:** Count the number of times each nucleotide (A, C, G, T) occurs in a given DNA sequence.

In [4]:
# Given: a DNA sequence (str)
dna = 'ACTGCAGTCATGCAGTCATGCTGATTTTCAGGCCATGCAAACTG'

# Return: four numbers (ints) separatated by spaces counting the number of 
#         A's, C's, G's, and T's respectively

# FIXME: Write your solution here
def count_base(dna):
    return str(dna.count('A')) + ' ' + str(dna.count('C')) + ' ' + str(dna.count('G')) + ' ' + str(dna.count('T'))

In [5]:
count_base(dna)

'11 11 10 12'

If you are totally stuck and want some pseudocode to get you started, here are some steps you could implement. Try to complete each commented step below with the appropriate Python code.

In [None]:
# Initialize the count of each nucleotide to zero

# Iterate through each nucleotide in the DNA sequence (in a for loop, for example)
    # If this nt is an 'A', increment the A count
    # Else if this nt is an 'C', increment the C count
    # Else if this nt is an 'G', increment the G count
    # Else if this nt is an 'T', increment the T count
    # Else, this nt is not a valid nucleotide so print a warning

# Print the frequency of each nucleotide separated by spaces

### 2. Complement

DNA is double stranded. One strand is called the *sense* strand and the other is called the *antisense* strand. Usually you are only given the sequence of the sense (coding) strand. In DNA, A and T are complements of each other, while C and G are complements of each other. The antisense strand is the complement of the sense strand, which is formed by taking the complement of each nucleotide. For example, the complement of GTCA is CAGT.

**Problem:** Print the complement of the given DNA sequence.

In [None]:
# Given: a DNA sequence (str)
dna = 'ACTGCAGTCATGCAGTCATGCTGATTTTCAGGCCATGCAAACTG'

# Return: the complement of the sequence (str)

# FIXME: Write your solution here

**Problem:** Now rewrite your solution as a function, so that you can find the complement of multiple sequences.

In [6]:
def complement(dna):
    '''
    Given: a DNA sequence (str)
    Return: the complement of the sequence (str)
    '''
    comp = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'}
    comp_str = ''
    for base in dna:
        comp_str += comp[base]
    return comp_str
    # FIXME: Write your solution here

In [7]:
dna1 = 'ACTGCAGTCATGCAGTCATGCTGATTTTCAGGCCATGCAAACTG'
dna2 = 'CGTCAGTCGTATTGCGTAGCTGGTACTGCATATATTGGTGCGAAACGT'

# FIXME: Run your function on both sequences
print(complement(dna1))
print(complement(dna2))

TGACGTCAGTACGTCAGTACGACTAAAAGTCCGGTACGTTTGAC
GCAGTCAGCATAACGCATCGACCATGACGTATATAACCACGCTTTGCA


### 3. Transcription

DNA is transcribed to RNA, which is comprised of four nucleotides: A, C, G, U (instead of T). Given a coding strand DNA sequence, its transcribed RNA sequence is formed by replacing all of the instances of T with U.

**Problem:** Transcribe a given DNA sequence to RNA.

In [8]:
def transcribe(dna):
    '''
    Given: a DNA sequence (str)
    Return: the transcribed RNA sequence (str)
    '''
    trans_str = ''
    for base in dna:
        if base == 'T':
            trans_str += 'U'
        else:
            trans_str += base
    return trans_str
    # FIXME: Write your solution here

In [9]:
dna = 'ACTGCAGTCATGCAGTCATGCTGATTTTCAGGCCATGCAAACTG'
transcribe(dna)
# FIXME: Run your function on this sequence

'ACUGCAGUCAUGCAGUCAUGCUGAUUUUCAGGCCAUGCAAACUG'

### 4. Translation

There are 20 commonly occurring amino acids that make up protein sequences. RNA is translated to protein according to the [codon table](https://en.wikipedia.org/wiki/Genetic_code#RNA_codon_table). Given a messenger RNA sequence, its corresponding protein sequence is formed by translating each codon (triplet of nucleotides) to the appropriate amino acid or stop codon (denoted \*).

**Problem:** Translate a given RNA sequence to protein.

In [10]:
def translate(rna):
    '''
    Given: an RNA sequence (str)
    Return: the translated protein sequence (str)
    '''
    
    # Save the codon table as a dict
    aas = {'UUU':'F','UUC':'F',
           'UUA':'L','UUG':'L',
           'UCU':'S','UCC':'S','UCA':'S','UCG':'S','AGU':'S','AGC':'S',
           'UAU':'Y','UAC':'Y',
           'UAA':'*','UAG':'*','UGA':'*',
           'UGU':'C','UGC':'C',
           'UGG':'W',
           'CUU':'L','CUC':'L','CUA':'L','CUG':'L',
           'CCU':'P','CCC':'P','CCA':'P','CCG':'P',
           'CAU':'H','CAC':'H',
           'CAA':'Q','CAG':'Q',
           'CGU':'R','CGC':'R','CGA':'R','CGG':'R','AGA':'R','AGG':'R',
           'AUU':'I','AUC':'I','AUA':'I',
           'AUG':'M',
           'ACU':'T','ACC':'T','ACA':'T','ACG':'T',
           'AAU':'N','AAC':'N',
           'AAA':'K','AAG':'K',
           'GUU':'V','GUC':'V','GUA':'V','GUG':'V',
           'GCU':'A','GCC':'A','GCA':'A','GCG':'A',
           'GAU':'D','GAC':'D',
           'GAA':'E','GAG':'E',
           'GGU':'G','GGC':'G','GGA':'G','GGG':'G'}
    codons = [rna[i:i+3] for i in range(0, len(rna), 3)]
    amino_acid = ''
    for codon in codons:
        amino_acid += aas[codon]
    return amino_acid
    # FIXME: Write your solution here

In [11]:
rna = 'AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA'
translate(rna)
# FIXME: Run your function on this sequence

'MAMAPRTEINSTRING*'

### 5. Mutations

Frequently, biological sequences are stored in text files called fasta files. These files follow the format:
```
>SEQ_ID_1
GCATCAGTACTGACTATGTTGTTACGTGCAGTCGATGACTCACAGACTGCATGTTTTTCTATTGTAACACATGCATACATGTACTGTTACGTGTACGTGTGCATGTCATCGTTGATCTGCATGTCATTGCATGTACGTGTGACTACTACTACTGCGCGGATAACTCGCGCCCAT
>SEQ_ID_2
GCATAAGTACTGACTATGTTGTTACGTGCAGTCGCTGACTCACAGACTGCATGTTTTTCTATTGTAACACATGCATACATGTACTGTTACGTGTACGCGTGCATGTCATCGTTGATCTGCATGTCAGTGCATGTGCGTGTGACTACTACTACTGCGCGGATCACTCGCGCCCAT
```
This can be a useful way to store lots of DNA, RNA, or protein sequences for analysis.

**Problem:** Copy the above sequences into a text file in your Jupyter notebook directory. Import the sequences from the file and find the [hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) (number of different nucleotides) between the two DNA sequences.

In [12]:
# FIXME: Write your solution here
def hamming_distance(str1, str2):
    return sum(base1 != base2 for base1,base2 in zip(str1, str2))

seq1 = 'GCATCAGTACTGACTATGTTGTTACGTGCAGTCGATGACTCACAGACTGCATGTTTTTCTATTGTAACACATGCATACATGTACTGTTACGTGTACGTGTGCATGTCATCGTTGATCTGCATGTCATTGCATGTACGTGTGACTACTACTACTGCGCGGATAACTCGCGCCCAT'
seq2 = 'GCATAAGTACTGACTATGTTGTTACGTGCAGTCGCTGACTCACAGACTGCATGTTTTTCTATTGTAACACATGCATACATGTACTGTTACGTGTACGCGTGCATGTCATCGTTGATCTGCATGTCAGTGCATGTGCGTGTGACTACTACTACTGCGCGGATCACTCGCGCCCAT'

hamming_distance(seq1, seq2)

6

**Problem:** Transcribe then translate the DNA sequences to protein. Now calculate the hamming distance (number of different amino acids) between the two protein sequences.

In [13]:
# FIXME: Write your solution here
seq1_transcribed = transcribe(seq1)
seq1_translated = translate(seq1_transcribed)

seq2_transcribed = transcribe(seq2)
seq2_translated = translate(seq2_transcribed)

hamming_distance(seq1_translated, seq2_translated)

4