# Section 01: Practice Python Problems

This notebook contains several example biologically-motivated problems for you to try out your Python skills. Type your answers in the code cells as indicated. Reference the section notes for help with Jupyter notebook and Python. This isn't an assignment, so there's no need to submit it. It's purely a chance for you to practice your python skills and gauge your understanding of basic Python commands.

These problems are adapted from [Rosalind](http://rosalind.info/problems/list-view/). If you want more Python problems to try out, explore the site!

### 1.  Nucleotides

DNA is comprised of four nucleotides: A, C, G, T. Often we will represent a DNA sequence as a string of these characters in Python.

**Problem:** Count the number of times each nucleotide (A, C, G, T) occurs in a given DNA sequence.

**Solution:** When trying to solve a problem, I usually start with the simplest strategy I can think of. Often this is a brute-force approach – it won't be fast but for simple problems that won't matter. Just getting something working is usually the best way to start, no matter how simple or "stupid" the approach may feel, rather than obsessing over writing "[good code](https://xkcd.com/844/)".

My initial solution was to iterate through each nucleotide of the DNA string using a for loop and increment the count of the appropriate nucleotide. Then I print the counts separated by spaces at the end.

In [1]:
# Given: a DNA sequence (str)
dna = 'ACTGCAGTCATGCAGTCATGCTGATTTTCAGGCCATGCAAACTG'

# Return: four numbers (ints) separatated by spaces counting the number of 
#         A's, C's, G's, and T's respectively

# Count the frequency of each nucleotide
A = 0 
C = 0
G = 0
T = 0
for nt in dna:          # Iterate through each nucleotide in the DNA sequence
    if nt == 'A':       # If this nt is an 'A', increment the A count
        A += 1
    elif nt == 'C':     # Otherwise if this nt is an 'C', increment the C count
        C += 1
    elif nt == 'G':     # Otherwise if this nt is an 'G', increment the G count
        G += 1
    elif nt == 'T':     # Otherwise if this nt is an 'T', increment the T count
        T += 1
    else:               # Otherwise, this nt is not a valid nucleotide so print a warning
        print('Unexpected nucleotide: ' + nt)

# Print the frequency of each nucleotide separated by spaces
print(str(A) + ' ' + str(C) + ' ' + str(G) + ' ' + str(T)) # Remember to cast to a str to print

11 11 10 12


If I wanted to get a bit fancier, I could also solve this problem using a `list` to store the counts. Then also use a slightly different type of print statement.

In [25]:
# Given: a DNA sequence (str)
dna = 'ACTGCAGTCATGCAGTCATGCTGATTTTCAGGCCATGCAAACTG'

# Return: four numbers (ints) separatated by spaces counting the number of 
#         A's, C's, G's, and T's respectively

# Count the frequency of each nucleotide
counts = [0,0,0,0]      # Initialize the count of each nucleotide to zero in a list

# Use the python list count() function to tally the number of each nt
counts[0] = dna.count('A')
counts[1] = dna.count('C')
counts[2] = dna.count('G')
counts[3] = dna.count('T')

# Print the frequency of each nucleotide separated by spaces
print('A: %i, C: %i, G: %i, T: %i' % (A, C, G, T)) # %i indicates that the value you want to print is an int

A: 11, C: 11, G: 10, T: 12


These are just two example solutions. There are countless ways to solve this problem in Python. If your solution gave you the correct answer, then your approach is "right". Good work!

### 2. Complement

DNA is double stranded. One strand is called the *sense* strand and the other is called the *antisense* strand. Usually you are only given the sequence of the sense (coding) strand. In DNA, A and T are complements of each other, while C and G are complements of each other. The antisense strand is the complement of the sense strand, which is formed by taking the complement of each nucleotide. For example, the complement of GTCA is CAGT.

**Problem:** Print the complement of the given DNA sequence.

**Solution:** My solution here was to use a dictionary to map each nucleotide to its complement. Again, there are countless ways to solve this problem. This is just one way to go about it.

In [13]:
# Given: a DNA sequence (str)
dna = 'ACTGCAGTCATGCAGTCATGCTGATTTTCAGGCCATGCAAACTG'

# Return: the reverse complement of the sequence (str)

# Create a dictionary mapping each nucleotide to its complement
complement_map = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}

# Compute the complement of the reversed DNA sequence
comp = ''          # Initialize the complement sequence as an empty str
for nt in dna:
    comp += complement_map[nt]

print(comp)

TGACGTCAGTACGTCAGTACGACTAAAAGTCCGGTACGTTTGAC


**Problem:** Now rewrite your solution as a function, so that you can find the complement of multiple sequences.

**Solution:** I simply moved my code into a function that accepts a DNA sequence as an argument and returns the complement sequence. Functions are a good way to modularize your code, which makes it easy to rerun steps repeatedly without copying and pasting. It also makes your code easier to understand and explain.

In [14]:
def complement(dna):
    '''
    Given: a DNA sequence (str)
    Return: the complement of the sequence (str)
    '''
    # Create a dictionary mapping each nucleotide to its complement
    complement_map = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}

    # Compute the complement of the DNA sequence
    comp = ''          # Initialize the complement sequence as an empty str
    for nt in dna:
        comp += complement_map[nt]
        
    # Return the complement DNA sequence
    return comp

In [15]:
dna1 = 'ACTGCAGTCATGCAGTCATGCTGATTTTCAGGCCATGCAAACTG'
dna2 = 'CGTCAGTCGTATTGCGTAGCTGGTACTGCATATATTGGTGCGAAACGT'

# Calculate the complements of both sequences
comp1 = complement(dna1)
comp2 = complement(dna2)

# Print the results
print(comp1)
print(comp2)

TGACGTCAGTACGTCAGTACGACTAAAAGTCCGGTACGTTTGAC
GCAGTCAGCATAACGCATCGACCATGACGTATATAACCACGCTTTGCA


### 3. Transcription

DNA is transcribed to RNA, which is comprised of four nucleotides: A, C, G, U (instead of T). Given a coding strand DNA sequence, its transcribed RNA sequence is formed by replacing all of the instances of T with U.

**Problem:** Transcribe a given DNA sequence to RNA.

**Solution:** My solution goes back to the basics of for loops and conditional statements to add a U in place of any T nucleotide and leave all other nucleotides unchanged.

In [16]:
def transcribe(dna):
    '''
    Given: a DNA sequence (str)
    Return: the transcribed RNA sequence (str)
    '''
    # Generate the transcribed RNA for the DNA sequence
    rna = ''          # Initialize the RNA sequence as an empty str
    for nt in dna:
        if nt == 'T': # If this nt is T, replace it with U
            rna += 'U'
        else:         # Otherwise, keep the nt unchanged
            rna += nt
        
    # Return the RNA sequence
    return rna

In [17]:
dna = 'ACTGCAGTCATGCAGTCATGCTGATTTTCAGGCCATGCAAACTG'

# Transcribe both sequences
rna = transcribe(dna)

# Print the results
print(rna)

ACUGCAGUCAUGCAGUCAUGCUGAUUUUCAGGCCAUGCAAACUG


### 4. Translation

There are 20 commonly occurring amino acids that make up protein sequences. RNA is translated to protein according to the [codon table](https://en.wikipedia.org/wiki/Genetic_code#RNA_codon_table). Given a messenger RNA sequence, its corresponding protein sequence is formed by translating each codon (triplet of nucleotides) to the appropriate amino acid or stop codon (denoted \*).

**Problem:** Translate a given RNA sequence to protein.

**Solution:** I gave you the codon table as a dictionary to start (to save you extra typing), so I'll show you how I used that strategy. There are other ways to go about this as well. Here I iterated through the RNA sequence in steps of 3 nucleotides and converted each codon to the appropriate amino acid using the codon table.

In [18]:
def translate(rna):
    '''
    Given: an RNA sequence (str)
    Return: the translated protein sequence (str)
    '''
    
    # Save the codon table as a dict
    aas = {'UUU':'F','UUC':'F',
           'UUA':'L','UUG':'L',
           'UCU':'S','UCC':'S','UCA':'S','UCG':'S','AGU':'S','AGC':'S',
           'UAU':'Y','UAC':'Y',
           'UAA':'*','UAG':'*','UGA':'*',
           'UGU':'C','UGC':'C',
           'UGG':'W',
           'CUU':'L','CUC':'L','CUA':'L','CUG':'L',
           'CCU':'P','CCC':'P','CCA':'P','CCG':'P',
           'CAU':'H','CAC':'H',
           'CAA':'Q','CAG':'Q',
           'CGU':'R','CGC':'R','CGA':'R','CGG':'R','AGA':'R','AGG':'R',
           'AUU':'I','AUC':'I','AUA':'I',
           'AUG':'M',
           'ACU':'T','ACC':'T','ACA':'T','ACG':'T',
           'AAU':'N','AAC':'N',
           'AAA':'K','AAG':'K',
           'GUU':'V','GUC':'V','GUA':'V','GUG':'V',
           'GCU':'A','GCC':'A','GCA':'A','GCG':'A',
           'GAU':'D','GAC':'D',
           'GAA':'E','GAG':'E',
           'GGU':'G','GGC':'G','GGA':'G','GGG':'G'}

    prot = ''         # Initialize the protein sequence as an empty str
    L = len(rna)      # Get the length of the RNA sequence
    
    # Iterate through each nt position of the RNA sequence from 0 to L-3 in steps of 3 nts
    for i in range(0,L-2,3):
        codon = rna[i:i+3]   # This codon is the RNA sequence from i to i+2
        prot += aas[codon]   # Add the corresponding amino acid for this codon
        
    # Return the protein sequence
    return prot

In [19]:
rna = 'AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA'

# Translate both sequences
prot = translate(rna)

# Print the results
print(prot)

MAMAPRTEINSTRING*


### 5. Mutations

Frequently, biological sequences are stored in text files called fasta files. These files follow the format:
```
>SEQ_ID_1
GCATCAGTACTGACTATGTTGTTACGTGCAGTCGATGACTCACAGACTGCATGTTTTTCTATTGTAACACATGCATACATGTACTGTTACGTGTACGTGTGCATGTCATCGTTGATCTGCATGTCATTGCATGTACGTGTGACTACTACTACTGCGCGGATAACTCGCGCCCAT
>SEQ_ID_2
GCATAAGTACTGACTATGTTGTTACGTGCAGTCGCTGACTCACAGACTGCATGTTTTTCTATTGTAACACATGCATACATGTACTGTTACGTGTACGCGTGCATGTCATCGTTGATCTGCATGTCAGTGCATGTGCGTGTGACTACTACTACTGCGCGGATCACTCGCGCCCAT
```
This can be a useful way to store lots of DNA, RNA, or protein sequences for analysis.

**Problem:** Copy the above sequences into a text file in your Jupyter notebook directory. Import the sequences from the file and find the [hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) (number of different nucleotides) between the two DNA sequences.

In [20]:
# I'm going to write the sequences to a file directly from this notebook 
# instead of manually copy/pasting them into a file
filename = 'dna_sequences.fasta'

with open(filename, 'w') as outfile:
    outfile.write('>SEQ_ID_1\n')
    outfile.write('GCATCAGTACTGACTATGTTGTTACGTGCAGTCGATGACTCACAGACTGCATGTTTTTCTATTGTAACACATGCATACATGTACTGTTACGTGTACGTGTGCATGTCATCGTTGATCTGCATGTCATTGCATGTACGTGTGACTACTACTACTGCGCGGATAACTCGCGCCCAT\n')
    outfile.write('>SEQ_ID_2\n')
    outfile.write('GCATAAGTACTGACTATGTTGTTACGTGCAGTCGCTGACTCACAGACTGCATGTTTTTCTATTGTAACACATGCATACATGTACTGTTACGTGTACGCGTGCATGTCATCGTTGATCTGCATGTCAGTGCATGTGCGTGTGACTACTACTACTGCGCGGATCACTCGCGCCCAT\n')


**Solution:** I wrote two functions: one to read the sequences in the file into a list, and one to calculate the hamming distance between two sequences. Then I called my hamming_dist function on the two sequences in my list.

In [21]:
def read_fasta(filename):
    '''
    Given: a fasta file name (str)
    Return: a list of sequences (list)
    '''
    # Initialize an empty list of sequences
    seqs = []
    
    # Read the file
    with open(filename, 'r') as infile:  # Open the file for reading
        for line in infile:              # Iterate through each line in the file
            
            # If this line begins with >, skip it (does not contain a sequence)
            if line[0] == '>':
                continue
                
            # Otherwise, save this line as the seq
            else:
                seq = line.strip()       # Strip deletes the trailing newline character
                seqs.append(seq)         # Append this sequence to the list of sequences
    
    # Return the list of sequences
    return seqs

In [22]:
def hamming_dist(seq1, seq2):
    '''
    Given: two sequences of equal length (str)
    Return: the number of differences between the two seqeunces (int)
    '''
    # Initialize the hamming distance to be zero
    dist = 0
    
    # Check the length of both sequences
    len1 = len(seq1)
    len2 = len(seq2)
    
    # If the lengths are not equal, exit the function and throw an error
    if len1 != len2:
        print('The sequence lengths are not equal!')
        exit(1)
    
    for i in range(len1):                # Iterate through each position of the sequences
        if seq1[i] != seq2[i]:           # If the value at this position is not the same in both seqs
            dist += 1                    # Increment the hamming distance
    
    # Return the hamming distance
    return dist

In [27]:
filename = 'dna_sequences.fasta'

# Import the sequecnes from the file
dna_seqs = read_fasta(filename)

# Compute the hamming distance between the two DNA sequences
dist = hamming_dist(dna_seqs[0], dna_seqs[1])

print('There are {} nucleotide differences.'.format(dist))

There are 6 nucleotide differences.


**Problem:** Transcribe then translate the DNA sequences to protein. Now calculate the hamming distance (number of different amino acids) between the two protein sequences.

**Solution:** I wrote my hamming_dist function above in a general way so that it can accept any type of sequence (DNA, RNA, protein, random letters). So all I need to do here is transcribe then translate the sequences and rerun the hamming_dist function on the protein sequences. I can do all of this using previously defined functions, which makes life much easier!

In [28]:
# Initialize an empty list of protein sequences
prot_seqs = []

# Iterate through each DNA sequence
for dna in dna_seqs:
    
    # Transcribe the DNA to RNA
    rna = transcribe(dna)
    
    # Translate the RNA to protein
    prot = translate(rna)
    
    # Append the new protein sequence to the list of protein seqs
    prot_seqs.append(prot)

    
# Compute the hamming distance between the two protein sequences
dist = hamming_dist(prot_seqs[0], prot_seqs[1])

print('There are {} amino acid differences.'.format(dist))

There are 4 amino acid differences.
