# COMP90016 - Assignment 1
Version 1 Last edited 13/3/2024

## Semester 1, 2024

In [None]:
NAME = "Keziah Tikno"

ID = "1319716"


This assignment should be completed by each student individually. Make sure you read this entire document, and ask for help if anything is not clear. Any changes or clarifications to this document will be announced via the LMS.

Please make sure you review the University's rules on academic integrity: https://academicintegrity.unimelb.edu.au/

You submission must be your own work. Do not copy material from other students, from the internet or from AI tools. 

Your completed notebook file containing all your answers will be turned in via Canvas. Please also submit an HTML file with the output cleared.

To complete the assignment, finish the tasks in this notebook.

The tasks are a combination of writing your own code, interpreting the results and answering related short-answer questions.

In some cases, we have provided test input and test output that you can use to try out your solutions. These tests are just samples and are **not** exhaustive - they may warn you if you've made a mistake, but they are not guaranteed to. It's up to you to decide whether your code is correct.

**Remember to save your work early and often.**

## Marking

Cells that must be completed to receive marks are clearly labelled. Some cells are code cells, in which you must complete the code to solve a problem. Others are markdown cells, in which you must write your answers to short-answer questions. 

Cells that must be completed to receive marks are labelled like this:

`# -- GRADED CELL (1 mark) - complete this cell --`

Some graded cells are code cells, in which you must complete the code to solve a problem. Other graded cells are markdown cells, in which you must write your answers to short-answer questions. 

You will see the following text in graded code cells:

```
# YOUR CODE HERE
raise NotImplementedError()
```

***You must remove the `raise NotImplementedError()` line from the cell, and replace it with your solution.***

Only add answers to graded cells. If you want to import a library or use a helper function, this must be included in a graded cell.

Only graded cells will be marked.
**Don't make changes outside graded cells, and don't add or remove cells from the notebook**.

>Word limits, where stated, will be strictly enforced. Answers exceeding the limit **will not be marked**.

>Run-time limits will be imposed for each coding question. The run-time of a code cell can be calculated by including `%time` at the top of your cell. Cells exceeding the run-time limit **will not be marked**. The run-time limits only apply to test cases that are included in this document.

No marks are allocated to commenting in your code. We do however, encourage efficient and well-commented code.

The total marks for the assignment add up to 100, and it will be worth 10% of your overall subject grade.

Part 1: 35 marks

Part 2: 35 marks

Part 3: 30 marks


## Submitting

Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and student ID at the top of this notebook.


Your completed notebook file containing all your answers must be turned in via LMS in `.ipynb` format.
You must also submit a copy of this notebook in `html` format with the output cleared.
You can do this by using the `clear all output` option in the menu.

Your submission should include **only two** files with names formatted as: **Assignment_1.ipynb** and **Assignment_1.html**

## Overview

In this assignment, you will answer questions about working with short reads, sequence motifs and codon bias.

You will use the `biopython` library in your functions. You may want to refer to sections of the `biopython` documentation for additional help (https://biopython.org/wiki/Documentation). Additional to `biopython` and standard Python 3 functions and methods, you may also use any other library we have used in Computational Genomics including `collections`, `numpy`, `pandas`, `math`, `itertools`, `seaborn` and `matplotlib`.

## Part 1: Working with short reads

### Setup

In [None]:
import os
import requests
from IPython.core.display import HTML

# Function to get data. DO NOT MODIFY!
def fetch_file(url, outpath='.'):
    response = requests.get(url)
    if response.status_code == 200:
        print('File found!')
        # Get the filename from the URL
        filename = os.path.basename(url).split('?', 1)[0]
        # Construct the filepath using the specified directory and filename
        filepath = os.path.join(outpath, filename)
        # Create the directory if it doesn't exist
        if not os.path.exists(outpath):
            print(f'Creating output dir: {outpath}')
            os.makedirs(outpath)
        # Check if the file already exists in the specified directory
        if os.path.exists(filepath):
            print(f'{filename} already exists in {outpath}. Skip download.')
        else:
            with open(filepath, 'wb') as f:
                f.write(response.content)
                f.close()
            print(f'Saved to: {filepath}')
    else:
        print(f'File not found: Code {response.status_code}')

In [None]:
# Make the notebook pretty
HTML(requests.get('https://raw.githubusercontent.com/melbournebioinformatics/COMP90016/main/data/2023/style/custom.css').text)

In [None]:
# Fetch assignment data
url = 'https://github.com/melbournebioinformatics/COMP90016/blob/main/data/2023/Assignment_01/data/comp90016_assignment_1.fastq.gz?raw=true'
fetch_file(url)

First, we read in a read set from the `comp90016_assignment_1.fastq.gz` file. Note that converting a readset into a list of `biopython` objects makes it easier to handle.

In [None]:
import gzip
from Bio import SeqIO, SeqRecord, Seq

In [None]:
fname = 'comp90016_assignment_1.fastq.gz'

# Our fastq file is compressed using gzip. 
# We must open it before SeqIO can read the contents
with gzip.open(fname, "rt") as handle:
    readset = list(SeqIO.parse(handle, "fastq"))
# Check the first read in the readset to ensure file read correctly
print(readset[0])

### Questions
In the cells below, complete the following tasks:

<div class="info">
<h3> Question 1.1 </h3>

(5 marks)  
 
<b>Challenge:</b> Write a Python function to compute the percentage of reads that are longer than n bases.

- [ ] Input: a list of Bio.SeqRecord.SeqRecord objects, and a length value n.
- [ ] Output: Return a floating-point number in the range 0 - 100, rounded to 2 decimal places.
- [ ] Assume n is a positive integer.
- [ ] If the input list is empty, return None.
</div>

In [None]:
# GRADED CELL 1.1 (5 marks, max 1 min run-time)
%time
def percent_reads_len(reads, n):
    """
    Compute the percentage of reads that are longer than n bases.  
    Assume reads is a list of Bio.SeqRecord.SeqRecord objects containing DNA sequences. 
    Assume n is a positive integer.
    Return a floating-point number in the range 0 - 100.
    If the input list is empty, return None.
    """
    
    # YOUR CODE HERE
    longer_reads = []

    if len(reads) != 0:
        for i in range(len(reads)):
            read_length = len(reads[i])
            if read_length > n:
                longer_reads.append(reads[i])
                
        percentage = round(len(longer_reads) / len(reads) * 100, 2)
        
    else:
        return None

    return percentage

In [None]:
# Test your function in this cell
# First we will create a list of dummy reads
demo_reads_a = [SeqRecord.SeqRecord(Seq.Seq('ATATA'), 'ERR024571.2', '', ''),
             SeqRecord.SeqRecord(Seq.Seq('GCGCGCGC'), 'ERR024571.2', '', '')]

print(percent_reads_len(demo_reads_a, 5)) # should return 50.0

print(percent_reads_len(readset, 72))

In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


<div class="info">

<h3> Question 1.2 </h3>

(10 marks)

Suppose quality control metrics indicated that the bases on the 3' end of the reads were of unacceptable base quality.
    
<b>Challenge:</b> Write a Python function that trims a specified number of bases from the 3' end of reads in a list of reads, removes trimmed reads below a specified length threshold, and returns the new readset as list of SeqRecord objects.

- [ ] Input:  
    - reads = a list of SeqRecord objects
    - trim_len = number of bases to remove (int)
    - min_length = Minimum length threshold (int)
- [ ] Remove any read where trim is greater than or equal to the read length.
- [ ] Remove n bases from the 3' end of each read.
- [ ] Remove any read that is *shorter* than min_length bases long after trimming.
- [ ] Return list of trimmed reads
- [ ] If the input list is empty, return None.
- [ ] If no reads remain after trimming return an empty list
    
Assumptions:
- [ ] Assume reads is list of Bio.SeqRecord.SeqRecord objects, each object contains a sequence that is a Bio.Seq.Seq object.   
- [ ] Assume trim and min_length are positive integers.


</div>

In [None]:
# GRADED CELL 1.2 (10 marks, max 1 min run-time)
%time
def preprocess_reads(reads, trim_len, min_length):
    """
    Remove any read where trim is greater than or equal to the read length.
    Remove trim bases from the 3' end of each read.
    Remove any read that is shorter than min_length bases long after trimming.
    Assume trim and min_length are positive integers.
    Assume reads is list of Bio.SeqRecord.SeqRecord objects.
    If the input list is empty, return None.
    Return the processed readset as a list of Bio.SeqRecord.SeqRecord objects.
    """

    # YOUR CODE HERE
    filtered_reads = []

    if len(reads) != 0:
        for read in range(len(reads)):

            if trim_len < len(reads[read]):
                sequence = reads[read].seq[:len(reads[read])-trim_len]

                if len(sequence) >= min_length:
                    filtered_reads.append(sequence)

    else:
        return None
            
    return filtered_reads

In [None]:
# Test your function in this cell
demo_reads_b = [SeqRecord.SeqRecord(Seq.Seq('AA')), SeqRecord.SeqRecord(Seq.Seq('GAAATCGG')), SeqRecord.SeqRecord(Seq.Seq('TTATTT'))]

# Should output a list with just one SeqRecord object containing the sequence GAAATC
print(preprocess_reads(demo_reads_b, 2, 5))
# Experiment with different settings. Does your function behave as expected?
len(preprocess_reads(readset, 15, 50))

In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----



<div class="question">
<h3>Question 1.3</h3>
    
(5 marks, max 50 words)

Suppose these reads were to be aligned to a reference genome. Explain the potential consequences of not removing very short reads.

</div>

#### -- GRADED CELL (5 marks) - complete this cell -- 

Short reads are prone to error sequencing and low-quality data which introduce noise in the downstream analysis (DEG). It's not uniquely matched to the origin location in the genome therefore, it caused mapping ambiguity. It's hard to deduce the biological relevance since it can either be meaningful sequences / merely artifacts.

<div class="info">
    
<h3>Question 1.4</h3>
    
(10 marks)

<b>Challenge:</b> Write Python code to produce a histogram showing the distribution of read lengths in a set of reads.

- [ ] Input: The "readset" list of Bio.SeqRecord.SeqRecord objects
- [ ] Trim reads with settings: reads = readset, trim_len = 15 and min_length = 50.
- [ ] The plot should be produced inline, in the Jupyter notebook. 
- [ ] Choose an appropriate bin-width.
- [ ] Label your axes appropriately (include units).
- [ ] Add a vertical line marking the median read length.

Your code does not need to include a function.

</div>

In [None]:
# Here's some Jupyter magic to render plots in the notebook
%matplotlib inline

# You may want to import some additional packages for building and formatting your plot (non-essential)
# Un-comment as required
import numpy as np
import matplotlib.pyplot as plt
#import seaborn
from collections import Counter
from numpy import median

In [None]:
# GRADED CELL 1.4 (10 marks, max 1 min run-time)
%time
# Use this cell to make your histogram.

# YOUR CODE HERE
trimmed_reads = preprocess_reads(readset, 15, 50)
hist_data = [len(trimmed_reads[r]) for r in range(len(trimmed_reads))]
a = hist_data.count(61)
plt.hist(hist_data, width=2)
plt.title("Distribution of Trimmed Read Lengths in Bio.SeqRecord.SeqRecord objects")
plt.xlabel("Trimmed Read Lengths (bases)")
plt.ylabel("Frequency")
plt.xlim(0, max(hist_data) + 5)
plt.ylim(0, a * 1.1)
plt.axvline(x=np.median(hist_data),linewidth=1, color='r')
plt.show()

<div class="question">

<h3>Question 1.5</h3>

(5 marks, max 50 words)

Two different strains of a disease-causing DNA virus have been isolated on a fish farm. Both strains infect the fish, but one strain appears to cause more severe symptoms. A computational genomics expert suggests sequencing the whole genome of both strains with Illumina sequencing. Then, comparing the sequences to identify single-base differences (genetic variants) between the two strains. It is important that the sequencing is accurate so that the differences are likely to be genuine and not errors.
    
Explain why Illumina is an appropriate sequencing platform for this case study.
    
</div>

    

#### -- GRADED CELL (5 marks) - complete this cell -- 

Illumina is highly accurate (error rates below 1%). It has a high coverage depth that increases the confidence in variant calling. The high quality of base-calling and mapping software used in Illumina allows accurate detection of genetic variants between the two strains. Cost-effective which is important for sequencing two DNA.

## Part 2: Sequence motifs
Sequence motifs are short, recurring patterns in nucleic-acid sequences. Many are involved in important biological functions.

### Setup

In [None]:
# Set up two DNA sequences to test your code on. Do not change these sequences.
linear_seq = Seq.Seq('TTACAGTGATTATGAAAACTTTGCGGGGCATGGCTACGACTTGTTCAGCCACGTCCGAGGGCAGAAACCTCGAGGGGTTTGTATGTTCAGCTATCTTCTACCCATCCCCGGAGGTTAAGTACGAGGGGAGATGCGGAAGAGGCTCTCGATCATCCCGTGGGACATCAACCTTTCCCTTGATAAAGCACCCCGCTCGGGTA')
circular_seq = Seq.Seq('TGGCAGAGAGAACGCCTTCTGAATTGTGCTATCCTTCGACCTTATCAAAGCTTGCTACCAATAATTAGGATTATTGCCTTGCGACAGACCTCCTACTCACACTGCCTCACATTGAGCTAGTCAGTGAGCGATTAGCTTGACCCGCTCTCTAGGGTCGCGAGTACGTGAGCTAGGGCTCCGGACTGGGCTATATAGTCGAG')

# Set up a dictionary of sequence motifs. Do not change this dictionary.
motif_dict = {'motif_a': Seq.Seq('TACAGTG'), 
              'motif_b': Seq.Seq('AGCTTGCT'), 
              'motif_c': Seq.Seq('ATATATAC'), 
              'motif_d': Seq.Seq('CGAGGGG'), 
              'motif_e': Seq.Seq('CGAGTG')}

In [2]:
a='TTACAGTGATTATGAAAACTTTGCGGGGCATG'
a[10::-1]

'TTAGTGACATT'

### Questions

<div class="info">

<h3>Question 2.1</h3>

(10 marks)

<b>Challenge:</b> Write a Python function to count the number of times sequence motifs are present in a DNA sequence. 

- [ ] Input:
    - A sequence to search for the specified motifs (Bio.Seq.Seq)
    - A dictionary of motifs with motif names (str) as keys and Bio.Seq.Seq objects as values.     
- [ ] Assume seq is a Bio.Seq.Seq object. 
- [ ] For each motif count the number of exact matches in the sequence.
- [ ] Overlapping motifs should be considered.
- [ ] Return a pandas DataFrame with each motif represented as a row. 
    - The first column should contain motif names (str),
    - The second column should contain counts of exact matches (int). 
    - The column names should be “Motif” and “Counts”, in that order.
- [ ] If either seq or motifs are empty, return None.

</div>

In [None]:
# GRADED CELL Question 2.1 (10 marks, max 1 min run-time)
%time
import pandas as pd

def motif_count(seq, motifs):
    """
    Count the number of times sequence motifs are present in a DNA sequence. 
    Overlapping motifs should be considered.
    Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values. 
    Assume seq is an Bio.Seq.Seq object. 
    Return a pandas DataFrame with each motif represented as a row. 
    The first column should contain motif names as strings, the second column should contain integer counts of exact matches. 
    The column names should be “Motif” and “Counts”, in that order. 
    If either seq or motifs are empty, return None.
    """
    
   # YOUR CODE HERE
    seq_matching_counts = {}
    
    if len(seq) != 0 and len(motifs) != 0:
        for motif in motifs:
            seq_matching_counts[motif] = 0
            
            for base in range(len(seq) - len(motifs[motif]) + 1):
                kmer = seq[base:base+len(motifs[motif])]

                if kmer == motifs[motif]:
                    seq_matching_counts[motif] += 1

        df = pd.DataFrame([key for key in seq_matching_counts.items()], columns=['Motif', 'Counts'])
        
        return df
        
    else:
        return None

In [None]:
# ~~ Test your function in this cell ~~

demo_seq = Seq.Seq('GTTGGATTCATGAAAGA')
demo_motifs = {'motif_a': Seq.Seq('GTTG'), 'motif_b': Seq.Seq('AAAA')}

# Should output a pandas dataframe with 2 rows and 2 columns. Row 1: motif_a, 1. Row 2: motif_b, 0
# Note: The rows are NOT nammed in this output.
'''
     Motif  Counts
0  motif_a       1
1  motif_b       0
'''
print(motif_count(demo_seq, demo_motifs)) 

# Test on linear seq
print(motif_count(linear_seq, motif_dict))


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----




<div class="info">

<h3>Question 2.2</h3>

(15 marks)

A colleague is working on sequence motifs in circular DNA molecules. Circular DNA molecules can also be stored in FASTQ files and Bio.Seq.Seq objects but care must be taken when programming so that the final base is treated as though it is adjacent to the first base. You decide to help your colleague by enhancing your function from 2.1

<b>Challenge:</b> Write a Python function to count the number of times sequence motifs are present in a circular DNA sequence. 

The function should count the number of exact matches AND the number of near misses. A near miss is defined as a sequence that is the same length as a sequence motif but has a single base mismatch. 
    
    
- [ ] Input:
    - Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values.
    - Assume seq_circular is an Bio.Seq.Seq object. 
- [ ] For each mofif count exact matches and near miss matches (where the target seq has 1 mismatch with the query motif)
- [ ] Treat the input sequence as circular. The last base is followed by the first base.
- [ ] Overlapping motifs should be considered separately.
- [ ] Return a pandas DataFrame with each motif represented as a row. 
    - The first column should contain motif names as strings,
    - The second column should contain integer counts of exact matches, 
    - The third column should contain integer counts of near misses. 
    - The column names should be “Motif”, “Match_counts” and "Near_miss_counts". In that order.
- [ ] If either of seq_circular or motifs are empty, return None.
    
</div>

In [None]:
# Import additional library
from collections import Counter 

# Helper function
# Use this to identify near miss instances of a reference motif i.e. containing a single base mismatch
def hamming_dist(s1, s2):
    """
    A helper function to calculate the hamming distance between two sequences.
    Assume the input sequences are strings.
    They should be of equal length in order to calculate hamming distance
    """
    assert len(s1) == len(s2)
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))

In [None]:
# GRADED CELL Question 2.2 (15 marks, max 1 min run-time)
%time
def motif_count_circular(seq, motifs):
    """
    Count the number of times sequence motifs are present in a circular DNA sequence. 
    Counts the number of exact matches and the number of near misses. 
    A near miss is defined as a sequence that is the same length a sequence motif but has a single base mismatch. 
    Assume motifs is a dictionary with motif names as strings for keys and Bio.Seq.Seq objects for values. 
    Assume seq_circular is an Bio.Seq.Seq object. 
    Return a pandas DataFrame with each motif represented as a row. 
    The first column should contain motif names as strings, the second column should contain integer counts of exact matches, the third column should contain integer counts of near misses.
    The column names should be “Motif”, “Match_counts” and "Near_miss_counts".
    If either seq_circular or motifs are empty, return None.
    """

    # YOUR CODE HERE
    seq_matching_counts = {'Motif':0, 'Match_counts':0, 'Near_miss_counts':0}
    
    motifs_keys = [x for x in motifs.keys()]
    seq_matching_counts['Motif'] = motifs_keys
    
    match = []
    near_miss = []
    
    if len(seq) != 0 and len(motifs) != 0:
        
        for motif in motifs:
            m_counts = 0
            nm_counts = 0
            
            for base in range(len(seq)):
                kmer = seq[base:base+len(motifs[motif])]
                
                if base < (len(seq) - len(motifs[motif])):
                    if hamming_dist(kmer, motifs[motif]) == 0:
                        m_counts += 1
                    
                    elif hamming_dist(kmer, motifs[motif]) == 1:
                        nm_counts += 1

                else:
                    overlap_seq = seq[base:] + seq[:len(motifs[motif]) - len(seq[base:])]
                    
                    if hamming_dist(overlap_seq, motifs[motif]) == 0:
                        m_counts += 1
                    
                    elif hamming_dist(overlap_seq, motifs[motif]) == 1:
                        nm_counts += 1
    
            match.append(m_counts)
            near_miss.append(nm_counts)
    
        seq_matching_counts['Match_counts'] = match
        seq_matching_counts['Near_miss_counts'] = near_miss
        
        df = pd.DataFrame.from_dict(seq_matching_counts)
    
        return df
        
    else:
        return None

In [None]:
# Test your function in this cell
# Should output a pandas dataframe with 2 rows and 3 columns that looks like this:
'''
     Motif  Match_counts  Near_miss_counts
0  motif_a             1                 0
1  motif_b             0                 3
'''
print(motif_count_circular(demo_seq, demo_motifs)) 

# Consider function behaviour on a circular sequence
print(motif_count_circular(circular_seq, motif_dict))


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


<div class="question">

<h3>Question 2.3</h3>

(10 marks, max 100 words)

Your colleague plans on running their analyses using the Spartan high-performace computing system. They ask your advice on requesting job resources. 
    
What could be the effect of requesting insufficient memory for a job? What could be the effect of requesting much more memory than is required for a job?
    
</div>

    

#### -- GRADED CELL (10 marks) - complete this cell -- 

Insufficient memory will cause the job fail to execute or terminate prematurely due to memory exhaustion. This wasted computational resources and time. It can cause resource contention and stability issues with other jobs running on the same computer nodes, results in overall decreased system throughput or crashes and delays job execution. 

Excessive memory leads to resources waste as the allocated memory remains unused by the job. It wasted money and time for the following capacity to be available. It causes resource contention because other jobs can't access the excess memory, delaying job execution and potential bottlenecks in the system. 

## Part 3: Codon bias

### Setup
The frequency of occurrence of synonymous codons in coding DNA differs between species. Some codons that encode a particular amino acid are common, while other are rare. This is referred to as codon bias.

Each codon requires a different tRNA. The efficiency of translation is influenced by the availability of specific tRNAs. tRNA availability differs from one species to another.

### Questions

<div class="info">
     
<h3>Question 3.1</h3>

(10 marks)

<b>Challenge:</b> Write a Python function to calculate the codon usage frequencies from a set of DNA coding sequences. 

- [ ] Input: A list of DNA coding sequences (i.e CDS) as Bio.Seq.Seq objects
- [ ] Assume the CDS sequences begin with a start codon, end with an in-frame stop codon and do not contain any introns. 
- [ ] Use the standard genetic code.
- [ ] Return a 2D dictionary, where the keys are one-letter amino-acid strings and the values are dictionaries containing codon frequencies. The inner dictionaries must have DNA codon strings as keys and codon usage frequencies as values (floating-point numbers between 0 and 1). 
- [ ] If coding_seqs is of length 0, return None.
- [ ] If no instances of any codon for an amino acid are observed then omit that amino acid from the output dict.

Note: The "codon usage frequency" for a codon is the total number of occurrences of that codon divided by the total number of codons that code for the same amino acid. 
    
</div>

In [None]:
# GRADED CELL Question 3.1 (10 marks, max 1 min run-time)
%time
from collections import Counter
from collections import defaultdict

def codon_usage(coding_seqs):
    """
    Calculate the codon usage frequencies from a set of DNA coding sequences. 
    Assume coding_seqs is a list of Bio.Seq.Seq objects. 
    Assume the DNA sequences begin with a start codon, end with an in-frame stop codon and do not contain any introns. 
    Use the standard genetic code.
    Return a 2D dictionary, where the keys are one-letter amino-acid strings and the values are dictionaries containing codon frequencies. 
    The inner dictionaries must have DNA codon strings as keys and codon usage frequencies as values (floating-point numbers between 0 and 1). 
    The codon usage frequency for a codon is the total number of occurrences of that codon divided by the total number of codons that code for the same amino acid. 
    If coding_seqs is of length 0, return None.
    """
    
   # YOUR CODE HERE
    my_dict = {}
    inter_dict = {}
    
    for seq in range(len(coding_seqs)):
        
        for codon in range(0, len(coding_seqs[seq]), 3):
            codons = coding_seqs[seq][codon:codon+3]
            protein = codons.translate()
            
            if protein not in inter_dict:
                inter_dict[str(protein)] = [str(codons)]
            else:
                inter_dict[str(protein)].append(str(codons))

    for key in inter_dict:
        my_dict[key] = dict(Counter(inter_dict[key]))

        for key_codon in my_dict[key]:
            my_dict[key][key_codon] = my_dict[key][key_codon]/len(inter_dict[key])

    return my_dict

In [None]:
# ~~ Test your function in this cell ~~
demo_seqs = [Seq.Seq('ATGTCGTAA'), Seq.Seq('ATGTCCAAATAG')]

sequence_a = Seq.Seq('ATGGCTGAAGCCGCATCCCCAGCTTTTATAGAGTATCTCCCACGATCTGACCCGTTGCTCTGTATTATACACAAGGTTGGAGTCGGATGTGAGTCTTCTCACCGGAGACCCAAGACATAG')
sequence_b = Seq.Seq('ATGGTTCTCCTTTGGATCTTATTCGGCAAAAGCATCGCGGCGTCACTAGCTAGTTACGTTTTGAGGACGCTCGCAGATTCTGCCAACCAATCTTATACGAAAATACGGAGGCACGCGTAA')
sequence_c = Seq.Seq('ATGCTAGTGATCCCGTCGTGGGCTAGAGAGCGGCGAGAGTGGGGTTTGGAAATTCGCGACATTGAGCCAACTATGGCTAATGCCATGGGCGGATTATGGGGGTCACTCAGAACTGTATAA')
sequence_d = Seq.Seq('ATGCTACTTACCAAGAGATATCTTCTTAATACACTGATCCATAACGTCATTTCGCGGCTGAGATGGTGGGTGCATGAGCAAGTAATTGTGACTCCGCGGGTTTCGCCAAAGCAAACCTAA')
seqs = [sequence_a, sequence_b, sequence_c, sequence_d]

# Should return {'M':{'ATG':1.0}, 'S':{'TCG':0.5, 'TCC':0.5}, 'K':{'AAA':1.0}, '*':{'TAA':0.5, 'TAG':0.5}}
print(codon_usage(demo_seqs)) 

print(codon_usage(seqs))


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


<div class="question">
    
<h3>Question 3.2</h3>

(10 marks, max 100 words)
    
RNA viruses infect cells and utilise the molecular machinery of their host cell for transcription and translation. This allows the virus to make more copies of themselves.
    
Explain why codon bias is an important factor in the potential transmission of RNA viruses between different species.

</div>


#### -- GRADED CELL (10 marks) - complete this cell -- 

Because RNA viruses need to adapt to the host's cellular machinery for efficient replication and translation. However, in organisms with limited tRNA (molecule that carries amino acid duing translation) diversity or abundance, codon bias may be more pronounced as the translation machinery tends to use the most abundant tRNA isoacceptors efficiently. This can lead to biased codon usage patterns, where certain codons are favoured over other. Thus, viruses need to adapt their codon usage to match the host's preferences to produce their own proteins to have successful transmissions to new host species.

<div class="info">
    
<h3>Question 3.3</h3>
    
(10 marks)

<b>Challenge:</b> Write a Python function which takes as input a list of coding sequences and a single letter amino-acid identifier. The function should calculate a codon usage frequency dict (as per Q3.1) and then output a bar plot of the codon usage frequencies for the specified amino acid. Each bar represents a codon.

- [ ] Input:
    - Assume coding_seqs is a list of Bio.Seq.Seq objects. 
    - Assume the DNA sequences begin with a start codon, end with an in-frame stop codon and do not contain any introns. 
    - Single letter amino acid identifier (str)
- [ ] Use the standard genetic code
- [ ] The codon usage frequency for a codon is the total number of occurrences of that codon divided by the total number of codons that code for the same amino acid. 
- [ ] If no codons for the specified AA are found in the CDS list, return None.
- [ ] If coding_seqs is of length 0, return None.
- [ ] Output: Generate a bar plot of codon usage frequency for the specified amino acid.
    

You may use your function from question 3.1 in your solution. 
    
Your function should produce a bar plot inline with appropriate labels, including the codon sequences and the amino acid encoded.
    
</div>

In [None]:
# Here's some Jupyter magic to render plots in the notebook
%matplotlib inline

# You may want to import some additional packages for building and formatting your plot (non-essential)
# Un-comment as required
import numpy as np
import matplotlib.pyplot as plt
#import seaborn
from collections import Counter

In [None]:
# GRADED CELL 3.3 (10 marks, max 1 min run-time)
%time
def codon_usage_plot(coding_seqs, aa):
    """
    Create a bar plot of the codon usage frequencies for a specific amino acid from a set of DNA coding sequences.  
    Assume coding_seqs is a list of Bio.Seq.Seq objects. 
    Assume the DNA sequences begin with a start codon, end with an in-frame stop codon and do not contain any introns.  
    Assume aa is a one-letter amino acid code.
    Use the standard genetic code.
    The codon usage frequency for a codon is the total number of occurrences of that codon divided by the total number of codons that code for the same amino acid. 
    If coding_seqs is of length 0, return None.
    If there are no codons encoding aa in coding_seqs, return None.
    """
    
    # YOUR CODE HERE
    freq_dict = codon_usage(coding_seqs)

    if len(coding_seqs) != 0 and len(aa) != 0 and aa in freq_dict:
        codons = freq_dict[aa].keys()
        freq = freq_dict[aa].values()
    
        plt.bar(codons, freq, width = 0.5, align = 'center')
        for x, y in zip(codons, freq):
            plt.text(x, y, f'{y:.2f}\n', ha='center', va='center')
        plt.xlim(-0.9, len(codons) - 1 + 0.9)
        plt.ylim(0, max(freq) * 1.1)
        plt.xlabel("Codons")
        plt.ylabel("Frequencies")
        plt.title("Codon Usage Frequencies for Protein " + aa)
        
    else:
        return None

    plt.show()

In [None]:
# Test your function in this cell

# Should return a bar plot with two equal bars for 'TCG' and 'TCC'
print(codon_usage_plot(demo_seqs, 'S')) 

# Should return None
#print(codon_usage_plot(demo_seqs, 'P')) 

print(codon_usage_plot(seqs, 'L'))

# END OF ASSIGNMENT

## Submitting

Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you have filled in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE"


Your completed notebook file containing all your answers must be turned in via LMS in `.ipynb` format.
You must also submit a copy of this notebook in `html` format with the output cleared.
You can do this by using the `clear all output` option in the menu.

Your submission should include **only two** files with names formatted as: **Assignment_1.ipynb** and **Assignment_1.html**
