# COMP90016 - Assignment 1



Version 1. Last edited 19/4/2020.

## Sample solutions

Well done on completing assignment 1! In general, the solutions we received show that you have been doing a great job keeping up with the subject content and applying it to answer questions.

Here are some of the most common mistakes.
- Read alignment is different from de novo assembly. Think about what applications might require each one.
- For coding questions, read the instructions carefully. Some students missed boundary cases (be careful with inequalities).
- Please only output the result specified in the instructions. Remove all internal print statements before submitting.
- If your function works on the demo reads, it is not guarenteed to work on all test cases. Try making up some test cases of your own.

The sample solutions below provide a guide for students to compare to their own answers. These solutions are not the most efficient or succinct; they are written to be simple and easy to understand.

## Semester 1, 2020¶

This assignment should be completed by each student individually. Make sure you read this entire document, and ask for help if anything is not clear. Any changes or clarification to this document will be announced via the LMS.

Please make sure you are aware of the University's rules on academic honesty and plagiarism, which are very strict: https://academichonesty.unimelb.edu.au/

Make sure you do not copy any code either from other students or from the internet. This is considered plagiarism. It is generally a good idea to avoid looking at any solutions as you may find it surprisingly difficult to generate your own solution to the problem once you have seen somebody else's.

Your completed notebook file containing all your answers will be turned in via LMS. No other files or formats will be accepted - **only upload the completed .ipynb file.**

## Overview

To complete the assignment you will need to finish the tasks in this notebook. There are multiple tasks that are connected in a logical order.

The tasks are a combination of writing your own code to use library implementations to perform analyses, and interpreting the results in short answer format.

>Word limits, where stated, will be strictly enforced! Answers exceeding the limit will **NOT** be marked.

In some cases, we have provided test input and test output that you can use to try out your solutions. These tests are just samples and are **not** exhaustive - they may warn you if you've made a mistake, but they are not guaranteed to. It's up to you to decide whether your code is correct.

## Marking

Cells that must be completed to receive marks are clearly labeled. Some cells are code cells, in which you must complete the code to solve a problem, and some are markdown cells, in which you must write your answers to short-answer questions. Please only add to graded cells, **do not remove or modify any text in graded cells** unless specified otherwise (please don't remove text like this: # ~~ GRADED CELL (1 mark) - complete this cell ~~ ).

No marks are allocated to commenting in your your code. Likewise, no marks are allocated to the speed or complexity of your code. We do however, encourage efficient and well commented code.

Your code will be tested on the comp90016_assignment_1.fastq file as well as additional test cases. Your code must work on any correctly formatted FASTQ file.

The total marks for the assignment add up to 10, and it will be worth 10% of your overall subject grade. The mark breakdown is as follows:

* 5 marks for coding tasks
* 5 marks for short answer tasks

## Background and data

In this assignment, you will use functions in the `skbio` library to parse a *FASTQ* file. You will then use the data to perform some basic analyses. You may want to refer to sections of the `skbio` documentation for additional help (http://scikit-bio.org/docs/0.5.1/generated/skbio.sequence.DNA.html). You may use functions in the `skbio` library for any of the coding questions. Aside from `skbio` functions, please only use standard Python 3 functions and methods.

## Task 0 - setup

We begin by importing the `skbio` library. The `skbio.io.read` function reads the *FASTQ* file, applying an offset of 33 for the phred scores for quality (quality scores are not used in this assignment). The function returns a generator object which can be used to loop through the file, or simply converted to a list. Read https://realpython.com/introduction-to-python-generators/ to better understand generators and iterators in P
Python (this is not necessary to complete the assignment).

The *FASTQ* file you will be using can be downloaded from the LMS. Download the file and place it in the same folder where this notebook is. **DO NOT** rename the file. 

In [None]:
# Import the skbio library.
import skbio

In [None]:
# Read in the FASTQ file to produce a generator object named registry.
fname = 'comp90016_assignment_1.fastq'
registry = skbio.io.read(fname, format = 'fastq', phred_offset = 33)
registry

In [None]:
# Append the reads from registry to a list named readset.
readset = []
for r in registry:
    readset.append(r)

# Compute the number of reads in the readset. Should give an output of 783.
len(readset)


The resulting list is a list of `skbio.sequence.DNA` objects which is used to store DNA sequences. As shown below, the object stores sequence IDs, sequence qualities as positional information and basic statistics such as the sequence length.

In [None]:
# View an individual skbio.sequence.DNA object from the readset list. 
# Try a different index to view a different read.
readset[0]

## Part 1 - Working with short reads

In the cells below, complete the following tasks:

1.1. (1 mark) Write a Python function to compute the percentage of reads that are longer than n bases. Return a floating-point number in the range 0 - 100. Assume n is a positive integer. Assume the input reads are a non-empty list of skbio.sequence.DNA objects.


1.2. (2 marks) Suppose quality control metrics indicated that the bases on the tail end of the reads were of unacceptable quality. Write a Python function that performs the following tasks and returns the new readset as list of skbio.sequence.DNA objects. Assume trim and min_length are positive integers. Assume the input reads are a non-empty list of skbio.sequence.DNA objects.

    - Remove any read where trim is greater than or equal to the read length.
    - Remove trim bases from the end of each read. 
    - Remove any read that is less than min_length bases long after trimming.


1.3. (1 mark) Suppose these reads were to be aligned to a reference genome. Why would removing very short reads be useful?


In [None]:
# ~~ GRADED CELL (1 mark) - complete this cell ~~
def percent_reads_len(reads, n):
    """
    Compute the percentage of reads (skbio DNA sequences) that are longer than n bases. 
    Assume n is a positive integer.
    Assume the input reads are a non-empty list of skbio.sequence.DNA objects.
    Return a floating-point number in the range 0 - 100. 
    """
    # Initialise a variable to count the number of reads longer than n bases.
    long_reads = 0
    # Iterate through the reads and increase the counter if the read length is greater than n.
    for read in reads:
        if len(read) > n:
            long_reads += 1
    # Return the percentage of reads longer than n bases.
    return long_reads/len(reads) * 100

In [None]:
# ~~ Test your function in this cell ~~
demo_reads_a = [skbio.sequence.DNA('ATATA'), skbio.sequence.DNA('GCGCGCGC')]
print(percent_reads_len(demo_reads_a, 5)) # should give 50.0
print(percent_reads_len(readset, 99))

In [None]:
# ~~ GRADED CELL (2 marks) - complete this cell ~~
def preprocess_reads(reads, trim, min_length):
    """
    Preprocesses a readset (list of skbio DNA sequences).
    Remove any read where trim is greater than or equal to the read length.
    Remove trim bases from the end of each read.
    Remove any read that is less than min_length bases long after trimming.
    Assume trim and min_length are positive integers.
    Assume the input reads are a non-empty list of skbio.sequence.DNA objects.
    Return the processed readset as a list of skbio.sequence.DNA objects.
    """
    # Initiate a list to contain the preprocessed reads.
    preprocessed_reads = []
    # Iterate through the reads and add a trimmed read to preprocessed_reads if the length after trimming
    # is greater than or equal to the minimum length.
    for read in reads:
        if len(read) - trim >= min_length:
                preprocessed_reads.append(read[:-trim])
    return preprocessed_reads

In [None]:
# ~~ Test your function in this cell ~~
demo_reads_b = [skbio.sequence.DNA('AA'), skbio.sequence.DNA('GAAATCGG'), skbio.sequence.DNA('TTATTT')]
print(preprocess_reads(demo_reads_b, 2, 5)) # should give only one sequence, 'GAAATC' as an skbio.sequence.DNA object, in a list.
print(preprocess_reads(readset, 10, 50))

~~ GRADED CELL (1 mark) - complete this cell ~~

Suppose these reads were to be aligned to a reference genome. Why would removing very short reads be useful? **50 words max**

Shorter reads are more likely to map to multiple regions of the reference genome. Minimising multi-mapping reads improves alignment run time.

## Part 2 - K-mer analysis

Please consider **overlapping** k-mers for the following questions.

In the cells below, complete the following tasks:

2.1. (1 mark) Write a Python function to compute the number of distinct k-mers in the total readset for a given value of k. Assume the value of k is greater than or equal to 1, and less than or equal to the length of the shortest read. Assume the input reads are a non-empty list of skbio.sequence.DNA objects. Return a positive integer.

2.2. (1 mark) Write a Python function to identify the most abundant k-mer in the readset for a given value of k. Return the k-mer as a string. If there is a tie, and 2 or more kmers are equally abundant, return one of the most abundant k-mers only. Assume the value of k is greater than or equal to 1 and less than or equal to the length of the shortest read. Assume the input reads are a non-empty list of skbio.sequence.DNA objects.


In [None]:
# ~~ GRADED CELL (1 mark) - complete this cell ~~
def distinct_kmers(reads, k):
    """
    Compute the number of distinct k-mers in the total readset for a given value of k. 
    Assume the value of k is greater than or equal to 1, and less than or equal to the length of the shortest read.
    Assume the input reads are a non-empty list of skbio.sequence.DNA objects.
    Return a positive integer.
    """
    # Initiate a set to contain the distict k-mers.
    distinct_set = set()
    # Iterate through each k-mer in each read and add it to the set as a string.
    for read in reads:
        for i in range(0, len(read) - k + 1):
            distinct_set.add(str(read)[i: i + k])
    # Return the length of the set.
    return len(distinct_set)

In [None]:
# ~~ Test your function in this cell ~~
demo_reads_c = [skbio.sequence.DNA('AAAAATTC'), skbio.sequence.DNA('CAT')]
print(distinct_kmers(demo_reads_c, 3)) # should give 5
print(distinct_kmers(readset, 9))

In [None]:
# ~~ GRADED CELL (1 mark) - complete this cell ~~
def top_kmer(reads, k):
    """
    Identify the most abundant k-mer in the readset for a given value of k.
    Assume the value of k is greater than or equal to 1 and less than or equal to the length of the shortest read.
    Assume the input reads are a non-empty list of skbio.sequence.DNA objects.
    Return the k-mer as a string. If there is a tie, and 2 or more kmers are equally abundant, return one of the most abundant k-mers only.
    """
    # Initialise a dictionary to count the occurances of k-mers.
    # A default dictionary could also be used.
    kmer_freq = {}
    # Iterate through each k-mer in each read.
    for read in reads:
        for i in range(0, len(read) - k + 1):
            # If the k-mer is not in the dictionary, add it with a count of 1.
            if str(read)[i: i + k] not in kmer_freq:
                kmer_freq[str(read)[i: i + k]] = 1
            # If the k-mer is in the dictionary, increase the counter by 1.
            else:
                kmer_freq[str(read)[i: i + k]] += 1
    # Return the string of the sequence of the k-mer with the highest frequency.
    return max(kmer_freq, key = kmer_freq.get)

In [None]:
# ~~ Test your function in this cell ~~
print(top_kmer(demo_reads_c, 3)) # should give 'AAA'
print(top_kmer(readset, 9))

## Part 3 - Read alignment

In the cells below, complete the following tasks:

3.1. (1 mark) What does it mean for read alignment algorithms to be semi-global?

3.2. (1 mark) Explain the difference between simple and affine gap costs?

3.3. (2 marks) Aside from read length, describe one difference between reads produced on an Illumina NextSeq and reads produced on an Oxford Nanopore MinION? How would this difference affect read alignment?

~~ GRADED CELL (1 mark) - complete this cell ~~

What does it mean for read alignment algorithms to be semi-global? **50 words max**

In semi-global alignment, the whole of a short query sequence is aligned to part of a reference sequence. Gaps at either end of the alignment are not penalised. The alignment is global with respect to the query sequence and local with respect to the reference sequence.

~~ GRADED CELL (1 mark) - complete this cell ~~

Explain the difference between simple and affine gap costs? **50 words max**

Simple gap cost models assign a fixed penalty to an indel in an alignment. Affine gap models assign different penalties for opening a gap and for extending a gap. 

~~ GRADED CELL (2 marks) - complete this cell ~~

Aside from read length, describe one difference between reads produced on an Illumina NextSeq and reads produced on an Oxford Nanopore MinION? 
How would this difference affect read alignment? **100 words max**

Reads produced by an Illumina NextSeq sequencer have a higher per-base sequence quality compared to reads produced by an Oxford Nanopore MinION device. Reads containing more errors are more difficult to align correctly to a reference genome. The errors may cause reads to be aligned to the incorrect region of the genome, or the reads may not map to the genome at all. 