## Question 1 & 2

Comparisons to editDistance algorithm:

* first row set to 0
* rows: P, columns: T
* first row: all 0
* get minimal value of bottom row

In [1]:
def approximateMatching(p, t):
    # Create distance matrix
    D = []
    for i in range(len(p)+1):
        D.append([0]*(len(t)+1))
    
    # Leave first row untouched (all zeros)
    # Initialise first column (same as editDistance)
    for i in range(len(p)+1):
        D[i][0] = i
    
    # Fill in the rest of the matrix
    for i in range(1, len(p)+1):
        for j in range(1, len(t)+1):
            distHor = D[i][j-1] + 1
            distVer = D[i-1][j] + 1
            if p[i-1] == t[j-1]:
                distDiag = D[i-1][j-1]
            else:
                distDiag = D[i-1][j-1] + 1
            D[i][j] = min(distHor, distVer, distDiag)
    
    # Edit distance is the minimum value of bottom row
    return min(D[-1])

In [2]:
p = "GCGTATGC"
t = "TATTGGCTATACGGTT"
approximateMatching(p, t)

2

In [3]:
def readGenome(filename):
    genome = ''
    with open(filename, 'r') as f:
        for line in f:
            # ignore header line with genome information
            if not line[0] == '>':
                genome += line.rstrip()
    return genome

In [4]:
chr1_genome = readGenome("chr1.GRCh38.excerpt.fasta")

In [5]:
# question 1
approximateMatching("GCTGATCGATCGTACG", chr1_genome)

3

In [6]:
# question 2
approximateMatching("GATTTACCAGATTGAG", chr1_genome)

2

## Question 3 & 4

`!wget https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ERR266411_1.for_asm.fastq`

Provided below is a useful function to calculate the length of overlap between the longest suffix of `a` matching a prefix of `b`.

In [65]:
def overlap(a, b, min_length=3):
    """ Return length of longest suffix of 'a' matching
        a prefix of 'b' that is at least 'min_length'
        characters long.  If no such overlap exists,
        return 0. """
    start = 0  # start all the way at the left
    while True:
        start = a.find(b[:min_length], start)  # look for b's prefix in a
        if start == -1:  # no more occurrences to right
            return 0
        # found occurrence; check for full suffix/prefix match
        if b.startswith(a[start:]):
            return len(a)-start
        start += 1  # move just past previous match

To call this `overlap` function between every possible pair of reads will be very slow. However, let's say we have a read `a` whose length-k suffix **does not occur** anywhere else, it follows that this suffix cannot overlap any of them.

This means that we should only call `overlap` on other reads that contain that length-k suffix. This saves a lot of time.

We need to first build something like a lookup table for our `overlap` function:

* Create a dictionary to associate each k-mer with its own set. Key: k-mer, value: set.
* The set includes all reads that contain the corresponding k-mer.
* Now, let's say we want to find reads to call `overlap` onto say read `a`, we take the last k-letters of `a`, and find the set associated with that k-mer.
* This returns all reads containing the k-mer.
* Be aware to ignore a read matching back to itself.

In [71]:
def generate_read_lut(reads, k):
    # create lookup table of read IDs for each k-mer
    # iterate across all reads, within with we iterate all k-mers
    # if k-mer not yet present, add into the LUT
    read_lut = {}
    for read in reads:
        for i in range(0, len(read)-k+1):
            kmer = read[i:i+k]
            if read_lut.get(kmer) is None:
                read_lut[kmer] = set([read]) # o/w read split into bases
            else:
                read_lut[kmer].add(read)
    return read_lut 

In [87]:
def overlap_all_pairs(reads, k):
    read_lut = generate_read_lut(reads, k)
    overlap_list = []
    for read in reads:
        k_suffix = read[-k:] # get last k-mer
        matches = read_lut.get(k_suffix)
        if matches is not None:
            for match in matches:
                if match != read:
                    if overlap(read, match, k) != 0:
                        overlap_list.append((read, match))
    return overlap_list
            

### Test examples

In [88]:
reads = ['ABCDEFG', 'EFGHIJ', 'HIJABC']

In [89]:
overlap_all_pairs(reads, 3)

[('ABCDEFG', 'EFGHIJ'), ('EFGHIJ', 'HIJABC'), ('HIJABC', 'ABCDEFG')]

In [90]:
overlap_all_pairs(reads, 4)

[]

In [91]:
reads = ['CGTACG', 'TACGTA', 'GTACGT', 'ACGTAC', 'GTACGA', 'TACGAT']

In [92]:
overlap_all_pairs(reads, 4)

[('CGTACG', 'GTACGA'),
 ('CGTACG', 'GTACGT'),
 ('CGTACG', 'TACGTA'),
 ('CGTACG', 'TACGAT'),
 ('TACGTA', 'CGTACG'),
 ('TACGTA', 'ACGTAC'),
 ('GTACGT', 'ACGTAC'),
 ('GTACGT', 'TACGTA'),
 ('ACGTAC', 'CGTACG'),
 ('ACGTAC', 'GTACGT'),
 ('ACGTAC', 'GTACGA'),
 ('GTACGA', 'TACGAT')]

In [93]:
overlap_all_pairs(reads, 5)

[('CGTACG', 'GTACGT'),
 ('CGTACG', 'GTACGA'),
 ('TACGTA', 'ACGTAC'),
 ('GTACGT', 'TACGTA'),
 ('ACGTAC', 'CGTACG'),
 ('GTACGA', 'TACGAT')]

### The actual question

In [100]:
def readFastq(filename):
    sequences = []
    qualities = []
    with open(filename) as fh:
        while True:
            fh.readline()  # skip name line
            seq = fh.readline().rstrip()  # read base sequence
            fh.readline()  # skip placeholder line
            qual = fh.readline().rstrip() # base quality line
            if len(seq) == 0:
                break
            sequences.append(seq)
            qualities.append(qual)
    return sequences, qualities

phix_reads, _ = readFastq("ERR266411_1.for_asm.fastq")

In [101]:
phix_graph = overlap_all_pairs(phix_reads, 30)

In [102]:
# question 3
len(phix_graph)

904746

In [104]:
phix_graph[0]

('TAAACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCAGAAATATCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGATAAAACTC',
 'AACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCAGAAATATCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGATAAAACTCTG')

In [110]:
# question 4
len(set(map(lambda x: x[0], phix_graph)))

7161