## Functions

In [8]:
def readGenome(filename):
    genome = ''
    with open(filename, 'r') as f:
        for line in f:
            # ignore header line with genome information
            if not line[0] == '>':
                genome += line.rstrip()
    return genome

In [2]:
def reverseComplement(s):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N': 'N'}
    t = ''
    for base in s:
        t = complement[base] + t
    return t

In [3]:
def naive(p, t):
    occurrences = []
    for i in range(len(t) - len(p) + 1):  # loop over alignments
        match = True
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:  # compare characters
                match = False
                break
        if match:
            occurrences.append(i)  # all chars matched; record
    return occurrences

## Questions with naive_with_rc

First, implement a version of the naive exact matching algorithm that is strand-aware. That is, instead of looking only for occurrences of P in T, additionally look for occurrences of thereverse complement of P in T. If P is ACT, your function should find occurrences of both ACTand its reverse complement AGT in T.

If P and its reverse complement are identical (e.g. AACGTT), then a given match offset should be reported only once. So if your new function is called naive_with_rc, then the old naivefunction and your new naive_with_rc function should return the same results when P equals its reverse complement.

In [5]:
def naive_with_rc(p, t):
    occurrences = []
    for i in range(len(t) - len(p) + 1):  # loop over alignments
        match = True
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:  # compare characters
                match = False
                break
        if match:
            occurrences.append(i)  # all chars matched; record
    
    rc = reverseComplement(p)
    
    if rc != p:
        for i in range(len(t) - len(p) + 1):  # loop over alignments
            match = True
            for j in range(len(p)):  # loop over characters
                if t[i+j] != rc[j]:  # compare characters
                    match = False
                    break
            if match:
                occurrences.append(i)  # all chars matched; record
    return occurrences

### Example 1

In [10]:
p = 'CCC'
ten_as = 'AAAAAAAAAA'
t = ten_as + 'CCC' + ten_as + 'GGG' + ten_as
occurrences = naive_with_rc(p, t)
print(occurrences)

[10, 23]


### Example 2

In [11]:
p = 'CGCG'
t = ten_as + 'CGCG' + ten_as + 'CGCG' + ten_as
occurrences = naive_with_rc(p, t)
print(occurrences)

[10, 24]


### Example 3

In [12]:
phix_genome = readGenome('./data/phix.fa')

In [13]:
occurrences = naive_with_rc('ATTA', phix_genome)

In [14]:
print('offset of leftmost occurrence: %d' % min(occurrences))

offset of leftmost occurrence: 62


In [15]:
print('# occurrences: %d' % len(occurrences))

# occurrences: 60


In [9]:
genome = readGenome('./data/lambda_virus.fa')

> Question 1

How many times does AGGT or its reverse complement (ACCT) occur in the lambda virus genome?  E.g. if AGGT occurs 10 times and ACCT occurs 12 times, you should report 22.

In [17]:
occurrences = naive_with_rc('AGGT', genome)
len(occurrences)

306

> Question 2

How many times does TTAA or its reverse complement occur in the lambda virus genome?  
Hint: TTAA and its reverse complement are equal, so remember not to double count.

In [18]:
occurrences = naive_with_rc('TTAA', genome)
len(occurrences)

195

In [19]:
occurrences = naive('TTAA', genome)
len(occurrences)

195

> Question 3

What is the offset of the leftmost occurrence of ACTAAGT or its reverse complement in the Lambda virus genome?  E.g. if the leftmost occurrence of ACTAAGT is at offset 40 (0-based) and the leftmost occurrence of its reverse complement ACTAAGT is at offset 29, then report 29.

In [20]:
occurrences = naive_with_rc('ACTAAGT', genome)
print('offset of leftmost occurrence: %d' % min(occurrences))

offset of leftmost occurrence: 26028


> Question 4

What is the offset of the leftmost occurrence of AGTCGA or its reverse complement in the Lambda virus genome?

In [21]:
occurrences = naive_with_rc('AGTCGA', genome)
print('offset of leftmost occurrence: %d' % min(occurrences))

offset of leftmost occurrence: 450


## Questions with naive_2mm

As we will discuss, sometimes we would like to find approximate matches for P in T. That is, we want to find occurrences with one or more differences.

For Questions 5 and 6, make a new version of the naivenaive function called naive_2mmnaive_2mm that allows up to 2 mismatches per occurrence. Unlike for the previous questions, do not consider the reverse complement here.  We're looking for approximate matches for P itself, not its reverse complement.

For example, ACTTTA occurs twice in ACTTACTTGATAAAGT, once at offset 0 with 2 mismatches, and once at offset 4 with 1 mismatch. So naive_2mm('ACTTTA', 'ACTTACTTGATAAAGT') should return the list [0, 4].

In [26]:
def naive_2mm(p, t):
    occurrences = []
    for i in range(len(t) - len(p) + 1):  # loop over alignments
        match = True
        mismatch = 2
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:  # compare characters
                mismatch -= 1
                if mismatch < 0:
                    match = False
                    break
        if match:
            occurrences.append(i)  # all chars matched; record
    return occurrences

### Example 1

In [27]:
p = 'CTGT'
ten_as = 'AAAAAAAAAA'
t = ten_as + 'CTGT' + ten_as + 'CTTT' + ten_as + 'CGGG' + ten_as
occurrences = naive_2mm(p, t)
print(occurrences)

[10, 24, 38]


### Example 2

In [28]:
occurrences = naive_2mm('GATTACA', phix_genome)
print('offset of leftmost occurrence: %d' % min(occurrences))

offset of leftmost occurrence: 10


In [29]:
print('# occurrences: %d' % len(occurrences))

# occurrences: 79


> Question 5

How many times does TTCAAGCC occur in the Lambda virus genome when allowing up to 2 mismatches? 

In [30]:
occurrences = naive_2mm('TTCAAGCC', genome)
print('# occurrences: %d' % len(occurrences))

# occurrences: 191


> Question 6

What is the offset of the leftmost occurrence of AGGAGGTT in the Lambda virus genome when allowing up to 2 mismatches?

In [31]:
occurrences = naive_2mm('AGGAGGTT', genome)
print('offset of leftmost occurrence: %d' % min(occurrences))

offset of leftmost occurrence: 49


> Question 7

Finally, download and parse the provided FASTQ file containing real DNA sequencing reads derived from a human:

 https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ERR037900_1.first1000.fastq

Note that the file has many reads in it and you should examine all of them together when answering this question.  The reads are taken from this study:

Ajay, S. S., Parker, S. C., Abaan, H. O., Fajardo, K. V. F., & Margulies, E. H. (2011). Accurate and comprehensive sequencing of personal genomes. Genome research, 21(9), 1498-1505. 

This dataset has something wrong with it; one of the sequencing cycles is poor quality.

Report which sequencing cycle has the problem.  Remember that a sequencing cycle corresponds to a particular offset in all the reads. For example, if the leftmost read position seems to have a problem consistently across reads, report 0. If the fourth position from the left has the problem, report 3. Do whatever analysis you think is needed to identify the bad cycle. It might help to review the "Analyzing reads by position" video.

In [32]:
def readFastq(filename):
    sequences = []
    qualities = []
    with open(filename) as fh:
        while True:
            fh.readline()  # skip name line
            seq = fh.readline().rstrip()  # read base sequence
            fh.readline()  # skip placeholder line
            qual = fh.readline().rstrip() # base quality line
            if len(seq) == 0:
                break
            sequences.append(seq)
            qualities.append(qual)
    return sequences, qualities

In [37]:
seqs, quals = readFastq('./data/ERR037900_1.first1000.fastq')

In [38]:
def phred33ToQ(qual):
    return ord(qual) - 33

In [53]:
def createHist(qualities):
    hist = {}
    for qual in qualities:
        for i, phred in enumerate(qual):
            q = phred33ToQ(phred)
            hist[i] = hist.get(i, 0) + q
    return hist

h = createHist(quals)
print(min(h.items(), key=lambda x: x[1]))

(66, 4526)
