# 1. DNA Sequencing, Strings, and Matching

In this training project, I performed several key steps for DNA sequence anaysis:
- Download desired FASTA or FASTQ files
- Parse files for entire genome sequence (FASTA) or sequenced fragments (FASTQ)
- Determine reverse complement of a given sequence
- Locate positions of a matching sequence pattern among a larger genome sequence 
    - *(Note: This method is called naive pattern matching. There are a several better options for real-world NGS analysis)*
    - Include reverse complement and/or allowance for 2 basepair mismatches

I used these code cells to answer the following questions:

- Question 1: Question 1
How many times does `AGGT` or its reverse complement (`ACCT`) occur in the lambda virus genome?  E.g. if `AGGT` occurs 10 times and `ACCT` occurs 12 times, you should report 22. 
    - 306

- Question 2: How many times does `TTAA` or its reverse complement occur in the lambda virus genome?  
Hint: `TTAA` and its reverse complement are equal, so remember not to double count.
    - 195

- Question 3: What is the offset of the leftmost occurrence of `ACTAAGT` or its reverse complement in the Lambda virus genome?  E.g. if the leftmost occurrence of `ACTAAGT` is at offset 40 (0-based) and the leftmost occurrence of its reverse complement `ACTTAGT` is at offset 29, then report 29.
    - 26028

- Question 4: What is the offset of the leftmost occurrence of `AGTCGA` or its reverse complement in the Lambda virus genome?
    - 450

- Question 5: How many times does `TTCAAGCC` occur in the Lambda virus genome when allowing up to 2 mismatches?
    - 191

- Question 6: What is the offset of the leftmost occurrence of `AGGAGGTT` in the Lambda virus genome when allowing up to 2 mismatches?
    - 213

- Question 7: Given a fastq file, report which sequencing cycle has a quality problem.
    - 66

In [112]:
# Import desired FASTA/FASTQ from the Internet

import urllib.request

def download_file(url, output_path):
    urllib.request.urlretrieve(url, output_path)
    print("Download complete!")

url1 = "https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/lambda_virus.fa" # FASTA
output_path1 = "lambda_virus.fa"

url2 = "https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ERR037900_1.first1000.fastq" # FASTQ
output_path2 = "human.fastq"

# Download files

download_file(url1, output_path1)
download_file(url2, output_path2)


Download complete!
Download complete!


In [None]:
# Read genome from imported FASTA file

def readFasta(filename):
    genome = ''
    with open(filename, 'r') as fa:
        for line in fa:
            # ignore header line with genome information
            if not line[0] == '>':
                genome += line.rstrip()
    return genome

In [None]:
# Read sequences and qualities from imported FASTQ file

def readFastq(filename):
    sequences = []
    qualities = []
    with open(filename, 'r') as fq:
        while True:
            fq.readline()  # skip name line
            seq = fq.readline().rstrip()  # read base sequence
            fq.readline()  # skip placeholder line
            qual = fq.readline().rstrip() # base quality line
            if len(seq) == 0:
                break
            sequences.append(seq)
            qualities.append(qual)
    return sequences, qualities

In [None]:
# Determine the reverse complement of a DNA sequence s

def reverseComplement(s):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N': 'N'}
    t = ''
    for base in s:
        t = complement[base] + t
    return t

In [None]:
# Read in genome from FASTA file and sequences from FASTQ file

genome = readFasta("lambda_virus.fa")
sequences, qualities = readFastq("human.fastq")

*Note: The 4 functions below could be streamlined into one (i.e. with a toggle for reverse complement and variable mismatch allowance), but they were written separately for training purposes.*

In [34]:
# Locate index position of matches of pattern p in text t

def naive(p, t):
    occurrences = []
    for i in range(len(t) - len(p) + 1):  # loop over alignments
        match = True
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:  # compare characters
                match = False
                break
        if match:
            occurrences.append(i)  # all chars matched; record
    return occurrences

In [32]:
# Locate index position of matches of pattern p in text t while being strand-aware (reverse complement)

def naive_with_rc(p, t):
    occurrences = []
    rc_p = reverseComplement(p)

    for i in range(len(t) - len(p) + 1):  # loop over alignments
        # Check forward strand
        match_forward = True
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:   # compare characters
                match_forward = False
                break
        if match_forward:
            occurrences.append(i)  # all chars matched; record
            continue               # skip checking reverse if forward matched to avoid duplicates
    
        # Check reverse strand
        match_reverse = True
        for j in range(len(p)):
            if t[i+j] != rc_p[j]:
                match_reverse = False
                break
        if match_reverse:
            occurrences.append(i)   

    return occurrences

In [None]:
# Adding allowance of 2 basepair mismatches to naive (no reverse complement)

def naive_2mm(p, t):
    occurrences = []
    max_mismatches = 2

    for i in range(len(t) - len(p) + 1):  # loop over alignments
        # Check forward strand with mismatches allowed
        mismatches = 0
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:   # compare characters
                mismatches += 1  # add to mismatch counter
                if mismatches > max_mismatches:    # stop if we exceed max mismatches (2)
                    break
        if mismatches <= max_mismatches:
            occurrences.append(i)  # all chars matched; record
            continue               # skip checking reverse if forward matched to avoid duplicates 

    return occurrences

In [50]:
# Adding allowance of 2 basepair mismatches to naive_with_rc

def naive_with_rc_2mm(p, t):
    occurrences = []
    rc_p = reverseComplement(p)
    max_mismatches = 2

    for i in range(len(t) - len(p) + 1):  # loop over alignments
        # Check forward strand with mismatches allowed
        mismatches = 0
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:   # compare characters
                mismatches += 1  # add to mismatch counter
                if mismatches > max_mismatches:    # stop if we exceed max mismatches (2)
                    break
        if mismatches <= max_mismatches:
            occurrences.append(i)  # all chars matched; record
            continue               # skip checking reverse if forward matched to avoid duplicates
    
        # Check reverse strand with mismatches allowed
        mismatches = 0
        for j in range(len(p)):
            if t[i+j] != rc_p[j]:
                mismatches += 1
                if mismatches > max_mismatches:
                    break
        if mismatches <= max_mismatches:
            occurrences.append(i)   

    return occurrences

In [None]:
# Testing each naive function for different results using a test pattern

test_pattern = "ACTAAGT"

naive_occurrences = naive(test_pattern, genome)
naive_with_rc_occurrences = naive_with_rc(test_pattern, genome)
naive_2mm_occurrences = naive_2mm(test_pattern, genome)
naive_with_rc_2mm_occurrences = naive_with_rc_2mm(test_pattern, genome)

print(f"Occurrences with naive funtion:", len(naive_occurrences))
print(f"Occurrences with naive_with_rc funtion:",len(naive_with_rc_occurrences))
print(f"Occurrences with naive_2mm funtion:",len(naive_2mm_occurrences))
print(f"Occurrences with naive_with_rc_2mm funtion:",len(naive_with_rc_2mm_occurrences))

# Checking the offset of the leftmost occurrence of the test pattern

print(naive_occurrences[0])
print(naive_with_rc_occurrences[0])
print(naive_2mm_occurrences[0])
print(naive_with_rc_2mm_occurrences[0])


Occurrences with naive funtion: 2
Occurrences with naive_with_rc funtion: 3
Occurrences with naive_2mm funtion: 414
Occurrences with naive_with_rc_2mm funtion: 664
[27733, 45382]
26028


In [108]:
# Assessment of sequencing cycles from FASTQ data

print(qualities)

# Function to convert Phred+33 scores to Q scores
def phred_to_q(phred):
    q_scores = []
    for line in phred:
        clean_line = line.rstrip()
        for c in clean_line:
            q = ord(c) - 33
            if q < 0:
                print(f"Warning: Invalid Phred character: '{c}' → Q = {q}")
                continue
            q_scores.append(q)
    return q_scores

# Convert qualities from FASTQ to Q scores for just the first sequence read
q_scores = phred_to_q(qualities)
print(q_scores)

# Determine lowest quality read cycle and Q score
lowest_q = min(q_scores)
cycle = q_scores.index(lowest_q)
print(f"The lowest quality Q score was {lowest_q} on cycle {cycle}.")

# Confirm the position (cycle vs index)
print(f" The Q score at index 66 (cycle 67) is {q_scores[66]}.")

['HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGFHHHFHFFHHHHHGHHFHEH@4#55554455HGFBF<@C>7EEF@FBEDDD<=C<E', 'HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCHHHHEHHBA#C>@54455C/7=CGHEGEB;C############', 'HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHDHHHDEHHHHFGIHEHEGGGF4#45655366GIGEHAGBG################', 'HHHHHHHHHHHHHHHHHHHHHHHHHIHHHHHHHHHHHHHHHHHHHHHHIHHHHHIHFHHHIHHHHD#ECA54655GGIBH?BD@+BCBF?5A=::>8?##', 'HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIHIHEHHIGHIFFHIIGF6#555:2=7=CB;?3CAACBAC2B###########', 'HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIHHHIHIHHHGH:#@@@@9@C@EEGCGGFIFFF9FCAF?EEE4B8>>', "HHHHHHHHHHHHHHHHHHHHHHIFHFEGGFHHHHHHGHHHHGHHHHHFHAFGHEHHIHHGBCCDC,#55564565CE:BB44+'5/36,(<<BC<DDBCE", 'HHFHHDHHHHDDGGGDHDHHHHHGHHHHHHHDHHECHHH8GGDEEHHHHEH?3HG<=4>555624/#5/55/555DADA#####################', 'HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIHIIFIHEIIGFI@#==?46560GAAEDGGDGCA8CCB=@########', 'HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHEHHHGH@