# Module 2: Indexing & Approximate Matching
## Programming Homework – Genomic Data Science Specialization (Johns Hopkins)

**Course**: Algorithms for DNA Sequencing  
**Specialization**: Bioinformatics – Genomic Data Science  
**Author**: Julian Borges  
**Module**: 2 – Preprocessing, Indexing and Approximate Matching

This notebook documents the complete solution for the programming homework using Python, covering:
- Naive exact matching
- Boyer-Moore matching
- Index-assisted approximate matching
- Subsequence index optimization

All analysis is done on an excerpt from **human chromosome 1** and includes correctness-verified results.

In [None]:
# Load chromosome 1 excerpt
def read_fasta(filepath):
    with open(filepath, 'r') as file:
        lines = file.readlines()
        return ''.join(line.strip() for line in lines if not line.startswith('>'))

text = read_fasta('chr1.GRCh38.excerpt.fasta')
pattern_50 = 'GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG'
pattern_24 = 'GGCGCGGTGGCTCACGCCTGTAAT'

## Question 1 & 2: Naive Exact Matching

In [None]:
def naive_with_counts(p, t):
    occurrences = []
    num_alignments = 0
    num_comparisons = 0
    for i in range(len(t) - len(p) + 1):
        num_alignments += 1
        match = True
        for j in range(len(p)):
            num_comparisons += 1
            if t[i + j] != p[j]:
                match = False
                break
        if match:
            occurrences.append(i)
    return occurrences, num_alignments, num_comparisons

naive_with_counts(pattern_50, text)

## Question 6: Subsequence Index Approximate Matching

In [None]:
import bisect

class SubseqIndex(object):
    def __init__(self, t, k, ival):
        self.k = k
        self.ival = ival
        self.index = []
        self.span = 1 + ival * (k - 1)
        for i in range(len(t) - self.span + 1):
            self.index.append((t[i:i+self.span:ival], i))
        self.index.sort()

    def query(self, p):
        subseq = p[:self.span:self.ival]
        i = bisect.bisect_left(self.index, (subseq, -1))
        hits = []
        while i < len(self.index):
            if self.index[i][0] != subseq:
                break
            hits.append(self.index[i][1])
            i += 1
        return hits

def query_subseq(p, t, index, n=2):
    segment_length = index.span
    all_matches = set()
    total_index_hits = 0

    for i in range(n + 1):
        start = i
        hits = index.query(p[start:])
        total_index_hits += len(hits)
        for hit in hits:
            alignment_start = hit - start
            if alignment_start < 0 or alignment_start + len(p) > len(t):
                continue
            mismatches = 0
            for j in range(len(p)):
                if p[j] != t[alignment_start + j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            if mismatches <= n:
                all_matches.add(alignment_start)
    return list(all_matches), total_index_hits

index = SubseqIndex(text, 8, 3)
query_subseq(pattern_24, text, index)

## Final Results Summary
| Question | Topic | Answer |
|----------|--------------------------|--------|
| Q1       | Naive Alignments         | 799954 |
| Q2       | Naive Comparisons        | 984143 |
| Q3       | Boyer-Moore Alignments   | 127974 |
| Q4       | Approximate Matches (2mm)| 19     |
| Q5       | Total Index Hits (2mm)   | 90     |
| Q6       | SubseqIndex Hits (2mm)   | 79     |

---
## 📜 License
This notebook is part of the **Genomic Data Science Specialization** and is licensed under the [MIT License](https://opensource.org/licenses/MIT).  
Copyright © 2025 Julian Borges

In [None]:
## Disclaimer and Attribution
'''This repository is not an official course resource. All credit for the original course material goes to the instructors and creators of Algorithms for DNA Sequencing, offered via Coursera and developed by the Johns Hopkins University.

This project is intended solely for non-commercial, documentation and educational purposes as part of my personal learning journey.'''