# Demo: Python Modules and Packages for Bioinformatics

Welcome to this Jupyter Notebook! You're in a Master of Bioinformatics program and know basic Python (functions, strings, lists). This notebook will teach you how to create a **Python package** called `bio_utils` with reusable DNA analysis tools. You'll implement 8 functions, test them, and make the package installable with `pip`—a real-world skill for bioinformatics!

## What You'll Learn
- **Modules**: Python files (`.py`) with functions you can import and reuse.
- **Packages**: Folders with modules and an `__init__.py` file.
- **Why Use Them?**
  - **Reuse**: Write DNA tools once, use them in any project.
  - **Organize**: Group related functions (e.g., DNA cleaning, GC content).
  - **Share**: Install your package or share with lab mates.
  - **Maintain**: Fix a bug in one file, all projects update.
- **Testing**: Write tests to ensure your code is reliable.

## Project Setup
Create this structure in a `demo_modules` folder:
```
demo_modules/
├── demo-modules.ipynb  # This notebook
├── src/
│   ├── bio_utils/
│   │   ├── __init__.py  # Empty file
│   │   └── sequence_utils.py  # Your functions
├── tests/
│   └── test_sequence_utils.py  # Your tests
├── pyproject.toml  # Makes package installable
└── README.md  # Explains your package
```

**Steps**:
1. Create `demo_modules/` folder.
2. Create `src/bio_utils/` and add an empty `__init__.py` (just create the file, no content needed).
3. Create `pyproject.toml` and `README.md` (see end of notebook).
4. You'll create `sequence_utils.py` and `test_sequence_utils.py` after implementing the functions.

## Your Tasks
- **Implement 8 Functions**: Each has a `# TODO` section, pseudocode, and tests for instant feedback.
- **Write Tests**: Add tests for `count_base` and `base_counts` to learn test-driven development.
- **Build a Package**: Move functions to `sequence_utils.py`, tests to `test_sequence_utils.py`, and install with `pip`.
- **Apply the Package**: Solve example bioinformatics problems using the package.

## Why This Matters
- Bioinformatics involves repetitive tasks (e.g., cleaning DNA, finding motifs).
- Packages let you store tools in one place, saving time and reducing errors.

**Instructions**:
- For each function: read the purpose, pseudocode, and efficiency.
- Write the `# TODO` code and run the cell to test. If tests fail, check the error message and debug.
- If stuck, check the solution (which passes all tests).
- Use type hints (e.g., `seq: str -> str`) and docstrings for clean code.
- At the end, create and install the package, then use it for examples.

Let's start coding!

## Exercise 1: Clean DNA Sequence

**Purpose**: Standardize a DNA sequence to uppercase A, C, G, T, N, replacing invalid characters, including spaces, with 'N'.
**Why?** Real DNA data is messy (lowercase, spaces, ambiguous bases like 'R'). Cleaning ensures reliable analysis.
**Question**: "How do I validate my DNA sequence for analysis?"

**Pseudocode**:
1. Convert sequence to uppercase.
2. Create a list for cleaned characters.
3. For each character: if A, C, G, T, or N, add it; else add 'N'.
4. Join list into a string and return.

**Efficiency**: Time O(n), Space O(n) (n = sequence length).

**Task**: Implement below and run tests. If a test fails, check the error (e.g., are you handling spaces?).

In [1]:
x = set(('A', 'B', 'C'))
y = 'A'

print(y in x)

True


In [2]:
def clean_dna(seq: str) -> str:
    """Returns an uppercase DNA string with only A, C, G, T, N. Other characters, including spaces, become 'N'."""
    # TODO
    clean_seq = ''
    valid_bases = set(('A', 'C', 'G', 'T', 'N'))
    
    for base in seq.upper():
        if base in valid_bases:
            clean_seq += base
        else:
            clean_seq += 'N'
            
    return clean_seq
    pass

# Tests
# Test 1: Mixed case and invalid characters
# Why? Ensures lowercase and ambiguous bases (e.g., 'R') are handled.
assert clean_dna('acgtRY') == 'ACGTNN', "Test 1 failed: Should convert lowercase and invalid chars to N"

# Test 2: Empty string
# Why? Functions should handle edge cases like empty input.
assert clean_dna('') == '', "Test 2 failed: Empty string should return empty"

# Test 3: All valid bases
# Why? Confirms valid bases stay unchanged.
assert clean_dna('ACGT') == 'ACGT', "Test 3 failed: Valid bases should stay the same"

# Test 4: Only invalid characters
# Why? Tests if all non-ACGTN characters become N.
assert clean_dna('XYZ123') == 'NNNNNN', "Test 4 failed: All invalid chars should become N"

# Test 5: Mixed sequence with spaces
# Why? Real data might include spaces, which should become N.
assert clean_dna('AT CG N') == 'ATNCGNN', "Test 5 failed: Spaces should become N"

print("All clean_dna tests passed!")

All clean_dna tests passed!


**Solution**: If stuck, compare with this working version.

In [3]:
def clean_dna(seq: str) -> str:
    """Returns an uppercase DNA string with only A, C, G, T, N. Other characters, including spaces, become 'N'."""
    allowed = {'A', 'C', 'G', 'T', 'N'}  # Valid DNA bases, including N
    cleaned = []  # Build result in a list
    for char in seq.upper():  # Convert to uppercase
        cleaned.append(char if char in allowed else 'N')  # Keep valid, else N
    return ''.join(cleaned)  # Join to string

# Verify
assert clean_dna('acgtRY') == 'ACGTNN'
assert clean_dna('') == ''
assert clean_dna('ACGT') == 'ACGT'
assert clean_dna('XYZ123') == 'NNNNNN'
assert clean_dna('AT CG N') == 'ATNCGNN'
print("Solution for clean_dna passes all tests!")

Solution for clean_dna passes all tests!


## Exercise 2: Calculate GC Content

**Purpose**: Compute fraction of G and C bases, important for PCR or genome analysis.
**Why?** GC content affects DNA stability and experimental conditions.
**Question**: "Does this sequence have high GC content for my experiment?"

**Pseudocode**:
1. Clean sequence using `clean_dna`.
2. Count G and C bases.
3. Compute total non-N bases (length minus N count).
4. Return GC count / total, or 0.0 if total is 0.

**Efficiency**: Time O(n), Space O(1).

**Task**: Implement using provided `clean_dna`. Run tests.

In [4]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def gc_content(seq: str) -> float:
    """Returns fraction of G and C bases (0.0 to 1.0), ignoring N."""
    # TODO - Implement here!
    clean_seq = clean_dna(seq)
    valid_base_count = len(clean_seq) - clean_seq.count('N')
    gc_count = clean_seq.count('G') + clean_seq.count('C')

    if valid_base_count != 0:
        gc = gc_count / valid_base_count
    else:
        gc = 0.0
        
    return gc
    
    # Hint: Use clean_dna(seq), count G and C, divide by (length - N count).
    pass

# Tests
# Test 1: All GC bases
# Why? Ensures 100% GC returns 1.0.
assert abs(gc_content('GCGC') - 1.0) < 1e-9, "Test 1 failed: GCGC should have 100% GC"

# Test 2: Mixed sequence
# Why? Checks calculation with A, T, G, C (4/7 bases are GC).
assert abs(gc_content('GCGCAAA') - 4/7) < 1e-9, "Test 2 failed: GCGCAAA should have 4/7 GC"

# Test 3: All N bases
# Why? Ensures division by zero returns 0.0.
assert gc_content('NNNN') == 0.0, "Test 3 failed: All N should return 0.0"

# Test 4: Empty string
# Why? Edge case: empty input returns 0.0.
assert gc_content('') == 0.0, "Test 4 failed: Empty string should return 0.0"

# Test 5: Messy input
# Why? Ensures cleaning works (xy becomes N).
assert abs(gc_content('GCxyGC') - 1.0) < 1e-9, "Test 5 failed: GCxyGC should have 100% GC after cleaning"

print("All gc_content tests passed!")

All gc_content tests passed!


**Solution**:

In [5]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def gc_content(seq: str) -> float:
    """Returns fraction of G and C bases (0.0 to 1.0), ignoring N."""
    s = clean_dna(seq)  # Clean first
    gc_count = s.count('G') + s.count('C')  # Count G and C
    total = len(s) - s.count('N')  # Non-N bases
    return gc_count / total if total > 0 else 0.0  # Avoid division by zero

# Verify
assert abs(gc_content('GCGC') - 1.0) < 1e-9
assert abs(gc_content('GCGCAAA') - 4/7) < 1e-9
assert gc_content('NNNN') == 0.0
assert gc_content('') == 0.0
assert abs(gc_content('GCxyGC') - 1.0) < 1e-9
print("Solution for gc_content passes all tests!")

Solution for gc_content passes all tests!


## Exercise 3: Reverse Complement

**Purpose**: Return reverse complement (A↔T, C↔G, reversed) for PCR or strand analysis.
**Why?** DNA is double-stranded, and we need the complementary strand.
**Question**: "What sequence pairs with this DNA strand?"

**Pseudocode**:
1. Clean sequence using `clean_dna`.
2. Create translation table (A→T, C→G, G→C, T→A, N→N).
3. Apply translation.
4. Reverse and return.

**Efficiency**: Time O(n), Space O(n).

**Task**: Use `str.maketrans()` and `translate()`.

In [6]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def revcomp(seq: str) -> str:
    """Returns the reverse complement of a DNA sequence."""
    # TODO - Implement here!
    s = clean_dna(seq)
    complements = str.maketrans('ACGTN', 'TGCAN')
    revcomp_seq = s.translate(complements)[::-1]
    return revcomp_seq
    # Hint: Use str.maketrans('ACGTN', 'TGCAN') and translate()[::-1].
    pass

# Tests
# Test 1: Standard DNA sequence
# Why? Checks complement and reverse.
assert revcomp('ATGC') == 'GCAT', "Test 1 failed: ATGC should reverse complement to GCAT"

# Test 2: Sequence with N
# Why? Ensures N stays N (common in real data).
assert revcomp('ATGCNN') == 'NNGCAT', "Test 2 failed: N should stay N"

# Test 3: Empty string
# Why? Edge case: empty input returns empty.
assert revcomp('') == '', "Test 3 failed: Empty string should return empty"

# Test 4: Messy input
# Why? Ensures cleaning before complementing.
assert revcomp('atXYgc') == 'GCNNAT', "Test 4 failed: atXYgc should clean and complement to GCNNAT"

# Test 5: Single base
# Why? Tests minimal case.
assert revcomp('A') == 'T', "Test 5 failed: Single A should complement to T"

print("All revcomp tests passed!")

All revcomp tests passed!


**Solution**:

In [7]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def revcomp(seq: str) -> str:
    """Returns the reverse complement of a DNA sequence."""
    s = clean_dna(seq)  # Clean first
    complement = str.maketrans('ACGTN', 'TGCAN')  # A->T, C->G, etc.
    return s.translate(complement)[::-1]  # Translate and reverse

# Verify
assert revcomp('ATGC') == 'GCAT'
assert revcomp('ATGCNN') == 'NNGCAT'
assert revcomp('') == ''
assert revcomp('atXYgc') == 'GCNNAT'
assert revcomp('A') == 'T'
print("Solution for revcomp passes all tests!")

Solution for revcomp passes all tests!


## Exercise 4: Count Base

**Purpose**: Count occurrences of a specific base (e.g., 'A').
**Why?** Useful for sequence composition analysis.
**Question**: "How many A's are in this sequence?"

**Pseudocode**:
1. Clean sequence.
2. Convert base to uppercase.
3. If base not A, C, G, T, or N, raise ValueError.
4. Count base occurrences and return.

**Efficiency**: Time O(n), Space O(1).

**Task**: Implement and add two additional tests in the cell.

In [8]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def count_base(seq: str, base: str) -> int:
    """Returns count of a specific base (A, C, G, T, or N)."""
    # TODO - Implement here!
    s = clean_dna(seq)
    base = base.upper()  # Uppercase base
    if base not in {'A', 'C', 'G', 'T', 'N'}:  # Validate
        raise ValueError("Base must be A, C, G, T, or N")
    return s.count(base)
    # Hint: Clean seq, check base is valid, use str.count().
    pass

# Tests
# Test 1: Count A's
# Why? Basic functionality for valid base.
assert count_base('AAGCT', 'A') == 2, "Test 1 failed: Should count 2 A's"

# Test 2: No N's
# Why? Ensures zero count for absent base.
assert count_base('AAGCT', 'N') == 0, "Test 2 failed: No N's in sequence"

# Test 3: Empty string
# Why? Edge case: no bases.
assert count_base('', 'A') == 0, "Test 3 failed: Empty string has no bases"

# TODO: Add two more tests here!

assert count_base('XYZA', 'A') == 1, "Test 4 failed: Invalid chars become N"
try:
    count_base('ACGT', 'X')
    assert False
except ValueError:
    pass

print("All count_base tests passed!")

All count_base tests passed!


**Solution**:

In [9]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def count_base(seq: str, base: str) -> int:
    """Returns count of a specific base (A, C, G, T, or N)."""
    s = clean_dna(seq)  # Clean first
    base = base.upper()  # Uppercase base
    if base not in {'A', 'C', 'G', 'T', 'N'}:  # Validate
        raise ValueError("Base must be A, C, G, T, or N")
    return s.count(base)  # Count occurrences

# Verify
assert count_base('AAGCT', 'A') == 2
assert count_base('AAGCT', 'N') == 0
assert count_base('', 'A') == 0
assert count_base('XYZA', 'A') == 1, "Test 4 failed: Invalid chars become N"
try:
    count_base('ACGT', 'X')
    assert False
except ValueError:
    pass
print("Solution for count_base passes all tests!")

Solution for count_base passes all tests!


## Exercise 5: Hamming Distance

**Purpose**: Count differing positions between two equal-length sequences.
**Why?** Measures similarity (e.g., mutations between strains).
**Question**: "How many mutations between two sequences?"

**Pseudocode**:
1. Clean both sequences.
2. If lengths differ, raise ValueError.
3. Count positions where characters differ.
4. Return count.

**Efficiency**: Time O(n), Space O(1).

In [10]:
x = 'abcd'
y = 'wxyz'
z = 0

for a, b in zip(x, y):
    z += (not a == b)

print(z)

4


In [11]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def hamming(a: str, b: str) -> int:
    """Returns number of differing positions (equal length required)."""
    # TODO - Implement here!
    if len(a) != len(b):
        raise ValueError('Sequences passed must be of equal length')
    a_clean, b_clean = clean_dna(a), clean_dna(b)
    
    mismatches = 0
    for x, y in zip(a_clean, b_clean):
        mismatches += not (x == y)

    return mismatches
    # Hint: Use zip() to pair characters, check lengths.
    pass

# Tests
# Test 1: One difference
# Why? Basic functionality.
assert hamming('ACGT', 'ACGA') == 1, "Test 1 failed: One difference"

# Test 2: Identical sequences
# Why? Ensures zero differences.
assert hamming('ACGT', 'ACGT') == 0, "Test 2 failed: Identical sequences"

# Test 3: Invalid input
# Why? Tests cleaning of invalid characters.
assert hamming('axyz', 'ACGT') == 3, "Test 3 failed: Invalid chars become N"

# Test 4: Unequal lengths
# Why? Should raise ValueError.
try:
    hamming('ACG', 'ACGT')
    assert False, "Test 4 failed: Should raise ValueError"
except ValueError:
    pass

# Test 5: Empty strings
# Why? Edge case: no differences.
assert hamming('', '') == 0, "Test 5 failed: Empty strings have no differences"

print("All hamming tests passed!")

All hamming tests passed!


**Solution**:

In [None]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def hamming(a: str, b: str) -> int:
    """Returns number of differing positions (equal length required)."""
    sa = clean_dna(a)  # Clean both
    sb = clean_dna(b)
    if len(sa) != len(sb):  # Check lengths
        raise ValueError("Sequences must have equal length")
    return sum(1 for x, y in zip(sa, sb) if x != y)  # Count differences

# Verify
assert hamming('ACGT', 'ACGA') == 1
assert hamming('ACGT', 'ACGT') == 0
assert hamming('axyz', 'ACGT') == 3
try:
    hamming('ACG', 'ACGT')
    assert False
except ValueError:
    pass
assert hamming('', '') == 0
print("Solution for hamming passes all tests!")

## Exercise 6: K-mer Counts

**Purpose**: Count k-length subsequences (k-mers) for motif analysis.
**Why?** Identifies frequent patterns in sequences.
**Question**: "What are common 3-letter patterns in this gene?"

**Pseudocode**:
1. Clean sequence.
2. If k <= 0, raise ValueError.
3. Slide window of size k, skip if contains N.
4. Count each k-mer in a dictionary.

**Efficiency**: Time O(n), Space O(min(n, 4^k)).

In [None]:
s = 'abcdef'
k = 3
i = 0



In [13]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def kmer_counts(seq: str, k: int) -> dict:
    """Returns dictionary of k-mer frequencies (overlapping, skips N)."""
    # TODO - Implement here!
    if k <= 0:
        raise ValueError('k must be an integer greater than 0')
    s = clean_dna(seq)
    i = 0
    kmer_dict = {}
    while i <= len(s) - k:
        kmer = s[i: i+step]
        if kmer[k-1] == 'N':
            i = i + step

        else:
            if kmer not in kmer_dict.keys():
                kmer_dict[kmer] = 1
            else:
                kmer_dict[kmer] += 1
                
            i = i + 
            
    return kmer_dict
    # Hint: Use dict, slide window, check for N.
    pass

# Tests
# Test 1: k=2 counts
# Why? Checks basic k-mer counting.
assert kmer_counts('ACGTCG', 2) == {'AC': 1, 'CG': 2, 'GT': 1, 'TC': 1}, "Test 1 failed: Wrong k-mer counts"

# Test 2: k larger than sequence
# Why? Ensures empty result for invalid k.
assert kmer_counts('ACGT', 5) == {}, "Test 2 failed: k larger than sequence"

# Test 3: Sequence with N
# Why? Skips k-mers with N.
assert kmer_counts('ACGN', 2) == {'AC': 1, 'CG': 1}, "Test 3 failed: Skip N k-mers"

# Test 4: Empty sequence
# Why? Edge case: no k-mers.
assert kmer_counts('', 1) == {}, "Test 4 failed: Empty sequence"

# Test 5: Invalid k
# Why? Should raise ValueError.
try:
    kmer_counts('ACGT', 0)
    assert False, "Test 5 failed: Should raise ValueError"
except ValueError:
    pass

print("All kmer_counts tests passed!")

AssertionError: Test 1 failed: Wrong k-mer counts

**Solution**:

In [None]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def kmer_counts(seq: str, k: int) -> dict:
    """Returns dictionary of k-mer frequencies (overlapping, skips N)."""
    if k <= 0:  # Validate k
        raise ValueError("k must be positive")
    s = clean_dna(seq)  # Clean sequence
    counts = {}  # Store k-mer counts
    for i in range(len(s) - k + 1):  # Slide window
        kmer = s[i:i+k]
        if 'N' in kmer:  # Skip invalid
            continue
        counts[kmer] = counts.get(kmer, 0) + 1  # Increment count
    return counts

# Verify
assert kmer_counts('ACGTCG', 2) == {'AC': 1, 'CG': 2, 'GT': 1, 'TC': 1}
assert kmer_counts('ACGT', 5) == {}
assert kmer_counts('ACGN', 2) == {'AC': 1, 'CG': 1}
assert kmer_counts('', 1) == {}
try:
    kmer_counts('ACGT', 0)
    assert False
except ValueError:
    pass
print("Solution for kmer_counts passes all tests!")

## Exercise 7: Find Motif

**Purpose**: Find exact motif positions in a sequence.
**Why?** Identifies regulatory elements (e.g., TATA box).
**Question**: "Where does this motif appear?"

**Pseudocode**:
1. Clean sequence and motif.
2. If motif empty, return [].
3. Slide motif-length window, collect matching positions.
4. Return list of 0-based positions.

**Efficiency**: Time O(n*m), Space O(number of matches).

In [None]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def find_motif(seq: str, motif: str) -> list:
    """Returns 0-based start positions of exact motif matches."""
    # TODO - Implement here!
    # Hint: Slide window of len(motif), compare substrings.
    pass

# Tests
# Test 1: Multiple motif occurrences
# Why? Checks basic functionality.
assert find_motif('ACGTCG', 'CG') == [1, 4], "Test 1 failed: CG appears at positions 1, 4"

# Test 2: Single occurrence
# Why? Tests specific match.
assert find_motif('ACGT', 'GT') == [2], "Test 2 failed: GT at position 2"

# Test 3: Empty motif
# Why? Edge case: empty motif returns [].
assert find_motif('ACGT', '') == [], "Test 3 failed: Empty motif returns []"

# Test 4: Sequence with N
# Why? Ensures N handling in sequence.
assert find_motif('ACGN', 'CG') == [1], "Test 4 failed: Skip N in sequence"

# Test 5: Empty sequence
# Why? Edge case: no matches.
assert find_motif('', 'CG') == [], "Test 5 failed: Empty sequence returns []"

print("All find_motif tests passed!")

**Solution**:

In [None]:
def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def find_motif(seq: str, motif: str) -> list:
    """Returns 0-based start positions of exact motif matches."""
    s = clean_dna(seq)  # Clean sequence
    m = clean_dna(motif)  # Clean motif
    if not m:  # Handle empty motif
        return []
    k = len(m)
    positions = []  # Store matching positions
    for i in range(len(s) - k + 1):  # Slide window
        if s[i:i+k] == m:  # Check match
            positions.append(i)
    return positions

# Verify
assert find_motif('ACGTCG', 'CG') == [1, 4]
assert find_motif('ACGT', 'GT') == [2]
assert find_motif('ACGT', '') == []
assert find_motif('ACGN', 'CG') == [1]
assert find_motif('', 'CG') == []
print("Solution for find_motif passes all tests!")

## Exercise 8: Base Counts

**Purpose**: Count all bases (A, C, G, T, N) and total length.
**Why?** Summarizes sequence composition for quality control.
**Question**: "What's the base distribution in this sequence?"

**Pseudocode**:
1. Clean sequence.
2. Use Counter to count A, C, G, T, N.
3. Add total length to dictionary.
4. Return dictionary.

**Efficiency**: Time O(n), Space O(1).

**Task**: Implement and add two additional tests.

In [None]:
from collections import Counter

def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def base_counts(seq: str) -> dict:
    """Returns counts of A, C, G, T, N and total length ('LEN')."""
    # TODO - Implement here!
    # Hint: Use Counter(s) and add 'LEN': len(s).
    pass

# Tests
# Test 1: Mixed sequence
# Why? Checks basic counting.
assert base_counts('ACGTN') == {'A': 1, 'C': 1, 'G': 1, 'T': 1, 'N': 1, 'LEN': 5}, "Test 1 failed"

# Test 2: Empty sequence
# Why? Edge case: all zeros.
assert base_counts('') == {'A': 0, 'C': 0, 'G': 0, 'T': 0, 'N': 0, 'LEN': 0}, "Test 2 failed"

# Test 3: Invalid input
# Why? Ensures invalid chars become N.
assert base_counts('xyz') == {'A': 0, 'C': 0, 'G': 0, 'T': 0, 'N': 3, 'LEN': 3}, "Test 3 failed"

# TODO: Add two more tests here!
# Suggestions: Test mixed case, test spaces.
# Example: assert base_counts('acGT') == {'A': 1, 'C': 1, 'G': 1, 'T': 1, 'N': 0, 'LEN': 4}

print("All base_counts tests passed!")

**Solution**:

In [None]:
from collections import Counter

def clean_dna(seq: str) -> str:
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def base_counts(seq: str) -> dict:
    """Returns counts of A, C, G, T, N and total length ('LEN')."""
    s = clean_dna(seq)  # Clean first
    counts = Counter(s)  # Count bases
    counts['LEN'] = len(s)  # Add total length
    # Ensure all bases are in dict, even if count is 0
    defaults = {'A': 0, 'C': 0, 'G': 0, 'T': 0, 'N': 0, 'LEN': len(s)}
    defaults.update(counts)
    return defaults

# Verify
assert base_counts('ACGTN') == {'A': 1, 'C': 1, 'G': 1, 'T': 1, 'N': 1, 'LEN': 5}
assert base_counts('') == {'A': 0, 'C': 0, 'G': 0, 'T': 0, 'N': 0, 'LEN': 0}
assert base_counts('xyz') == {'A': 0, 'C': 0, 'G': 0, 'T': 0, 'N': 3, 'LEN': 3}
assert base_counts('acGT') == {'A': 1, 'C': 1, 'G': 1, 'T': 1, 'N': 0, 'LEN': 4}
assert base_counts('AT CG') == {'A': 1, 'C': 1, 'G': 1, 'T': 1, 'N': 1, 'LEN': 5}
print("Solution for base_counts passes all tests!")

## Creating the bio_utils Package

Now that you've implemented and tested the functions, let's package them into `bio_utils` for reuse.

### Step 0: Create `__init__.py`

`demo_modules/src/bio_utils/__init__.py`

```python
from .sequence_utils import (
    clean_dna,
    base_counts,
    find_motif,
    kmer_counts,
    hamming,
    count_base,
    revcomp,
    gc_content,
)

__all__ = [
    "clean_dna",
    "base_counts",
    "find_motif",
    "kmer_counts",
    "hamming",
    "count_base",
    "revcomp",
    "gc_content",
]
```

### Step 1: Create `sequence_utils.py`
Copy all **solution** versions of the functions into `demo_modules/src/bio_utils/sequence_utils.py`:

```python
# demo_modules/src/bio_utils/sequence_utils.py
from collections import Counter

def clean_dna(seq: str) -> str:
    """Returns an uppercase DNA string with only A, C, G, T, N. Other characters, including spaces, become 'N'."""
    allowed = {'A', 'C', 'G', 'T', 'N'}
    cleaned = []
    for char in seq.upper():
        cleaned.append(char if char in allowed else 'N')
    return ''.join(cleaned)

def gc_content(seq: str) -> float:
    """Returns fraction of G and C bases (0.0 to 1.0), ignoring N."""
    s = clean_dna(seq)
    gc_count = s.count('G') + s.count('C')
    total = len(s) - s.count('N')
    return gc_count / total if total > 0 else 0.0

def revcomp(seq: str) -> str:
    """Returns the reverse complement of a DNA sequence."""
    s = clean_dna(seq)
    complement = str.maketrans('ACGTN', 'TGCAN')
    return s.translate(complement)[::-1]

def count_base(seq: str, base: str) -> int:
    """Returns count of a specific base (A, C, G, T, or N)."""
    s = clean_dna(seq)
    base = base.upper()
    if base not in {'A', 'C', 'G', 'T', 'N'}:
        raise ValueError("Base must be A, C, G, T, or N")
    return s.count(base)

def hamming(a: str, b: str) -> int:
    """Returns number of differing positions (equal length required)."""
    sa = clean_dna(a)
    sb = clean_dna(b)
    if len(sa) != len(sb):
        raise ValueError("Sequences must have equal length")
    return sum(1 for x, y in zip(sa, sb) if x != y)

def kmer_counts(seq: str, k: int) -> dict:
    """Returns dictionary of k-mer frequencies (overlapping, skips N)."""
    if k <= 0:
        raise ValueError("k must be positive")
    s = clean_dna(seq)
    counts = {}
    for i in range(len(s) - k + 1):
        kmer = s[i:i+k]
        if 'N' in kmer:
            continue
        counts[kmer] = counts.get(kmer, 0) + 1
    return counts

def find_motif(seq: str, motif: str) -> list:
    """Returns 0-based start positions of exact motif matches."""
    s = clean_dna(seq)
    m = clean_dna(motif)
    if not m:
        return []
    k = len(m)
    positions = []
    for i in range(len(s) - k + 1):
        if s[i:i+k] == m:
            positions.append(i)
    return positions

def base_counts(seq: str) -> dict:
    """Returns counts of A, C, G, T, N and total length ('LEN')."""
    s = clean_dna(seq)
    counts = Counter(s)
    counts['LEN'] = len(s)
    defaults = {'A': 0, 'C': 0, 'G': 0, 'T': 0, 'N': 0, 'LEN': len(s)}
    defaults.update(counts)
    return defaults
```

### Step 2: Create `test_sequence_utils.py`
Copy all tests from the solution cells into `demo_modules/tests/test_sequence_utils.py`, using `unittest`:

```python
# demo_modules/tests/test_sequence_utils.py
import unittest
from bio_utils.sequence_utils import (
    clean_dna, gc_content, revcomp, count_base, hamming,
    kmer_counts, find_motif, base_counts
)

class TestSequenceUtils(unittest.TestCase):
    def test_clean_dna(self):
        self.assertEqual(clean_dna('acgtRY'), 'ACGTNN')
        self.assertEqual(clean_dna(''), '')
        self.assertEqual(clean_dna('ACGT'), 'ACGT')
        self.assertEqual(clean_dna('XYZ123'), 'NNNNNN')
        self.assertEqual(clean_dna('AT CG N'), 'ATNCGNN')

    def test_gc_content(self):
        self.assertAlmostEqual(gc_content('GCGC'), 1.0)
        self.assertAlmostEqual(gc_content('GCGCAAA'), 4/7)
        self.assertEqual(gc_content('NNNN'), 0.0)
        self.assertEqual(gc_content(''), 0.0)
        self.assertAlmostEqual(gc_content('GCxyGC'), 1.0)

    def test_revcomp(self):
        self.assertEqual(revcomp('ATGC'), 'GCAT')
        self.assertEqual(revcomp('ATGCNN'), 'NNGCAT')
        self.assertEqual(revcomp(''), '')
        self.assertEqual(revcomp('atXYgc'), 'GCNNAT')
        self.assertEqual(revcomp('A'), 'T')

    def test_count_base(self):
        self.assertEqual(count_base('AAGCT', 'A'), 2)
        self.assertEqual(count_base('AAGCT', 'N'), 0)
        self.assertEqual(count_base('', 'A'), 0)
        self.assertEqual(count_base('XYZA', 'A'), 1)
        with self.assertRaises(ValueError):
            count_base('ACGT', 'X')

    def test_hamming(self):
        self.assertEqual(hamming('ACGT', 'ACGA'), 1)
        self.assertEqual(hamming('ACGT', 'ACGT'), 0)
        self.assertEqual(hamming('axyz', 'ACGT'), 3)
        with self.assertRaises(ValueError):
            hamming('ACG', 'ACGT')
        self.assertEqual(hamming('', ''), 0)

    def test_kmer_counts(self):
        self.assertEqual(kmer_counts('ACGTCG', 2), {'AC': 1, 'CG': 2, 'GT': 1, 'TC': 1})
        self.assertEqual(kmer_counts('ACGT', 5), {})
        self.assertEqual(kmer_counts('ACGN', 2), {'AC': 1, 'CG': 1})
        self.assertEqual(kmer_counts('', 1), {})
        with self.assertRaises(ValueError):
            kmer_counts('ACGT', 0)

    def test_find_motif(self):
        self.assertEqual(find_motif('ACGTCG', 'CG'), [1, 4])
        self.assertEqual(find_motif('ACGT', 'GT'), [2])
        self.assertEqual(find_motif('ACGT', ''), [])
        self.assertEqual(find_motif('ACGN', 'CG'), [1])
        self.assertEqual(find_motif('', 'CG'), [])

    def test_base_counts(self):
        self.assertEqual(base_counts('ACGTN'), {'A': 1, 'C': 1, 'G': 1, 'T': 1, 'N': 1, 'LEN': 5})
        self.assertEqual(base_counts(''), {'A': 0, 'C': 0, 'G': 0, 'T': 0, 'N': 0, 'LEN': 0})
        self.assertEqual(base_counts('xyz'), {'A': 0, 'C': 0, 'G': 0, 'T': 0, 'N': 3, 'LEN': 3})
        self.assertEqual(base_counts('acGT'), {'A': 1, 'C': 1, 'G': 1, 'T': 1, 'N': 0, 'LEN': 4})
        self.assertEqual(base_counts('AT CG'), {'A': 1, 'C': 1, 'G': 1, 'T': 1, 'N': 1, 'LEN': 5})

if __name__ == '__main__':
    unittest.main()
```

### Step 3: Create `pyproject.toml`
Create `demo_modules/pyproject.toml` to make the package installable:

```toml
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "bio_utils"
version = "0.1.0"
description = "A Python package for DNA sequence analysis."
readme = "README.md"
requires-python = ">=3.7"
dependencies = []

[tool.hatch.build.targets.wheel]
packages = ["src/bio_utils"]
```

### Step 4: Create `README.md`
Create `demo_modules/README.md`:

```markdown
# bio_utils

A Python package for DNA sequence analysis, designed for bioinformatics tasks.

## Installation
```bash
pip install .
```

## Usage
```python
from bio_utils.sequence_utils import clean_dna, gc_content
print(clean_dna('atcgXY'))  # Outputs: ATCGNN
print(gc_content('GCGC'))   # Outputs: 1.0
```

## Functions
- `clean_dna`: Standardize DNA sequences.
- `gc_content`: Calculate GC content.
- `revcomp`: Compute reverse complement.
- `count_base`: Count specific bases.
- `hamming`: Calculate Hamming distance.
- `kmer_counts`: Count k-mers.
- `find_motif`: Find motif positions.
- `base_counts`: Count all bases and sequence length.

## Running Tests
```bash
python -m unittest discover tests
```


### Step 5: Install and Test the Package
Run these commands in your terminal from the `demo_modules` directory:
```bash
pip install .
python -m unittest discover tests
```
This installs `bio_utils` and runs all tests to ensure everything works.

## Using the Package: Example Applications

Now, let's use `bio_utils` to solve real-world bioinformatics problems.

**Example 1: Analyze a Gene Sequence**
- Clean a messy sequence.
- Calculate GC content to check PCR suitability.
- Find occurrences of a promoter motif (e.g., 'TATA').

In [None]:
!pip install .
!python -m unittest discover tests

In [None]:
from bio_utils.sequence_utils import clean_dna, gc_content, find_motif

# Sample gene sequence (messy input)
gene = 'atg cgt tata xy gcttata'

# Clean the sequence
cleaned = clean_dna(gene)
print(f"Cleaned sequence: {cleaned}")

# Calculate GC content
gc = gc_content(gene)
print(f"GC content: {gc:.2f} (suitable for PCR if > 0.4)")

# Find TATA box motifs
positions = find_motif(gene, 'TATA')
print(f"TATA box found at positions: {positions}")

**Example 2: Compare Two Sequences**
- Compute Hamming distance to estimate mutations.
- Check base composition for quality control.

In [None]:
from bio_utils.sequence_utils import hamming, base_counts

# Two sequences (e.g., from different strains)
seq1 = 'ACGTACGT'
seq2 = 'ACGTACGA'

# Calculate Hamming distance
distance = hamming(seq1, seq2)
print(f"Hamming distance: {distance} mutations")

# Check base composition
counts1 = base_counts(seq1)
counts2 = base_counts(seq2)
print(f"Seq1 composition: {counts1}")
print(f"Seq2 composition: {counts2}")

**Example 3: Analyze K-mers for Motif Discovery**
- Find frequent 3-mers to identify potential regulatory elements.

In [None]:
from bio_utils.sequence_utils import kmer_counts

# Sample sequence
seq = 'ACGTCGTACG'

# Count 3-mers
kmers = kmer_counts(seq, 3)
print(f"3-mer counts: {kmers}")

# Find most frequent 3-mer
if kmers:
    most_common = max(kmers.items(), key=lambda x: x[1])
    print(f"Most common 3-mer: {most_common[0]} (count: {most_common[1]})")
else:
    print("No 3-mers found.")

## Next Steps
- **Extend the Package**: Add more functions (e.g., `translate` for protein translation).
- **Share It**: Upload to GitHub and share with lab mates.
- **Test More**: Add complex test cases in `test_sequence_utils.py`.
- **Document**: Expand `README.md` with more examples.

Congratulations! You've built a reusable bioinformatics package!