# GC Content of a DNA Sequence (Live Coding Exercise)

**Difficulty:** Easy → Medium  
**Time:** ~10–20 minutes  

**You’ll practice:**
- Counting characters in strings
- Dictionary / conditional logic
- Writing small, composable functions
- Sliding window algorithms (interview-style)


### Exercises

**Exercise 1**  
Compute the **GC content** of a single DNA sequence.

**Exercise 2**  
Compute **GC content in a sliding window algorithm** across a longer DNA sequence.


## Problem

**GC content** is the fraction of bases in a DNA sequence that are **G** or **C**.

Common assumptions (state these before coding):
- DNA bases: `A, T, C, G, N`
- GC content is usually **case-insensitive**
- `N` bases are typically **ignored** when computing GC%

Given a DNA sequence string, compute:
- sequence length
- number of `G` bases
- number of `C` bases
- GC content

GC content formula (ignoring `N` bases):  
`GC% = (G + C) / (number of valid bases)`

> **Note**  
> For this exercise, **ignore `N` bases** when computing GC%.


## Exercise 1 – GC content of a single DNA sequence

### Step 1 – Define assumptions 
- What should happen if the sequence contains invalid characters?
- Should lowercase bases be allowed?
- What should happen if the sequence is empty or only contains `N`?


### Step 2 – Implement helper functions

You will implement two functions:

1. `validate_dna(sequence: str) -> str`
2. `gc_content(sequence: str) -> float`

Fill in the code in the cells below.

In [6]:
dna = "ATTCATGACTGGATAC"

def compute_gc_content(seq:str) -> list[float]:
    length = len(seq)
    if length == 0:
        return "Cannot compute GC content"
    
    n_count = seq.count("N")
    
    non_n_length = length - n_count
    if non_n_length == 0:
        gc_content = None
    else:
        g_count = seq.count("G")
        c_count = seq.count("C")
        gc_content = (g_count + c_count)/non_n_length
    
    return gc_content

print(compute_gc_content(dna))

0.375


## Exercise 2 – GC content in a sliding window algorithm across a longer DNA sequence

In [None]:
dna = "ATTCATGACTGGATAC"

def gc_sliding_window(seq:str, k:int) -> list[float]:
    """
    Compute GC content (%) for each sliding window of size k across a DNA sequence
    
    Parameters
        - seq: str
            DNA seq contains A,C,G,T
        - k: int
            window size
            
    Returns a list[float] of GC percentage for each window
    """
    
    # normalizing sequence
    s = seq.upper()
    
    # length of the seq
    n = len(s)
    
    # edge cases
    if k <= 0 or n < k:
        return []
    
    gc_percentages = []
    
    # counting GC bases in the first window
    gc = 0
    for i in range(k):
        if s[i] == "G" or s[i] == "C":
            gc += 1
            
    gc_percentages.append((gc / k) * 100)  # FIX: return percent, not fraction
    
    # slide the window across the sequence
    for i in range(k, n):
        # remove the base leaving the window
        if s[i - k] == "G" or s[i - k] == "C":  # FIX: spacing for readability
            gc -= 1
            
        # add the base entering the window
        if s[i] == "G" or s[i] == "C":
            gc += 1
            
        gc_percentages.append((gc / k) * 100)  # FIX: return percent
        
    return gc_percentages

print(gc_sliding_window(dna,5))


[20.0, 20.0, 40.0, 40.0, 40.0, 40.0, 60.0, 60.0, 60.0, 40.0, 40.0, 40.0]
