# Validate Quality Check Metrics (Live Coding Exercise)

**Difficulty:** Easy → Medium  
**Time:** ~15–20 minutes  

**You’ll practice:**
- String iteration and counting in Python
- Dictionary / conditional logic
- Writing small, composable functions
- Returning structured outputs


### Exercises

**Exercise 1**  
Write code to validate sequences and compute QC metrics such as sequence length and GC content (ignoring N bases)

**Exercise 2**  
Return a structured output, with one record per sample


> This is a foundational preprocessing pattern used in real bioinformatics pipelines and should be implementable live, without any external libraries.


## Problem

You are given a **list of sample records**, where each record is a dictionary with:

- `sample_id`: unique ID string for the sample
- `sequence`: DNA sequence string
- `platform`: sequencing platform (e.g. `illumina`, `nanopore`)

For example:

```python
samples = [
    {
        "sample_id": "S1",
        "sequence": "ATTCATGACTGGATAC",
        "platform": "illumina"
    },
    {
        "sample_id": "S2",
        "sequence": "NNNTGCATGCA",
        "platform": "illumina"
    },
    {
        "sample_id": "S3",
        "sequence": "ATGCATGCATGC",
        "platform": "nanopore"
    }
]
```

Your job is to write code that **validates each sequence**, **computes QC metrics**, and **returns a structured summary of results**, taking into account that:

- Different platforms (Illumina vs Nanopore) may have different typical read lengths / error profiles (we will ignore in this exercise)
- Some sequences may contain `N` bases (unknown nucleotides)

### Workflow

1. Iterate over the input `samples` list.
2. For each sample:
   - Validate the `sequence` string
   - Compute QC metrics
3. Collect the results into a structured output (e.g. a new list of dicts), with **one record per sample**.

### Validation rules

Assume the following:

- Allowed DNA bases: `A, T, C, G, N`
- Validation is **case-insensitive**
- A sequence is **invalid** if:
  - It contains any character outside `A, T, C, G, N`
  - Or it is empty / only whitespace (decide and document your choice)

For invalid sequences, decide on a clear behavior, for example:

- Mark the sample as invalid and **skip QC metrics**, or
- Return metrics but include a flag like `is_valid = False`

> **Note**  
> Whatever rule you choose, keep it **consistent** and make it easy to see which samples failed validation.

### QC metrics

For each **valid** sequence, compute at least:

- `length`: total number of bases (including `N`)
- `g_count`: number of `G` bases
- `c_count`: number of `C` bases
- `gc_content`: GC fraction or percent, **ignoring `N` bases**

Remember:

- `N` bases are **ignored** when computing GC%

GC content formula (ignoring `N` bases):  
`GC% = (G + C) / (number of valid bases)`

> **Edge case**  
> If a sequence has only `N` bases (no valid `A/T/C/G`), avoid division by zero.  
> Decide how to represent GC% in this case (e.g. `None`, `0.0`, or a special flag).

The goal is to produce a **clean, easy-to-read QC table** (e.g. list of dicts) that summarizes validation and basic metrics for each sample.

## Exercise 1 – Validating Sequences and Computing Metrics

### Step 1 – Define assumptions 
- What should happen if the sequence contains invalid characters?
- Should lowercase bases be allowed?
- What should happen if the sequence is empty or only contains `N`?

### Step 2 – Implement helper functions

You will implement two functions:

1. `compute_metrics(sequence: str) -> [int, float, float]`
    - returns [length, gc_content, n_fraction]
2. `validate_sequence(sequence: str) -> [bool, str]`
    - returns ...

Fill in the code in the cells below.

In [30]:
# your sample dataset

samples = [
    {
        "sample_id": "S1",
        "sequence": "ATTCATGACTGGATAC",
        "platform": "illumina"
    },
    {
        "sample_id": "S2",
        "sequence": "NNNTGCATGCA",
        "platform": "illumina"
    },
    {
        "sample_id": "S3",
        "sequence": "ATGCATGCATGC",
        "platform": "nanopore"
    }
]

print(samples)

[{'sample_id': 'S1', 'sequence': 'ATTCATGACTGGATAC', 'platform': 'illumina'}, {'sample_id': 'S2', 'sequence': 'NNNTGCATGCA', 'platform': 'illumina'}, {'sample_id': 'S3', 'sequence': 'ATGCATGCATGC', 'platform': 'nanopore'}]


In [None]:
# to solve
from typing import Tuple, Optional

ALLOWED_BASES = set("ATCGN")

def compute_metrics(seq: str) -> Tuple[int, Optional[float], float]:
    """
    Takes a sequence from Dict
    Returns the following metrics
        - length: length of sequence
        - gc_content: the fraction of G and C bases in the sequence (ignoring N)
        - n_fraction: the fraction of N in the original sequence
    """
    # your code here


def validate_sequence(seq: str, n_fraction: float, max_n_fraction: float = 0.1) -> Tuple[bool, Optional[str]]:
    """
    Takes a sequence from dictionary
    Returns if a sequence is valid for downstream analysis
        - In this example, we will say it is valid if N fraction is below 10% (denoted by max_n_fraction)
    """
    # your code here


## Solutions to Exercise 1

> If you haven’t attempted the exercise yet, stop here and try implementing the functions yourself before expanding this section!

Below is one possible solution.

In [32]:
def compute_metrics(seq: str) -> Tuple[int, Optional[float], float]:
    """
    Takes a sequence from Dict
    Returns the following metrics
        - length: length of sequence
        - gc_content: the percentage of G and C within sequence
        - n_fraction: the percentage of N in orignal sequence
    """
    seq = seq.upper()
    length = len(seq)
    if length == 0:
        return 0, None, 0.0

    n_count = seq.count("N")
    n_fraction = n_count / length

    non_n_length = length - n_count
    if non_n_length == 0:
        gc_content = None
    else:
        g_count = seq.count("G")
        c_count = seq.count("C")
        gc_content = (g_count + c_count) / non_n_length

    return length, gc_content, n_fraction


def validate_sequence(
    seq: str,
    n_fraction: float,
    max_n_fraction: float = 0.10,
) -> Tuple[bool, Optional[str]]:
    """
    Takes a sequence from Dict
    Returns if a sequence is valid for downstream analysis
        - In this example, we will say it is valid if N fraction is below 10%
    """
    seq = seq.upper()
    invalid = set(seq) - ALLOWED_BASES

    if invalid:
        return False, f"invalid_bases = {''.join(sorted(invalid))}"

    if n_fraction > max_n_fraction:
        return False, f"too_many_Ns (n_fraction = {n_fraction:.3f} > {max_n_fraction:.2f})"

    return True, None


## Exercise 2 – Returning a Structured Dictionary with Validation & Metrics

Fill in the code in the cells below using your `compute_metrics` and `validate_sequence` functions.

Your goal is to create a helper that takes a single sample dictionary and returns a new dictionary containing:

- `sample_id`
- `length`
- `gc_content`
- `n_fraction`
- `gc_pass` (boolean validation flag)
- `reason` (string explaining why validation failed, or `None`)


In [41]:
# to solve

from typing import Dict

def process_sample(sample:Dict) -> Dict:
    """
    Taking the sample dictionary we want to return compute_metrics and validate_sequence results 
    The structured dictionary should look like the following
    
    "sample_id": sample_id,
            "length": length,
            "gc_content": gc_content,
            "n_fraction": n_fraction,
            "gc_pass": is_valid,
            "reason": reason
    """
    
    # your code here

In [42]:
# to check if your code works!

def run_qc(samples:Dict) -> Dict:
    return[process_sample(s) for s in samples]

results = run_qc(samples)

for r in results:
    print(r)

None
None
None


## Solutions to Exercise 2

> If you haven't attempted the exercise yet, stop here and try implementing `process_sample` yourself before expanding this section!

Below is one possible solution.

In [43]:
from typing import Dict

def process_sample(sample:Dict) -> Dict:
    """
    Taking the sample dictionary we want to return compute_metrics and validate_sequence results 
    The structured dictionary should look like the following
    
    "sample_id": sample_id,
            "length": length,
            "gc_content": gc_content,
            "n_fraction": n_fraction,
            "gc_pass": is_valid,
            "reason": reason
    """
    sample_id = sample.get("sample_id")
    seq = sample.get("sequence","")
    
    length, gc_content, n_fraction = compute_metrics(seq)
    is_valid, reason = validate_sequence(seq, n_fraction)
    
    return {"sample_id": sample_id,
            "length": length,
            "gc_content": gc_content,
            "n_fraction": n_fraction,
            "gc_pass": is_valid,
            "reason": reason
            }
    
def run_qc(samples:Dict) -> Dict:
    return[process_sample(s) for s in samples]

results = run_qc(samples)

for r in results:
    print(r)

{'sample_id': 'S1', 'length': 16, 'gc_content': 0.375, 'n_fraction': 0.0, 'gc_pass': True, 'reason': None}
{'sample_id': 'S2', 'length': 11, 'gc_content': 0.5, 'n_fraction': 0.2727272727272727, 'gc_pass': False, 'reason': 'too_many_Ns (n_fraction = 0.273 > 0.10)'}
{'sample_id': 'S3', 'length': 12, 'gc_content': 0.5, 'n_fraction': 0.0, 'gc_pass': True, 'reason': None}
