# GWAS Tutorial — Part 2: Quality Control

Before running a GWAS, we must remove low-quality samples and variants. Poor-quality data leads to spurious associations and inflated p-values. This notebook covers:

1. **Sample QC** — identify and remove poor-quality individuals
2. **Genotype-level filtering** — remove unreliable genotype calls
3. **Variant QC** — filter rare variants and Hardy-Weinberg violations

> **Prerequisites:** Run [Part 1](https://colab.research.google.com/github/gosborcz/winterschool-gwas-tutorial/blob/main/01_data_and_exploration.ipynb) first, or use the Setup cell below to start fresh.

## Setup

Run this cell at the start of every session. It installs Hail, downloads the data if needed, and loads the annotated MatrixTable.

In [None]:
!apt-get install -y openjdk-11-jdk-headless -q
!pip install hail -q

import hail as hl
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from pprint import pprint

hl.init()
from hail.plot import show
hl.plot.output_notebook()

# Download data if not already present
import os
os.makedirs('data', exist_ok=True)
if not os.path.exists('data/1kg.mt'):
    hl.utils.get_1kg('data/')
    hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)

# Load and annotate
table = hl.import_table('data/1kg_annotations.txt', impute=True).key_by('Sample')
mt = hl.read_matrix_table('data/1kg.mt')
mt = mt.annotate_cols(pheno=table[mt.s])
print('Loaded: %d variants x %d samples' % mt.count())

## 1. Sample QC

`hl.sample_qc()` computes per-sample quality metrics and adds them to the column fields under `sample_qc`. Key metrics include:

| Metric | Meaning |
|---|---|
| `call_rate` | Fraction of variants with a genotype call (non-missing) |
| `dp_stats.mean` | Mean read depth across all called genotypes |
| `gq_stats.mean` | Mean genotype quality score |
| `n_het` | Number of heterozygous calls |

Samples with low call rate or low depth are flagged for removal.

In [None]:
mt = hl.sample_qc(mt)
mt.col.describe()

In [None]:
# Mean read depth per sample
p = hl.plot.histogram(mt.sample_qc.dp_stats.mean, range=(0, 30), bins=30,
                      title='Mean Read Depth per Sample', legend='Mean DP')
show(p)

In [None]:
# Call rate per sample (fraction of non-missing genotypes)
p = hl.plot.histogram(mt.sample_qc.call_rate, range=(0.88, 1.0), bins=30,
                      title='Call Rate per Sample', legend='Call Rate')
show(p)

In [None]:
# Mean genotype quality per sample
p = hl.plot.histogram(mt.sample_qc.gq_stats.mean, range=(10, 70), bins=30,
                      title='Mean Genotype Quality per Sample', legend='Mean GQ')
show(p)

In [None]:
# Scatter: mean depth vs. call rate — samples failing either threshold will appear in the low-left corner
p = hl.plot.scatter(mt.sample_qc.dp_stats.mean, mt.sample_qc.call_rate,
                    xlabel='Mean DP', ylabel='Call Rate',
                    title='Mean Depth vs. Call Rate')
show(p)

### Sample Filtering

Based on the distributions above, we apply two filters:
- **Mean depth ≥ 4** — removes samples with insufficient sequencing coverage
- **Call rate ≥ 0.97** — removes samples with >3% missing genotypes

In [None]:
n_before = mt.count_cols()
mt = mt.filter_cols((mt.sample_qc.dp_stats.mean >= 4) & (mt.sample_qc.call_rate >= 0.97))
n_after = mt.count_cols()
print('Samples before filter: %d' % n_before)
print('Samples after filter:  %d' % n_after)
print('Removed: %d samples' % (n_before - n_after))

## 2. Allelic Balance Filter

Even within a passing sample, individual genotype calls can be unreliable. The **allelic balance (AB)** — the fraction of reads supporting the alternate allele — should be:

- ~0 for homozygous reference calls (0/0)
- ~0.5 for heterozygous calls (0/1)
- ~1.0 for homozygous alternate calls (1/1)

Calls with allelic balance far from these expected values are filtered out (set to missing).

In [None]:
ab = mt.AD[1] / hl.sum(mt.AD)

filter_condition_ab = (
    (mt.GT.is_hom_ref() & (ab <= 0.1)) |
    (mt.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
    (mt.GT.is_hom_var() & (ab >= 0.9))
)

fraction_filtered = mt.aggregate_entries(hl.agg.fraction(~filter_condition_ab))
print('Filtering %.2f%% of entries (setting to missing).' % (fraction_filtered * 100))
mt = mt.filter_entries(filter_condition_ab)

## 3. Variant QC

`hl.variant_qc()` computes per-variant quality metrics. Key metrics:

| Metric | Meaning |
|---|---|
| `AF[1]` | Allele frequency of the alternate allele in this dataset |
| `call_rate` | Fraction of samples with a genotype call at this variant |
| `p_value_hwe` | Hardy-Weinberg equilibrium test p-value |
| `n_het` | Number of heterozygous calls |

We filter on allele frequency and HWE.

In [None]:
mt = hl.variant_qc(mt)
mt.row.describe()

### Enhanced QC Visualizations

Before applying variant filters, let's visualize the distributions to understand where we set thresholds.

In [None]:
# Minor allele frequency distribution
# AF[1] is the alt allele frequency; MAF = min(AF, 1-AF)
af_hist = mt.aggregate_rows(hl.agg.hist(mt.variant_qc.AF[1], 0, 0.5, 50))

edges = af_hist.bin_edges
widths = [(edges[i+1] - edges[i]) * 0.9 for i in range(len(af_hist.bin_freq))]
mids = [(edges[i] + edges[i+1]) / 2 for i in range(len(af_hist.bin_freq))]

fig, ax = plt.subplots(figsize=(9, 4))
ax.bar(mids, af_hist.bin_freq, width=widths, color='steelblue', edgecolor='white', linewidth=0.5)
ax.axvline(x=0.01, color='red', linestyle='--', linewidth=1.5, label='AF = 1%% filter')
ax.set_xlabel('Alternate Allele Frequency (AF)', fontsize=12)
ax.set_ylabel('Number of Variants', fontsize=12)
ax.set_title('Minor Allele Frequency Distribution (before filtering)', fontsize=14)
ax.legend()
plt.tight_layout()
plt.show()

low_freq = sum(f for m, f in zip(mids, af_hist.bin_freq) if m < 0.01)
total = sum(af_hist.bin_freq)
print('Variants with AF < 1%%: %d (%.1f%%)' % (low_freq, 100 * low_freq / total if total else 0))

In [None]:
# Hardy-Weinberg Equilibrium p-value distribution
# Under HWE, genotype frequencies follow the expected binomial proportions.
# Very low p-values (< 1e-6) often indicate genotyping errors or population structure.
hwe_hist = mt.aggregate_rows(
    hl.agg.filter(hl.is_defined(mt.variant_qc.p_value_hwe),
                  hl.agg.hist(-hl.log10(mt.variant_qc.p_value_hwe), 0, 10, 40))
)

edges = hwe_hist.bin_edges
widths = [(edges[i+1] - edges[i]) * 0.9 for i in range(len(hwe_hist.bin_freq))]
mids = [(edges[i] + edges[i+1]) / 2 for i in range(len(hwe_hist.bin_freq))]

fig, ax = plt.subplots(figsize=(9, 4))
ax.bar(mids, hwe_hist.bin_freq, width=widths, color='darkorange', edgecolor='white', linewidth=0.5)
ax.axvline(x=6, color='red', linestyle='--', linewidth=1.5, label='p = 1e-6 filter')
ax.set_xlabel('-log10(HWE p-value)', fontsize=12)
ax.set_ylabel('Number of Variants', fontsize=12)
ax.set_title('Hardy-Weinberg Equilibrium P-value Distribution', fontsize=14)
ax.legend()
plt.tight_layout()
plt.show()

### Variant Filtering

We apply two standard filters:
- **AF > 1%** — removes rare variants (unreliable frequency estimates at low N)
- **HWE p-value > 1e-6** — removes variants with extreme genotype imbalances

In [None]:
n_vars_before = mt.count_rows()

mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.01)
mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-6)

n_vars_after = mt.count_rows()
print('Variants before QC: %d' % n_vars_before)
print('Variants after QC:  %d' % n_vars_after)
print('Removed: %d variants (%.1f%%)' % (
    n_vars_before - n_vars_after,
    100 * (n_vars_before - n_vars_after) / n_vars_before if n_vars_before else 0
))

In [None]:
# Visual comparison: before vs. after filtering
fig, ax = plt.subplots(figsize=(7, 4))
stages = ['Before QC', 'After QC']
values = [n_vars_before, n_vars_after]
colors = ['#d9534f', '#5cb85c']
bars = ax.bar(stages, values, color=colors, edgecolor='white', linewidth=0.8, width=0.5)
for bar, v in zip(bars, values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 50,
            '%d' % v, ha='center', va='bottom', fontsize=12, fontweight='bold')
ax.set_ylabel('Number of Variants', fontsize=12)
ax.set_title('Variants Before and After QC Filtering', fontsize=14)
ax.set_ylim(0, max(values) * 1.15)
plt.tight_layout()
plt.show()

In [None]:
n_samples, n_variants = mt.count_cols(), mt.count_rows()
print('Final dataset after QC:')
print('  Samples:  %d' % n_samples)
print('  Variants: %d' % n_variants)

## 4. Save the QC-Filtered MatrixTable

We write the filtered MatrixTable to disk so that Part 3 can load it directly without re-running QC.

In [None]:
mt.write('data/1kg_qc.mt', overwrite=True)
print('Saved QC-filtered MatrixTable to data/1kg_qc.mt')

## Summary

In this notebook you:

- ✅ Computed per-sample QC metrics and removed low-quality samples (low depth or call rate)
- ✅ Filtered unreliable genotype calls by allelic balance
- ✅ Computed per-variant QC metrics and visualized MAF and HWE distributions
- ✅ Removed rare variants (AF < 1%) and HWE outliers
- ✅ Saved a clean, filtered MatrixTable ready for association testing

---

**Next:** Open [Part 3 — Association Analysis](https://colab.research.google.com/github/gosborcz/winterschool-gwas-tutorial/blob/main/03_gwas_association.ipynb) to run the GWAS and visualize the results.