# GWAS Tutorial — Part 1: Data Loading & Exploration

This notebook introduces [Hail](https://hail.is/), a Python library for scalable genomic data analysis, and walks you through downloading, importing, and exploring the **1000 Genomes dataset**.

This is the first of three notebooks:

| Notebook | Topic |
|---|---|
| **Part 1 (this notebook)** | Data loading & exploration |
| **Part 2** | Quality control |
| **Part 3** | Association analysis & results |

> **Note:** Run cells from top to bottom. The first code cell installs dependencies and downloads ~20 MB of data — this takes a couple of minutes the first time.

## Setup

Hail requires Java to run its Spark backend. The cell below installs OpenJDK 11 and Hail in this Colab environment.

In [None]:
!apt-get install -y openjdk-11-jdk-headless -q
!pip install hail -q
print('Installation complete!')

In [None]:
import hail as hl
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from collections import Counter
from pprint import pprint

hl.init()
from hail.plot import show
hl.plot.output_notebook()
print('Hail initialized.')

## 1. Download the 1000 Genomes Dataset

We use a publicly available subset of the [1000 Genomes Project](https://www.internationalgenome.org/), which contains genomic variants from individuals across 26 world populations.

`hl.utils.get_1kg()` downloads:
- `1kg.vcf.bgz` — a compressed VCF file with genotype calls
- `1kg_annotations.txt` — sample phenotypes and population labels

In [None]:
import os
os.makedirs('data', exist_ok=True)
hl.utils.get_1kg('data/')

## 2. Import the VCF and Create a MatrixTable

Hail works with its own binary format called a **MatrixTable**. We import the VCF once and write it to disk — subsequent reads are much faster than re-parsing the VCF each time.

In [None]:
hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
mt = hl.read_matrix_table('data/1kg.mt')
n_vars, n_samples = mt.count()
print('Loaded: %d variants x %d samples' % (n_vars, n_samples))

## 3. The MatrixTable: Hail's Core Data Structure

A **MatrixTable** organizes genomic data as a 2D matrix:

```
              Sample1  Sample2  Sample3  ...
 chr1:100 A/T   0/1      0/0      1/1
 chr1:200 G/C   0/0      0/1      0/1
 chr2:500 A/G   1/1      0/1      0/0
    ...          ...      ...      ...
```

- **Rows** = variants (identified by locus + alleles)
- **Columns** = samples (identified by sample ID `s`)
- **Entries** = genotype calls (GT, DP, GQ, AD, PL per variant-sample pair)
- **Row fields** = variant-level INFO annotations
- **Column fields** = sample metadata (we add these next)

In [None]:
# Row key: locus (chromosome + position) and alleles (ref, alt)
mt.rows().select().show(5)

In [None]:
# Entry fields: GT=genotype, DP=read depth, GQ=genotype quality, AD=allele depths
mt.entry.take(3)

## 4. Sample Annotations

The annotations file contains one row per sample with:
- `SuperPopulation` — one of 5 continental groups (AFR, AMR, EAS, EUR, SAS)
- `Population` — one of 26 specific populations
- `isFemale` — biological sex
- `CaffeineConsumption` — a simulated continuous phenotype (our GWAS trait)
- `PurpleHair` — a simulated binary phenotype

We load this table and join it to the MatrixTable columns by sample ID.

In [None]:
table = hl.import_table('data/1kg_annotations.txt', impute=True).key_by('Sample')
table.describe()

In [None]:
table.show(5, width=120)

In [None]:
mt = mt.annotate_cols(pheno=table[mt.s])
mt.col.describe()

## 5. Phenotype & Population Overview

Let's look at the distribution of samples across super-populations and explore the phenotype of interest.

In [None]:
pop_counts = mt.aggregate_cols(hl.agg.counter(mt.pheno.SuperPopulation))
pprint(pop_counts)

In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
pops = sorted(pop_counts.keys())
counts = [pop_counts[p] for p in pops]
colors = ['#4e9af1', '#f4a261', '#2a9d8f', '#e76f51', '#8ecae6']
bars = ax.bar(pops, counts, color=colors, edgecolor='white', linewidth=0.8)
for bar, c in zip(bars, counts):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
            str(c), ha='center', va='bottom', fontsize=11)
ax.set_xlabel('Super-Population', fontsize=12)
ax.set_ylabel('Number of Samples', fontsize=12)
ax.set_title('Sample Counts by Super-Population', fontsize=14)
ax.set_ylim(0, max(counts) * 1.15)
plt.tight_layout()
plt.show()

In [None]:
# Summary statistics for caffeine consumption
caff_stats = mt.aggregate_cols(hl.agg.stats(mt.pheno.CaffeineConsumption))
pprint(caff_stats)

In [None]:
caff_values = mt.aggregate_cols(hl.agg.collect(mt.pheno.CaffeineConsumption))
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(caff_values, bins=30, color='steelblue', edgecolor='white')
ax.set_xlabel('Caffeine Consumption (mg/day)', fontsize=12)
ax.set_ylabel('Number of Samples', fontsize=12)
ax.set_title('Distribution of Caffeine Consumption Phenotype', fontsize=14)
plt.tight_layout()
plt.show()

## 6. Variant Exploration

Now let's dig into the genomic variants — their types, allele frequencies, and distribution across the genome.

In [None]:
# Count all unique ref->alt substitution pairs
snp_counts = mt.aggregate_rows(
    hl.agg.counter(hl.Struct(ref=mt.alleles[0], alt=mt.alleles[1]))
)
print('Top 10 most common allele changes:')
for change, count in Counter(snp_counts).most_common(10):
    print('  %s -> %s: %d' % (change.ref, change.alt, count))

### Transition / Transversion Ratio (Ti/Tv)

**Transitions (Ti)** are purine↔purine or pyrimidine↔pyrimidine: A↔G, C↔T  
**Transversions (Tv)** are purine↔pyrimidine: A↔C, A↔T, G↔C, G↔T

For whole-genome sequencing, a Ti/Tv of **~2.0–2.1** is expected. Values outside this range can indicate sequencing artifacts.

In [None]:
transitions = {'A>G', 'G>A', 'C>T', 'T>C'}
transversions = {'A>C', 'A>T', 'G>C', 'G>T', 'C>A', 'T>A', 'C>G', 'T>G'}

ti_count = sum(v for k, v in snp_counts.items() if '%s>%s' % (k.ref, k.alt) in transitions)
tv_count = sum(v for k, v in snp_counts.items() if '%s>%s' % (k.ref, k.alt) in transversions)

print('Transitions:   %d' % ti_count)
print('Transversions: %d' % tv_count)
print('Ti/Tv ratio:   %.3f' % (ti_count / tv_count if tv_count else float('nan')))

### Allele Frequency Spectrum (AFS)

The AFS shows the distribution of variant allele frequencies. In any population, most variants are **rare**, as predicted by population genetics theory — this L-shaped distribution is a hallmark of healthy genetic data.

We use the `AF` field from the VCF INFO column, which gives the allele frequency in the original 1000 Genomes cohort.

In [None]:
afs = mt.aggregate_rows(hl.agg.hist(mt.info.AF[0], 0, 1.0, 50))

edges = afs.bin_edges
widths = [(edges[i+1] - edges[i]) * 0.9 for i in range(len(afs.bin_freq))]
mids = [(edges[i] + edges[i+1]) / 2 for i in range(len(afs.bin_freq))]

fig, ax = plt.subplots(figsize=(9, 4))
ax.bar(mids, afs.bin_freq, width=widths, color='steelblue', edgecolor='white', linewidth=0.5)
ax.set_xlabel('Allele Frequency (AF)', fontsize=12)
ax.set_ylabel('Number of Variants', fontsize=12)
ax.set_title('Allele Frequency Spectrum', fontsize=14)
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: '%d' % int(x)))
plt.tight_layout()
plt.show()

rare = sum(f for m, f in zip(mids, afs.bin_freq) if m < 0.05)
total = sum(afs.bin_freq)
print('Variants with AF < 5%%: %d (%.1f%%)' % (rare, 100 * rare / total if total else 0))

### Variant Density by Chromosome

Longer chromosomes generally carry more variants. This plot shows how the ~10k variants in this subset are distributed across the genome.

In [None]:
chrom_counts = mt.aggregate_rows(hl.agg.counter(mt.locus.contig))

def chrom_sort_key(c):
    c = c.replace('chr', '')
    try:
        return (0, int(c))
    except ValueError:
        return (1, c)

chroms = sorted(chrom_counts.keys(), key=chrom_sort_key)
cnts = [chrom_counts[c] for c in chroms]

fig, ax = plt.subplots(figsize=(13, 4))
ax.bar(range(len(chroms)), cnts, color='teal', edgecolor='white', linewidth=0.5)
ax.set_xticks(range(len(chroms)))
ax.set_xticklabels(chroms, rotation=45, ha='right', fontsize=9)
ax.set_xlabel('Chromosome', fontsize=12)
ax.set_ylabel('Number of Variants', fontsize=12)
ax.set_title('Variant Density by Chromosome', fontsize=14)
plt.tight_layout()
plt.show()

### Genotype Distribution

For each variant-sample pair, the genotype is:
- **0/0** — homozygous reference (both alleles are the reference)
- **0/1** — heterozygous (one ref, one alt allele)
- **1/1** — homozygous alternate

Because most variants are rare, the vast majority of entries should be homozygous reference.

In [None]:
gt_counts = mt.aggregate_entries(hl.agg.counter(mt.GT.n_alt_alleles()))

labels = {0: 'Hom Ref (0/0)', 1: 'Het (0/1)', 2: 'Hom Alt (1/1)'}
values = [gt_counts.get(k, 0) for k in [0, 1, 2]]
total = sum(values)
colors = ['#4e9af1', '#f4a261', '#e76f51']

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

ax = axes[0]
ax.bar([labels[k] for k in [0, 1, 2]], values, color=colors, edgecolor='white')
ax.set_ylabel('Count', fontsize=12)
ax.set_title('Genotype Counts', fontsize=14)
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: '%d' % int(x)))

ax2 = axes[1]
ax2.pie(values, labels=[labels[k] for k in [0, 1, 2]], colors=colors,
        autopct='%1.1f%%', startangle=90)
ax2.set_title('Genotype Proportions', fontsize=14)

plt.suptitle('Genotype Distribution Across All Entries', fontsize=13)
plt.tight_layout()
plt.show()

for k in [0, 1, 2]:
    print('%s: %d (%.1f%%)' % (labels[k], gt_counts.get(k, 0), 100 * gt_counts.get(k, 0) / total if total else 0))

## Summary

In this notebook you:

- ✅ Installed Hail and initialized it in Colab
- ✅ Downloaded and imported the 1000 Genomes VCF into a Hail MatrixTable
- ✅ Learned the MatrixTable structure (rows/cols/entries)
- ✅ Joined sample phenotypes and population labels
- ✅ Visualized population composition, the allele frequency spectrum, Ti/Tv ratio, chromosome-level variant density, and genotype distribution

---

**Next:** Open [Part 2 — Quality Control](https://colab.research.google.com/github/gosborcz/winterschool-gwas-tutorial/blob/main/02_quality_control.ipynb) to filter low-quality samples and variants before running the GWAS.