# Variant Calling Workflow (chr20, NA12878)

Snakemake-based workflow for calling SNPs and indels on chromosome 20 of the NA12878 human genome sample.

Input: raw FASTQ  
Output: aligned BAM, compressed VCF, FastQC reports, and simple summary plots.

## Project layout
<pre>
2025WS_snv_workflow/
├── config/
│   └── <a href="config/config.yaml">config.yaml</a>
├── workflow/
│   ├── <a href="workflow/Snakefile">Snakefile</a>
│   └── rules/
│       ├── <a href="workflow/rules/qc.smk">qc.smk</a>
│       ├── <a href="workflow/rules/mapping.smk">mapping.smk</a>
│       └── <a href="workflow/rules/calling.smk">calling.smk</a>
├── envs/
│   └── <a href="envs/snv_env.yaml">snv_env.yaml</a>
├── scripts/
│   └── <a href="scripts/plot_variants.py">plot_variants.py</a>
├── data/
│   ├── raw/
│   ├── ref/
│   └── processed/
├── results/
│   ├── qc/
│   │   ├── <a href="results/qc/NA12878_fastqc.html">NA12878_fastqc.html</a>
│   │   └── <a href="results/qc/NA12878_fastqc.zip">NA12878_fastqc.zip</a>
│   ├── mapping/
│   │   ├── NA12878.bam
│   │   ├── NA12878.sorted.bam
│   │   └── NA12878.sorted.bam.bai
│   ├── variants/
│   │   ├── NA12878.vcf.gz
│   │   ├── NA12878.vcf.gz.csi
│   │   └── NA12878.bcf
│   └── plots/
│       ├── <a href="results/plots/chr20_variant_distribution.png">chr20_variant_distribution.png</a>
│       └── <a href="results/plots/variant_types_snp_vs_indel.png">variant_types_snp_vs_indel.png</a>
├── logs/
├── .gitignore
└── README.md

</pre>

## Key workflow files

Below you can expand the core files of the workflow.

</details> <details>
<summary><code>config/config.yaml</code></summary>

```yaml
samples:
  - NA12878

ref:
  fasta: "data/ref/chr20.fa"

paths:
  raw: "data/raw"
  qc: "results/qc"
  mapping: "results/mapping"
  variants: "results/variants"



</details> <details>
<summary><code>workflow/snakefile</code></summary>

```
configfile: "config/config.yaml"

SAMPLES = config["samples"]

include: "workflow/rules/qc.smk"
include: "workflow/rules/mapping.smk"
include: "workflow/rules/calling.smk"

rule all:
    input:
        expand("results/qc/{sample}_fastqc.html", sample=SAMPLES),
        expand("results/mapping/{sample}.sorted.bam", sample=SAMPLES),
        expand("results/variants/{sample}.vcf.gz", sample=SAMPLES)

</details> <details> <summary>
<code>workflow/rules/mapping.smk</code></summary>

```
rule index_reference:
    input:  "data/ref/chr20.fa"
    output: "data/ref/chr20.fa.bwt"

rule bwa_map:
    input:
        fastq = "data/raw/{sample}.fastq.gz",
        ref   = "data/ref/chr20.fa"
    output:
        "results/mapping/{sample}.bam"
    threads: 4

rule sort_bam:
    input:  "results/mapping/{sample}.bam"
    output: "results/mapping/{sample}.sorted.bam"

rule index_bam:
    input:  "results/mapping/{sample}.sorted.bam"
    output: "results/mapping/{sample}.sorted.bam.bai"


</details> <details> <summary>
<code>workflow/rules/mapping.smk</code></summary>

```
rule index_reference:
    input:  "data/ref/chr20.fa"
    output: "data/ref/chr20.fa.bwt"

rule bwa_map:
    input:
        fastq = "data/raw/{sample}.fastq.gz",
        ref   = "data/ref/chr20.fa"
    output:
        "results/mapping/{sample}.bam"
    threads: 4

rule sort_bam:
    input:  "results/mapping/{sample}.bam"
    output: "results/mapping/{sample}.sorted.bam"

rule index_bam:
    input:  "results/mapping/{sample}.sorted.bam"
    output: "results/mapping/{sample}.sorted.bam.bai"

</details> <details> <summary>
<code>workflow/rules/calling.smk</code></summary>

```
rule call_variants:
    input:
        bam = "results/mapping/{sample}.sorted.bam",
        bai = "results/mapping/{sample}.sorted.bam.bai",
        ref = "data/ref/chr20.fa"
    output:
        "results/variants/{sample}.vcf.gz"
    log:
        "logs/call_variants_{sample}.log"
    shell:
        "bcftools mpileup -f {input.ref} {input.bam} | "
        "bcftools call -mv -Oz -o {output} > {log} 2>&1"

## Results (chr20, NA12878)

**Main outputs**<br>
- Sorted & indexed BAM: `results/mapping/NA12878.sorted.bam`<br>
- Compressed VCF: `results/variants/NA12878.vcf.gz`<br>
- DAG:
- Summary plots in `results/plots/`<br>




<div>
    <b>DAG</b><br>
    <img src="results/dag.png" height="260">
  </div><br><br><br>

<div style="display:flex; gap:30px; align-items:flex-end;">

  <div>
    <b>Variant distribution along chr20</b><br>
    <img src="results/plots/chr20_variant_distribution.png" height="260">
  </div>

  <div>
    <b>Variant types (SNP vs INDEL)</b><br>
    <img src="results/plots/variant_types_snp_vs_indel.png" height="260">
  </div>

</div>

**Short interpretation**<br>
- Variants are distributed across all of chr20; peaks/dips likely correspond to repetitive or low-mappability regions.<br>
- SNPs strongly dominate over indels — a normal pattern for human germline genomes.<br>
