# WGS GVCF samples joint calling, filtering and annotation

Implementing a GATK + ANNOVAR workflow in [SoS](https://github.com/vatlab/SOS), written by Isabelle Schrauwen with software containers built by Gao Wang. 

In [1]:
%revisions -s -n 10

Revision,Author,Date,Message
,,,
4762450,haoyueshuai,2020-08-25,update submit_csg
b7958f9,Gao Wang,2020-08-25,Fix a bash variable bug
2986c3c,Gao Wang,2020-08-25,Update documentation
c1da803,Gao Wang,2020-08-25,Add job submission template for CSG cluster
69e450a,Gao Wang,2020-08-24,Add documentation
3b5a1e8,Gao Wang,2020-08-24,Fix ANNOVAR step
43b3150,Gao Wang,2020-08-23,Update joint variant calling pipeline with minimal working example
2cacdc2,Gao Wang,2020-08-20,Remove the need to mount workdir due to recent changes in SoS
afe343c,Gao Wang,2020-08-20,Add variant calling pipeline


## Overview

This SoS workflow notebook contains four workflows:

- `gatk_call`
- `gatk_filter`
- `annovar`
- `submit_csg`

The first three workflows are for the analysis and the last one is for submitting jobs on the cluster.

All workflow steps are numerically ordered to reflect the execution logic. This is the most straightforward SoS workflow style, the "process-oriented" style. 

## Input data

Samples in `GVCF` format, already indexed:

```
*.gvcf.gz
*.gvcf.gz.tbi
```

To input the list of samples to the workflow, please include all sample file names you would like to analyze, in a text file. For example:

```
GH.AR.SAD.P1.001.0_X3547_S42_1180478_GVCF.hard-filtered.gvcf.gz
GH.AR.SAD.P1.003.0_92455_S43_1189700_GVCF.hard-filtered.gvcf.gz
GH.AR.SAD.P1.004.0_92456_S44_1189701_GVCF.hard-filtered.gvcf.gz
GH.AR.SAD.P1.005.0_92457_S20_1189702_GVCF.hard-filtered.gvcf.gz
...
```

and save it as, eg, `20200820_sample_manifest.txt`. This text file will be the input file to the pipeline.

## Reference data preparation

Human genome reference files are needed for `GATK` joint calling; `ANNOVAR` database references are needed for `ANNOVAR` annotations.

- `GATK` reference files include:

```
*.fa
*.fa.fai
*.dict
```

- `ANNOVAR` reference files ship with `ANNOVAR` software, under a folder called `humandb`.

This workflow assumes that the required files already exit. This pipeline does not provide steps to download or to generate them automatically, which you could find in the tutorial slides. The pipeline will indeed check the availability of the reference files and quit on error if they are missing.

## Run the workflow

The workflow is currently designed to run on a Linux cluster (via `singularity`) although it can also be executed on a Mac computer
(via `docker`). In brief, after installing [SoS](https://github.com/vatlab/SOS) (also see section "Software Configuration" below), 
you can choose to run different workflows modules.

For example to run the variant calling workflow,

```
sos run gatk_joint_calling.ipynb call \
    --vcf-prefix /path/to/some_vcf_file_prefix \
    --samples /path/to/list/of/sample_gvcf.txt \
    --samples-dir /path/to/sample_gvcf \
    --ref-genome /path/to/reference_genome.fa \
    ...
```

to run variant filtering, 


```
sos run gatk_joint_calling.ipynb filter \
    --vcf-prefix /path/to/some_vcf_file_prefix \
    ...
```

to run annotation,

```
sos run gatk_joint_calling.ipynb annovar \
    --vcf-prefix /path/to/some_vcf_file_prefix \
    ...
```

You can put all these 3 commands to one bash file and execute that, so you run all steps one after another.

Note that `...` are additional options that fall into two categories:

1. Options needed to run the bioinformatics steps (e.g. ref_genome)
2. Options needed for SoS to run on different platforms ( e.g. container-option)

To view all options,

In [2]:
sos run gatk_joint_calling.ipynb -h

usage: sos run gatk_joint_calling.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  call
  filter
  annovar
  submit_csg

Global Workflow Options:
  --vcf-prefix joint_call_output (as path)
                        Combined VCF file prefix, including path to the output
                        but without vcf.gz extension, eg
                        "/path/to/output_filename".
  --build hg19
                        Human genome build
  --mem 12 (as int)
                        Memory allocated to a job, in terms of Gigabyte
  --container-option 'gaow/gatk4-annovar'
                        Software container option

Sections
  call_1:               Combine GVCF files
 

Please read these options carefully before you start running the analysis.

## Minimal working example

A minimal example data-set can be found on CSG cluster. The following commands use this data-set, although in practice you should change the paths to point to your own data of interest.

Joint calling:

```
sos run gatk_joint_calling.ipynb call \
    --container-option /mnt/mfs/statgen/containers/gatk4-annovar.sif \
    --vcf-prefix output/minimal_example \
    --samples /mnt/mfs/statgen/data_private/gatk_joint_call_example/20200820_sample_manifest.txt \
    --samples-dir /mnt/mfs/statgen/data_private/gatk_joint_call_example/ \
    --ref-genome /mnt/mfs/statgen/isabelle/REF/refs/Homo_sapiens.GRCh37.75.dna_sm.primary_assembly.fa
```

Filtering:

```
sos run gatk_joint_calling.ipynb filter \
    --container-option /mnt/mfs/statgen/containers/gatk4-annovar.sif \
    --vcf-prefix output/minimal_example
```

Annotating:

```
sos run gatk_joint_calling.ipynb annovar \
    --container-option /mnt/mfs/statgen/containers/gatk4-annovar.sif \
    --vcf-prefix output/minimal_example.snp_indel.filter.PASS \
    --keep "splic|exonic" \
    --humandb /mnt/mfs/statgen/isabelle/REF/humandb \
    --x-ref /mnt/mfs/statgen/isabelle/REF/humandb/mart_export_2019_LOFtools3.txt
```

## Share the workflow with people

Use 

```
sos convert gatk_joint_calling.ipynb gatk_joint_calling.html --template sos-cm-toc
```

to convert this workflow to an HTML file, then pass it around to others to read it.

## Software configuration

Instructions on SoS and docker installation can be found on [our CSG wiki](http://statgen.us/lab-wiki/orientation/jupyter-setup.html). 
The instructions works for both Mac and Linux, unless otherwise specified.

## Global parameter settings

In [3]:
[global]
# Combined VCF file prefix, including path to the output but without vcf.gz extension, 
# eg "/path/to/output_filename".
parameter: vcf_prefix = path('joint_call_output')
# Human genome build
parameter: build = 'hg19'
# Memory allocated to a job, in terms of Gigabyte
parameter: mem=12
# Software container option
parameter: container_option = 'gaow/gatk4-annovar'

## Joint variant calling from GVCF files

In [9]:
# Combine GVCF files
[call_1]
# A file listing out all sample GVCF you would like to analyze. 
# Each line is one sample GVCF name.
parameter: samples = path
# Directory where sample GVCF files locate.
parameter: samples_dir = path()
# Path to reference genome file
parameter: ref_genome = path('refs/Homo_sapiens.GRCh37.75.dna_sm.primary_assembly.fa')
#
fail_if(not samples.is_file(), msg = 'Need valid sample name list file input via ``--samples`` option!')
import os
sample_files = [f'{samples_dir}/{os.path.basename(x.strip())}' for x in open(samples).readlines()]
for x in sample_files:
    fail_if(not path(x).is_file(), msg = f'Cannot find file ``{x}``. Please use ``--samples-dir`` option to specify the directory for sample files.')
    fail_if(not x.endswith('gvcf.gz'), msg = f'Input file ``{x}`` does not have ``.gvcf.gz`` extension.')
fail_if(len(sample_files) == 0, msg = 'Need at least one input sample file!')
fail_if(not ref_genome.is_file(), msg = f'Cannot find reference genome ``{ref_genome}``. Please use ``--ref-genome`` option to specify it.')
fail_if(not path(f"{ref_genome:a}.fai").is_file(), msg = f'Cannot find reference genome index file ``{ref_genome}.fai``. Please make sure it exists.')
fail_if(not path(f"{ref_genome:an}.dict").is_file(), msg = f'Cannot find reference genome dict file ``{ref_genome:n}.dict``. Please make sure it exists.')

depends: system_resource(mem = f'{mem}G'), ref_genome
input: sample_files
output: f'{vcf_prefix:a}.combined.vcf.gz'

bash: container=container_option, volumes=[f'{ref_genome:ad}:{ref_genome:ad}'], expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'
    ${'&&'.join(["tabix -p vcf %s" % x for x in _input if not path(x + '.tbi').is_file()])}
    gatk --java-options "-Xmx${mem}g" CombineGVCFs \
        -R ${ref_genome} \
        ${' '.join(['--variant %s' % x for x in _input])} \
        -O ${_output}

In [6]:
# Joint calling
[call_2]
# Path to reference genome file
parameter: ref_genome = path
output: f'{vcf_prefix:a}.vcf.gz'


bash: container=container_option, volumes=[f'{ref_genome:ad}:{ref_genome:ad}'], expand="${ }", stderr=f'{_output:nn}.err', stdout=f'{_output:nn}.out'
    gatk --java-options "-Xmx${mem}g" GenotypeGVCFs \
        -R ${ref_genome} \
        -V ${_input} \
        -O ${_output}

## Variant filtering

Since we have two types of variants SNP and Indels, the first two steps of the filter workflow pipeline process the two variant types in parallel, then merge them and do additional filtering wiht steps 3 and 4.

In [10]:
# Split into SNP and INDEL for separate PASS filters
[filter_1]
variant_type = ['SNP', 'INDEL']
input: f'{vcf_prefix:a}.vcf.gz', for_each='variant_type', concurrent = True
output: f'{vcf_prefix:a}.{_variant_type.lower()}.vcf.gz'


bash: container=container_option, expand="${ }", stderr=f'{_output:nn}.err', stdout=f'{_output:nn}.out'
    gatk --java-options '-Xmx${mem}g' SelectVariants \
        -V ${_input} \
        -select-type ${_variant_type} \
        -O ${_output}

In [10]:
# PASS or filter for indels and SNPs (Note | not recommended for filters)
# Ignore MQRankSum warnings <- can only be calculated for het sites (not homs)
[filter_2]
parameter: snp_filters = ['QD < 2.0, QD2', 'QUAL < 30.0, QUAL30', 'SOR > 3.0, SOR3', 'FS > 60.0, FS60', 'MQ < 40.0, MQ40', 'MQRankSum < -12.5, MQRankSum-12.5', 'ReadPosRankSum < -8.0, ReadPosRankSum-8']
parameter: indel_filters = ["QD < 2.0, QD2", "QUAL < 30.0, QUAL30", "FS > 200.0, FS200", "ReadPosRankSum < -20.0, ReadPosRankSum-20"]
input: paired_with = dict(filter_option=[snp_filters, indel_filters])
output: f'{_input:nn}.filter.vcf.gz'


bash: container=container_option, expand="${ }", stderr=f'{_output:nn}.err', stdout=f'{_output:nn}.out'
    gatk --java-options '-Xmx${mem}g' VariantFiltration \
        -V ${_input} \
        ${" ".join(['-filter "%s" --filter-name "%s"' % tuple([y.strip() for y in x.split(',')]) for x in _input.filter_option])} \
        -O ${_output}

In [None]:
# Merge back SNP and INDEL
[filter_3]
input: group_by = 'all'
output: f'{vcf_prefix:a}.snp_indel.filter.vcf.gz'


bash: container=container_option, expand="${ }", stderr=f'{_output:nn}.err', stdout=f'{_output:nn}.out'
    gatk --java-options '-Xmx${mem}g' MergeVcfs \
     -I ${_input[0]} -I ${_input[1]} -O ${_output}

In [None]:
# remove non-PASS variants if wanted
[filter_4]
output: f'{vcf_prefix:a}.snp_indel.filter.PASS.vcf.gz'


bash: container=container_option, expand="${ }", stderr=f'{_output:nn}.err', stdout=f'{_output:nn}.out'
    gatk --java-options '-Xmx${mem}g' SelectVariants \
        -V ${_input} -O ${_output} \
        --exclude-filtered

## Annotation

In [None]:
# convert vcf to annovar input format
[annovar_1]
input: f'{vcf_prefix:a}.vcf.gz'
output: f'{_input:nn}.avinput'

bash: container=container_option,expand="${ }", stderr=f'{_output[0]:n}.err', stdout=f'{_output[0]:n}.out'

    convert2annovar.pl \
        -includeinfo \
        -allsample \
        -withfreq \
        -format vcf4 ${_input} > ${_output[0]} 

In [11]:
# Annotate 
[annovar_2]
# humandb path for ANNOVAR
parameter: humandb = path
#add xreffile to option without -exonicsplicing
#mart_export_2019_LOFtools3.txt #xreffile latest option -> Phenotype description,HGNC symbol,MIM morbid description,CGD_CONDITION,CGD_inh,CGD_man,CGD_comm,LOF_tools
parameter: x_ref = path(f"{humandb}/mart_export_2021_LOFtools.txt")
# Annovar protocol
parameter: protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'wgEncodeBroadHmmGm12878HMM', 'wgEncodeBroadHmmHmecHMM', 'wgEncodeBroadHmmHepg2HMM', 'wgEncodeBroadHmmH1hescHMM', 'wgEncodeRegDnaseClusteredV3', 'wgEncodeRegTfbsClusteredV3', 'genomicSuperDups', 'wgRna', 'targetScanS', 'phastConsElements46way', 'tfbsConsSites', 'gwasCatalog', 'gnomad211_genome', 'gnomad211_exome', 'popfreq_max_20150413', 'gme', 'kaviar_20150923', 'abraom', 'avsnp150', 'dbnsfp41a', 'dbscsnv11', 'regsnpintron', 'cadd13gt20', 'clinvar_20210123', 'mcap13', 'gene4denovo201907']
# Annovar operation
parameter: operation = ['g', 'g', 'g', 'gx', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f']
# Annovar args
parameter: arg = ['"-splicing 12 -exonicsplicing"', '"-splicing 30"', '"-splicing 12 -exonicsplicing"', '"-splicing 12"', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
input: f'{vcf_prefix:a}.avinput'
output: f'{vcf_prefix:a}.{build}_multianno.txt'

bash: container=container_option, volumes=[f'{humandb:a}:{humandb:a}', f'{x_ref:ad}:{x_ref:ad}'], expand="${ }", stderr=f'{vcf_prefix:a}.err', stdout=f'{vcf_prefix:a}.out'
    #do not add -intronhgvs as option -> writes cDNA variants as HGVS but creates issues (+2 splice site reported only)
    #-nastring . can only be . for VCF files
    #regsnpintron might cause shifted lines (be carefull using)
    table_annovar.pl \
        ${_input} \
        ${humandb} \
        -buildver ${build} \
        -out ${vcf_prefix:a}\
        -remove \
        -polish \
        -nastring . \
        -protocol ${",".join(protocol)} \
        -operation ${",".join(operation)} \
        -arg ${",".join(arg)} \
        -xreffile ${x_ref}

The step below provides some annotation filtered results. If you want to run your own annotation you can do it by running `ANNOVAR` from the singularity image directly, for example:

```
singularity exec  /mnt/mfs/statgen/containers/gatk4-annovar.sif annotate_variation.pl \
    -filter -dbtype gnomad211_exome \
    -build hg19 \
    -score_threshold 0.005 \
    minimal_example.snp_indel.filter.PASS.hg19_multianno.exonic_splic.txt \
    humandb \
    -out minimal_example.snp_indel.filter.PASS.hg19_multianno.exonic_splic
```

In [12]:
# Filter out common variants (from 3 databases) with annovar 
[annovar_3]
# humandb path for ANNOVAR
parameter: humandb = path("humandb/")
# keep pathogenic: use 'pathogenic|Pathogenic',
# keep splice_exonic: use 'splic|exonic'
parameter: keep="splic|exonic"
tag = '_'.join(sorted(set(keep.lower().split('|'))))
input: f'{vcf_prefix:a}.vcf.gz'
output: f'{_input[0]:n}.{tag}.txt', 
        f'{_input[0]:n}.{tag}.exome_genome.{build}_popfreq_max_20150413_filtered'

bash: container=container_option, volumes=[f'{humandb:a}:{humandb:a}'], expand="${ }", stderr=f'{_output[0]:n}.err', stdout=f'{_output[0]:n}.out'
    set -e
    awk 'FNR == 1 {print} /${keep}/{print}' ${_input[0]}  > ${_output[0]}
    
    annotate_variation.pl -filter -dbtype gnomad211_exome \
        -build ${build} \
        -score_threshold 0.005 \
        ${_output[0]} \
        ${humandb} \
        -out ${_output[0]:n}
    
    annotate_variation.pl -filter -dbtype gnomad211_genome \
        -build ${build} \
        -score_threshold 0.005 \
        ${_output[0]:n}.${build}_gnomad211_exome_filtered \
        ${humandb} \
        -out ${_output[0]:n}.exome

    annotate_variation.pl -filter -dbtype popfreq_max_20150413 \
        -build ${build} \
        -score_threshold 0.005 \
        ${_output[0]:n}.exome.${build}_gnomad211_genome_filtered \
        ${humandb} \
        -out ${_output[0]:n}.exome_genome
    rm ${_output[0]:nn}*_dropped

## Submit jobs to the cluster

Suppose we would like to submit these lines of commands to the cluster:

```
sos run gatk_joint_calling.ipynb call \
    --container-option /mnt/mfs/statgen/containers/gatk4-annovar.sif \
    --vcf-prefix output/minimal_example \
    --samples /mnt/mfs/statgen/data_private/gatk_joint_call_example/20200820_sample_manifest.txt \
    --samples-dir /mnt/mfs/statgen/data_private/gatk_joint_call_example/ \
    --ref-genome /mnt/mfs/statgen/isabelle/REF/refs/Homo_sapiens.GRCh37.75.dna_sm.primary_assembly.fa

sos run gatk_joint_calling.ipynb filter \
    --container-option /mnt/mfs/statgen/containers/gatk4-annovar.sif \
    --vcf-prefix output/minimal_example

sos run gatk_joint_calling.ipynb annovar \
    --container-option /mnt/mfs/statgen/containers/gatk4-annovar.sif \
    --vcf-prefix output/minimal_example.snp_indel.filter.PASS \
    --keep "splic|exonic" \
    --humandb /mnt/mfs/statgen/isabelle/REF/humandb \
    --x-ref /mnt/mfs/statgen/isabelle/REF/humandb/mart_export_2019_LOFtools3.txt
    
    
```

First, we save the above lines to a text file, e.g. call it `analysis_commands_20200825.txt`, then use the following workflow steps to allocate resources and submit the jobs.

Example to submit a job:

```
sos run gatk_joint_calling.ipynb submit_csg \
    --cmd_file command_1027.txt 
    
sos run ~/gatk_joint_calling_test.ipynb submit_csg     --cmd_file ~/gatk_joint_calling/command_1027.txt 
```


If you want to run in a dryrun mode, meaning just simply test the process but do not genrate results
```
sos run gatk_joint_calling.ipynb submit_csg \
    --cmd_file analysis_commands_20200825.txt \
    --dryrun True
```

In [None]:
# Job submission on CSG cluster
[submit_csg]
# Path to job file
parameter: cmd_file=path
# Total run time allocated to the script
parameter: time='36:00:00'
parameter: dryrun = False
input: cmd_file
python3: expand = '$[ ]'
    tpl = '''
    #!/bin/sh
    #$ -l h_rt=$[time]
    #$ -l h_vmem=$[mem+6]G
    #$ -N gatk_joint_call
    #$ -cwd
    #$ -j y
    #$ -S /bin/bash
    module load Singularity
    export PATH=$HOME/miniconda3/bin:$PATH
    set -e
    '''
    script = tpl.lstrip() + ''.join(open($[_input:r]).readlines())
    exe = 'cat' if $[dryrun] else 'qsub'
    from subprocess import Popen, PIPE
    import sys
    p = Popen(exe, shell = False, stdin = PIPE, stdout = PIPE, stderr = PIPE, close_fds = True)
    for item in p.communicate(script.encode(sys.getdefaultencoding())):
        output = item.decode(sys.getdefaultencoding()).rstrip()
        if output:
            print(output)