# NGS: SNP calling, genoptype calling and Imputation 

We will work on 33 European samples where we have reduced the genome so there is very little data. 

We will be looking at the default behaviour for 3 different programs.

| tool | SAMtools/bcftools | GATK | ANGSD |
| --- | --- | --- | --- |
| Genotype likelihood | tries to model error dependencies | simple | user specified |
| SNP caller | uses SFS as prior | uses SFS as prior | likelihood ratio test |
| Genotype caller | MAF | ML | MAF prior |

<br>
All the programs have additional filters and additional differences. 

## Setup

We are first going to set up the environmental variables to help run the software later. 

In [None]:
COURSE_PATH=/course/popgen25
DATA_PATH=${COURSE_PATH}/Imputation
SOFTWARE_PATH=${COURSE_PATH}/software

## shared tools and data folder
#TOOL_PATH=/course/popgen25/software 
#RESOURCE_PATH=/course/popgen25/Imputation/resources
#INPUT_PATH=/course/popgen25/Imputation  # for input data

# go to working folder
mkdir -p ~/popgen25_imputation
cd ~/popgen25_imputation

ln -sfn ~/popgen25_imputation ~/current_folder
ln -sfn ${DATA_PATH} ~/data_folder

OUT_DIR=~/popgen25_imputation

we should have access to all files and tools if the environment is set up correctly

In [None]:
# check environment
which angsd
which samtools
which bcftools
which vcftools

# Additional tools
BEAGLE4=${SOFTWARE_PATH}/beagle.27Jan18.7e1.jar
BEAGLE5=${SOFTWARE_PATH}/beagle.27Feb25.75f.jar
QUILT2=${SOFTWARE_PATH}/QUILT/QUILT2.R

# Reference data
QUILT2_MAP=~/data_folder/resources/CEU-chr20-final.b38.txt.gz
BEAGLE5_MAP=~/data_folder/resources/plink.chr20.GRCh38.rename.map
REF_GENOME=~/data_folder/resources/GRCh38_full_analysis_set_plus_decoy_hla.fa

REF_VCF=~/data_folder/vcfs/CEU_ref_set.chr20.vcf.gz
TRUE_VCF=~/data_folder/vcfs/CEU_true_set.chr20.vcf.gz
EXAMPLE_VCF=~/data_folder/vcfs/example.vcf.gz
QUILT2_EXAMPLE=~/data_folder/vcfs/quilt2_all_inds.vcf.gz

# Input data
SAM_FOLDER=~/data_folder/bams
FAKE_SNPCHIP=~/data_folder/vcfs/CEU_fake_chip.chr20.vcf.gz

# you could run this to check whether files exist
ls ${BEAGLE4} ${BEAGLE5} ${BEAGLE5_MAP} ${QUILT2} ${QULT2_MAP} ${REF_GENOME}* ${TRUE_VCF} ${REF_VCF} ${QUILT2_EXAMPLE} ${FAKE_SNPCHIP} ${SAM_FOLDER} 

## Input formats and data preparation

### Input Reference panel
IMPUTE format hap and legend format files with reference haplotypes. These can be made from haplotype VCFs using bcftools convert --haplegendsample. Alternatively, they can be made manually. The haplotype file is a gzipped file with no header and no rownames, with one row per SNP, with one column per reference haplotype, space separated, and values of 0 (ref) and 1 (alt). The legend file is a gzipped file with no rownames, a header file including position for the physical position in 1 based coordinates, a0 for the reference allele, and a1 for the alternate allele. An optional sample file and file with samples to exclude can be useful for changing who is used in the reference panel.

### Genetic map 
This is sometimes referred to as recombiation map. File with genetic map information, with 3 white-space delimited columns giving position (1-based), recombination rate in cM/Mbp, and genetic map in cM

### Bams
Given as a bamlist (i.e. a file with one row per sample, the path to the bam)

### Truth data (Optional) 
Consists of two files - phasefile and posfile - these are useful for understanding performance of imputation. 

Phasefile has a header row with a name for each sample, matching what is found in the bam file. File is tab separated, one subject per column, with 0 = ref and 1 = alt, separated by a vertical bar |, e.g. 0|0 or 0|1. Note therefore this file has one more row than posfile which has no header. 

For posfile, this is a file with positions of where to impute, lining up one-to-one with the SNPs of phasefile. File is tab seperated with no header, one row per SNP, with col 1 = chromosome, col 2 = physical position (sorted from smallest to largest), col 3 = reference base, col 4 = alternate base. Bases are capitalized. Example first row: 1 1000 A G

- create folders and prepare inputs

In [None]:
ls $SAM_FOLDER/NA*.bam > ${OUT_DIR}/CEU_inds_bams.list

* Check your data to see it looks OK. 

In [None]:
samtools mpileup -b ${OUT_DIR}/CEU_inds_bams.list | head -n 10

- View the data in mpileup format and identify the columns


And you should index the fasta reference

## Output formats
Output VCF with both SNP annotation information (see below) and per-sample genotype information. 

Per-sample genotype information includes the following entries:

* GT: Phased genotypes. Genotypes with phase where each allele is the rounded per-haplotype posterior probability (HD below).

* GP: Genotype posteriors. Posterior probabilities of the three genotypes given the data.

* DS Diploid dosage. Posterior expectation of the diploid genotype i.e. the expected number of copies of the alternate allele

* HD: Haploid dosages. Per-haplotype posterior probability of an alternate allele

Note that in QUILT, genotype posteriors (GP) and dosages (DS) are taken from the main Gibbs sampling, while the phasing results (GT and HD) are taken from an additional special phasing Gibbs sample. As such, phasing results (GT and HD) might not be consistent with genotype information (GP and DS). If consistency is necessary, note that you can create a consistent GP and DS from HD.

In [None]:
zcat ${EXAMPLE_VCF} | head -n 100 | tail -n +95

## Genotype calling without Imputation

In this exercise you will try to generate the widely used **vcf** formatted files. We will use

 - angsd
 - bcftools
 - vcftools

### samtools/bcftools

SAMtools outputs a binary version of vcf files, called bcf files. To get a the vcf equivalent you shou0ld pipe the data to bcftools which is located within the samtools/bcftools subfolder.

bcftools can do single sample genotyping/SNP calling however, it calculates the genotype likelihoods a little different than the simple GATK model

In [None]:
mkdir -p ${OUT_DIR}/bcftoolsgt_out

## 3 min to run
#bcftools mpileup -f ${REF_GENOME} -Ou -b ${OUT_DIR}/CEU_inds_bams.list -r chr20:2000001-5000000 | \
#    bcftools call -V indels -a GQ,GP -mv -Ov -o ${OUT_DIR}/bcftoolsgt_out/CEU_inds_bam.vcf
#
#bgzip -f ${OUT_DIR}/bcftoolsgt_out/CEU_inds_bam.vcf
#tabix -f ${OUT_DIR}/bcftoolsgt_out/CEU_inds_bam.vcf.gz
#
## this takes 3min to run, we just copy the output file
## you can uncomment (remove the leading single "#") the above lines to run it by yourself
cp -sf ~/data_folder/outputs/bcftoolsgt_out/* ~/current_folder/bcftoolsgt_out/


### ANGSD
We now call genotypes using ANGSD, and by simply calling the genotype that has the highest likelihood. We see that genotypes in the above vcf format is identified by the counts of non-reference. The default behaviour of angsd is to estimate the major/minor based on GL. But we can force it to use the reference as major by using -doMajorMinor 4 and supplying the reference (just as in done i GATK and Samtools).

In [None]:
mkdir -p ${OUT_DIR}/angsd_out

# 5 min to run
#angsd -bam ${OUT_DIR}/CEU_inds_bams.list -SNP_pval 0.001 \
# -doMaf 2 -doMajorMinor 4 -r chr20:2000001-5000000 \
# -doGlf 2 -doGeno 2 -doPost 2 -GL 1 -doBcf 1 \
# -out ${OUT_DIR}/angsd_out/angsd_genotype -ref ${REF_GENOME}
# 
#bcftools index ${OUT_DIR}/angsd_out/angsd_genotype.bcf

## again, this takes 5min to run, we just copy the output file
## you can uncomment (remove the leading single "#") the above lines to run it by yourself
cp -sf ~/data_folder/outputs/angsd_out/angsd_genotype* ~/current_folder/angsd_out/


This commands can be read as: We want to run the analysis based on 'bams.list' and we limit the analysis to chromosome '1'. We are not interested in all sites, but only those sites that are variable with a likelihood ratio test with a p-value of 0.001. Our output files should be prefixed with angsd. We want to estimate the allele frequency, and that requires that we also find the major and minor allele. And we base all analysis on the Samtools model of genotype likelihoods. 
For all information, please view <https://www.popgen.dk/angsd/index.php/ANGSD>

Here we use the allele frequency in order to call SNPs.


### Comparing the results (SNP-discovery)

This requires R and the vcfppR and UpSetR package.
Let us examine the difference between the two different approaches for SNP-calling, and their difference from the truth.

In [None]:
wd <- path.expand("~/current_folder/")
#wd <- path.expand("~/popgen25_imputation/")
setwd(wd)

library(vcfppR)
library(UpSetR)

find_match <- function(vcf_in, vcf_truth, name="tool", data_type="gt"){
    if(! all(vcf_truth$samples %in% vcf_in$samples)){stop("samples doesn't match")}
    true_pos_match <- match(vcf_truth$pos, vcf_in$pos)
    vcf_pos_match <- match(vcf_in$pos, vcf_truth$pos)
    vcf_in[[data_type]][true_pos_match,  match(vcf_truth$samples, vcf_in$samples)]
}

gt_truth_file <- paste0("/course/popgen25/Imputation/vcfs/CEU_true_set.chr20.vcf.gz")
gt_bcftool_file <- paste0(wd, "/bcftoolsgt_out/CEU_inds_bam.vcf.gz")
gt_angsd_file <- paste0(wd, "/angsd_out/angsd_genotype.bcf")

res_truth <- vcftable(gt_truth_file, "chr20:2000001-5000000", vartype = "snps")
res_bcftools <- vcftable(gt_bcftool_file, "chr20:2000001-5000000", vartype = "snps")
res_angsd <- vcftable(gt_angsd_file, "chr20:2000001-5000000", vartype = "snps")


We will use Upset plot to see the performance of angsd and bcftools genotype calling.

In [None]:
wd <- path.expand("~/current_folder/")
#wd <- path.expand("~/popgen25_imputation/")
setwd(wd)

## Make the data frame to make the upset plot. 
upsetdf <- fromList(list(
    Truth=res_truth$pos,
    angsd=res_angsd$pos,
    bcftools=res_bcftools$pos
))

## Visualize the results
upset(
  upsetdf,
  order.by         = "freq",
  sets.x.label     = "SNP count",
  mainbar.y.label  = "shared SNP",
  text.scale = c(2.2, 1.8, 2.2, 1.8, 2, 1.8)
)

See the last two columns, angsd has a bit less shared SNPs with the truth. The reason is, angsd is mainly focusing on genotype likelihood estimation, but not genotype calling.

# Genotype calling with imputation
We will use several different strategies to impute the genotype.

## Imputation using Beagle 4

We should first preparing the information needed for imputation. As we already have genotype (likelihood) data called by angsd, we will just use those output files.

Before that, we need to convert the data to vcf format, and build index, which is required for running Beagle 4 in genotype (likelihood) mode.

In [None]:

# prepare data for beagle 4, using angsd bcf output
mkdir -p ${OUT_DIR}/beagle4_out
bcftools view ${OUT_DIR}/angsd_out/angsd_genotype.bcf -Ov -o ${OUT_DIR}/beagle4_out/angsd_for_beagle.vcf
bgzip -f ${OUT_DIR}/beagle4_out/angsd_for_beagle.vcf
tabix -f ${OUT_DIR}/beagle4_out/angsd_for_beagle.vcf.gz
 

Run the imputation based on the angsd genotype.
"gl" means genotype likelihood

In [None]:

# 10 secs to run
java -Xmx8g -jar ${BEAGLE4} \
    gl=${OUT_DIR}/beagle4_out/angsd_for_beagle.vcf.gz \
    out=${OUT_DIR}/beagle4_out/beagle4_imputation \
    niterations=10 \
    nthreads=32
bcftools index ${OUT_DIR}/beagle4_out/beagle4_imputation.vcf.gz


## Imputation using QUILT2

We will then use QUILT2 to go all the way from bams to genotypes with imputation, i.e. it will start from the bam file, do genotype calling and perform imputation.

Because of time constraints, we will take 1 sample as an example, and use the prepared outcome of 30 samples for further comparison.

In [None]:
# 2 min to run
#mkdir -p ${OUT_DIR}/quilt2_1_ind

quilt_out=${OUT_DIR}/quilt2_1_ind
if [ -d ${quilt_out}/ ]; then rm -rf ${quilt_out}/; fi
head -n 1 ${OUT_DIR}/CEU_inds_bams.list > ${OUT_DIR}/CEU_1_ind.list

$QUILT2 \
    --outputdir=${quilt_out} \
    --chr=chr20 \
    --regionStart=2000001 \
    --regionEnd=5000000 \
    --buffer=500000 \
    --nGen=100 \
    --bamlist=${OUT_DIR}/CEU_1_ind.list \
    --genetic_map_file=${QUILT2_MAP} \
    --reference_vcf_file=${REF_VCF} \
    --save_prepared_reference=TRUE


In [None]:

# it takes 20 min to run for all individuals, we will just copy the prepared results
#quilt_out=${output_dir}/quilt2_output
#if [ -d $quilt_out ]; then rm -rf $quilt_out; fi
#$QUILT2 \
#    --outputdir=${quilt_out} \
#    --chr=chr20 \
#    --regionStart=2000001 \
#    --regionEnd=5000000 \
#    --buffer=500000 \
#    --nGen=100 \
#    --bamlist=${output_dir}/CEU_inds_bams.list \
#    --genetic_map_file=${QUILT2_MAP} \
#    --reference_vcf_file=${REF_VCF} \
#    --save_prepared_reference=TRUE

cp ${QUILT2_EXAMPLE} ${OUT_DIR}/quilt2_1_ind/quilt2_all_inds.vcf.gz
tabix ${OUT_DIR}/quilt2_1_ind/quilt2_all_inds.vcf.gz


## SNP-chip data Beagle 5.5 imputation

Beagle 5.5 supports large reference panel, which should help improve the accuracy of imputation and retain more SNPs with good imputation results.

In [None]:
# 3 secs to run
mkdir -p ${OUT_DIR}/beagle5_out

java -Xmx8g -jar $BEAGLE5 \
    gt=${FAKE_SNPCHIP} \
    ref=${REF_VCF} \
    map=${BEAGLE5_MAP} \
    nthreads=16 \
    impute=true \
    gp=true \
    out=${OUT_DIR}/beagle5_out/beagle5_imputed

bcftools index ${OUT_DIR}/beagle5_out/beagle5_imputed.vcf.gz


# Compare results from tools

Lets first copy the true VCF results

In [None]:
cp ${TRUE_VCF} ${OUT_DIR}/
tabix ${OUT_DIR}/CEU_true_set.chr20.vcf.gz

In [None]:
# load the function
library(vcfppR)

estimate_gt_tools <- function(gt_mat_list = list(), gt_true_mat=matrix(), keep=c()){
    if(!all(lapply(gt_mat_list, nrow) == nrow(gt_true_mat))){stop("SNP count doesn't match")}
    if(!all(lapply(gt_mat_list, ncol) == ncol(gt_true_mat))){stop("Sample count doesn't match")}
    if(length(keep)==0){
        keep <- rep(TRUE, nrow(gt_true_mat))
    }
    res <- 
    sapply(gt_mat_list,function(x){
        x[is.na(x)] <- -1;
      mean(res_truth_mat[keep, ]==x[keep, ],na.rm=T)})
    
    resMis <- sapply(gt_mat_list,function(x) 
      mean(is.na(x[keep, ]),na.rm=T))
    resNA <- 
    sapply(gt_mat_list,function(x) 
      mean(res_truth_mat[keep, ]==x[keep, ],na.rm=T))
    
    cat("Missing genotype rate:\n")
    print(resMis)
    
    cat("\nDiscordance rate assuming missing is discordant:\n")
    print(res)
    
    cat("\nDiscordance rate when ignoring missing genotypes:\n")
    print(resNA)
}
           
gp_to_gq <- function(gp_row) {
  if (any(is.na(gp_row))) return(0)  # NA to 0
  p_max <- max(gp_row)
  if (p_max >= 0.999999) return(99)  # cap GQ at 99
  gq <- -10 * log10(1 - p_max)
  return(round(gq, 2))
}

In [None]:
# Read all data

gt_truth_file <- paste0(wd, "/CEU_true_set.chr20.vcf.gz")
gt_bcftool_file <- paste0(wd, "/bcftoolsgt_out/CEU_inds_bam.vcf.gz")
gt_angsd_file <- paste0(wd, "/angsd_out/angsd_genotype.bcf")
gt_beagle4gl_file <- paste0(wd, "/beagle4_out/beagle4_imputation.vcf.gz")
gt_quilt2_file <- paste0(wd, "/quilt2_1_ind/quilt2_all_inds.vcf.gz")
gt_beagle5gl_file <- paste0(wd, "/beagle5_out/beagle5_imputed.vcf.gz")


res_truth <- vcftable(gt_truth_file, "chr20:2000001-5000000", vartype = "snps")
res_bcftools <- vcftable(gt_bcftool_file, "chr20:2000001-5000000", vartype = "snps")
res_angsd <- vcftable(gt_angsd_file, "chr20:2000001-5000000", vartype = "snps")
res_beagle4gl <- vcftable(gt_beagle4gl_file, "chr20:2000001-5000000", vartype = "snps")
res_quilt2 <- vcftable(gt_quilt2_file, "chr20:2000001-5000000", vartype = "snps")
res_beagle5gl <- vcftable(gt_beagle5gl_file, "chr20:2000001-5000000", vartype = "snps")


res_truth$samples <- sub(".lc.bam", "", basename(res_truth$samples))
res_bcftools$samples <- sub(".lc.bam", "", basename(res_bcftools$samples))
res_angsd$samples <- sub(".lc.bam", "", basename(res_angsd$samples))
res_beagle4gl$samples <- sub(".lc.bam", "", basename(res_beagle4gl$samples))
res_quilt2$samples <- sub(".lc.bam", "", basename(res_quilt2$samples))
res_beagle5gl$samples <- sub(".lc.bam", "", basename(res_beagle5gl$samples))



In [None]:
true_freq <- rowMeans(res_truth$gt)/2
table(true_common <- true_freq > 0.05 & true_freq < 0.95)
names(res_truth)

In [None]:
res_truth_mat <- res_truth$gt
res_bcftools_mat <- find_match(res_bcftools, res_truth, name="bcftools")
res_angsd_mat <- find_match(res_angsd, res_truth, name="angsd")
res_beagle4gl_mat <- find_match(res_beagle4gl, res_truth, name="beagle4gl")
res_beagle5gl_mat <- find_match(res_beagle5gl, res_truth, name="beagle5gl")
res_quilt2_mat <- find_match(res_quilt2, res_truth, name="quilt2")


In [None]:
estimate_gt_tools(list(
    bcftools=res_bcftools_mat,
    angsd=res_angsd_mat,
    beagle4gl=res_beagle4gl_mat,
    beagle5gl=res_beagle5gl_mat,
    quilt2=res_quilt2_mat), res_truth_mat)

In [None]:
estimate_gt_tools(list(
    bcftools=res_bcftools_mat,
    angsd=res_angsd_mat,
    beagle4gl=res_beagle4gl_mat,
    beagle5gl=res_beagle5gl_mat,
    quilt2=res_quilt2_mat), res_truth_mat, keep=true_common)