# Download and install evaluation suite (Linux only) curl -L https://github.com/lh3/CHM-eval/releases/download/v0.4/CHM-evalkit-20180221.tar \ | tar xf - # Call CHM1-CHM13 variants in the GRCh37 coordinate (will take a while...) wget -qO- ftp://ftp.sra.ebi.ac.uk/vol1/ERA596/ERA596361/bam/CHM1_CHM13_2.bam \ | freebayes -f hs37.fa - > CHM1_CHM13_2.raw.vcf # Filter (use your own filters if you like) CHM-eval.kit/run-flt -o CHM1_CHM13_2.flt CHM1_CHM13_2.raw.vcf # Distance-based evaluation CHM-eval.kit/run-eval -g 37 CHM1_CHM13_2.flt.vcf.gz | sh more CHM1_CHM13_2.flt.summary # Evaluating allele and genotype accuracy (Java required) CHM-eval.kit/rtg format -o hs37.sdf hs37.fa # if you haven't done this before CHM-eval.kit/run-eval -g 37 -s hs37.sdf CHM1_CHM13_2.flt.vcf.gz | sh more CHM1_CHM13_2.flt.rtg.summary
CHM-eval, aka Syndip, is a benchmark dataset for evaluating the accuracy of small variant callers. It is constructed from the PacBio assembilies of two independent CHM cell lines using procedures largely orthogonal to the methodology used for short-read variant calling, which makes it more comprehensive and less biased in comparison to existing benchmark datasets. The following figure briefly explains how this dataset was generated:
The truth data can be downloaded from the release page. The package contains the list of confident regions, phased variant calls including thousands of long insertions/deletions, and evaluation scripts (see below). Illumina short reads sequenced from the two cell lines and from the experimental mixture of the two cell lines are availble via project PRJEB13208 at ENA.