Skip to content

Latest commit

 

History

History
33 lines (25 loc) · 9.69 KB

hprc-v1.0-mc.md

File metadata and controls

33 lines (25 loc) · 9.69 KB

Minigraph/CACTUS v1.0

The CACTUS pangenome pipeline adds base-level alignments to the minigraph graphs above (so both GRCh38- and CHM13-based graphs are available).

Graphs and associated files are summarized below.

Description GRCh38 Graph CHM13 Graph
graph gfa gfa
Decomposed VCF VCF     VCF index
Pangenie-ready VCF VCF     VCF index
Raw VCF VCF     VCF index VCF   VCF index   VCF(CHM13)   VCF(CHM13) index
multiple alignment HAL HAL
sequences clipped out before alignment masking masking
VG indexes xg     snarls     trans xg     snarls     trans
Giraffe indexes dist     min     gg     gbwt     dist(vg<1.44.0)     min(vg<1.44.0) dist     min     gg     gbwt     dist(vg<1.44.0)     min(vg<1.44.0)

The graphs are available in gfa format alongside other graph and index files. Information about the associated file formats can be found:

VCF Decomposition

The Raw VCF files contain a site for each bubble in the graph. Nested bubbles will result in overlapping sites. The nesting relationships are denoted with the PS (parent snarl), LV (level) and AT (allele traversal) tags and need to be taken into account when interpreting the VCF. Alternatively, you can use the "Decomposed VCFs" which have been normalized by using vcfbub to "pop" bubbles with alleles larger than 100k and vcfwave to realign each alt allele to the reference (script). The "Pangenie-ready VCF" was created using a different decomposition that does not use re-alignment (description, intermediate files), with the aim of optimizing genotyping performance with Pangenie.

Filtered Graphs

The Giraffe short read mapper relies on the graph's snarl decomposition. The versions of the Cactus/Minigraph graphs released here contain some spurious large deletion edges that make this decomposition less efficient, which impacts Giraffe runtime. Furthermore, we have found that for calling small variants with the Giraffe-DeepVariant pipeline, accuracy is improved if all alleles with frequency < 10% are removed from the graph before indexing. Two filtered versions of each of the two Minigraph/Cactus graphs are available here. The graphs with maxdel.10mb in the name (recommended to speed up general mapping experiments) were created by removing edges that imply deletions > 10mb, and the graphs with minaf.0.1 in the name (recommended when using with DeepVariant) were created by removing, in addition to the deletions, nodes that are covered by fewer than 9 haplotypes. In order to use vg versions older than v1.44.0 with these graphs, download the .dist.old and .min.old indexes and rename them to .dist and .min (update 4/18/2023: All .dist and .min indexes have been regenerated using a patched vg to fix a speed regression. They remain compatible with vg versions >= v1.44.0).

Masked Sequence

Highly repetitive sequence such as found in centromeres was excluded from the Minigraph/Cactus graphs using the following process. dna-brnn was first run with its default parameters and model to identify alpha satellite and hsat 2/3 regions >100kb, which were clipped out of the input fasta files. Gaps >100kb between minigraph mappings were likewise removed. Any remaining contigs or contig fragments that could not be assigned to a reference chromosome were excluded. Finally, gaps >10kb left unaligned after Cactus were removed. Please note that no sequence was removed from the reference genome of either graph. Each removed interval, as well as the step it was removed by, are available: