Skip to content

human-pangenomics/hpp_pangenome_resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 

Repository files navigation

HPRC Pangenome Resources

This repo describes pangenomes produced by the Human Pangenome Reference Consortium from year 1 data. For information about data reuse and publicating with HPRC data please see the HPRC's Data Use Protocol.

Note: The pangenomes and resultant files referred to in this repo have not been fully QC'd, are not published, and may have known issues.

Background Information

Preprint

A Draft Human Pangenome Reference

Graph Creation Strategies

Graphs are available from three different strategies summarized in the table (and relevant sections) below:

Minigraph Minigraph-Cactus PGGB
sequence comparison reference-based, progressive reference-based, progressive symmetric, all-vs-all
resolution SV only base-level (via abPOA) base-level (via abPOA)
scope full assemblies Non-centromeric full assemblies
cyclic paths no non-reference all
short read mapping untested yes (fast) untested
long read mapping yes (fastest) yes yes (slowest)
Assembly mapping yes (direct) untested yes (via injection)

Index files listing file locations for download with the AWS CLI can be found in the indexes folder of this repository. Alternatively, tables are listed below in each graph creation strategy's section. Note that the index files list the file locations with s3:// uris -- as opposed to http:// urls as found in the tables.

Assembly Inputs

Information about the source assemblies can be found in the HPRC Assembly GitHub repository. Of the 47 samples assembled (94 assemblies) in year 1, all but three samples were included in graph constructions (HG002, HG005 and NA19240 were excluded for evaluation purposes). GRCh38 and CHM13 were added to make the total number of haplotypes included 90.

Graphs

Minigraph

Minigraph (cite) is a generalization of minimap2 (very fast) which builds the graph with iterative construction. Minigraph aligns with approximate locations and can be used to call structural variants (>50nt). Graphs were built with both GRCh38 and CHM13+Y (found here) used as reference sequences.

Description GRCh38 Graph CHM13 Graph
graph graph graph
bed bed     index bed     index

Minigraph-Cactus

Minigraph-Cactus (cite) adds base-level alignment to minigraph graphs.

Note: The links below have been updated to point to version 1.1 of the graphs which contain numerous bug fixes and updated file formats (this includes switching from . to # as path name separator in all vg files). The original version 1.0 graph that was described in the HPRC paper, has been moved here. The input assemblies are the same for both versions, so unless you are trying to exactly reproduce results from the paper, please consider using the updated version.

Graphs and associated files are summarized below.

Description GRCh38 Graph CHM13 Graph
Graph gfa     gbz gfa     gbz
Full (Unclipped) Graph gfa     gbz     odgi gfa     gbz     odgi
Chromosome Graphs chroms chroms
Decomposed VCF VCF     VCF index VCF     VCF index     GRCh38-VCF     GRCh38-VCF index
Raw VCF VCF     VCF index VCF     VCF index     GRCh38-VCF     GRCh38-VCF index
Multiple Alignment HAL     MAF     MAF Index     TAF     TAF Index HAL     MAF     MAF Index     TAF     TAF Index
Multiple Alignment (Duplications removed) MAF     MAF Index     TAF     TAF Index MAF     MAF Index     TAF     TAF Index
VG Indexes gbz     hapl    dist     min     snarls gbz     hapl     dist     min     snarls
AF-Filtered VG Indexes gbz     dist     min     snarls gbz     dist     min     snarls
Excluded Regions full graph bed     clipped graph bed full graph bed     clipped graph bed
All Files files files

The graphs are available in gfa format alongside other graph and index files. Information about the associated file formats can be found:

VCF Decomposition

The Raw VCF files contain a site for each bubble in the graph. Nested bubbles will result in overlapping sites. The nesting relationships are denoted with the PS (parent snarl), LV (level) and AT (allele traversal) tags and need to be taken into account when interpreting the VCF. Alternatively, you can use the "Decomposed VCFs" which have been normalized by using vcfbub to "pop" bubbles with alleles larger than 100k and vcfwave to realign each alt allele to the reference (script). Note that in order to reproduce the PanGenie analyses from the papers, you should instead use the PanGenie HPRC Workflow. This workflow has a CHM13 branch to use when working with that reference.

The exact tools and commands used to produce the VCFs are given here.

Filtered Graphs

The "AF-Filtered VG indexes" above were created by dropping nodes and edges supported by fewer than 10% of haplotypes, and give the best performance for Giraffe and are what have been used in the various papers to date. Note that giraffe requires only the .gbz, .dist and .min indexes.

Excluded Sequence

Some input contigs could not be assigned to a reference chromosome and were dropped. See the "full graph bed" files above for a listing of these. Contig fragments >10kb that did not map anywhere were likewise excluded (these regions are predominantly centromeric). See the "clipped graph bed" files above for these regions (this file includes the unassigned contigs). dna-brnn was not used to make these graphs.

PGGB

The Pangenome Graph Builder pipeline (PGGB) (cite) creates and all-vs-all graph with base-level alignments and no clipping of mitochondrial or centromeric regions.

Graphs and associated files are summarized below.

Description Location
graph gfa
untangle delta     paf
Decomposed VCFs GRCh38 VCF     GRCh38 VCF Index
Raw VCFs chm13.1-22+X     chm13.M     grch38.1-22+X     grch38.M     grch38.Y

Graph chromosome files and images can be found here and here.

See above for more information of VCF decomposition (script).

Change Log

* Dec 03, 2021: updated minigraph-cactus VCFs to fix headers (thanks to Wen-Wei)