T2T Consortium: centomeric satellite datasets and scripts
The human genome reference sequence has remained incomplete for two decades. Genome assembly efforts to date have excluded an estimated 5-10% of the human genome, most of which is found in and around each chromosome’s highly repetitive centromere, owing to a fundamental inability to assemble across long, repetitive sequences using short DNA sequencing reads. As a result, millions of bases in each chromosome’s peri/centromere have remained largely uncharacterized and have been omitted from essentially all contemporary genetic and epigenetic studies. However, emerging long-read sequencing and assembly methods have now enabled the Telomere-to-Telomere Consortium to produce the first complete assembly of an entire human genome (T2T-CHM13) (Nurk et al). This effort relied on careful measures to correctly assemble, polish, and validate entire centromeric and pericentromeric repeat arrays for the first time. By deeply characterizing these newly assembled sequences, here we present the first high-resolution, genome-wide atlas of the sequence content and organization of human peri/centromeric regions.
T2T Consortium webpage: https://sites.google.com/ucsc.edu/t2tworkinggroup
All data tracks and satellite annotations can be visualized on the UCSC Genome Browser: http://genome.ucsc.edu/cgi-bin/hgTracks?genome=t2t-chm13-v1.0&hubUrl=http://t2t.gi.ucsc.edu/chm13/hub/hub.txt
Human Pangenome Reference Consortium (HPRC) generated long and accurate HiFi reads for sixteen human samples HG002, HG003, HG004, HG005, HG006, HG007, HG01243, HG02055, HG02109, HG02723, HG03492, HG01109, HG01442,HG02080,HG02145, and HG03098. We refer to these datasets as HPRC samples. (https://github.com/human-pangenomics/hpgp-data).
Sequence data are available through https://www.ncbi.nlm.nih.gov/bioproject/559484
Sequencing data, assemblies, and other supporting data on AWS:
https://github.com/marbl/CHM13
Assembly issues and known heterozygous sites:
https://github.com/marbl/CHM13-issues
NTRprism scripts http://public.gi.ucsc.edu/~khmiga/NTRprism_v0.1.zip