Skip to content

T2T Consortium: centomeric satellite datasets and scripts

License

Notifications You must be signed in to change notification settings

kmiga/t2t_censat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

t2t_censat

T2T Consortium: centomeric satellite datasets and scripts

The human genome reference sequence has remained incomplete for two decades. Genome assembly efforts to date have excluded an estimated 5-10% of the human genome, most of which is found in and around each chromosome’s highly repetitive centromere, owing to a fundamental inability to assemble across long, repetitive sequences using short DNA sequencing reads. As a result, millions of bases in each chromosome’s peri/centromere have remained largely uncharacterized and have been omitted from essentially all contemporary genetic and epigenetic studies. However, emerging long-read sequencing and assembly methods have now enabled the Telomere-to-Telomere Consortium to produce the first complete assembly of an entire human genome (T2T-CHM13) (Nurk et al). This effort relied on careful measures to correctly assemble, polish, and validate entire centromeric and pericentromeric repeat arrays for the first time. By deeply characterizing these newly assembled sequences, here we present the first high-resolution, genome-wide atlas of the sequence content and organization of human peri/centromeric regions.

T2T Consortium webpage: https://sites.google.com/ucsc.edu/t2tworkinggroup

All data tracks and satellite annotations can be visualized on the UCSC Genome Browser: http://genome.ucsc.edu/cgi-bin/hgTracks?genome=t2t-chm13-v1.0&hubUrl=http://t2t.gi.ucsc.edu/chm13/hub/hub.txt

Human Pangenome Reference Consortium (HPRC) generated long and accurate HiFi reads for sixteen human samples HG002, HG003, HG004, HG005, HG006, HG007, HG01243, HG02055, HG02109, HG02723, HG03492, HG01109, HG01442,HG02080,HG02145, and HG03098. We refer to these datasets as HPRC samples. (https://github.com/human-pangenomics/hpgp-data).

Sequence data are available through https://www.ncbi.nlm.nih.gov/bioproject/559484

Sequencing data, assemblies, and other supporting data on AWS:

https://github.com/marbl/CHM13

Assembly issues and known heterozygous sites:

https://github.com/marbl/CHM13-issues

NTRprism scripts http://public.gi.ucsc.edu/~khmiga/NTRprism_v0.1.zip

About

T2T Consortium: centomeric satellite datasets and scripts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published