T2T-Primates is a project of the Telomere-to-Telomere consortium and is led by the Makova, Phillippy, and Eichler labs. The project seeks to finish complete, diploid assemblies for key non-human primate species. The project is currently focused on gorilla, bonobo, chimpanzee, orangutan, and gibbon. Following the approach of the human T2T-CHM13 project, all species have been sequenced with high-coverage PacBio HiFi (>50x) and Oxford Nanopore ultra-long 100 kb+ (>30x) sequencing reads. For haplotype phasing, Dovetail Hi-C data was generated for all genomes and Strand-seq data is also expected. Parental Illumina data was collected for bonobo and gorilla, where familial trios were available.
Phase one of the project focused on completing the sex chromosomes (v1 release), and phase two focused on finishing the autosomes (v2 release). Version 2 assemblies for all species are now available, both here and via GenBank/RefSeq. See below for publications detailing our initial analyses of these assemblies.
All data is released to the public domain (CC0) and we encourage its reuse. However, we are in the process of finishing and analyzing these genomes, so to avoid duplicating effort, we encourage you to contact us if you are interested in contributing. The following working groups have been formed: assembly, annotation, sex chromosomes, comparative and evolutionary genomics, segmental duplications, acrocentric chromosomes and rDNAs, satellite DNAs, mobile elements, and pangenomics.
- Yoo D, et al. Complete sequencing of ape genomes. BioRxiv, 2024
- Makova K, et al. The Complete Sequence and Comparative Analysis of Ape Sex Chromosomes. Nature, 2024.
The raw genome sequencing data generated by this study are available under NCBI BioProjects, PRJNA602326, PRJNA976699–PRJNA976702, and PRJNA986878–PRJNA986879 and transcriptome data are deposited under BioProjects, PRJNA902025 (UW Iso-Seq) and PRJNA1016395 (UW and PSU Iso-Seq and short-read RNA-seq). The genome assemblies are available from GenBank under accessions: GCA_028858775.2, GCA_028878055.2, GCA_028885625.2, GCA_028885655.2, GCA_029281585.2 and GCA_029289425.2. Genome assemblies can be downloaded via NCBI.
A UCSC Browser hub is available including genome-wide alignments, CAT annotations, methylation, and various other annotation and analysis tracks used in this study. The T2T-CHM13v2.0 and HG002v1.0 assemblies used here are also available via the same browser hub, and from GenBank via accessions GCA_009914755.4 (T2T-CHM13), GCA_018852605.1 (HG002 paternal), and GCA_018852615.1 (HG002 maternal). The alignments are available to download or browse in HAL118 MAF and UCSC Chains formats.
Version 2 diploid assemblies were generated by Verkko with additional finishing and polishing steps to reach T2T. Chromosomes were named and oriented according to the prior cytogenetics literature for each species. For convenience, the "hsa" suffix in the chromosome names refers to the human homologous chromosome, where applicable. Gorilla and bonobo were phased using familial trios, and so complete maternal and paternal haplotypes are available for these species. All other species were phased using Hi-C. In the case of Hi-C phasing, each chromosome is completely phased, but it is not known which comes from the maternal or paternal haplotype, so the higher quality haplotype was assigned to hap1 and the lower quality haplotype to hap2. All assemblies have been submitted to NCBI GenBank and are currently being processed. The curated and submitted versions can be downloaded from AWS in a variety of configurations:
- Gorilla gorilla v2.0 (gorilla, 20231122)
- Pan paniscus v2.0 (bonobo, 20231122)
- Pan troglodytes v2.0 (chimpanzee, 20231122)
- Pongo abelii v2.0 (Sumatran orangutan, 20231205)
- Pongo pygmaeus v2.0 (Bornean orangutan, 20231122)
- Symphalangus syndactylus v2.1 (siamang gibbon, 20240514)*
There are a number of files within these directories with the following tags:
dip
: diploid assembly including both haplotypesanalysis-dip
: diploid assembly + MT + rDNA morph + EBV contigspri
: "Primary linear haplotype". higher quality haplotype per chromosome (hap1 in non-trios) + ChrXYalt
: "Alternate haplotype". equal or lower quality haplotype (hap2 in non-trios) with no ChrXYmat/pat
: maternal and paternal haplotypes, with chrX in mat and chrY in pathap1/hap2
: hap1 and hap2 haplotypes, with chrX in hap1 and chrY in hap2chrEBV/MT/rDNA
: consensus EBV, mitochondria, and rDNA contigsunloc
: any unlocalized sequences from unresolved gaps
Files with the date tag 20231122
and 20231205
are the v2.0 assemblies that were initially submitted to GenBank. To serve as a linear reference genome, a haploid “primary” assembly was selected from the diploid assembly of each species. For each chromosome, the most complete and accurate chromosome was selected for each chromosome pair. When rDNA was present in only one haplotype, it was chosen as the primary haplotype regardless of the completeness status. Both diploid and primary assemblies were submitted, but only the primary assemblies containing both chrX and chrY will be annotated and serve as a linear reference for each species.
All primary haplotypes are in "T2T" status (gapless and complete, telomere on both ends, higher accuracy) with the exception of the large rDNA arrays; one additional gap in mPanPan1 chr22_pat_hsa21, mPonAbe1 chr18_hap1_hsa16, and mPonAbe1 chr1_hap1_hsa1; and one missing telomere from mPonPyg2 chr21_hap1_hsa20.
*Symphalangus syndactylus (mSymSyn1, siamang gibbon) has been updated to v2.1 with date tag 20240514
and updated accordingly on GenBank. The only change from v2.0 is between Chromosomes 12 and 19, which the chromosome labels were swapped to match prior chromosome assignment of this species.
Version 1 diploid assemblies were generated with Verkko, and contigs were chromosome-assigned and oriented by alignment to the previous references. Both X and Y chromosomes are complete for all species listed. Gorilla and bonobo were phased using familial trios, and all others using Hi-C. To avoid confusion, we have removed links to these assemblies, but they still exist in the AWS bucket.
All generated sequencing data and assemblies are available for browsing and download from GenomeArk.
- Gorilla gorilla (gorilla)
- Pan paniscus (bonobo)
- Pan troglodytes (chimpanzee)
- Pongo abelii (Sumatran orangutan)
- Pongo pygmaeus (Bornean orangutan)
- Symphalangus syndactylus (siamang gibbon)
Files are generously hosted by Amazon Web Services under s3://genomeark
. Although available as HTTP links above, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3://
addressing scheme. Amending the max_concurrent_requests
etc. settings as per this guide will improve download performance further.
Custom scripts used for the v2 assemblies are listed as below:
- Polishing
- Assembly QC
- Implicit graph & pangenome
- Cactus alignment
- Assessment of ancestral sequence
- AQER analysis
- Population genome processing and selection analysis
- Acrocentric/rDNA analyses - rDNA image processing and copy number analysis
- Species-specific MEI analysis
- Non-B DNA annotation and NUMT detection
- Transcript comparison
- Analysis of IG/TR loci
In addition to the custom scripts, the following codes were used:
- Alignment: Winnowmap, lastz, minimap2, blastn, blastp, wfmash, MashMap, nucmer
- Alignment processing: paf2chain, wgatools, rustybam, perbase
- Conservation score calculation: PhastCons
- Pan-genome graph: impg
- Further assembly QC: Flagger, NucFreq
- Non-B DNA annotation: non-B_gfa
- Gene annotation: CAT, TOGA, IgDetective, Digger, Exonerate
- Repeat annotation: Repeatmasker, TRF, ULTRA, windowmasker
- Transcriptome data alignment: StringTie2
- Incomplete lineage sorting: TRAILS, mcmc2
- Selection signature scans: Sweepfinder2, saltiLASSI
- Replication timing: Phylo-HMGP
- Structural variation calling: syri, PAV
- Segmental duplication: SEDEF
- Alpha satellites higher order array prediction: HumAS-HMMER
- Data visualization: SVbyEye, StainedGlass, ModDotPlot
For any problems related to this dataset, please raise issues on this GitHub repository. For general questions regarding the project, please contact adam.phillippy@nih.gov. More information about our consortium can be found on the T2T homepage.
* Dec 2022. v1 release.
* Nov 2023. v2 release.
* Dec 2023. hap1 hap2 swapped in mPonAbe1 chr14 (hsa13) and mSymSyn1 chr3 to keep the rDNA containing or higher quality haplotype in hap1 and in the primary assembly.
* May 2024. mSymSyn1 v2.1 release. Chr12 and Chr19 are swapped to follow prior chromosome assignments for this species.