Skip to content

Latest commit

 

History

History
398 lines (335 loc) · 53.7 KB

Sequencing_data.md

File metadata and controls

398 lines (335 loc) · 53.7 KB

Sequencing Data

HiFi Data

A total of 100 Gbp of data (32.4x coverage) in HiFi 20 kbp libraries (used for v0.9-v1.1 assemblies) is available from NCBI. An additional 76 Gbp of data (24.4x coverage) is available in HiFi 10 kbp libraries at NCBI. The raw subreads for the 20 kbp libraries are available below.

raw subreads (genome DNA) (NOTE: there are the individual raw subreads NOT HiFi reads. Most users will want to download the HiFi reads the links above).

Oxford Nanopore Data

Nanopore sequencing was performed using Josh Quick's ultra-long read (UL) protocol and modifications as described in The structure, function, and evolution of a complete human chromosome 8.

We sequenced a total of 390 Gbp of data (126x coverage). The read N50 is 58 kbp and there are 219 Gbp bases in reads >50 kbp (71x). The longest full-length mapping read is 1.3 Mbp. Sequencing data was generated from three lines of CHM13 (NHGRI, UW, UCD), which all originate from the original line established by Urvashi Surti. Only the NHGRI line was karyotyped and confirmed to be stable prior to sequencing. For the NHGRI line, NHGRI (PI: Phillippy) and University of Nottingham (PI: Loose) contributed approximately 140 flowcells of UL data using Quick's ultra-long protocol; 199 Gbp (64x, 1.4 Gbp/flowcell). The read N50 is 71 kbp and there are 128 Gbp of data in reads >50 kbp (41x). For the UW line, University of Washington (PI: Eichler) contibuted 106 flowcells of UL data using a new UL protocol developed by Glennis Logsdon; 69 Gbp (22x, 0.6 Gbp/flowcell). The read N50 is 133 kbp and there are 57 Gbp of data in reads >50 kbp (18x). For the UCD line, UCDavis (PI: Dennis) contributed two PromethION cells using a ligation prep; 114 Gbp (37x, 57 Gbp/flowcell). The read N50 is 36 kbp and there are 25 Gbp of data in reads >50 kbp (8x).

Read ids broken out by sequencing location are available for NHGRI, U of Nottingham, UW, and UCD.

Guppy 6 + BLOW5 (genome DNA)

Thanks to Hasindu Gamaarachchi who contributed a BLOW5-formatted and basecalled sequences.

Downloads

rel8 (genome DNA)

rel8 is the full dataset as of 2020/10/01. All data was re-called using Guppy 5.0.7

Downloads

rel7 (genome DNA)

rel7 is the full dataset as of 2020/10/01. All data was re-called using Bonito v0.3.1.

Downloads

rel6 (genomic DNA)

rel6 is the full dataset as of 2020/10/01, adding UW data from partitions 232-243. All data was re-called using Guppy 3.6.0 with the HAC model.

Downloads

rel5 (genomic DNA)

rel5 is the full dataset as of 2019/09/01, all data was re-called using Guppy 3.6.0 with the HAC model.

Downloads

rel4 (genomic DNA)

rel4 is the full dataset as of 2019/09/01, all data was re-called using Guppy 3.4.5 with the HAC model.

Downloads

rel3 (genomic DNA)

rel3 is the full dataset as of 2019/09/01, all data was re-called using Guppy 3.1.5 with the HAC model. We have provided mappings both to our current draft assembly and to the GRCh38 with decoys in cram format, using minimap2.

Downloads

rel2 (genomic DNA)

rel2 is the same data as rel1 but recalled with the latest generation callers (Guppy flip-flop 2.3.1). We have provided mappings both to our current draft assembly and to the GRCh38 with decoys in cram format, using minimap2.

Downloads

rel1 (genomic DNA)

The full dataset as of 2019/01/09. These basecalls were generated on-instrument and use older versions of Guppy (depending on when the flowcell ran on the instrument).

Downloads

fast5 data

The raw fast5 data, without basecalls, is available for completeness. The data is grouped into 243 sets.

  • Partitions 1-94 were sequenced at NHGRI

  • Partitions 95-98 were sequenced at University of Nottingham

  • Partitions 99-144 were sequenced at NHGRI

  • Partitions 145-224 were sequenced at University of Washington

  • Partitions 225-226 were sequenced at UC Davis

  • Partitions 227-231 were sequenced at NHGRI

  • Partitions 232-243 were sequenced at University of Washington

  • Note that when the tgz were groupped and uploaded, some inadvertently included more than a single partition. These are denoted as partition ranges in the downloads (e.g. 145-149).

Downloads

Illumina PCRFree Data

A total of >300 Gbp of data (105x coverage) in PCR-Free Illumina libraries is available from NCBI.

10X Genomics Data

Raw fastq files

Approximately 50x of data was generated on a NovaSeq instrument. Based on the summary output of Supernova, there are 1.2 billion reads with 41x effective coverage. The mean molecule length is 130 kbp and an N50 of 864 reads per barcode.

Downloads

BioNano DLS Data

Approximately 430x of data was generated using the Saphyr instrument and the DLE-1 enzyme. There are 15.2 M molecules with an N50 molecule length of 115.9 kbp and a max of 2.3 Mbp (2 M molecules > 150 kbp, N50 218 kbp). The assembly of the molecules is 2.97 Gbp in size with 255 contigs and an NG50 of 59.6 Mbp.

The BNX file was produced from the CHM13 cell line and therefore does not include the Y chromosome. Due to a low frequency of restrictions sites in some regions of the genome, the BNX data has gaps and is not telomere to telomere. The CMAP was generated from the BNX file and therefore also has gaps and does not include Y chromosome. However, if needed, bionano has a tool that can convert the complete T2T sequence (FASTA file) into a format comatible with data output from the sapphyr instrument (CMAP File). The name of this tool is "in silico digestion" as described in bionano's documentation here: https://bionanogenomics.com/wp-content/uploads/2018/04/30205-Guidelines-for-Running-Bionano-Solve-Pipeline-on-Command-Line.pdf. The CMAP resulting from converting the FASTA file would be truly complete (telomere to telomere for all chromosomes including Y).

Downloads

  • BNX (md5: 59a7a5583e900e1e5cecb08a34b5b0dc)
  • CMAP (md5: cf1a6fbcf006a26673499b9297664fdb)

Hi-C Data

A library was generated using an Arima genomics kit and sequenced to approximately 40x on an Illumina HiSeq X.

Downloads

RNA-seq data

Two separate poly-A prep libraries were generated at UC Davis and 2x150 bp RNA-seq reads generated on an Illumina NovaSeq (~25 million PE reads each).

Downloads

Previously generated PacBio data

The PacBio data (both CLR and HiFi) was previously generated and is available from the SRA. The list of cells used for arrow polishing the v0.7 assembly are listed here.