Human Pangenome Reference Consortium
HG002 Data Freeze (v1.0)
The Human Reference Pangenome Consortium has organized a collection of HG002 (NA24385) data to guide the development of new assembly and evaluation strategies for our reference production efforts.
Information about our consortium can be found at https://humanpangenome.org/ and on Twitter @HumanPangenome.
The following data are available from the HG002 (NA24385) human cell line:
- PacBio HiFi: SMRTbell libraries were prepared and size-selected with SageELF to the targeted size (15 kb, 19 kb, 20 kb, or 25 kb), with Sequel II System with Chemistry 2.0, Sequel II System with pre-2.0 Early Access Chemistry, and Sequel System with Chemistry 3.0
- PacBio CLR: Continuous Long Reads (30 kb size selection), >60X fold coverage from 2 SMRT Cells, 1 SMRT Cell is 30X fold coverage (v1.0 Chem).
- Oxford Nanopore (Unsheared PromethION data): Totaling 658x coverage of unsheared sequencing (28 PromethION Flow cells). With a total of ~51x 100kb+ reads. Please note redundancy as explained here.
- Oxford Nanopore (Ultra-long GridION data): Totaling ~52x coverage of unsheared sequencing (106 MinION Flow cells). With a total of ~15x 100kb+ reads.
- Hi-C: Sequencing libraries represent two distinct protocols/optimized methods (labeled HiC.1 and HiC.2, from two anonymized companies) to reach "smooth-coverage", with access to high coverage Nova-Seq (250bp PE).
- Strand-seq: Strand-specific libraries were generated in collaboration with Jan Korbel’s laboratory (EMBL, Heidelberg). These data represent 192 barcoded single-cell libraries.
- 10X Genomics: Chromium Genome Platform from 10X Genomics was sequenced to two depths: 51.7x coverage, and a deeper coverage (300Gb, 84.4x coverage) dataset. Additional data is available from BioProject: PRJNA527321
- BioNano Maps DLE1 Data collected (Molecules >150 kbp): 317 Gbp Read N50(Molecules >150 kbp): 323 kbp provided by BioNano Genomics and Genome in a Bottle (GIAB) Consortium
- Illumina: Whole genome data, 300x PCR-free Illumina 150bp + 40x PCR-free 250bp, from GIAB
All HG002 data (and links to parental data) are posted at the following Amazon AWS public dataset: human-pangenomics.
In addition, we will host links to the following parental data: PacBio (HiFi and CLR), Nanopore-PromethION unsheared, BioNano maps, 10XG, and Illumina data (300x PCR-free and 6kb mate-pair sequencing), and HiC for both Father (HG003, NA24149) and Mother (HG004, NA24143). With links provided below: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=HG002/hpp_HG002_NA24385_son_v1/parents/
Data reuse and license
All data are released to the public domain (CC0) and we encourage its reuse. We would appreciate it if you would acknowledge the Human Pangenome Reference Consortium when assemblies are posted or if you use the data sets generated by the consortium.
Please credit the use of GIAB datasets:
Zook, Justin M., et al. "Extensive sequencing of seven human genomes to characterize benchmark reference materials." Scientific data 3.1 (2016): 1-26.
Please credit the use of UCSC nanopore and HiC (HG003 and HG004) datasets:
Shafin, Kishwar, et al. "Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit." BioRxiv (2019): 715722.
Please credit the use of previously published PacBio HiFi data:
Wenger, Aaron M., et al. "Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome." Nature biotechnology 37.10 (2019): 1155-1162.
The annotated table of sequence data can be downloaded here.
HG002 Data Freeze (v1.0) Recommended downsampled data mix
We encourage assembly groups to use as much of the data from the HG002 freeze as possible to get the best assembly they can. However, as no two groups are likely to use exactly the same subset of data, making comparison more difficult, and the size and variety of the HG002 freeze is not representative of what is likely to be available in future freezes, we recommend that assembly groups also run their pipeline on the following set of 4 downsampled datasets from the HG002 (NA24385) human cell line:
~34X coverage of Sequel II System with Chemistry 2.0
Oxford Nanopore (Unsheared PromethION data):
Standard production (60x coverage of unsheared sequencing from 3 PromethION flow cells, N50 ~44kb; from Shafin et al BioRxiv 2019)
Ultra-long data ( ~15x coverage of 100kb+ reads GridION and ~45x coverage of 100kb+ reads PromethION data).
NOTE: The 100kb+ reads are included from Standard production data (File name: HG002_ucsc_Jan_2019_Guppy_3.4.4). That is, ultra-long data from the standard unsheared run are included in this file, and read names are expected to be redundant.
Sequencing libraries represent two distinct protocols/optimized methods (labeled HiC.1 and HiC.2, from two anonymized companies) to reach "smooth-coverage", with access to high coverage Nova-Seq (250bp PE): Downsampled matched HiC_1 and HiC_2 data: 441,683,225 raw read-pairs, or ~69X coverage
Whole-genome data, downsampled to ~30x PCR-free Illumina 150bp (HG002, HG003, and HG004) to match the expected production data available for the 1000 genome samples (produced at the NYGC: ~30x 150 bp PE Illumina reads from the 2504 1KG samples (using ~450bp fragment size))
- HG002: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=HG002/hpp_HG002_NA24385_son_v1/ILMN/downsampled/
- HG003 and HG004: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=HG002/hpp_HG002_NA24385_son_v1/parents/ILMN/downsampled/
Sequencing Data - whole dataset
README - PacBio HiFi reads of HG002, Ashkenazim Son
Last Updated: January 24, 2020
DNA samples were extracted from large homogenized growths of B-lymphoblastoid cell lines from the Coriell Institute for Medical Research and sheared to 20 kb with Megaruptor. SMRTbell libraries were prepared and size-selected with SageELF to the targeted size (15 kb, 19 kb, 20 kb, or 25 kb). Circular Consensus Sequence was generated in SMRT Link, and reads were filtered to 99% predicted accuracy (QV20). Sample preparation, library preparation, sequencing, and CCS analysis were performed at PacBio.
Sequel II System with Chemistry 2.0
- Size selection - 15 kb or 20 kb selected on SageELF
- Run time - 30 hrs per SMRT Cell 8M
- CCS - "Circular Consensus Sequencing" analysis in SMRT Link v8.0
- SRA - https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA586863
- 15 kb Library, ~36-fold coverage
- 20 kb Library, ~16-fold coverage
Sequel II System with pre-2.0 Early Access Chemistry
- Size selection - 15 kb, 19 kb, or 25 kb selected on SageELF
- Run time - 30 hrs per SMRT Cell 8M
- CCS - "Circular Consensus Sequencing" analysis in SMRT Link v8.0 pre-release
- 15 kb Library
- 19 kb Library
- 25 kb Library
Sequel System with Chemistry 3.0
- Size selection - 15 kb selected on SageELF
- Run time - 24 hrs per SMRT Cell 1M
- CCS - "Circular Consensus Sequencing" analysis in SMRT Link v6.0
- SRA - https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA529679
- 15kb, ~28-fold coverage
For Research Use Only. Not for use in diagnostic procedures. © Copyright 2020, Pacific Biosciences of California, Inc. All rights reserved. The data provided in these files is subject to change without notice and Pacific Biosciences assumes no responsibility for any errors or omissions. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences data, products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at https://www.pacb.com/legal-and-trademarks/terms-and- conditions-of-sale. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. All other trademarks are the sole property of their respective owners.
README - PacBio CLR reads of HG002, Ashkenazim Son
Last Updated: January 24, 2020
DNA samples were extracted from large homogenized growths of B-lymphoblastoid cell lines from the Coriell Institute for Medical Research and sheared to 30 kb with Megaruptor. A SMRTbell library was prepared with SMRTbell Express 2.0 and size-selected with BluePippin to >15 kb. The library was sequenced on the Sequel II System with Chemistry 2.0 for 15 hrs per SMRT Cell 8M. Sample preparation, library preparation and sequencing were performed at PacBio.
m64070_190824_163708, ~30-fold unique molecular coverage
For Research Use Only. Not for use in diagnostic procedures. Copyright 2020, Pacific Biosciences of California, Inc. All rights reserved. The data provided in these files is subject to change without notice and Pacific Biosciences assumes no responsibility for any errors or omissions. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences data, products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at https://www.pacb.com/legal-and-trademarks/terms-and- conditions-of-sale. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. All other trademarks are the sole property of their respective owners.
HG002 CLR data provided by the McDonnell Genome Institute at Washington University in St. Louis, Missouri,
HG002 extracted DNA from cells using Qiagen MagAttract Kit
- Shearing - 60kb with Megaruptor
- Library prep - SMRTbell Express Template Prep 2.0
- Size selection - >40 kb with BluePippin
- Sequencing - Sequel System II with binding kit (101-842-900) and sequencing kit (101-820-200)
- SMRT Cells - SMRT Cell 8M Tray 101-389-001
- Run time - 20 hrs per SMRT Cell
SMRT cell 'A01':
- Polymerase read bases 140,829,614,688
- Polymerase reads 5,606,947
- Polymerase read N50 42,922
- Subread Length (mean) 21,871
- Subread N50 37,116
- Insert length (mean) 22,072
- Insert N50 37,703
- Unique Molecular Yield: 123,059,829,616
SMRT cell 'C01':
- Polymerase read bases 159,869,879,629
- Polymerase reads 6,678,991
- Polymerase read N50 23,936
- Subread Length (mean) 20,549
- Subread N50 34,601
- Insert length (mean) 20,962
- Insert N50 35,428
- Unique Molecular Yield: 138,808,414,978
Attention to redundancy
Please note that UCSC standard unsheared data from January 2019 are present in two directories:
- Guppy 3.4.4: https://s3-us-west-2.amazonaws.com/human-pangenomics/HG002/hpp_HG002_NA24385_son_v1/nanopore/downsampled/standard_unsheared/HG002_ucsc_Jan_2019_Guppy_3.4.4.fastq.gz
- Guppy 3.0.5: https://s3-us-west-2.amazonaws.com/human-pangenomics/HG002/hpp_HG002_NA24385_son_v1/nanopore/HG002_ucsc_Jan_2019_Guppy_3.0.fastq.gz
100 kb+ read data: the 100 kb+ reads included in the data include reads from all datasets
We advise you to note that there will be redundancy in read id's if you use data both from the main directory (https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=HG002/hpp_HG002_NA24385_son_v1/nanopore/) and downsampled directory (https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=HG002/hpp_HG002_NA24385_son_v1/nanopore/downsampled/).
GIAB UL data
GIAB UL dataset (HG002_giab_ULfastqs_guppy3.2.4.fastq.gz) was generated using RAD004 sequencing chemistry on a GridION.
NOTE: If you have previously downloaded this file (before January 9, 2020) please note that some of the reads in that file were later flagged to be removed. The list of read id's to be excluded is in the HG002_giab_UL_old_fastqs_mislabeled-sample_read-ids.txt.gz file.
Nanopore PromethION data
The UCSC data (HG002_ucsc_Jan_2019_Guppy_3.4.fastq.gz) were generated using the Shasta publication protocol (3 LSK109-based sequencing libraries per PromethION flow cell with 3 flow cells per individual), with observed N50s (average) ~42 kb and 6x coverage per individual in 100kb+ reads.
The additional UCSC data (HG002_ucsc_Dec_2019_Guppy_3.2.fastq.gz) were generated in efforts to boost coverage in 100kb+ reads using the same Shasta publication protocol.
The HG002_ucsc_Oct_2018_Guppy_3.0.fastq.gz dataset was generated at UCSC in 2018 as part of optimization using unsheared DNA and LSK109 sequencing chemistry.
The HG002_ONT_PAD64459_Guppy_3.2.fastq.gz was generated and donated by ONT using unsheared DNA and LSK109 chemistry.
(1) GIAB Data for the A/J trio from 10X Genomics, Inc.
The following datasets were produced using the Chromium Genome Platform from 10X Genomics:
- Son (NA24385): 51.7x coverage (300Gb: 84.4x coverage)
- Mother (NA24143): 69.1x coverage
- Father (NA24149): 70.6x coverage
Samples were processed using Long Ranger (version 2.2) against both GRCh37 and GRCh38, as well as assembled with Supernova 2.0.1.
The son in the A/J trio (NA24385) was sequenced to two depths. The first is the same dataset that went into the Supernova assemblies. The second is a deeper (300Gb, 84.4x coverage) dataset that was also ran through Long Ranger 2.2 against both GRCh37 and GRCH38.
Default Long Ranger file formats are described here: http://support.10xgenomics.com/genome-exome/software/pipelines/latest/output/overview
And Supernova file formats are described here:
The three default VCFs (phased_variants, dels and svs) were also reformatted to support structural variant analysis with SURVIVOR. In particular, the SVLEN, SVTYPE and END tags were added. The phased_variants file was filtered to remove SNVs.
All samples were sequenced on the Illumina Xten at 2x150bp. The A/J trio was done using v1 of the 10x library prep protocol.
(2) Additional A/J (Son, NA24385)10XG data from Accession: PRJNA527321.
Abstract: This study aims to explore the parameter space of 10x Linked-Read data for human diploid assembly and structural variant detection. The submission includes eight 10x Linked-Reads libraries, five for NA12878 and three for NA24385. These libraries with different parameters of Cf, Cr and MuFL.
SRX5532186: 10x data for L5
1 ILLUMINA (HiSeq X Ten) run: 1.1G spots, 321.4G bases, 146.7Gb downloads
Design: 10x Linked-Read sequencing of NA24385 with Cf=208, WmuFL=267.4X
88.6 GB - SRR8739414_1.fastq.gz
101.2 GB - SRR8739414_2.fastq.gz
SRX5532189: 10x data for L4
1 ILLUMINA (HiSeq X Ten) run: 1.2G spots, 373.7G bases, 154.3Gb downloads
Design: 10x Linked-Read sequencing of NA24385 with Cf=1504, WmuFL=246.9X
93.8 GB - SRR8739411_1.fastq.gz 108.8 GB - SRR8739411_2.fastq.gz
Data generation at BioNano Genomics in collaboration with Genome in a Bottle Consortium Technology: Bionano optical mapping
Link to raw data:
GM24385 DLE1: Data collected (Molecules >150 kbp): 317 Gbp Read N50(Molecules >150 kbp): 323 kbp
Data collected (Molecules >150 kbp): 249 Gbp
Read N50(Molecules >150 kbp): 383 kbp
Data collected (Molecules >150 kbp): 316 Gbp
Read N50(Molecules >150 kbp): 332 kbp
Software used for assembly: Bionano Solve3.2.1
Link to Solve3.2.1: http://bnxinstall.com/solve/Solve3.2.1_04122018.tar.gz
Parameters used: optArguments_haplotype_DLE1_saphyr_human.xml in Solve 3.2.1
(1) 2x250bp overlapping libraries with nominally 350bp insert size (NIST, GIAB)
For each Reference Material, 1 library was prepared using the Illumina TruSeq (LT) DNA PCR-Free Sample Prep Kits (FC-121-3001).
DNA concentrations were measured using a Qubit 2.0 fluorometer (Life Technologies). Genomic DNA (1.5 ug) was fragmented using a Covaris S2 focused ultrasonicator in micro TUBE AFA Fiber Pre-Slit Snap-Cap 6x16mm micro tubes and the Covaris MicroTUBE holder (covaris part numbers 520045 and 500114, respectively) under the following conditions for a target insert size of 350 base pairs. Size selection was done using a 96-well 0.8 mL plate (Fisher Scientific Part # AB-0859), a magnetic stand-96 (Ambion part # AM10027) and the Illumina sample purification beads according to the 350 bp insert protocol.
Adenylation of 3’™ ends was done in 0.2 mL PCR tubes on an MJ Research PTC-200 thermal cycler. The optional A-Tailing control was not used. Ligation of indexed paired-end adapters was done in 0.2 mL PCR tubes using the DNA adapter tubes included in the Illumina TruSeq (LT) DNA PCR-Free Sample Prep Kit on an MJ Research PTC-200 thermal cycler. The optional ligation control was not used. The libraries were cleaned up in a 96-well 0.8 mL plate (Fisher Scientific Part # AB-0859) and a magnetic stand-96 (Ambion part # AM10027) using the Illumina sample purification beads. The final libraries were run on an Agilent 2100 Bioanalyzer HS-DNA chip to verify fragment size distribution. Final library concentration was measured via qPCR using the KAPA library quantification kit for Illumina sequencing platforms (KAPA part # KK4835).
The TruSeq libraries were run on an Illumina HiSeq 2500 in Rapid mode (v2) with 2x250 paired end reads. Pooled Libraries were initially loaded at a concentration of 10 pM. loading concentration was adjusted accordingly on subsequent runs to balance the libraries as well as possible.
(2) 6kb mate pair libraries (NIST, GIAB)
Mate Pair libraries were generated using Nextera Mate Pair Sample Preparation Kit (Illumina, Cat# FC-132-1001). Briefly, 4 µg of high molecular weight genomic DNA from the NIST Reference Materials (or from Coriell for the Asian parents) was fragmented to about 7 kb in a 400 mL tagmentation reaction containing 12 µL of Tagment Enzyme at 55âˆžC for 30 minutes. The tagmented DNA fragments were purified with Zymo Genomic DNA Clean & ConcentratorTM Kit (Zymo Research, Cat# D4010). The gap in the tagmented DNA was filled with a Strand Displacement Polymerase in a 200 µL strand displacement reaction at 20âˆžC for 30 minutes. DNA was then purified with AMPure XP Beads (0.5x vol, Beckman Coulter, Cat# A63880) and size-selected by 0.6% agarose gel electrophoresis in 0.5x TBE buffer. The 6-9 kb fragments were excised from a gel and DNA was recovered using a ZymocleanTM Large Fragment DNA Recovery Kit (Zymo Research, Cat# D4045). Up to 600 µg of DNA was then circulated overnight at 30âˆžC with Circularization Ligase in a 300 µL reaction.
After overnight circularization, the uncirculated linear DNA was removed by Exonuclease digestion. Both DNA Ligase and Exonuclease were inactivated by heat treatment and the addition of Stop Ligation Buffer. Circularized DNA was then sheared to smaller sized fragments (300-1000 bp) using Covaris S2 with T6 (6x32 mm) glass tube (Covaris, Part# 520031 and 520042) under these conditions: Intensity of 8, Duty Cycle of 20%, Cycles Per Burst of 200, Time of 40 sec, Temperature of 6-8âˆžC.
The sheared DNA fragments that contain the biotinylated junction adapter are mate pair fragments. These fragments were isolated by binding to Dynabeads M-280 Streptavidin Magnetic Beads (Invitrogen, Part# 112-05D) in Bead Bind Buffer. The unbiotinylated molecules in solution are unwanted genomic fragments that are removed through a series of washes. All downstream reactions were carried out on bead and beads were washed between successive reactions. The sheared DNA was first end-repaired to generate blunt ends followed by an A-Tailing reaction to add a single Ã¬AÃ® nucleotide to the 3Ã ends of the blunt fragments. Then the Illumina T-tailed indexing adapters were ligated to the A-tailed fragments.
The adapter-ligated fragments were PCR amplified [98âˆžC/1 min, 11 cycles of (98âˆžC/10 sec, 60âˆžC/30 sec, 72âˆžC/30s), 72âˆžC/5 min , 4âˆžC /hold] to generate the final library. The amplified library was purified using AMPure XP Beads (0.67x vol) and eluted in Resuspension Buffer. The size distribution of the library was determined by running a sample on an Agilent Technologies 2100 Bioanalyzer. Library concentration was measured by the Qubit dsDNA HS Assay Kit (Life Technologies, Cat# Q32851).
Pooled Mate-Pair libraries were sequenced on an Illumina HiSeq 2500 in Rapid mode (v1) with 2x101 bp paired-end reads. The loading concentration was 9.5 pM. This Initial run was for library QC purposes prior to running high throughput.
The Mate-Pair libraries were also sequenced on an Illumina HiSeq 2500 in high output mode (v4) with 2x125 bp paired-end reads. Libraries were sequenced on individual lanes (not pooled). The template loading concentration for each lane was adjusted based on the cluster density from the QC run. Two replicate flowcells were sequenced simultaneously, each with 6 lanes of mate-pair libraries.
To assess duplication rate, coverage, and insert size of the mate-pair libraries, reads were stripped of adapter sequences. Read pairs were removed if the sequence of one or both mates was less than 20 bp after adapter stripping, or if the adapter sequence was at the beginning rather the end of a read (indicating the read inserts were likely to be in inward-facing F/R orientation rather than the expected outward-facing R/F orientation). Reads were then mapped to the hg19 reference genome from ucsc or the GRCh38 reference genome with decoy but no alts using bwa mem (Li 2013) with default settings, and duplicates were marked using samblaster (Faust 2014).
The high rate of PCR duplicates (close to 50% in some libraries) resulted in lower than expected sequence coverage (13-17x average across all sequenced genomic positions). A more relevant metric for mate-pair data is the physical coverage, which measures the number of inferred fragments that cover a particular genomic position (including both the sequenced ends as well as the unsequenced genomic region between the ends). Because the empirical insert size average was between 6Ã±7kb per individual, the physical coverage of the genome was quite high (>400x per individual). BAMs were stripped of duplicate reads to reduce file size, but the full data are available in fastq format.
Strand-seq data produced, sequenced and processed by Ashley D. Sanders and Jan O. Korbel at the EMBL, Heidelberg on behalf of the Human Pangenome Consortium.
Strand-specific libaries were generated as described in Sanders et al. Nat Protoc. 2017 and sequenced on a NextSeq Illumina platform. 192 barcoded single-cell libraries were pooled for sequencing of HG002 sample.
Raw demultiplexed fastq files from the paired-end sequencing run (80bp read length) were uploaded for each single cell library. These data can be found at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=HG002/hpp_HG002_NA24385_son_v1/Strand_seq/. Strand-seq stat
Estimated depth of coverage for 83 selected libraries: 2.01x % of GRCh38 covered by at least one read in 83 selected libraries: 69.97%
Sequencing libraries represent two distinct protocols/optimized methods (labeled HiC.1 and HiC.2, from two anonymized companies) to reach "smooth-coverage", with access to high coverage Nova-Seq (250bp PE). Attention is required to coverage difference between the two methods in processing and comparisons.
Guidance for processing are listed below:
These three sequencing libraries are triplicates made from one sample of 100,000 cells of the HG002 sample. They are proximity ligation libraries generated using a procedure that includes a linker sequence. Each library was generated using
The DNA fragmentation and proximity-ligation does not use restriction enzymes. Thus, the mapping and analysis procedures may not to be modified.
The raw fastq data should be aligned using bwa mem with the -5SP option. This effectively excludes the linker sequence from the bam output alignments. For example: bwa mem -5SP index_path/hg38.fa Sample_R1.fastq Sample_R2.fastq
The proximity-ligation procedure uses an adapter with a 3' overhang as template for the proximity ligation. The adapter sequence is as follow:
5' GGTTCGTCCATCGATC 3'
3' CCAAGCAGGTAG 5'
The resulting bridge after proximity ligation will be (Ns being genomic DNA)
5' NNNNGGTTCGTCCATCGATCCCAAGCAGGTAGNNNN 3'
3' NNNNCCAAGCAGGTAGCTAGGGTTCGTCCATCNNNN 5'
Files - 2x150bp Sequencing
Files - 2x250bp Sequencing*
- As of Jan 22 2020, only one dataset is available. Additional datasets will be made available within 1 week of the initial data freeze, in order to increase the total sequence depth of HiC_2 libraries to be equivalent with HiC_1 libraries.
These two sequencing libraries are generated from biological replicates (i.e. two independent cultures) HG002 lymphoblast cells obtained from Coriell (https://www.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=GM24385&Product=CC). So the 2x150bp Sequencing library is prepared from Biological Replicate 1, while the 2x250bp library is prepared from Biological Replicate 2.
They are HiC libraries generated using a procedure that deploys an specialized combination of restriction enzymes (RE), where the RE cut site distributions are optimized to produce uniform genomic coverage. The specific RE cut site motifs are: ^GATC, G^ANTC, C^TNAG, T^TAA. The "^" is the cut site on the + DNA strand, and the 'N' can be any of the 4 genomic bases. Because of this, there are 10 possible RE cut sites.
We align the HiC data using bwa mem with the -SP5M option.
For example: bwa mem -SP5M index_path/hg38.fa Sample_R1.fastq Sample_R2.fastq
A full mapping and data analysis pipeline for certain applications, such as identifying SNVs from HiC data, can be made available upon request.
Other Analysis Considerations - Normalization
The RE cut site distributions are optimized to produce uniform genomic coverage. For certain applications such as genome scaffolding, several open source tools (e.g. SALSA2) conduct data normalization based on RE cut site locations. We advise analyses groups to perform their analyses with normalization based on known information about RE cut sites locations AND without assuming a priori knowledge about RE cut site locations. For applications such as polishing, knowledge of RE cut site locations could help reducing errors (especially at chimeric junctions) to improve polishing accuracy.
Available Parental Datasets:
Father HG003 NA24149
|300x PCR-free Illumina 150bp + 40x PCR-free 250bp||GIAB||ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/NIST_Illumina_2x250bps/|
|PacBio HiFi||Google + HudsonAlpha + GIAB||https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=HG003/PacBio_HiFi/|
Mother HG004 NA24143
|300x PCR-free Illumina 150bp + 40x PCR-free 250bp||GIAB||ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/NIST_Illumina_2x250bps/|
|PacBio HiFi||Google + HudsonAlpha + GIAB||https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=HG004/PacBio_HiFi/|