The Platinum Genomes Truthset
Switch branches/tags
Nothing to show
Clone or download
Latest commit 45d4785 Nov 8, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
files release file listings Aug 29, 2017
README.md 2016 -> 2017 Nov 8, 2017
RELEASES.md releases Aug 29, 2017

README.md

Platinum Genomes

This repo contains the Platinum Genomes small variant truthset for samples NA12878 (also known as hg001) and NA12877. Platinum Genomes truthset variants were validated using haplotype inheritance information through a well studied 17-member pedigree (CEPH 1463).

Truthsets

Truthsets are made up of a VCF of validated variant records and a BED file of confident regions. These files aren't huge (00s of MB) but are too large to play nicely with git and github, here's a few ways to download:

AWS CLI

Truthset files are stored in an AWS S3 bucket called platinum-genomes, one way to download is via the AWS CLI:

aws s3 cp s3://platinum-genomes/2017-1.0 pg2017 --recursive

To download without AWS credentials, add the --no-sign-request flag. You can also explore the bucket and download individual files with this S3 bucket display.

wget

Alternatively, use wget or similar with the file URIs in this repo, e.g.:

wget -xi files/2017-1.0.files

You can then use the relevant md5 checksum in each release to validate data integrity.

Finally, truthset files can also be downloaded via FTP, e.g.:

wget ftp://platgene_ro:''@ussd-ftp.illumina.com/2017-1.0/hg38/small_variants/NA12878/NA12878.vcf.gz

Usage

To compare a VCF against these truthsets, we recommend using hap.py which performs a sophisticated haplotype comparison rather than a simpler tool such as bcftools isec.

Applications wrapping hap.py and containing these truthsets are available on the following platforms:

Details

See the attached wiki for technical information.

Raw data

Sequencing data for NA12878, NA12877 and samples NA12889-NA12892 (grandparents) are available through the ENA:

BaseSpace users can access this data via a shared BaseSpace project:

Sequencing data for the remaining pedigree members is not consented for public release and so is made available through the dbGaP database:

Issues

Please open an issue for comments, issues and other feedback.

Citation

For further information or to cite Platinum Genomes resources, see:

  • Eberle, MA et al. (2017) A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Research, 27:157-164. doi:10.1101/gr.210500.116