In 1 sentence
This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
The main output files are:
- clinvar_allele_trait_pairs.tsv.gz: variant-condition specific record
- clinvar_alleles.tsv.gz: variant-specific aggregated record - generated by grouping clinvar_allele_trait_pairs.tsv.gz by variant.
- clinvar_alleles.vcf.gz: generated from clinvar_alleles.tsv.gz
- clinvar_alleles_with_exac.tsv.gz: generated from clinvar_alleles.tsv.gz
- clinvar_multi_alleles.tsv (with option
ClinVar is a public database hosted by NCBI for the purpose of collecting assertions as to genotype-phenotype pairings in the human genome. One common use case for ClinVar is as a catalogue of genetic variants that have been reported to cause Mendelian disease. In our work in the MacArthur Lab, we have two major use cases for ClinVar:
- To check whether candidate causal variants we find in Mendelian disease exomes have been previously reported as pathogenic.
- To pair with ExAC data to enable exome-wide analyses of reportedly pathogenic variants.
ClinVar makes its data available via FTP in three formats: XML, TXT, and VCF. We found that none of these files were ideally suited for our purposes. The VCF only contains variants present in dbSNP; it is not a comprehensive catalogue of ClinVar variants. The TXT file lacks certain annotations such as PubMed IDs for related publications. The XML file is large and complex, with multiple entries for the same genomic variant, making it difficult to quickly look up a variant of interest. In addition, both the XML and TXT representations are not guaranteed to be unique on genomic coordinates, and also contain many genomic coordinates that have been parsed from HGVS notation, and therefore may be right-aligned (in contrast to left alignment, the standard for VCF) and may also be non-minimal (containing additional nucleotides of context to the left or right of a given variant).
- Download the latest XML and TXT dumps from ClinVar FTP.
- Parse the XML file using src/parse_clinvar_xml.py to extract fields of interest into a flat file.
- Sort on genomic coordinates (we use GRCh37).
- Normalize using our Python implementation of vt normalize (see [Tan 2015]). The output file clinvar_allele_trait_pairs.tsv.gz after this step contains the variant-condition specific ClinVar records.
- Join the TXT file using src/join_data.R to aggregate interpretations from multiple submitters independent of conditions.
- Sort and de-duplicate (this removes dups arising from duplicate records in the TXT dump). The output file clinvar_alleles.tsv.gz after this step contains the variant-specific aggregated data.
†Because a ClinVar record may contain multiple assertions of Clinical Significance, we defined three additional columns:
1if the variant has ever been asserted "Pathogenic" or "Likely pathogenic" by any submitter for any phenotype, and
1if the variant has ever been asserted "Benign" or "Likely benign" by any submitter for any phenotype, and
1if the variant has ever been asserted "Pathogenic" or "Likely pathogenic" by any submitter for any phenotype, and has also been asserted "Benign" or "Likely benign" by any submitter for any phenotype, and
0otherwise. Note that having one assertion of pathogenic and one of uncertain significance does not count as conflicted for this column.
To run the pipeline:
cd ./src pip install --user --upgrade -r requirements.txt python master.py -R hg19.fasta -E ExAC.r0.3.1.sites.vep.vcf.gz
python master.py -h for additional options. The above solution pipeline would only output simple variation, i.e. one variation has a single variant. With option
-M, the pipeline could output another flat file with complex variations (i.e. more than one variant interpreted together).
Because ClinVar contains a great deal of data complexity, we made a deliberate decision to not attempt to capture all fields in our resulting file. We made an effort to capture a subset of fields that we believed would be most useful for genome-wide filtering, and also included
measureset_id as a column to enable the user to look up additional details on the ClinVar website. For instance, the page for the variant with
measureset_id 7105 is located at ncbi.nlm.nih.gov/clinvar/variation/7105/. Note that we also do not capture all of the complexity of the fields that are included. For example, the ClinVar website may display multiple HGVS notations for a single variant, while our file displays only one HGVS notation drawn from the ClinVar TXT dump.
If you want to analyze the output file into R, a suitable line of code to read it in would be:
clinvar = read.table('clinvar_alleles.tsv',sep='\t',header=T,quote='',comment.char='')
License, terms, and conditions
ClinVar data, as a work of the United States federal government, are in the public domain and are redistributed here under the same terms as they are distributed by ClinVar itself. Importantly, note that ClinVar data are "not intended for direct diagnostic use or medical decision-making without review by a genetics professional". The code in this repository is distributed under an MIT license.