Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix_Aug2017 #41

Merged
merged 31 commits into from
Sep 28, 2017
Merged

Fix_Aug2017 #41

merged 31 commits into from
Sep 28, 2017

Conversation

XiaoleiZ
Copy link
Collaborator

@XiaoleiZ XiaoleiZ commented Aug 29, 2017

Fix the bugs reported in Issues and referees report

The following key changes are added:

In parse_clinvar_xml.py

  1. Adding columnsstart,stop and strand for variant representation. Fix Feature request: Including strand info and genomic start and end coordinates #36
  2. Adding columns pathogenic,likely_pathogenic,uncertain_significance,likely_benign and benign (the standard terms used by ACMG guideline) to record the counts of individual submissions reported the variants as "Pathogenic","Likely pathogenic","Uncertain significance","Likely benign" and "Benign" (ignore cases) respectively. It is worth noting that the previous pathogenic and benign columns encoding the binary information are replaced. Fix Improper labelling of conflicting variants as pathogenic #40
  3. Adding column scv to list all the scv accession number of individual submissions
  4. Changing columns names: all column names with prefix measureset are replaced with variation since the latter are more familiar with ClinVar users.
  5. Changing the way to extract gene symbol: using the symbol used in the variant name/title. Fix Wrong gene mapping #37 and Sometimes "symbol" disagrees with primary hgvs gene annotation  #31

In group_by_allele.py:
6. Adding the counts for each term in pathogenic,likely_pathogenic,uncertain_significance,likely_benign and benign

when joining variant_summary.txt file:
7. Replacing the R script using a Python equivalent. Fix #35
8. Changing the way to encode column conflicted: according to the updated terms used in ClinVar aggregated variation reports, conflicted is changed to indicate whether the variation is aggregated to report as Conflicting interpretations of pathogenicity. Fix #40
9. Propagating the columns like last_evaluated, submitters_ordered and etc. Fix #38
10. Remove the duplicated records in variant_summary before joining: the variant_summary file is indeed not allele_id-specific. Variants with alternative loci like in PAR or complex variation like translocation would have more than one genomic coordinates but same allele_ids. The alternative loci would be recorded as another entry in variant_summary file. I just simply remove the duplicated records after extracting the interesting columns from variant_summary. Currently, only one of the sequence locations of these variants are kept after parsing the xml file. There is still problems in handling these type of variants with current pipeline: e.g the variants in PAR are represented on Y chromosome and would not be able to find the variant info from ExAC and gnomAD. And for complex variation like translocation, just one allele is represented in final output files. Since these are rare cases, I am not sure how to deal with them uniformly. For the variants with alternative loci, there is a separate VCF file available for download on ClinVar FTP . Fix #39

In add_gnomad_field.py and add_exac_field.py:
11. Adding the DP - approximate read depth for users to query about the coverage info

kristjaneerik and others added 27 commits May 5, 2017 10:38
…ing reference genomes if desired; use gunzip -c rather than zcat to enable compatibility with os x; add new ordered columns
Fix the issues reported in ISSUES and referee reports
change measureset to variation;
update the doi link
@konradjk
Copy link
Contributor

FWIW we just ran this code as-is and it worked totally fine! Might want to merge since master definitely does not work on the current clinvar xml

@bw2
Copy link
Contributor

bw2 commented Sep 26, 2017

@XiaoleiZ should we merge this into master?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants