Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redraft clinvar write #175

Merged
merged 1 commit into from
Dec 19, 2022
Merged

redraft clinvar write #175

merged 1 commit into from
Dec 19, 2022

Conversation

MattWellie
Copy link
Collaborator

@MattWellie MattWellie commented Dec 19, 2022

Fixes

Proposed Changes

  • The initial version of the Clinvar summarising script created all the ~800k entries as a single JSON, converted that to a DataFrame, then to a HailTable. Somehow, converting 200MB of JSON into a Hail Table consumed 80GB+ of memory, and required a massive resource allocation
  • Switching this to syntax already in use by @vladsavelyev (modelled on the original pipeline from Broad?), the data is written & ingested line-by-line on a fixed schema basis (See code here)
  • This whole process now takes 3 mins, a trivial resource allocation, and is light enough that it could be run with each AIP run (i.e. pull down the latest clinvar data at runtime, and regenerate annotations on the fly). For reference this can be run on my local machine, whereas before it could barely run in GCP.

Notes

Conflicting and Unknown entries were previously removed, as they are not relevant to any AIP category, and removing them from consideration was the only way to get the process to complete. These could now be re-introduced if we feel like keeping all that data.

I'd also like to find a good place to stash this new table in cpg-reference-main so that the same data is available to all project buckets. I can see the following process working:

  • Add this script in analysis-runner's scripts folder
  • Copy latest the data from the NCBI FTP into the CPG-reference bucket
  • Run to generate the new data, and write back into cpg-reference-main/test
  • Re-summarised data is then available to all projects

Checklist

  • Related Issue created
  • Tests covering new change
  • Linting checks pass

@MattWellie MattWellie merged commit f713e18 into main Dec 19, 2022
@MattWellie MattWellie deleted the change_up_clinvar branch December 19, 2022 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change Clinvar Hail Table generation to per-line JSON
2 participants