-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to submission portal jobs to handle sequencing data #329
Conversation
…urrent nmdc-schema
…ce OmicsProcessing and DataObject records
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the tests! I wonder how it'll affect #301.
Looks good to me (I reviewed it in terms of "do I understand what this does in isolation"), with the caveat that I don't have experience with the submission portal or this part of the codebase (i.e. "do I understand what this does to the big picture").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You addressed all my concerns. Thanks!
Re: #315 (will close once we actually run the job on production)
We previously had jobs to fetch a submission portal entry, translate it into
Study
andBiosample
objects, validate, and submit to Mongo. It turns out Andrew Tritt's submission (and possibly future ones??) came with sequencing data. The submission portal doesn't have the capability to accept that information, so it was provided in a CSV file mapping sample IDs (that is, the ID provided in the submission portal, not an NMDC minted ID) to FASTA files.These changes:
OmicsProcessing
andDataObject
slots, with one extra column to join on sample IDs. If URLs to such files are provided in the job config, each row will be expanded into the corresponding objects by theSubmissionPortalTranslator
class.SubmissionPortalTranslator
class and corresponding tests were also a little behind-the-times on some recent schema changes (DOI stuff, ID patterns, etc). That has been resolved, and the test now does a schema validation step to catch these things earlier (especially if we, ahem, ran tests against PRs in CI)tests/test_data/test_submission_portal_translator_data.yaml
-- it was redundant as far as testing is concerned and it made test failure diffs hard to read. I added a second test case in that file to cover the case of providingOmicsProcessing
andDataObject
mapping data.