-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEON benthic samples ingest pipeline #308
Conversation
|
||
BENTHIC_ENV_MEDIUM_MAPPINGS = { | ||
"plant-associated": { | ||
"term_id": "ENVO:01001057", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an environmental medium
} | ||
|
||
BENTHIC_LOCAL_SCALE_MAPPINGS = { | ||
"pool": {"term_id": "ENVO:03600094", "term_name": "stream pool"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mappings should live outside code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is out strategy for alternative identifiers that link back to NEON samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in reviewing this! I don't know if you have specific deadlines on this so I'm hitting approve just to not hold anything up, but I do have some mild concerns about mixing data fetching with data translation in the same class. If it works for now it's probably fine, but it's not a model I'd like to proliferate.
return NeonBenthicDataTranslator(benthic_data) | ||
|
||
def test_neon_envo_mappings_download(self): | ||
response = requests.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always apprehensive about making live network requests in tests and this is all this and the following test do. I'd consider removing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great comment, yes! I'll remove live network requests from all the NEON pipeline tests.
|
||
return minted_nmdc_ids | ||
|
||
def test_get_database(self, translator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an assertion to this test that the database validates against the schema? If in the future we update the schema in such a way that this translator no longer produces compliant objects, we want to find out early by this test failing rather than later if/when we try to use it again.
See:
validation_result = validate_json(json_dumper.to_dict(actual), mongo_db) |
"neonRawDataFile", self.conn, if_exists="replace", index=False | ||
) | ||
|
||
def get_site_by_code(self, site_code: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above comment it would be great to see this implemented as a separate @op
that just produces a mapping from site code to location string (perhaps by calling the https://data.neonscience.org/api/v0/sites
endpoint) and handing that mapping to the Translator
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, excellent suggestion. I made an @op
that produces a dict mapping from site code to location here: https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/ops.py#L931-L947
f"You are missing one of the aquatic benthic microbiome tables: {neon_amb_data_tables}" | ||
) | ||
|
||
neon_envo_mappings_file = "https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/assets/neon_mixs_env_triad_mappings/neon-nlcd-local-broad-mappings.tsv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer seeing a model where Translator
classes are purely about translation and have no knowledge of information sources. In that world this type of information would be fetched by a separate @op
that only knows about data fetching (and is configured by inputs to the run) and passes that off to the Translator
. That makes testing easier and more reliable because there are no external network dependencies. And it means that these mapping file locations don't have to be hardcoded, but can change from run to run.
This may not exactly meet your needs but it illustrates the model:
nmdc-runtime/nmdc_runtime/site/ops.py
Line 829 in 017aa9b
def get_csv_rows_from_url(url: Optional[str]) -> List[Dict]: |
nmdc-runtime/nmdc_runtime/site/graphs.py
Lines 148 to 161 in 017aa9b
omics_processing_mapping = get_csv_rows_from_url(omics_processing_mapping_file_url) | |
data_object_mapping = get_csv_rows_from_url(data_object_mapping_file_url) | |
biosample_extras = get_csv_rows_from_url(biosample_extras_file_url) | |
biosample_extras_slot_mapping = get_csv_rows_from_url( | |
biosample_extras_slot_mapping_file_url | |
) | |
database = translate_portal_submission_to_nmdc_schema_database( | |
metadata_submission, | |
omics_processing_mapping, | |
data_object_mapping, | |
biosample_extras=biosample_extras, | |
biosample_extras_slot_mapping=biosample_extras_slot_mapping, | |
) |
nmdc-runtime/nmdc_runtime/site/repository.py
Lines 522 to 528 in 017aa9b
"inputs": { | |
"submission_id": "", | |
"omics_processing_mapping_file_url": None, | |
"data_object_mapping_file_url": None, | |
"biosample_extras_file_url": None, | |
"biosample_extras_slot_mapping_file_url": None, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for illustrating the pattern for an ideal Translator
Patrick, I think this pattern's really clean and I'd like to follow it. I've tried to follow it and adopt it, but I think there's something very trivial that I'm overlooking which is causing the dagster repo to break (on dagit). Can you tell simply by visual inspection if I'm missing something?
In ops.py
:
- Very similar to
get_csv_rows_from_url()
I've definedget_df_from_url()
here: https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/ops.py#L908-L928 - I've also modified the Translator op to accept the processed results from
get_df_from_url
as inputs: https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/ops.py#L837-L860
In graphs.py
:
Calls to all the ops are being made in the graph here: https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/graphs.py#L246-L275
In repository.py
:
https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/repository.py#L656-L695 this is where I think I might be making a mistake in configuration. Do you seeing anything glaringly incorrect?
def __init__(self, benthic_data: dict, *args, **kwargs) -> None: | ||
super().__init__(*args, **kwargs) | ||
|
||
self.conn = sqlite3.connect("neon.db") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still concerned about writing this to disk and now we have two Translator
classes that are accessing the same file! I'm a little worried about cross-contamination between runs of different translators.
Thank you for the thorough review @pkalita-lbl 😁 these are great review comments, and I'll have them addressed in the next day/two, and yes, we definitely won't merge this in before addressing all the comments. |
nmdc_runtime/site/ops.py
Outdated
:return: pandas DataFrame of CSV/TSV/etc content | ||
""" | ||
if not url: | ||
return pd.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems an odd pattern. Better to make teh signature include Optional
and return None
?
nmdc_runtime/site/ops.py
Outdated
} | ||
return site_code_mapping | ||
else: | ||
raise Exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6bb6aff
to
5be0647
Compare
I've made a new branch — neon-pipeline-refactor which includes all the commits/updates that I had piled on on this branch post the code review from @pkalita-lbl. I will open up a PR from that branch soon. We can merge this PR in so that it can be included in the release that is about to go out on 12/18, and the refactored code can be included in the release after. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, yeah we can take care of the cleanup and refactoring in a separate PR
Fixes microbiomedata/issues#335