Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEON benthic samples ingest pipeline #308

Merged
merged 9 commits into from
Dec 15, 2023
Merged

Conversation

sujaypatil96
Copy link
Collaborator


BENTHIC_ENV_MEDIUM_MAPPINGS = {
"plant-associated": {
"term_id": "ENVO:01001057",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an environmental medium

}

BENTHIC_LOCAL_SCALE_MAPPINGS = {
"pool": {"term_id": "ENVO:03600094", "term_name": "stream pool"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mappings should live outside code

Copy link
Contributor

@cmungall cmungall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is out strategy for alternative identifiers that link back to NEON samples?

@sujaypatil96 sujaypatil96 marked this pull request as ready for review November 22, 2023 17:15
pkalita-lbl
pkalita-lbl previously approved these changes Dec 5, 2023
Copy link
Collaborator

@pkalita-lbl pkalita-lbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in reviewing this! I don't know if you have specific deadlines on this so I'm hitting approve just to not hold anything up, but I do have some mild concerns about mixing data fetching with data translation in the same class. If it works for now it's probably fine, but it's not a model I'd like to proliferate.

return NeonBenthicDataTranslator(benthic_data)

def test_neon_envo_mappings_download(self):
response = requests.get(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always apprehensive about making live network requests in tests and this is all this and the following test do. I'd consider removing them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment, yes! I'll remove live network requests from all the NEON pipeline tests.


return minted_nmdc_ids

def test_get_database(self, translator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an assertion to this test that the database validates against the schema? If in the future we update the schema in such a way that this translator no longer produces compliant objects, we want to find out early by this test failing rather than later if/when we try to use it again.

See:

validation_result = validate_json(json_dumper.to_dict(actual), mongo_db)

"neonRawDataFile", self.conn, if_exists="replace", index=False
)

def get_site_by_code(self, site_code: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above comment it would be great to see this implemented as a separate @op that just produces a mapping from site code to location string (perhaps by calling the https://data.neonscience.org/api/v0/sites endpoint) and handing that mapping to the Translator class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, excellent suggestion. I made an @op that produces a dict mapping from site code to location here: https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/ops.py#L931-L947

f"You are missing one of the aquatic benthic microbiome tables: {neon_amb_data_tables}"
)

neon_envo_mappings_file = "https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/assets/neon_mixs_env_triad_mappings/neon-nlcd-local-broad-mappings.tsv"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer seeing a model where Translator classes are purely about translation and have no knowledge of information sources. In that world this type of information would be fetched by a separate @op that only knows about data fetching (and is configured by inputs to the run) and passes that off to the Translator. That makes testing easier and more reliable because there are no external network dependencies. And it means that these mapping file locations don't have to be hardcoded, but can change from run to run.

This may not exactly meet your needs but it illustrates the model:

def get_csv_rows_from_url(url: Optional[str]) -> List[Dict]:

omics_processing_mapping = get_csv_rows_from_url(omics_processing_mapping_file_url)
data_object_mapping = get_csv_rows_from_url(data_object_mapping_file_url)
biosample_extras = get_csv_rows_from_url(biosample_extras_file_url)
biosample_extras_slot_mapping = get_csv_rows_from_url(
biosample_extras_slot_mapping_file_url
)
database = translate_portal_submission_to_nmdc_schema_database(
metadata_submission,
omics_processing_mapping,
data_object_mapping,
biosample_extras=biosample_extras,
biosample_extras_slot_mapping=biosample_extras_slot_mapping,
)

"inputs": {
"submission_id": "",
"omics_processing_mapping_file_url": None,
"data_object_mapping_file_url": None,
"biosample_extras_file_url": None,
"biosample_extras_slot_mapping_file_url": None,
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for illustrating the pattern for an ideal Translator Patrick, I think this pattern's really clean and I'd like to follow it. I've tried to follow it and adopt it, but I think there's something very trivial that I'm overlooking which is causing the dagster repo to break (on dagit). Can you tell simply by visual inspection if I'm missing something?

In ops.py:

In graphs.py:
Calls to all the ops are being made in the graph here: https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/graphs.py#L246-L275

In repository.py:
https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/repository.py#L656-L695 this is where I think I might be making a mistake in configuration. Do you seeing anything glaringly incorrect?

def __init__(self, benthic_data: dict, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)

self.conn = sqlite3.connect("neon.db")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still concerned about writing this to disk and now we have two Translator classes that are accessing the same file! I'm a little worried about cross-contamination between runs of different translators.

@sujaypatil96
Copy link
Collaborator Author

Thank you for the thorough review @pkalita-lbl 😁 these are great review comments, and I'll have them addressed in the next day/two, and yes, we definitely won't merge this in before addressing all the comments.

:return: pandas DataFrame of CSV/TSV/etc content
"""
if not url:
return pd.DataFrame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems an odd pattern. Better to make teh signature include Optional and return None?

}
return site_code_mapping
else:
raise Exception(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujaypatil96
Copy link
Collaborator Author

sujaypatil96 commented Dec 15, 2023

I've made a new branch — neon-pipeline-refactor which includes all the commits/updates that I had piled on on this branch post the code review from @pkalita-lbl. I will open up a PR from that branch soon.

We can merge this PR in so that it can be included in the release that is about to go out on 12/18, and the refactored code can be included in the release after.

Copy link
Collaborator

@pkalita-lbl pkalita-lbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, yeah we can take care of the cleanup and refactoring in a separate PR

@pkalita-lbl pkalita-lbl merged commit 6e2c472 into main Dec 15, 2023
@sujaypatil96 sujaypatil96 deleted the benthic-samples-ingest branch December 15, 2023 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants