NEON benthic samples ingest pipeline #308

sujaypatil96 · 2023-09-20T23:56:23Z

cmungall · 2023-10-02T21:57:33Z

nmdc_runtime/site/translation/neon_benthic_translator.py

+
+BENTHIC_ENV_MEDIUM_MAPPINGS = {
+    "plant-associated": {
+        "term_id": "ENVO:01001057",


This is not an environmental medium

cmungall · 2023-10-02T21:58:02Z

nmdc_runtime/site/translation/neon_benthic_translator.py

+}
+
+BENTHIC_LOCAL_SCALE_MAPPINGS = {
+    "pool": {"term_id": "ENVO:03600094", "term_name": "stream pool"},


Mappings should live outside code

cmungall

What is out strategy for alternative identifiers that link back to NEON samples?

pkalita-lbl

Sorry for the delay in reviewing this! I don't know if you have specific deadlines on this so I'm hitting approve just to not hold anything up, but I do have some mild concerns about mixing data fetching with data translation in the same class. If it works for now it's probably fine, but it's not a model I'd like to proliferate.

tests/test_data/test_neon_benthic_data_translator.py

pkalita-lbl · 2023-12-05T17:48:26Z

tests/test_data/test_neon_benthic_data_translator.py

+        return NeonBenthicDataTranslator(benthic_data)
+
+    def test_neon_envo_mappings_download(self):
+        response = requests.get(


I'm always apprehensive about making live network requests in tests and this is all this and the following test do. I'd consider removing them.

Great comment, yes! I'll remove live network requests from all the NEON pipeline tests.

pkalita-lbl · 2023-12-05T17:51:30Z

tests/test_data/test_neon_benthic_data_translator.py

+
+        return minted_nmdc_ids
+
+    def test_get_database(self, translator):


Can you add an assertion to this test that the database validates against the schema? If in the future we update the schema in such a way that this translator no longer produces compliant objects, we want to find out early by this test failing rather than later if/when we try to use it again.

See:

nmdc-runtime/tests/test_data/test_submission_portal_translator.py

Line 314 in 3522fae

validation_result = validate_json(json_dumper.to_dict(actual), mongo_db)

pkalita-lbl · 2023-12-05T18:17:39Z

nmdc_runtime/site/translation/neon_benthic_translator.py

+            "neonRawDataFile", self.conn, if_exists="replace", index=False
+        )
+
+    def get_site_by_code(self, site_code: str) -> str:


Similar to above comment it would be great to see this implemented as a separate @op that just produces a mapping from site code to location string (perhaps by calling the https://data.neonscience.org/api/v0/sites endpoint) and handing that mapping to the Translator class.

Yup, excellent suggestion. I made an @op that produces a dict mapping from site code to location here: https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/ops.py#L931-L947

pkalita-lbl · 2023-12-05T18:17:46Z

nmdc_runtime/site/translation/neon_benthic_translator.py

+                f"You are missing one of the aquatic benthic microbiome tables: {neon_amb_data_tables}"
+            )
+
+        neon_envo_mappings_file = "https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/assets/neon_mixs_env_triad_mappings/neon-nlcd-local-broad-mappings.tsv"


I'd prefer seeing a model where Translator classes are purely about translation and have no knowledge of information sources. In that world this type of information would be fetched by a separate @op that only knows about data fetching (and is configured by inputs to the run) and passes that off to the Translator. That makes testing easier and more reliable because there are no external network dependencies. And it means that these mapping file locations don't have to be hardcoded, but can change from run to run.

This may not exactly meet your needs but it illustrates the model:

nmdc-runtime/nmdc_runtime/site/ops.py

Line 829 in 017aa9b

def get_csv_rows_from_url(url: Optional[str]) -> List[Dict]:

nmdc-runtime/nmdc_runtime/site/graphs.py

Lines 148 to 161 in 017aa9b

omics_processing_mapping = get_csv_rows_from_url(omics_processing_mapping_file_url)

data_object_mapping = get_csv_rows_from_url(data_object_mapping_file_url)

biosample_extras = get_csv_rows_from_url(biosample_extras_file_url)

biosample_extras_slot_mapping = get_csv_rows_from_url(

biosample_extras_slot_mapping_file_url

)

database = translate_portal_submission_to_nmdc_schema_database(

metadata_submission,

omics_processing_mapping,

data_object_mapping,

biosample_extras=biosample_extras,

biosample_extras_slot_mapping=biosample_extras_slot_mapping,

)

nmdc-runtime/nmdc_runtime/site/repository.py

Lines 522 to 528 in 017aa9b

"inputs": {

"submission_id": "",

"omics_processing_mapping_file_url": None,

"data_object_mapping_file_url": None,

"biosample_extras_file_url": None,

"biosample_extras_slot_mapping_file_url": None,

}

Thank you for illustrating the pattern for an ideal Translator Patrick, I think this pattern's really clean and I'd like to follow it. I've tried to follow it and adopt it, but I think there's something very trivial that I'm overlooking which is causing the dagster repo to break (on dagit). Can you tell simply by visual inspection if I'm missing something?

In ops.py:

Very similar to get_csv_rows_from_url() I've defined get_df_from_url() here: https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/ops.py#L908-L928

I've also modified the Translator op to accept the processed results from get_df_from_url as inputs: https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/ops.py#L837-L860

In graphs.py:
Calls to all the ops are being made in the graph here: https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/graphs.py#L246-L275

In repository.py:
https://github.com/microbiomedata/nmdc-runtime/blob/benthic-samples-ingest/nmdc_runtime/site/repository.py#L656-L695 this is where I think I might be making a mistake in configuration. Do you seeing anything glaringly incorrect?

pkalita-lbl · 2023-12-05T18:20:23Z

nmdc_runtime/site/translation/neon_benthic_translator.py

+    def __init__(self, benthic_data: dict, *args, **kwargs) -> None:
+        super().__init__(*args, **kwargs)
+
+        self.conn = sqlite3.connect("neon.db")


I'm still concerned about writing this to disk and now we have two Translator classes that are accessing the same file! I'm a little worried about cross-contamination between runs of different translators.

sujaypatil96 · 2023-12-05T20:47:03Z

Thank you for the thorough review @pkalita-lbl 😁 these are great review comments, and I'll have them addressed in the next day/two, and yes, we definitely won't merge this in before addressing all the comments.

cmungall · 2023-12-14T15:55:58Z

nmdc_runtime/site/ops.py

+    :return: pandas DataFrame of CSV/TSV/etc content
+    """
+    if not url:
+        return pd.DataFrame()


this seems an odd pattern. Better to make teh signature include Optional and return None?

cmungall · 2023-12-14T15:57:41Z

nmdc_runtime/site/ops.py

+        }
+        return site_code_mapping
+    else:
+        raise Exception(


Better pattern: https://stackoverflow.com/questions/24518944/try-except-when-using-python-requests-module

sujaypatil96 · 2023-12-15T07:03:57Z

I've made a new branch — neon-pipeline-refactor which includes all the commits/updates that I had piled on on this branch post the code review from @pkalita-lbl. I will open up a PR from that branch soon.

We can merge this PR in so that it can be included in the release that is about to go out on 12/18, and the refactored code can be included in the release after.

pkalita-lbl

Okay, yeah we can take care of the cleanup and refactoring in a separate PR

sujaypatil96 added 2 commits September 20, 2023 16:55

first pass benthic samples ingest pipeline

f564dc2

import values for depth and geo_loc_name slots

6045d80

cmungall reviewed Oct 2, 2023

View reviewed changes

cmungall requested changes Oct 2, 2023

View reviewed changes

sujaypatil96 added 4 commits October 13, 2023 14:43

populate Extraction, OmicsProcessing, DataObject set records

37f8bff

merge conflict in graphs.py

bb814b5

fix base query for NEON benthic samples ingest

4bad42a

split out neon pipeline helper methods into separate module

40e7e00

sujaypatil96 mentioned this pull request Nov 18, 2023

NEON soil metadata translator refactor microbiomedata/issues#402

Closed

7 tasks

sujaypatil96 added 3 commits November 22, 2023 09:13

add tests for NEON benthic translator

90e5ffd

rename NEON soil translator module

ac4cd36

change path to NEON soil translator import in ops.py

5be0647

sujaypatil96 marked this pull request as ready for review November 22, 2023 17:15

sujaypatil96 requested review from cmungall and pkalita-lbl November 22, 2023 17:16

pkalita-lbl previously approved these changes Dec 5, 2023

View reviewed changes

sujaypatil96 dismissed pkalita-lbl’s stale review via 8cab140 December 11, 2023 15:15

aclum linked an issue Dec 11, 2023 that may be closed by this pull request

NEON soil metadata translator refactor microbiomedata/issues#402

Closed

7 tasks

cmungall reviewed Dec 14, 2023

View reviewed changes

sujaypatil96 force-pushed the benthic-samples-ingest branch from 6bb6aff to 5be0647 Compare December 15, 2023 06:58

sujaypatil96 requested a review from pkalita-lbl December 15, 2023 06:59

pkalita-lbl approved these changes Dec 15, 2023

View reviewed changes

pkalita-lbl merged commit 6e2c472 into main Dec 15, 2023

sujaypatil96 deleted the benthic-samples-ingest branch December 15, 2023 17:52

sujaypatil96 mentioned this pull request Jan 23, 2024

NEON pipeline infrastructure refactoring #448

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEON benthic samples ingest pipeline #308

NEON benthic samples ingest pipeline #308

sujaypatil96 commented Sep 20, 2023

cmungall Oct 2, 2023

cmungall Oct 2, 2023

cmungall left a comment

pkalita-lbl left a comment

pkalita-lbl Dec 5, 2023

sujaypatil96 Dec 12, 2023

pkalita-lbl Dec 5, 2023

pkalita-lbl Dec 5, 2023

sujaypatil96 Dec 12, 2023

pkalita-lbl Dec 5, 2023

sujaypatil96 Dec 12, 2023

pkalita-lbl Dec 5, 2023

sujaypatil96 commented Dec 5, 2023

cmungall Dec 14, 2023

cmungall Dec 14, 2023

sujaypatil96 commented Dec 15, 2023 •

edited

Loading

pkalita-lbl left a comment


		return minted_nmdc_ids

		def test_get_database(self, translator):

	omics_processing_mapping = get_csv_rows_from_url(omics_processing_mapping_file_url)
	data_object_mapping = get_csv_rows_from_url(data_object_mapping_file_url)
	biosample_extras = get_csv_rows_from_url(biosample_extras_file_url)
	biosample_extras_slot_mapping = get_csv_rows_from_url(
	biosample_extras_slot_mapping_file_url
	)

	database = translate_portal_submission_to_nmdc_schema_database(
	metadata_submission,
	omics_processing_mapping,
	data_object_mapping,
	biosample_extras=biosample_extras,
	biosample_extras_slot_mapping=biosample_extras_slot_mapping,
	)

	"inputs": {
	"submission_id": "",
	"omics_processing_mapping_file_url": None,
	"data_object_mapping_file_url": None,
	"biosample_extras_file_url": None,
	"biosample_extras_slot_mapping_file_url": None,
	}

NEON benthic samples ingest pipeline #308

NEON benthic samples ingest pipeline #308

Conversation

sujaypatil96 commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmungall left a comment

Choose a reason for hiding this comment

pkalita-lbl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sujaypatil96 commented Dec 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sujaypatil96 commented Dec 15, 2023 • edited Loading

pkalita-lbl left a comment

Choose a reason for hiding this comment

sujaypatil96 commented Dec 15, 2023 •

edited

Loading