Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MIxS samp_taxon_id to model NCBI taxonomy ID from GOLD? #574

Closed
Tracked by #587 ...
sujaypatil96 opened this issue Dec 21, 2022 · 9 comments · Fixed by #626
Closed
Tracked by #587 ...

Use MIxS samp_taxon_id to model NCBI taxonomy ID from GOLD? #574

sujaypatil96 opened this issue Dec 21, 2022 · 9 comments · Fixed by #626
Assignees
Labels
schema change Term updates to NMDC Schema

Comments

@sujaypatil96
Copy link
Collaborator

sujaypatil96 commented Dec 21, 2022

Add a slot called ncbi_taxon_id to capture the host taxonomy ID as present in GOLD, and other sources possibly.

For example, let's look at Gb0291745 on the GOLD website: https://gold.jgi.doe.gov/biosample?id=Gb0291745

Look under the Host Metadata section, and you'll see a value of 3689 associated with the host taxonomy ID.

We need a slot under the Biosample class to capture this information. In this issue, I'm proposing the addition of a slot called ncbi_taxon_id generally, and asserting its usage in the Biosample class using it's slots property.

@sujaypatil96 sujaypatil96 added the schema change Term updates to NMDC Schema label Dec 21, 2022
@sujaypatil96 sujaypatil96 self-assigned this Dec 21, 2022
@sujaypatil96 sujaypatil96 changed the title Add host_taxon_id slot to schema Add ncbi_taxon_id slot to schema Dec 21, 2022
@sujaypatil96
Copy link
Collaborator Author

Comments from discussion with @turbomam:

There is a slot in MIxS called specific_host already, which models the taxonomy name or id.

We have two approaches here:

  • make changes to the mixs.yaml file in nmdc-schema repo, add in specific_host slot, import that into Biosample class in nmdc.yaml and constrain specific_host under the slot_usage for Biosample
  • add a new slot in nmdc.yaml called ncbi_taxon_id which is inherited from specific_host using the is_a property, constrain the slot and then assert it on the Biosample class

@ssarrafan
Copy link
Collaborator

@sujaypatil96 is this something you're currently working on? Can I add it to the sprint board for January?

@sujaypatil96
Copy link
Collaborator Author

Model this similar to ENVO terms. Followup with GSC to figure out why there are three similar terms for taxon id.

@mslarae13 mslarae13 mentioned this issue Jan 6, 2023
99 tasks
@sujaypatil96
Copy link
Collaborator Author

Following up on this, the three MIxS terms with confusing definitions are:

@ssarrafan
Copy link
Collaborator

@sujaypatil96 I'll move this to the next sprint but let me know if you won't be actively working on it for the next couple of weeks

@sujaypatil96
Copy link
Collaborator Author

@ssarrafan yup, we plan to address this at the metadata call today.

@sujaypatil96
Copy link
Collaborator Author

sujaypatil96 commented Jan 18, 2023

The information that we intend to capture in the schema from GOLD is the NCBI taxonomy ID (this is the label for the field that appears on the JGI GOLD website)

For example, below is a snippet of the output from the GOLD API:

{'biosampleGoldId': 'Gb0291653', 'biosampleName': 'Bulk soil microbial communities from poplar common garden site in 
Clatskanie, Oregon, USA - BESC-86-CL2_69_17', 'ncbiTaxId': 410658, 'ncbiTaxName': 'soil metagenome', 
'sampleCollectionSite': 'Bulk Soil', 'geographicLocation': 'USA: Oregon'...}

Semantic considerations

"soil metagenome" is not a host, so neither of the MIxS terms with the word "host" is applicable, leaving only the samp_taxon_id term from the original 3 candidates (specific_host, host_taxid and samp_taxon_id)

Format considerations

Looking at the ncbiTaxId field, it seems to be an integer value. And here is an example value for the samp_taxon_id field that MIxS provides:

Gut Metagenome [NCBI:txid749906]

We could reconstruct a value that looks like the above example using two GOLD fields - ncbiTaxName and ncbiTaxId. We propose replacing the syntax implied by the MIxS example above with syntax like: NCBITaxon:410658 (as found on OLS)

@sujaypatil96 sujaypatil96 changed the title Add ncbi_taxon_id slot to schema Use MIxS samp_taxon_id to model NCBI taxonomy ID from GOLD? Jan 18, 2023
@mslarae13
Copy link
Contributor

mslarae13 commented Jan 18, 2023

  • samp_taxon_id

Agree, samp_taxon_id is the right slot. This is NOT collected by the user or in the submission portal, but I think that’s fine. This can just be a GOLD assigned field

These descriptions are more or less forward and reverse, so how they're different is unclear, but I think this is a different issue.

Decision during Wednesday 1pm metadata meeting

Format change ncbiTaxName to NCBITaxon:####
Capture from GOLD “ ncbiTaxName [NCBITaxon:ncbiTaxID] “ example gut metagenome [NCBITaxon:749906] or soil metagenome [NCBITaxon:######]

@sujaypatil96 sujaypatil96 linked a pull request Jan 19, 2023 that will close this issue
@turbomam
Copy link
Member

turbomam commented Jan 20, 2023

Thanks for the notes, @mslarae13

A couple of questions:

  1. You're suggesting that there wouldn't be any column for samp_taxon_id in the Submission Portal. How would that relate to our vision of gathering sample metadata in bulk form external sources (like GOLD) and then loading it into the submission portal for a "data wrangler" to check?
  2. what does "descriptions are more or less forward and reverse" mean?
  3. Could the last two lines in your comment be summarized as below?

  • NMDC will populate samp_taxon_id slots for Biosamples by composing the ncbiTaxName and ncbiTaxId. For the taxon identifier portion, MIxS implies a syntax of ^NCBI:txid[0-9]+$ like 'NCBI:txid749906' but instead NMDC will follow the syntax used by the OBO foundry, ^NCBITaxon:[0-9]+$ like 'NCBITaxon:749906'

See regexr.com for experimenting with those patterns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema change Term updates to NMDC Schema
Projects
Status: ✅ SubPort 1 - Done
4 participants