Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portal Search - EnvO #361

Closed
ssarrafan opened this issue May 5, 2021 · 14 comments
Closed

Portal Search - EnvO #361

ssarrafan opened this issue May 5, 2021 · 14 comments
Assignees
Labels
priority: high type: question Further information is requested
Milestone

Comments

@ssarrafan
Copy link

This is a request to work on navigation for EnvO such that we can put something in front of users and iterate. EnvO has a high learning curve and the portal has potential to be integrated into training efforts if done well. What's in the portal now does not show relationships between EnvO terms and is not intuitive.

This will require working with Chris and Aim 1 (I do not expect Kitware to solve this on their own). Integrating search with EnvO support for aliases would be really powerful. Can we track the searches people are doing? (to try to understand EnvO vs. GOLD)

Priority - High
Urgency - High

@ssarrafan ssarrafan created this issue from a note in NMDC May 2021 Sprint (To do) May 5, 2021
@jbeezley
Copy link

jbeezley commented May 5, 2021

One actionable (and quick) task here is to store queries for later analysis. We can make that a standalone task while the rest of the issue is fleshed out.

@kfagnan
Copy link

kfagnan commented May 5, 2021

This is fuzzy to me... @pvangay can you clarify how this issue relates to the existing Envo browsers - https://sites.google.com/site/environmentontology/
https://www.ebi.ac.uk/ols/ontologies/envo

@kfagnan kfagnan added the type: question Further information is requested label May 5, 2021
@kfagnan kfagnan removed this from To do in NMDC May 2021 Sprint May 5, 2021
@pvangay
Copy link

pvangay commented May 6, 2021

The original request is to improve how the EnvO terms are displayed on the portal to reflect the hierarchy (similar to what's at https://www.ebi.ac.uk/ols/ontologies/envo). Each of the 3 terms have an underlying hierarchical structure -- what's on the portal now is flat.

But before any implementation, we need to have a broader discussion about how we should expose EnvO on the portal. I have lots of ideas about how researchers would/could use EnvO for search/refinement/etc. but none of them are backed up by actual data :) -- which indicates this is definitely an opportunity to put something in front of users to get feedback. Yet, what do we put in front of users? This? Or are there alternatives? Tagging @cmungall because I thought he had some ideas.

@cmungall
Copy link
Contributor

cmungall commented May 6, 2021

We should think carefully about the strategy of getting feedback from users where (a) the data we have doesn't have the range of environments we will have in the future (b) we are asking them to imagine without putting forward specific possibilities.

I'll give a high level description of a general strategy for ontological faceting for now but I think there is a more detailed discussion to be had about scientific use cases, UI/UX, and ontology content.

I like how the current facets are dynamic and driven by the content in the database. Here are some small changes that can improve things:

  • nest the facets using the is-a/part-of graph
  • use an envo slim to eliminate 'astronomical body part' and the like
  • only include MRCAs of any directly used terms

For example, right now if we click on feature we see:

image

there are many terms like river, stream, watercourse, etc. But note you get the same results if you select one or another. No discriminatory power.

You can see why if we feed these terms into a graph viewer (I am not suggesting we do this for users, this is by way of explanation for us in NMDC):

image

The exact term that was used to annotate the samples was "river", it's good that the facet browsing uses inference such that querying for "water body" correctly gives you annotations to "river". But unless there is annotations to other "water body" concepts like say "hypoxic lake" you don't get any value from filtering by intermediate terms.

If you trim out terms that are not MRCA of any pair of samples, then you get a list of terms each of which yields different sets of samples. This is also a tractable set for nesting the facets visually.

For example, let's say we had drilled down to a set of samples that were collected from 3 different environments:

  • hypoxic lake (20 samples)
  • river (30 samples)
  • tidal creek (40 samples)

the subgraph induced by these terms is:

image

when you strip to MRCAs you get:

  • water body (90)
    • hypoxic lake (20)
    • watercourse (70)
      • river (30)
      • tidal creek (40)

I would say that nesting the facets by rendering as a tree in this way is a good way to provide dynamic drill down that leverages the ontology groupings, but that is a wider UI/UX decision (trimming by MRCA removes polyhierarchical aspects, there are other strategies to get a tree rendering)

We can also interleave this strategy with curated subsets for intermediate nodes. E.g. we may decide that "watercourse" is not a useful grouping level, if we exclude it then the MRCA of creek and river will then be "lotic water body". We may decide that this also is not a useful grouping, in which case all roll up to 'water body'

We could also combine the 3 facets into one hierarchy this way.

Note that this exact same strategy could be used for any hierarchical system - e.g. KEGG for the function classification.

We have code in js and python for doing some of this kind of thing (there are a few engineering challenges - eg do you load the ontology ahead of time into the client, or do pre-processing of the facets on the server side?)

Straw man proposal for proceeding

  1. ontology group defines initial exclusion sets (e.g. astronomical body part). Small T-shirt
  2. Kitware implements MRCA and exclusion set filtering Medium?
  3. Kitware implements nesting/hierarchical layout of facets Medium/Large?
  4. Deploy a test database instance that has many more samples (all public samples in gold that are envo-annotated) Medium/Large
  5. Iterate within NMDC, potentially expanding inclusion/exclusion sets depending on feedback
  6. Test with larger user group

@pvangay
Copy link

pvangay commented May 6, 2021

@cmungall - agree re: need for a broader range of data to demonstrate value. Thanks for laying it out here and for the suggestions. #1-2 seems like a reasonable start to me but I'll let others chime in.

@dehays
Copy link
Contributor

dehays commented Jul 14, 2021

@subdavis Spoke with Chris regarding the questions you raised yesterday (How do I proceed?). The two pieces I think you need to display the EnvO terms as nested in hierarchy are:

  1. Only display the Most Recent Common Ancestor (MRCA) for paths in the EnvO graph
  2. We (probably @turbomam ) will provide a list of terms to filter out of the display (i.e. the terrestrial body terms)

If you have additional questions - please comment

@jeffbaumes
Copy link
Collaborator

I believe I've been able to describe a process for building the simplified tree in this notebook:

https://observablehq.com/@jeffbaumes/ontology-directed-acyclic-graph-simplification

This in JavaScript but we could implement this similarly in Python on data ingestion and make it available to the client as the static tree to use for navigating EnvO.

@cmungall does this match what you had in mind?

@ssarrafan
Copy link
Author

@cmungall I will leave this assigned to you for now and move to the August sprint. Let me know if it should be assigned to someone else.

@ssarrafan ssarrafan removed this from To do in NMDC July 2021 Sprint Jul 30, 2021
@ssarrafan ssarrafan added this to To do in NMDC August 2021 Sprint via automation Jul 30, 2021
@ssarrafan ssarrafan modified the milestones: Sprint 4, Sprint 5 Jul 30, 2021
@cmungall
Copy link
Contributor

@jeffbaumes - as discussed briefly in the call the other day, I think this is great for a first iteration. I think the step where you make it a tree could lose information that may be relevant to the most optimal trimmed tree, but we can try more later, it will certainly be better than a flat list1

And also as discussed this should take care naturally of filtering the non-informative upper level terms

@ssarrafan
Copy link
Author

@jeffbaumes and @subdavis do you need anything else from anyone for this issue? Let me know if I can help.

@jeffbaumes jeffbaumes assigned zachmullen and unassigned cmungall Aug 17, 2021
@jeffbaumes
Copy link
Collaborator

@zachmullen see microbiomedata/nmdc-ontology#4 (comment) for the new data.

@ssarrafan
Copy link
Author

@zachmullen and @jeffbaumes any update on this? Are you still actively working on this? I can move it to the September sprint but if you're not working on it I can remove it and add the backlog label. Let me know. Thank you.

@ssarrafan ssarrafan removed this from In progress in NMDC August 2021 Sprint Sep 1, 2021
@ssarrafan ssarrafan added this to To do in NMDC September 2021 Sprint via automation Sep 1, 2021
@ssarrafan ssarrafan modified the milestones: Sprint 5, Sprint 6 Sep 1, 2021
@zachmullen
Copy link
Contributor

I think we can call this one done.

@ssarrafan
Copy link
Author

I think we can call this one done.

That's great! I'll close it. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: high type: question Further information is requested
Projects
No open projects
Development

No branches or pull requests

9 participants