Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MassSpectrometryConfiguration and Configuration classes #130

Merged
merged 24 commits into from
May 13, 2024

Conversation

kheal
Copy link

@kheal kheal commented Apr 12, 2024

This PR addresses microbiomedata#1913.

As currently modeled, each MassSpectrometry instance (currently applicable to about 2000 process samples), correspond to only 4 configurations in which the mass spectrometry instrument was actually used. We propose to have a MassSpectrometryConfiguration class that houses the configuration information which would 1) greatly reduce repeating data on MassSpectrometry instances and 2) allow for workflows to be configured based on MassSpectrometryConfiguration instance and 3) allow users to query samples that were run with similar configurations for downstream comparisons.

This would move several slots from MassSpectrometry to MassSpectrometryConfiguration and associate the configuration through a new has_configuration slot (see below for diagram).

Note that all slots inherited from PlannedProcess (like start_date, has_input) will remain on MassSpectrometry
Screenshot 2024-04-12 at 12 17 42 PM

This same configuration abstraction pertains to ChromatographicSeparationProcess (see microbiomedata#1920), so we've also created an abstract Configuration parent class.

@kheal kheal self-assigned this Apr 12, 2024
@kheal kheal requested review from turbomam and removed request for turbomam April 12, 2024 19:33
@kheal kheal marked this pull request as ready for review April 12, 2024 19:38
@kheal kheal requested a review from turbomam April 12, 2024 19:38
@turbomam
Copy link
Collaborator

turbomam commented Apr 12, 2024

Presumably we will need a configuration_set collection. It would be good to include that in this PR, along with an example data file that instantiates a Database and includes a DataGeneration and a Configuration

Configuration:
abstract: true
class_uri: nmdc:Configuration
description: A platform that enables the user to configure the appearance, actions, and other usage preferences on a process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way is a configuration a "platform"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we were thinking a computer platform. We struggled with making this (and associated) descriptions and are open to other suggestions. Maybe "A framework that enables...."?

Copy link
Author

@kheal kheal Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the definition of the Configuration class to "A programmable set of parameters that define the actions of a process and is shared among multiple instances of the process.". I think that addresses some of the outstanding issues regarding understanding when we would use this class to hold configuration information.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of MassSpectrometryConfiguration, this would be 1:1 with a method file that we use to run the mass spectrometer.

@@ -686,6 +704,12 @@ slots:
range: FluidHandling
description: how a processed sample is introduced into a mass spectrometer, for example liquid chromatography or direct infusion through a syringe

has_configuration:
any_of:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the range be Configuration in some cases and DataObject in other cases?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should go to great lengths to avoid any_of ranges. They are complex to reason over and diagram. That may change in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brynnz22 and I were thinking ahead to accommodate workflow configuration files (DataObjects) that will be associated with a has_configuration slot on WorkflowExcecution class. See microbiomedata#1912 .

Copy link
Collaborator

@turbomam turbomam Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to make sure it's very clear when we

  • embed configuration-like data in an instance of a PlannedProcess
  • associate configuration-like data with a process via a DataObject, which is a pointer to a configuration file (but doesn't actually manifest the settings in the instance)
  • embed configuration-like data in instances of the new Configuration class

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A review of those configuration embedding/linking options may shine some additional light on the idea of a configurations as platforms

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I do not quite understand this.

I would like to make sure it's very clear we we

  • embed configuration-like data in an instance of a PlannedProcess
  • associate configuration-like data with a process via a DataObject, which is a pointer to a configuration file (but doesn't actually manifest the settings in the instance)
  • embed configuration-like data in instances of the new Configuration class

Are there actions we can take in the PR to meet this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR adds a new way of capturing configuration, it is required that this PR includes documentation (or additional schema constraints) on on how it relates to existing ways of capturing configuration information.

@kheal
Copy link
Author

kheal commented Apr 12, 2024

Presumably we will need a configuration_set collection. It would be good to include that in this PR, along with an example data file that instantiates a Database and includes a DataGeneration and a Configuration

We have a few following issues related to configurations (microbiomedata#1912 and microbiomedata#1920), and we'll plan to make a configuration_set with one of those, would that work?

@turbomam
Copy link
Collaborator

@cmungall @sierra-moxon and I just started discussing this today.

This PR has a lot of good features, like a small size, inclusion of examples and the abstract Configuration parent class. @cmungall is in favor of merging small PRs like this quickly, but this one will need some refactoring and or better documentation.

We can help more if

  • links to real-world data are provided, like some of the 2000 process samples
  • modeling is kept as simple as possible. any_of ranges (unions), rules and slot_usages should be kept to an absolute minimum. Use of those features can have dramatic impact on the ability to use LinkML-related tools, which may not be readily apparent in the existing tests.
  • the bodies of issues or PRs include a justification for the change and a holistic review of existing modeling in the schema. The justification is pretty good here, but there are other issues from this squad that are just a title.

@turbomam
Copy link
Collaborator

If it's currently necessary for processes to indicate that they were configured by either a Configuration instance or a DataObject instance, @cmungall suggested using two different slots, like has_configuration and configuration_data_object. If they aren't really both needed now, let's just use one slot and one range.

@turbomam
Copy link
Collaborator

We will also have to think about the life cycle of Configuration instances in MongoDB. Who gets to add them? How? Who is responsible for ensuring that we don't have duplicates?

@kheal
Copy link
Author

kheal commented Apr 12, 2024

  • links to real-world data are provided, like some of the 2000 process samples

I can work on this using the runtime API, but just as a simple demonstrative, all the Natural Organic Matter process samples that have resulted in data objects currently available in the data portal (aka FT ICR-MS analysis results as shown below) would share a single MassSpectrometryConfiguration object. In the data portal, I'm seeing this as 2,549 DataObjects each originating from their own DataGeneration instance.
Screenshot 2024-04-12 at 4 17 41 PM

@kheal
Copy link
Author

kheal commented Apr 18, 2024

We will also have to think about the life cycle of Configuration instances in MongoDB. Who gets to add them? How? Who is responsible for ensuring that we don't have duplicates?

I think this is outside of the scope of the modeling at this moment, but MassSpectrometryConfigurations should be easy to generate from DMS (actually easier than generating MassSpectrometry instances, at the moment). Naming a Configuration instance would help with dealing with duplicates and deciding if we need to add an additional I've added an inheritance of NamedThing to Configuration to facilitate this.

@kheal
Copy link
Author

kheal commented May 1, 2024

I've made the following updates after PRs #142 and #141

  1. Used updated slot names and update examples to multivalued (after Edit slots and usages to accommodate multiple detection configurations for single MassSpectrometry instance #141)
  2. Made abstract Configuration class inherit from InformationObject class.

@turbomam, this is ready for re-review.

@SamuelPurvine, please take a look, especially the example file https://github.com/microbiomedata/berkeley-schema-fy24/blob/mass_spec_config/src/data/valid/Database-mass-spectrometry.yaml which holds two example DataGeneration (specifically MassSpectrometry) instances with their respective MassSpectrometryConfigurations. In practice, we'll likely load many more MassSpectrometry instances paired with very few MassSpectrometryConfigurations.

@kheal
Copy link
Author

kheal commented May 8, 2024

has_output:
- nmdc:dobj-00-9n9n9n
has_configuration: nmdc:mscon-99-oW43DzG0
eluent_introduction: nmdc:cspro-99-hello00
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if we have any example data file that populate an eluant introduction instance? Is that still going to be relevant after you do some chromatography configuration modeling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will still be relevant. I will provide an extend these MassSpectrometry examples alongside the parallel work I've been doing for the ChromatographicSeparationConfiguration work.

@@ -727,6 +754,10 @@ slots:
range: FluidHandling
description: how a processed sample is introduced into a mass spectrometer, for example liquid chromatography or direct infusion through a syringe

has_configuration:
range: Configuration
description: a set of parameters that define the actions of the process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to merge this but make a PR afterwards, with you as an approver, to change this description. As a matter of principle, we can't use near-identical descriptions for has_configuration and and Configuration

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks. I could definitely use support in proper/ontologically correct descriptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants