Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE: Cumulative configurations etc (reuse as documentation only) #137

Closed
wants to merge 51 commits into from

Conversation

turbomam
Copy link
Member

@turbomam turbomam commented Apr 23, 2024

This merger is intended as input into an LLM analysis of the various configuration, calibration, etc mechanisms in NMDC's bleeding edge berkeley-schema-fy24.

Use code like

import vertexai
from vertexai.generative_models import GenerativeModel, Part, GenerationConfig
import requests

project_id = "820943527916"
gc_location = "us-east4"
vertex_model = "gemini-1.5-pro-preview-0409"

vertex_temp = 2.0
vertex_max_tokens_out = 8192

prompt_alone = """
I am concerned about all of the different ways to model configurations in this LinkML schema.
Some processes have slots for describing their configuration inline.
Others can be associated with a Configuration subclass or a calibration DataObject.
We also have some protocol modeling. I may be overlooking other configuration patterns.
Please provide a thorough analysis on the topic of configuration patterns in the schema.
Provide suggestions for simplifying if possible. 
I would like to align our modeling with the Ontology of Biomedial investigation as much as possible,
possibly using OBI:0000654 device setting or a subclass of IAO:0000033 directive information entity.
Having said that, our modeling will probably have to be flatter than OBI
"""

nmdc_schema_uri = "https://raw.githubusercontent.com/microbiomedata/berkeley-schema-fy24/cumulative-configurations/nmdc_schema/nmdc_schema_merged.yaml"
# nmdc_schema_content = Part.from_uri(nmdc_schema_uri, mime_type="text/yaml")

response = requests.get(nmdc_schema_uri)
response.raise_for_status()  # Check if the request was successful

nmdc_schema_content = response.text  


contents = [prompt_alone, nmdc_schema_content]

vertexai.init(project=project_id, location=gc_location)

model = GenerativeModel(vertex_model)

generation_config = GenerationConfig(
    temperature=vertex_temp,
    # top_p=1.0,
    # top_k=32,
    # candidate_count=1,
    max_output_tokens=vertex_max_tokens_out,
)

response = model.generate_content(
	contents,
	generation_config=generation_config
	)

print(response.text)

brynnz22 and others added 30 commits April 9, 2024 16:16
and has_configuration slot
…n.yaml

Needed because nmdc.yaml inherits from workflow_execution.yaml but not vice versa, so in order to use class on both, need to define on workflow_execution.yaml.
@turbomam turbomam marked this pull request as draft April 23, 2024 14:20
@turbomam
Copy link
Member Author

With a temperature of 0:

Analysis of Configuration Patterns in the NMDC Schema

The NMDC schema presents several patterns for modeling configurations, leading to potential complexity and inconsistency. Here's a breakdown of the observed patterns:

1. Inline Configuration Slots:

  • Some processes, like MassSpectrometry, have dedicated slots (e.g., acquisition_category, mass_analyzer) to describe configuration parameters directly within the process instance.

2. Configuration Subclasses:

  • The schema defines abstract Configuration class with subclasses like MassSpectrometryConfiguration to hold configuration parameters separately. These are then linked to processes via the has_configuration slot.

3. Calibration DataObjects:

  • The has_calibration slot in classes like MassSpectrometry and ChromatographicSeparationProcess can point to a Calibration instance or a string representing a calibration file path. The Calibration class itself has slots for specific calibration parameters.

4. Protocol Modeling:

  • The schema includes Protocol and ProtocolExecution classes. While not directly representing configurations, protocols can implicitly define configurations through the steps and parameters they specify.

5. DataObjects for Software-Specific Settings:

  • The workflow_configuration_objects slot allows associating DataObject instances with workflows to capture software-specific settings.

Suggestions for Simplification and Alignment with OBI

1. Standardize on Configuration Subclasses:

  • Instead of inline configuration slots, consistently use Configuration subclasses for all processes. This promotes modularity, reusability, and easier management of configuration parameters.

2. Leverage OBI or IAO for Configuration Modeling:

  • Align the Configuration class with OBI:0000654 (device setting) or a subclass of IAO:0000033 (directive information entity). This ensures interoperability with other ontologies and facilitates data integration.

3. Flatten the Hierarchy:

  • While aligning with OBI, consider a flatter hierarchy than OBI's detailed structure. This simplifies the schema and avoids unnecessary complexity for NMDC's specific needs.

4. Reconsider Calibration Modeling:

  • Evaluate if the Calibration class is necessary or if its information can be incorporated into the relevant Configuration subclass. This reduces redundancy and streamlines the schema.

5. Clarify Protocol Usage:

  • Clearly define the relationship between protocols and configurations. If protocols implicitly define configurations, consider explicitly linking them to the corresponding Configuration instances.

6. Evaluate DataObjects for Software Settings:

  • Assess if using DataObject instances for software settings is the most efficient approach. Consider alternative options like dedicated configuration slots or controlled vocabularies within the Configuration subclasses.

Implementation Considerations

  • Transition Plan: Develop a strategy for migrating existing data to the new configuration model.
  • Tooling Support: Ensure that existing tools and applications can work with the updated schema.
  • Community Engagement: Discuss the proposed changes with the NMDC community and gather feedback.

Conclusion

By adopting a consistent and standardized approach to configuration modeling, the NMDC schema can be simplified and aligned with established ontologies like OBI. This will improve data interoperability, facilitate data integration, and enhance the overall usability of the schema.

@turbomam
Copy link
Member Author

temperature = 2

Analysis of Configuration Patterns in the NMDC Schema

The NMDC schema presents a multifaceted approach to modeling configurations, leading to potential complexity and challenges in data integration and interpretation. Let's delve into the different patterns observed:

1. Inline Configuration Slots:

  • Certain processes like MassSpectrometry and ChromatographicSeparationProcess have dedicated slots for directly specifying configuration parameters, e.g., acquisition_category, mass_analyzer, etc.
  • This approach offers a simple and direct way to capture specific configuration details but can lead to a proliferation of slots and schema complexity.

2. Configuration Subclasses:

  • The schema utilizes abstract classes like Configuration and its concrete subclasses like MassSpectrometryConfiguration to model configuration parameters as independent entities.
  • This promotes reusability and modularity but necessitates managing and linking separate configuration entities to their respective processes.

3. Data Objects for Calibration:

  • Calibration information, a crucial aspect of configuration, is sometimes represented using DataObject entities, especially for chromatographic processes.
  • This provides flexibility in capturing diverse calibration data formats but introduces another layer of indirection and potential confusion.

4. Protocol Modeling:

  • Protocol classes offer a way to document standardized procedures but are not directly linked to configuration parameters. This can result in ambiguity regarding specific configurations used during protocol execution.

Suggestions for Simplification and Alignment with OBI:

1. Utilize OBI and IAO Classes:

  • Leverage the existing ontological framework of OBI and IAO for configuration modeling.
  • Consider adopting OBI:0000654 (device setting) or a subclass of IAO:0000033 (directive information entity) as a central class for representing configuration parameters.
  • This ensures semantic consistency and interoperability with other biomedical ontologies.

2. Introduce a Dedicated Configuration Class:

  • Introduce a new class, perhaps NMDCConfiguration, as a subclass of OBI:0000654 or IAO:0000033.
  • This class can hold relevant configuration parameters as slots, such as:
    • has_setting_value (with range encompassing diverse data types)
    • has_unit (for units of measurement)
    • has_parameter_type (referencing relevant ontology terms)

3. Standardize Calibration Modeling:

  • Move away from using DataObject for calibration and integrate calibration parameters as slots within the dedicated configuration class or as instances linked to it.
  • This simplifies data representation and improves clarity regarding calibration details.

4. Link Processes to Configurations:

  • Establish a direct link between processes and configuration entities through a dedicated slot, such as has_configuration, with the new NMDCConfiguration class as its range.
  • This provides explicit association and simplifies querying for specific configurations used in processes.

5. Refactor Inline Configuration Slots:

  • Gradually refactor existing inline configuration slots, replacing them with references to the dedicated configuration class. This process may require data migration and schema evolution, ensuring backward compatibility.

6. Improve Protocol Integration:

  • Explore ways to more tightly integrate Protocol classes with configuration entities. This might involve adding a slot to Protocol for associating it with a specific configuration or using annotations to explicitly indicate parameter settings.

Benefits of Simplifying Configuration Modeling:

  • Increased Clarity and Consistency: Simplifying configuration patterns promotes a clearer understanding of the relationships between processes, protocols, and configurations.
  • Improved Data Integration: Standardized configuration modeling enhances data integration and interoperability with other ontologies and datasets.
  • Enhanced Search and Retrieval: A dedicated configuration class enables easier searching and retrieval of specific configurations used in experiments.
  • Greater Reusability and Modularity: Using OBI and IAO classes fosters reusability and modularity, allowing configuration entities to be linked to diverse processes.

Considerations and Tradeoffs:

  • Flattening OBI Hierarchy: The need for a flatter model in NMDC may require creating subclasses of OBI classes or using slots to capture specific details. Careful consideration should be given to maintaining semantic consistency with OBI.
  • Data Migration and Schema Evolution: Refactoring existing configuration patterns may require data migration and schema evolution. This should be managed in a controlled and well-documented manner.

Overall, adopting a standardized approach using OBI and IAO principles can significantly enhance configuration modeling in the NMDC schema. While some complexities exist, the benefits of improved clarity, data integration, and reusability outweigh the challenges. Careful planning and implementation will be key to ensuring a successful transition.

@turbomam
Copy link
Member Author

Added some questions about instruments to the prompt:

Analysis of Configuration Patterns in the NMDC Schema

The NMDC schema presents several patterns for modeling configurations, each with its own advantages and disadvantages:

1. Inline Configuration Slots:

  • Examples: pcr_cycles, library_preparation_kit, input_mass
  • Advantages: Simple and direct, easy to understand and implement.
  • Disadvantages: Can lead to redundancy if the same configuration is used in multiple processes, makes it difficult to manage and update configurations independently.

2. Configuration Subclasses:

  • Examples: MassSpectrometryConfiguration, ChromatographicSeparationConfiguration
  • Advantages: Allows for centralized management of configurations, promotes reusability and reduces redundancy.
  • Disadvantages: Adds complexity to the schema, requires additional classes and relationships.

3. Calibration DataObjects:

  • Examples: Calibration class with calibration_file slot
  • Advantages: Allows for storing detailed calibration data, including reference standards and target measurements.
  • Disadvantages: Introduces another layer of indirection, requires managing separate data objects.

4. Protocol Modeling:

  • Examples: ProtocolExecution class with has_process_parts slot
  • Advantages: Captures the sequence of steps in a protocol, allows for associating protocols with specific processes.
  • Disadvantages: Can be complex to model and implement, may not be necessary for all use cases.

5. Instrument Specification:

  • Examples: instrument_used slot in PlannedProcess
  • Advantages: Provides information about the instrument used in a process, allows for linking processes to specific instruments and their configurations.
  • Disadvantages: May not be relevant for all processes, requires managing instrument information separately.

Additional Patterns:

  • Controlled Vocabulary Terms: Several slots use controlled vocabularies to restrict possible values, such as analyte_category and extraction_target. This improves data consistency and facilitates data integration.
  • Quantity Values: The QuantityValue class allows for capturing both numeric values and units, ensuring data accuracy and interoperability.

Suggestions for Simplification and Alignment with OBI

  • Consolidate Inline Configuration Slots: Consider creating dedicated configuration classes for commonly used configurations, reducing redundancy and improving maintainability.
  • Leverage OBI Classes: Utilize OBI classes like OBI:0000654 (device setting) or subclasses of IAO:0000033 (directive information entity) to represent configurations, aligning the schema with established ontologies.
  • Flattening OBI Hierarchy: Adapt the OBI hierarchy to fit the specific needs of NMDC, potentially creating a flatter structure for easier implementation.
  • Standardize Instrument Modeling: Ensure consistent use of the instrument_used slot and consider creating a dedicated instrument class with relevant properties.
  • Harmonize Terminology: Align the terminology used in the schema with OBI and other relevant ontologies, promoting interoperability and data sharing.

Conclusion

The NMDC schema offers various configuration modeling approaches, each with its own strengths and weaknesses. By analyzing these patterns and considering the suggestions for simplification and alignment with OBI, the schema can be further refined to improve data consistency, reusability, and interoperability.

@turbomam turbomam changed the title Cumulative configurations etc DO NOT MERGE: Cumulative configurations etc (reuse as documentation only) May 21, 2024
@turbomam turbomam closed this Sep 13, 2024
@turbomam turbomam deleted the cumulative-configurations branch September 13, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants