# Using OntoGPT for Evironmental and Earth Science Data Extraction

Last updated Oct 9, 2024

The following examples demonstrate basic functionality of OntoGPT and the SPIRES method for extracting and integrating data (i.e., concepts and relationships) from texts in the environmental and earth science domains.
These examples assume use of the LBNL CBORG computing resource.

## Setup

In [None]:
%pip install ontogpt
import yaml
import pprint

## Creating a template for extracting ECOSIM terms

EcoSIM is on BioPortal here: https://bioportal.bioontology.org/ontologies/ECOSIM

Let's build an extraction template to find *any* EcoSIM term, then refine.

In [28]:
text = \
"""id: http://w3id.org/ontogpt/ecosim_simple
name: ecosim_simple
title: Simple EcoSIM Extraction Template
description: >-
  Simple EcoSIM Extraction Template
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
  linkml: https://w3id.org/linkml/
  ecosim_simple: http://w3id.org/ontogpt/ecosim_simple
  ecosim: http://purl.obolibrary.org/obo/ecosim

default_prefix: ecosim_simple
default_range: string

imports:
  - linkml:types
  - core

classes:
  TermSet:
    tree_root: true
    is_a: NamedEntity
    attributes:
      terms:
        range: Term
        multivalued: true
        description: >-
          A semicolon-separated list of variables
          for earth system simulation. Do not include
          abbreviations in parentheses, e.g., "Carbon (C)"
          should be represented as "carbon". Examples include:\
          carboxylation, sodium, underground irrigation.

  Term:
    is_a: NamedEntity
    annotations:
      annotators: bioportal:ECOSIM
      prompt: >-
        The name of a variable for earth system simulation.
"""
with open('ecosim_simple.yaml', 'w') as outfile:
    outfile.write(text)

Let's retrieve a particular set of methods descriptions from an ESS-DIVE entry.

In [None]:
!wget -O essdive_test.csv https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-212bbaf7e0d1597-20240919T163005378

Now a quick extraction (should take about a minute).

In [None]:
# Replace the API key with your own
!export OPENAI_API_KEY="" && ontogpt -vvv extract -t ecosim_simple.yaml -i essdive_test.csv -m lbl/cborg-chat:latest -o output.yaml --model-provider openai --api-base "https://api.cborg.lbl.gov"

In [None]:
# Read the output file
with open('output.yaml', 'r') as infile:
    output1 = yaml.safe_load(infile)
pprint.pprint(output1)

In [None]:
for entity in output1["named_entities"]:
    if ((entity["id"]).split(":"))[0] in ["AUTO"]:
        print("NOT GROUNDED -> " + entity["label"])
    else:
        print(entity["id"], entity["label"])

Now let's attempt to extract some more specific concepts and relationships.
We'll make a new template first.

In [36]:
text = \
"""id: http://w3id.org/ontogpt/ecosim_methods
name: ecosim_methods
title: EcoSIM Methods Extraction Template
description: >-
  EcoSIM Methods Extraction Template
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
  linkml: https://w3id.org/linkml/
  ecosim_simple: http://w3id.org/ontogpt/ecosim_simple
  ecosim: http://purl.obolibrary.org/obo/ecosim

default_prefix: ecosim_methods
default_range: string

imports:
  - linkml:types
  - core

classes:
  TermSet:
    tree_root: true
    is_a: NamedEntity
    attributes:
      locations:
        range: Location
        multivalued: true
        description: >-
          A semicolon-separated list of research locations.
          Examples include: Vermont, New York City,
          Ethiopia
      methods:
        range: Method
        multivalued: true
        description: >-
          A semicolon-separated list of methods used in
          environmental and earth science research. Examples
          include: sampling, spectroscopy
      variables:
        range: Variable
        description: >-
          A semicolon-separated list of variables measured in
          environmental and earth science research. Examples
          include: root shape, biomass, water turbidity
      equipment:
        range: Equipment
        description: >-
          A semicolon-separated list of equipment used in
          environmental and earth science research.
      equipment_to_variable_relationships:
        range: EquipmentMeasuresVariable
        description: >-
          A semicolon separated list of relationships
          between specific equipment and variables
          they are used to measure as described in the input.
          Example: NMR spectrometer was used to measure
          chemical content
        multivalued: true
        inlined: true

  Location:
    is_a: NamedEntity
    annotations:
      prompt: >-
        The name of a location used in research.

  Method:
    is_a: NamedEntity
    annotations:
      annotators: bioportal:ECOSIM
      prompt: >-
        The name of a method used in environment and
        earth science research.

  Variable:
    is_a: NamedEntity
    annotations:
      annotators: bioportal:ECOSIM
      prompt: >-
        The name of a variable measured in environment and
        earth science research.

  Equipment:
    is_a: NamedEntity
    annotations:
      prompt: >-
        The name of a piece of equipment used in
        environment and earth science research.

  EquipmentMeasuresVariable:
    is_a: CompoundExpression
    attributes:
      equipment:
        range: Equipment
        description: Name of the equipment used to measure a variable.
      variable:
        range: Variable
        description: Name of the variable being measured.

"""
with open('ecosim_methods.yaml', 'w') as outfile:
    outfile.write(text)

In [None]:
# Replace the API key with your own
!export OPENAI_API_KEY="" && ontogpt -vvv extract -t ecosim_methods.yaml -i essdive_test.csv -m lbl/cborg-chat:latest -o output.yaml --model-provider openai --api-base "https://api.cborg.lbl.gov"

In [None]:
# Read the output file
with open('output.yaml', 'r') as infile:
    output1 = yaml.safe_load(infile)
pprint.pprint(output1)