# Comparing Grounding between OntoGPT and GPT Alone

This notebook performs the following:
* Installs dependencies, primarily ontogpt
* Retrieves all identifiers and terms for a specified ontology
* Prepares a randomly-determined selection of 100 term descriptions
* Attempts to ground all terms in the selection with OntoGPT and a template specific to the ontology
* Attempts to ground all terms in the selection through a text completion query sent to GPT-3.5-turbo and GPT-4-turbo

This notebook is intended to be run with Python 3.9 on a *nix-like system.

## Setup

In [None]:
!pip install ontogpt

In [None]:
import linecache
import random
import yaml

The following variable will set the ontology name. It should be lowercase and be in the OBO Foundry, e.g., "caro" or "envo".

In [None]:
onto_name = "mondo"
termfile = f"{onto_name}_terms.tsv"
select_termfile = f"{onto_name}_terms_select.tsv"
extract_file_35 = f"{onto_name}_extract_35.yaml"
extract_file_4 = f"{onto_name}_extract_4.yaml"
prompting_file = f"{onto_name}_prompt.txt"
prompt_extract_file_35 = f"{onto_name}_prompt_extract_35.txt"
prompt_extract_file_4 = f"{onto_name}_prompt_extract_4.txt"

Now we retrieve all terms. Note that this will exclude obsolete terms by default. It may take a moment to complete this for larger ontologies.

Then we filter that to a list of 100 randomly-selected identifiers and terms.

In [None]:
!runoak -i sqlite:obo:{onto_name} terms > {termfile}

In [None]:
desired_count = 100
i = 0
term_map = {}
with open(termfile) as infile:
    linecount = sum(1 for line in infile)
while i < desired_count:
    oneterm = linecache.getline(termfile, random.randrange(0,linecount)).rstrip()
    if (oneterm.lower()).startswith(onto_name):
        split_term = oneterm.split("!")
        curie = split_term[0].strip()
        label = split_term[1].strip()
        term_map[curie] = label
        i = i +1

len(term_map)

## Grounding with OntoGPT

We assemble a document of _terms alone_ to parse with OntoGPT and SPIRES.

In [None]:
with open(select_termfile, "w") as outfile:
    for label in term_map.values():
        outfile.write(f"{label}\n")

In [None]:
# This uses the gpt-3.5-turbo-16k model.
!ontogpt -vvv extract -i {select_termfile} -t {onto_name}_simple -o {extract_file_35} -m gpt-3.5-turbo-16k

In [None]:
# This uses the gpt-4-turbo model (as gpt-4-1106-preview)
!ontogpt -vvv extract -i {select_termfile} -t {onto_name}_simple -o {extract_file_4} -m gpt-4-1106-preview

Now we parse the results and evaluate.

In [None]:
for filename in [extract_file_35, extract_file_4]:
    score = 0
    with open(filename) as infile:
        print(f"*** {filename} ***")
        extract = yaml.safe_load(infile)
        named_entities = extract["named_entities"]
        print(f"Total named entities found: {len(named_entities)}")
        for curie in term_map:
            term_pair = {'id':curie, 'label':term_map[curie]}
            if term_pair in named_entities:
                score = score + 1
        pct_score = float(score / len(term_map))
        print(f"Total score: {score} / {len(term_map)} ({pct_score})")

## Grounding with GPT alone

Now we repeat the process, passing the same selected term list to GPT through the OpenAI API.

In [None]:
# Here's the prompt for GPT-3+. We assemble this and the terms into a file.
inprompt = f"Please provide the corresponding identifier from the {onto_name} Ontology for each of the following terms. The output should be in the following format: \n id: IDENTIFIER ! label: LABEL \n"

In [None]:
with open(prompting_file, "w") as outfile:
    outfile.write(inprompt)
    for label in term_map.values():
        outfile.write(f"{label}\n") 

In [None]:
# This uses the gpt-3.5-turbo-16k model.
!ontogpt -vvv complete -o {prompt_extract_file_35} -m gpt-3.5-turbo-16k {prompting_file}

In [None]:
# This uses the gpt-4-turbo model (as gpt-4-1106-preview)
!ontogpt -vvv complete -o {prompt_extract_file_4} -m gpt-4-1106-preview {prompting_file}

And now we score it.

In [None]:
for filename in [prompt_extract_file_35, prompt_extract_file_4]:
    score = 0
    named_entities = []
    with open(filename) as infile:
        print(f"*** {filename} ***")
        for line in infile:
            if not line.startswith("id"):
                continue
            else:
                try:
                    splitline = line.split("!")
                    id = (splitline[0].split("id:"))[1].strip()
                    label = (splitline[1].split("label:"))[1].strip()
                    pair = {"id": id, "label": label}
                    named_entities.append(pair)
                except IndexError:
                    continue
        print(f"Total named entities found: {len(named_entities)}")
        for curie in term_map:
            term_pair = {'id':curie, 'label':term_map[curie]}
            if term_pair in named_entities:
                score = score + 1
        pct_score = float(score / len(term_map))
        print(f"Total score: {score} / {len(term_map)} ({pct_score})")