## This notebook demonstrates benchmarking the statement extraction pipeline by evaluating the LLM generated statements against the benchmark corpus containing correct INDRA statements. 

Import relevant modules

In [1]:
import sys
from pathlib import Path
# get current path
sys.path.append(str(Path.cwd().parent))

from indra_gpt.benchmarks.benchmark import Benchmark
from indra_gpt.resources.constants import INPUT_DEFAULT


INFO: [2025-02-18 21:47:08] indra.preassembler.grounding_mapper.disambiguate - INDRA DB is not available for text content retrieval for grounding disambiguation.
  from .autonotebook import tqdm as notebook_tqdm


Let's evaluate the performance of OpenAI model 'gpt-4o-mini', with structured_output mode enabled, and for a sample of 10 examples from the benchmark corpus. 

In [2]:
model = 'gpt-4o-mini'
benchmark_file = INPUT_DEFAULT
structured_output = True
n_statements = 500
random_sample = True
benchmark = Benchmark(model, benchmark_file, structured_output, n_statements, random_sample)


In [3]:
benchmark_df = benchmark.get_comparison_df()


Extracting:   0%|          | 0/500 [00:00<?, ?statement/s]INFO: [2025-02-18 21:47:15] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: [2025-02-18 21:47:18] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: [2025-02-18 21:47:22] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Extracting:   0%|          | 1/500 [00:09<1:22:58,  9.98s/statement]INFO: [2025-02-18 21:47:26] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: [2025-02-18 21:47:31] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: [2025-02-18 21:47:39] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Extracting:   0%|          | 2/500 [00:27<1:59:23, 14.38s/statement]INFO: [2025-02-18 21:47:42] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 20

This generates a dataframe that has input and generated output, along with the comparison results.
Columns:
- <b>'original_statement_json'</b>: The original statement in json format, from the benchmark corpus. 
- <b>'generated_statements_json'</b>: The generated statements in json format, given the input text from the 'evidence' of the original statement.
- <b>'original_statement'</b>: The original statement converted from json format to indra statement. 
- <b>'generated_statements'</b>: The generated statements converted from json format to indra statement.
- <b>'original_statement_grounded'</b>: Grounding process applied to 'original_statement'.
- <b>'generated_statements_grounded'</b>: Grounding process applied to 'generated_statements'.
- <b>'comparison_result'</b>: Result of comparing 'original_statement' with each of the generated statement from 'generated_statements'.
- <b>'comparison_result_grounded'</b>: Same as 'comparison_result' but for 'original_statement_grounded' and 'generated_statements_grounded'.
- <b>'best_match_index'</b>: This is the list index of the 'generated_statements', corresponding to the statement that is most similar with the original statement. 
- <b>'best_match_grounded_index'</b>: Same as 'best_match_index' but for 'generated_statements_grounded'

## Evaluate the accuracy of the model  

The `benchmark_df` above shows a comprehensive input and output and comparison results for a specified configuration. 
The performance of the model is measured by in terms of three built-in equivalence methods
from `indra` which asks the three questions: 

1) Are the two statements are equal?
2) Do the two statements have the same type?
3) Do the two statements have the same set of agents?

Since the model can extract multiple statements from given input text, we compare the benchmark example with the "best" 
match from the generated set of statements. <br>
E.g. 
- Original statement:  `Activation(BMP(), PTEN())`
- Generated statements: `[Activation(Caspase(), PTEN()), Acetylation(BMP(), PTEN())]` <br><br>
The algorithm prioritizes as follows: statement match > type match > agent set match
<br>So for the above example, `Activation(Caspase(), PTEN())` will be selected over `Acetylation(BMP(), PTEN())` since the types match.

In [None]:
benchmark.compute_comparison_statistics(benchmark_df)


* It is very unlikely for 'equals_accuracy' to be a non-zero value, since the two statements are only equal if
types, agent set, and evidence are all equal. 

Benchmark another configuration, this time, 'gpt-4o-mini' but without structured_output mode enabled.

In [None]:
model = 'gpt-4o-mini'
benchmark_file = INPUT_DEFAULT
structured_output = False
n_statements = 10
benchmark = Benchmark(model, benchmark_file, structured_output, n_statements)
benchmark_df = benchmark.get_comparison_df()
benchmark.compute_comparison_statistics(benchmark_df)
