## Honeycomb Query Evals

Evals for for Honeycomb Natural Langaguge Query generator from the [Fine Tuning LLMs](https://maven.com/parlance-labs/fine-tuning) course. Related notebooks from the course can be found at <https://github.com/parlance-labs/ftcourse>.

The [queries.csv](queries.csv) dataset contains \~ 2,300 example queries (along with per-query column schemas generated offline via RAG). There are two scoring methods supported
(corresponding to the two @task definitions below):

1. validate - score using the validity checker from the course (utils.py)
2. critique - score using the critique prompt from the course (critique.txt)

### Dataset

Inspect uses a standard schema for [Datasets](https://ukgovernmentbeis.github.io/inspect_ai/datasets.html), so we'll map the raw data into that schema when reading it (note that "columns" are saved as metadata so we can use them for prompt engineering)

In [None]:
from inspect_ai.dataset import csv_dataset, FieldSpec

dataset = csv_dataset(
    csv_file="queries.csv",
    sample_fields=FieldSpec(input="user_input", metadata=["columns"]),
    shuffle=True
)

### Solver

To build the prompt, we'll cfreate a custom [Solver](https://ukgovernmentbeis.github.io/inspect_ai/solvers.html) that merges the user query/prompt and the RAG retreived column list into our prompt template:

In [None]:
from inspect_ai.solver import solver
from inspect_ai.util import resource

@solver
def prompt_with_schema():

    prompt_template = resource("prompt.txt")

    async def solve(state, generate):
        # build the prompt
        state.user_prompt.text = prompt_template.replace(
            "{{prompt}}", state.user_prompt.text
        ).replace(
            "{{columns}}", state.metadata["columns"]
        )
        return state

    return solve


### Scorer

To score the model's respones to the prompt, we'll create a custom [Scorer](https://ukgovernmentbeis.github.io/inspect_ai/scorers.html) that calls the `is_valid()` function to determine whether a valid query has been constructed:

In [None]:
from inspect_ai.scorer import accuracy, scorer, Score, CORRECT, INCORRECT
from utils import is_valid, json_completion

@scorer(metrics=[accuracy()])
def validate_scorer():

    async def score(state, target):
       
        # check for valid query
        query = json_completion(state.output.completion)
        if is_valid(query, state.metadata["columns"]):
            value=CORRECT
        else: 
            value=INCORRECT
       
        # return score w/ query that was extracted
        return Score(value=value, answer=query)

    return score


Note that the `json_completion()` function takes care of some details around extracting JSON from a model completion (e.g. removing sorrounding backtick code block emitted by some models)

### Validate Task

Now we'll put all of this together to create an evaluation task:

In [None]:
from inspect_ai import eval, task, Task
from inspect_ai.solver import system_message, generate

@task
def validate():
    return Task(
        dataset=dataset,
        plan=[
            system_message("Honeycomb AI suggests queries based on user input."),
            prompt_with_schema(),
            generate()
        ],
        scorer=validate_scorer()
    )

We can run the task using Inspect's `eval()` function (limiting to 100 samples):

In [None]:
eval(validate, model="openai/gpt-4-turbo", limit=100)

### Critique Task

Now we'll create a critique task. For this we'll need a LLM-based scorer that uses a critique template to prompt for whether the generated query is "good" or "bad":

In [None]:
import json
from inspect_ai.model import get_model

@scorer(metrics=[accuracy()])
def critique_scorer(model = "anthropic/claude-3-opus-20240229"):

    async def score(state, target):
       
        # build the critic prompt
        query = state.output.completion.strip()
        critic_prompt = resource("critique.txt").replace(
            "{{prompt}}", state.user_prompt.text
        ).replace(
            "{{columns}}", state.metadata["columns"]
        ).replace(
            "{{query}}", query
        )
       
        # run the critique
        result = await get_model(model).generate(critic_prompt)
        try:
            parsed = json.loads(json_completion(result.completion))
            value = CORRECT if parsed["outcome"] == "good" else INCORRECT
            explanation = parsed["critique"]
        except (json.JSONDecodeError, KeyError):
            value = INCORRECT
            explanation = f"JSON parsing error:\n{result.completion}"
        
        # return value and explanation (critique text)
        return Score(value=value, explanation=explanation)

    return score

Now we use this scorer in a critique task definition:

In [None]:
@task
def critique():
    return Task(
        dataset=dataset,
        plan=[
            system_message("Honeycomb AI suggests queries based on user input."),
            prompt_with_schema(),
            generate()
        ],
        scorer=critique_scorer()
    )

And then run the task using `eval()` (limiting to 25 samples):

In [None]:
eval(critique, model="openai/gpt-4-turbo", limit=25)