# Auto-Evaluation Example

## Setup imports and API keys

First, we'll need to set our API keys. If we are in DEBUG mode, we don't need to use real OpenAI or Hegel AI API keys, so for now we'll set them to empty strings.

In [1]:
import os
os.environ['HEGELAI_API_KEY'] = ""
os.environ['OPENAI_API_KEY'] = ""

Then we'll import the relevant `prompttools` modules to setup our experiment.

In [2]:
from typing import Dict
from prompttools.harness.prompt_template_harness import (
    PromptTemplateExperimentationHarness,
)

## Run experiments

Next, we create our test inputs. For this example, we'll use a prompt template, which uses [jinja](https://jinja.palletsprojects.com/en/3.1.x/) for templating.

In [3]:
prompt_templates = ["Echo the following input: {{input}}", "Repeat the following input: {{input}}"]
user_inputs = [{"input": "This is a test"}, {"input": "This is not a test"}]

Now we can define an experimentation harness for our inputs and model. We could also pass model arguments if, for example, we wanted to change the model temperature.

In [4]:
harness = PromptTemplateExperimentationHarness("text-davinci-003", prompt_templates, user_inputs)

We can then run the experiment to get results.

In [5]:
harness.prepare()
harness.run()
harness.visualize()

Unnamed: 0,messages,response(s),latency
0,Echo the following input: This is a test,[\n\nThis is a test],2.60849
1,Echo the following input: This is not a test,[\n\nThis is not a test],2.105207
2,Repeat the following input: This is a test,[\n\nThis is a test],0.510293
3,Repeat the following input: This is not a test,[\n\nThis is not a test],1.843766


You can use the `pivot` keyword argument to view results by the template and inputs that created them.

In [6]:
harness.visualize(pivot=True)

prompt_template,Echo the following input: {{input}},Repeat the following input: {{input}}
user_input,Unnamed: 1_level_1,Unnamed: 2_level_1
{'input': 'This is a test'},[\n\nThis is a test],[\n\nThis is a test]
{'input': 'This is not a test'},[\n\nThis is not a test],[\n\nThis is not a test]


## Auto-Evaluate the model response

To evaluate the model response, we can define an eval method that passes the input and response into another LLM to get feedback.

In [7]:
import openai
import jinja2

ECHO_EVALUATION_TEMPLATE="""
Determine whether or not the response is following directions.
Your answer should either be "RIGHT" if the response follows directions,
or "WRONG" if the model is not following directions.

Write your answer here:
PROMPT: {{prompt}}
RESPONSE: {{response}}
ANSWER: 
"""

def extract_responses(output) -> str:
    return [choice["text"] for choice in output["choices"]]

def auto_eval(prompt: str, results: Dict, metadata: Dict) -> float:
    environment = jinja2.Environment()
    template = environment.from_string(ECHO_EVALUATION_TEMPLATE)
    responses = extract_responses(results)
    prompts = [template.render({"prompt": prompt, 
                                "response": response}) for response in responses]
    evals = [openai.Completion.create(model="text-davinci-003",
                                      prompt=prompt) for prompt in prompts]
    return float(sum([1 if 'RIGHT' in e['choices'][0]['text'] else 0 for e in evals]))

Finally, we can evaluate and visualize the results.

In [8]:
harness.evaluate("followed_directions", auto_eval)
harness.visualize()

[<OpenAIObject text_completion id=cmpl-7Yl3SC4jCsVc2mJMNZmUEomK8BKZa at 0x11e656e10> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " RIGHT"
    }
  ],
  "created": 1688518386,
  "id": "cmpl-7Yl3SC4jCsVc2mJMNZmUEomK8BKZa",
  "model": "text-davinci-003",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 1,
    "prompt_tokens": 82,
    "total_tokens": 83
  }
}]
[<OpenAIObject text_completion id=cmpl-7Yl3T3VGQLUlP50Pt3blH1tTCI0O3 at 0x11e6575f0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " RIGHT"
    }
  ],
  "created": 1688518387,
  "id": "cmpl-7Yl3T3VGQLUlP50Pt3blH1tTCI0O3",
  "model": "text-davinci-003",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 1,
    "prompt_tokens": 84,
    "total_tokens": 85
  }
}]
[<OpenAIObject text_completion id=cmpl-7Yl3UaVALivISZx7dqcpZrkXV0ug8 at 0x11e6575f0> JSON: {
  "choi

Unnamed: 0,messages,response(s),latency,followed_directions
0,Echo the following input: This is a test,[\n\nThis is a test],2.60849,1.0
1,Echo the following input: This is not a test,[\n\nThis is not a test],2.105207,1.0
2,Repeat the following input: This is a test,[\n\nThis is a test],0.510293,1.0
3,Repeat the following input: This is not a test,[\n\nThis is not a test],1.843766,1.0
