# Auto-Evaluation Example

## Installations

In [1]:
# !pip install --quiet --force-reinstall prompttools

## Setup imports and API keys

First, we'll need to set our API keys. If we are in DEBUG mode, we don't need to use a real OpenAI key, so for now we'll set them to empty strings.

In [10]:
import os

os.environ["DEBUG"] = ""  # Set to "1" if you want to use debug mode.
os.environ["OPENAI_API_KEY"] = ""

Then we'll import the relevant `prompttools` modules to set up our experiment.

In [3]:
from typing import Dict
from prompttools.harness import PromptTemplateExperimentationHarness
from prompttools.experiment import OpenAICompletionExperiment

## Run experiments

Next, we create our test inputs. For this example, we'll use a prompt template, which uses [jinja](https://jinja.palletsprojects.com/en/3.1.x/) for templating.

In [4]:
prompt_templates = [
    """
INSTRUCTIONS
Answer the following question

QUESTION
{{input}}

ANSWER
""",
    """
INSTRUCTIONS
Answer the following question. 
If it is not prime, give its decomposition

QUESTION
{{input}}

ANSWER
""",
]
user_inputs = [{"input": "is 17077 a prime number"}, {"input": "Is 17077 prime?"}]

Now we can define an experimentation harness for our inputs and model. We could also pass model arguments if, for example, we wanted to change the model temperature.

In [5]:
harness = PromptTemplateExperimentationHarness(
    OpenAICompletionExperiment, "text-davinci-003", prompt_templates, user_inputs
)

We can then run the experiment to get results.

In [6]:
harness.run()
harness.visualize()

Unnamed: 0,prompt,response(s),latency
0,\nINSTRUCTIONS\nAnswer the following question\n\nQUESTION\nis 17077 a prime number\n\nANSWER,"\nNo, 17077 is not a prime number. It is divisible",0.743519
1,\nINSTRUCTIONS\nAnswer the following question\n\nQUESTION\nIs 17077 prime?\n\nANSWER,"\nNo, 17077 is not a prime number. It is divisible",0.919933
2,"\nINSTRUCTIONS\nAnswer the following question. \nIf it is not prime, give its decomposition\n\nQUESTION\nis 17077 a prime number\n\nANSWER","\nNo, 17077 is not a prime number. It can be decomp",0.928695
3,"\nINSTRUCTIONS\nAnswer the following question. \nIf it is not prime, give its decomposition\n\nQUESTION\nIs 17077 prime?\n\nANSWER","\nNo, 17077 is not prime. 17077 = 11 x",0.550601


You can use the `pivot` keyword argument to view results by the template and inputs that created them.

In [7]:
harness.visualize(pivot=True)

prompt_template,\nINSTRUCTIONS\nAnswer the following question\n\nQUESTION\n{{input}}\n\nANSWER\n,"\nINSTRUCTIONS\nAnswer the following question. \nIf it is not prime, give its decomposition\n\nQUESTION\n{{input}}\n\nANSWER\n"
user_input,Unnamed: 1_level_1,Unnamed: 2_level_1
{'input': 'Is 17077 prime?'},"\nNo, 17077 is not a prime number. It is divisible","\nNo, 17077 is not prime. 17077 = 11 x"
{'input': 'is 17077 a prime number'},"\nNo, 17077 is not a prime number. It is divisible","\nNo, 17077 is not a prime number. It can be decomp"


## Auto-Evaluate the model response

To evaluate the model response, we can define an eval method that passes the input and response into another LLM to get feedback.

In [8]:
from prompttools.utils import autoeval

Finally, we can evaluate and visualize the results.

In [9]:
harness.evaluate("followed_directions", autoeval.evaluate)
harness.visualize()

Unnamed: 0,prompt,response(s),latency,followed_directions
0,\nINSTRUCTIONS\nAnswer the following question\n\nQUESTION\nis 17077 a prime number\n\nANSWER,"\nNo, 17077 is not a prime number. It is divisible",0.743519,1.0
1,\nINSTRUCTIONS\nAnswer the following question\n\nQUESTION\nIs 17077 prime?\n\nANSWER,"\nNo, 17077 is not a prime number. It is divisible",0.919933,1.0
2,"\nINSTRUCTIONS\nAnswer the following question. \nIf it is not prime, give its decomposition\n\nQUESTION\nis 17077 a prime number\n\nANSWER","\nNo, 17077 is not a prime number. It can be decomp",0.928695,0.0
3,"\nINSTRUCTIONS\nAnswer the following question. \nIf it is not prime, give its decomposition\n\nQUESTION\nIs 17077 prime?\n\nANSWER","\nNo, 17077 is not prime. 17077 = 11 x",0.550601,0.0
