# Auto-Evaluation Example

To view this example on Google Colab, see [here](https://colab.research.google.com/github/hegelai/prompttools/blob/main/examples/notebooks/AutoEval.ipynb).

## Installations

In [1]:
# !pip install --quiet --force-reinstall prompttools

## Setup imports and API keys

First, we'll need to set our API keys. If we are in DEBUG mode, we don't need to use a real OpenAI key, so for now we'll set them to empty strings.

In [9]:
import os

os.environ["DEBUG"] = ""  # Set to "1" if you want to use debug mode.
os.environ["OPENAI_API_KEY"] = ""

Then we'll import the relevant `prompttools` modules to set up our experiment.

In [3]:
from typing import Dict
from prompttools.harness import PromptTemplateExperimentationHarness
from prompttools.experiment import OpenAICompletionExperiment
from prompttools.selector.prompt_selector import PromptSelector

## Run experiments

Next, we create our test inputs. For this example, we'll use a prompt template, which uses [jinja](https://jinja.palletsprojects.com/en/3.1.x/) for templating.

In [4]:
instructions = [
    """
Answer the following question. 
If it is not prime, give its decomposition.
""",
    """
Answer the following question.
""",
]

inputs = ["is 17077 a prime number", "Is 17077 prime?"]

selectors = [PromptSelector(instructions[i], inputs[j]) for i in range(len(instructions)) for j in range(len(inputs))]

Now we can define an experimentation harness for our inputs and model. We could also pass model arguments if, for example, we wanted to change the model temperature.

In [5]:
experiment = OpenAICompletionExperiment(["text-davinci-003"], selectors)

We can then run the experiment to get results.

In [6]:
experiment.run()
experiment.visualize()

Unnamed: 0,prompt,response,latency
0,"INSTRUCTION:\n\nAnswer the following question. \nIf it is not prime, give its decomposition.\n\nPROMPT:\nis 17077 a prime number\nRESPONSE:\n","No, 17077 is not a prime number. Its decomposition is 13",0.970505
1,"INSTRUCTION:\n\nAnswer the following question. \nIf it is not prime, give its decomposition.\n\nPROMPT:\nIs 17077 prime?\nRESPONSE:\n","No, 17077 is not prime. Its decomposition is 79 x 215",0.895672
2,INSTRUCTION:\n\nAnswer the following question.\n\nPROMPT:\nis 17077 a prime number\nRESPONSE:\n,"No, 17077 is not a prime number. It is divisible by",1.149236
3,INSTRUCTION:\n\nAnswer the following question.\n\nPROMPT:\nIs 17077 prime?\nRESPONSE:\n,"No, 17077 is not a prime number. 17077 is div",0.921796


## Auto-Evaluate the model response

To evaluate the model response, we can define an eval method that passes the input and response into another LLM to get feedback.

In [7]:
from prompttools.utils import autoeval_binary_scoring

Finally, we can evaluate and visualize the results.

In [8]:
experiment.evaluate("followed_directions", autoeval_binary_scoring, {"prompt_column_name": "prompt"})
experiment.visualize()

Unnamed: 0,prompt,response,latency,followed_directions
0,"INSTRUCTION:\n\nAnswer the following question. \nIf it is not prime, give its decomposition.\n\nPROMPT:\nis 17077 a prime number\nRESPONSE:\n","No, 17077 is not a prime number. Its decomposition is 13",0.970505,0.0
1,"INSTRUCTION:\n\nAnswer the following question. \nIf it is not prime, give its decomposition.\n\nPROMPT:\nIs 17077 prime?\nRESPONSE:\n","No, 17077 is not prime. Its decomposition is 79 x 215",0.895672,1.0
2,INSTRUCTION:\n\nAnswer the following question.\n\nPROMPT:\nis 17077 a prime number\nRESPONSE:\n,"No, 17077 is not a prime number. It is divisible by",1.149236,1.0
3,INSTRUCTION:\n\nAnswer the following question.\n\nPROMPT:\nIs 17077 prime?\nRESPONSE:\n,"No, 17077 is not a prime number. 17077 is div",0.921796,1.0
