# Basic Experiment Example

## Setup imports and API keys

First, we'll need to set our API keys. If we are in DEBUG mode, we don't need to use real OpenAI or Hegel AI API keys, so for now we'll set them to empty strings.

In [1]:
import os
os.environ['DEBUG']="1"
os.environ['HEGELAI_API_KEY'] = ""  # Optional, it will be needed to use with `HegelScribe` to persist/visualize your experiments
os.environ['OPENAI_API_KEY'] = ""

Then we'll import the relevant `prompttools` modules to setup our experiment.

In [2]:
from typing import Dict, List, Tuple
from prompttools.experiment.openai_chat_experiment import OpenAIChatExperiment

## Run an experiment

Next, we create our test inputs. We can iterate over models, inputs, and configurations like temperature.

In [3]:
models = ['gpt-3.5-turbo', 'gpt-3.5-turbo-0613']
messages = [[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who was the first president?"},
]]
temperatures = [0.0, 1.0]

experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)

We can then run the experiment to get results.

In [4]:
experiment.run()

[2023-07-05 15:34:24,014] INFO in experiment: Preparing first...


## Evaluate the model response

To evaluate the results, we'll define an eval function. We can use semantic distance to check if the model's response is similar to our expected output.

In [5]:
from prompttools.utils import similarity


EXPECTED = {"Who was the first president?": "George W"}

def extract_responses(output) -> str:
    return [choice["message"]["content"] for choice in output["choices"]]


def measure_similarity(
    messages: List[Dict[str, str]], results: Dict, metadata: Dict
) -> float:
    """
    A simple test that checks semantic similarity between the user input
    and the model's text responses.
    """
    distances = [
        similarity.compute(EXPECTED[messages[1]["content"]], response)
        for response in extract_responses(results)
    ]
    return min(distances)

Finally, we can evaluate and visualize the results.

In [6]:
experiment.evaluate("similar_to_expected", measure_similarity)


[2023-07-05 15:34:24,125] INFO in posthog: Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
[2023-07-05 15:34:24,137] INFO in ctypes: Successfully imported ClickHouse Connect C data optimizations
[2023-07-05 15:34:24,138] INFO in ctypes: Successfully import ClickHouse Connect C/Numpy optimizations
[2023-07-05 15:34:24,144] INFO in json_impl: Using python library for writing JSON byte strings


In [7]:
experiment.visualize()

Unnamed: 0,messages,response(s),latency,similar_to_expected,model,temperature
0,"[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Who was the first president?'}]",[George Washington],3e-06,0.199477,gpt-3.5-turbo,0.0
1,"[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Who was the first president?'}]",[George Washington],2e-06,0.199477,gpt-3.5-turbo,1.0
2,"[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Who was the first president?'}]",[George Washington],1e-06,0.199477,gpt-3.5-turbo-0613,0.0
3,"[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Who was the first president?'}]",[George Washington],1e-06,0.199477,gpt-3.5-turbo-0613,1.0
