# Model Comparison Example

## Installations

In [1]:
# !pip install --quiet --force-reinstall prompttools

## Setup imports and API keys

First, we'll need to set our API keys. If we are in DEBUG mode, we don't need to use real OpenAI or Hegel AI API keys, so for now we'll set them to empty strings.

In [2]:
import os
os.environ['DEBUG']="1"
os.environ['HEGELAI_API_KEY'] = ""  # Optional, it will be needed to use with `HegelScribe` to persist/visualize your experiments
os.environ['OPENAI_API_KEY'] = ""

Then we'll import the relevant `prompttools` modules to setup our experiment.

In [3]:
from prompttools.harness import ChatModelComparisonHarness

## Run experiments

Next, we create our test inputs.

In [4]:
chat_histories = [[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"},
]]

Now we can define an experimentation harness to compare two models.

In [5]:
harness = ChatModelComparisonHarness(["gpt-3.5-turbo", "gpt-3.5-turbo-0613"], chat_histories)

We can then run the experiment to get results.

In [6]:
harness.run()
harness.visualize()

{'model': ['gpt-3.5-turbo', 'gpt-3.5-turbo-0613'], 'messages': [[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Who won the world series in 2020?'}]], 'temperature': [1.0], 'top_p': [1.0], 'n': [1], 'stream': [False], 'stop': [None], 'max_token': [inf], 'presence_penalty': [0], 'frequency_penalty': [0], 'logit_bias': [None]}


Unnamed: 0,messages,response(s),latency,model
0,"[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Who won the world series in 2020?'}]",[George Washington],5e-06,gpt-3.5-turbo
1,"[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Who won the world series in 2020?'}]",[George Washington],3e-06,gpt-3.5-turbo-0613


You can use the `pivot` keyword argument to view results by the template and inputs that created them.

## Compare outputs manually

We can now record our manual evaluations of the output from each model. We are evaluating the quality of the first model, based on it's side-by-side comparisons from other models.

In [7]:
harness.compare()

GridBox(children=(Label(value='Input'), Label(value='gpt-3.5-turbo'), Label(value='gpt-3.5-turbo-0613'), Labelâ€¦

Unnamed: 0,gpt-3.5-turbo,feedback
0,"[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Who won the world series in 2020?'}]",0


Unnamed: 0,gpt-3.5-turbo,feedback
0,"[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Who won the world series in 2020?'}]",1
