# Anthropic Completion Experiment Example

## Installations

In [None]:
# !pip install --quiet --force-reinstall prompttools

## Setup imports and API keys

First, we'll need to set our API keys. If we are in DEBUG mode, we don't need to use a real Anthropic key, so for now we'll set them to empty strings.

In [1]:
import os

os.environ["DEBUG"] = ""  # Set this to "" to call Anthropic's API, "1" to use debug mode
os.environ["ANTHROPIC_API_KEY"] = ""  # Insert your key here

Then we'll import the relevant `prompttools` modules to setup our experiment.

In [2]:
from prompttools.experiment import AnthropicCompletionExperiment
from anthropic import HUMAN_PROMPT, AI_PROMPT

## Run an experiment

Next, we create our test inputs. We can iterate over models, prompts, and configurations like temperature.

In this case, we test the models 'claude-instant-v1' and 'claude-2' on two similar but differently worded prompts.
Two prompts both ask Claude "Is 17077 a prime number?", but the second prompt encourages the model to "I don't know" if the model is not sure.

This is a technique to prevent hallucination.

In [3]:
models = ["claude-instant-v1", "claude-2"]

prompts = [
    f"""{HUMAN_PROMPT}Is 17077 a prime number?
    {AI_PROMPT}""",
    f"""{HUMAN_PROMPT}Answer the following question only if you know the answer or can make a well-informed guess; otherwise tell me you don't know it.
    Is 17077 a prime number?
    {AI_PROMPT}""",
]

experiment = AnthropicCompletionExperiment(max_tokens_to_sample=[1000], model=models, prompt=prompts)

In [4]:
experiment.run()
experiment.visualize()

Unnamed: 0,prompt,response,latency,model
0,\n\nHuman:Is 17077 a prime number?\n \n\nAssistant:,"[ No, 17077 is not a prime number. A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. 17077 has positive divisors other than 1 and itself, such as 17 and 1009, so it is not a prime number.]",1.496818,claude-instant-v1
1,\n\nHuman:Answer the following question only if you know the answer or can make a well-informed guess; otherwise tell me you don't know it.\n Is 17077 a prime number?\n \n\nAssistant:,[ I don't know it.],0.611383,claude-instant-v1
2,\n\nHuman:Is 17077 a prime number?\n \n\nAssistant:,"[ Okay, let's check if 17077 is a prime number:\n\nTo determine if a number is prime, we check if it is divisible by any integers between 2 and the square root of the number.\n\nThe square root of 17077 is approximately 130. \nLet's check if 17077 is divisible by any integers between 2 and 130:\n\n2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... 130\n\n17077 is not evenly divisible by any of these numbers.\n\nTherefore, 17077 is a prime number.]",4.751695,claude-2
3,\n\nHuman:Answer the following question only if you know the answer or can make a well-informed guess; otherwise tell me you don't know it.\n Is 17077 a prime number?\n \n\nAssistant:,"[ I don't know for certain if 17077 is a prime number without doing the computation, but I can make a reasonable guess. 17077 is a 5-digit number that does not end in 0, 2, 4, 5, 6, or 8. Also, the only even number factors it could possibly have are 2 and 4, and it is clearly not divisible by either of those. So while I can't say for absolute certain without checking, I would guess that 17077 is likely a prime number.]",5.827512,claude-2


## Auto-Evaluate the model response

To evaluate the model response, we can define an eval method that passes a fact and the previous model response into another LLM to get feedback.

In this case, we are using a built-in evaluation function `autoeval_scoring` provided within `prompttools`.

The evaluation function provides Claude 2 with a fact (truth) and the previous model response. With those, the function asks Claude 2 to provide a score from 1 - 7, with a lower score means the answer is factually wrong, higher score means the answer is correct, and a medium score for uncertain answer that is not necessary wrong.

You can also write your own auto-evaluation function or pick a different model to be the judge.

In [5]:
from prompttools.utils import autoeval_scoring

fact = "17077 is a prime number, because it has no divisor aside from 1 and 17077."
experiment.evaluate("Score", autoeval_scoring, expected=[fact] * 4)
experiment.visualize()

Unnamed: 0,prompt,response,latency,Score,model
0,\n\nHuman:Is 17077 a prime number?\n \n\nAssistant:,"[ No, 17077 is not a prime number. A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. 17077 has positive divisors other than 1 and itself, such as 17 and 1009, so it is not a prime number.]",1.496818,1,claude-instant-v1
1,\n\nHuman:Answer the following question only if you know the answer or can make a well-informed guess; otherwise tell me you don't know it.\n Is 17077 a prime number?\n \n\nAssistant:,[ I don't know it.],0.611383,3,claude-instant-v1
2,\n\nHuman:Is 17077 a prime number?\n \n\nAssistant:,"[ Okay, let's check if 17077 is a prime number:\n\nTo determine if a number is prime, we check if it is divisible by any integers between 2 and the square root of the number.\n\nThe square root of 17077 is approximately 130. \nLet's check if 17077 is divisible by any integers between 2 and 130:\n\n2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... 130\n\n17077 is not evenly divisible by any of these numbers.\n\nTherefore, 17077 is a prime number.]",4.751695,7,claude-2
3,\n\nHuman:Answer the following question only if you know the answer or can make a well-informed guess; otherwise tell me you don't know it.\n Is 17077 a prime number?\n \n\nAssistant:,"[ I don't know for certain if 17077 is a prime number without doing the computation, but I can make a reasonable guess. 17077 is a 5-digit number that does not end in 0, 2, 4, 5, 6, or 8. Also, the only even number factors it could possibly have are 2 and 4, and it is clearly not divisible by either of those. So while I can't say for absolute certain without checking, I would guess that 17077 is likely a prime number.]",5.827512,7,claude-2


From the scores above, we see that Claude 2 does provide low score when the model response is wrong or uncertain.
You should also consider using a different model (such as GPT-4) as the judge.