# PaLM 2 Completion Experiment Example

## Installations

In [None]:
# !pip install --quiet --force-reinstall prompttools

## Setup imports and API keys

First, we'll need to set our API keys. If we are in DEBUG mode (set to `"1"`), we don't need to use a real PaLM key.

In [1]:
import os

os.environ["DEBUG"] = ""  # Set this to "" to call GOOGLE_PALM_API_KEY's API, "1" to use debug mode
os.environ["GOOGLE_PALM_API_KEY"] = ""  # Insert your key here

Then we'll import the relevant `prompttools` modules to setup our experiment.

You can also list out the PaLM models that can generate text completion.

In [8]:
from prompttools.experiment import GooglePaLMCompletionExperiment
import google.generativeai as palm


palm.configure(api_key=os.environ["GOOGLE_PALM_API_KEY"])
[m.name for m in palm.list_models() if "generateText" in m.supported_generation_methods]

['models/text-bison-001']

## Run an experiment

Next, we create our test inputs. We can iterate over models, prompts, and configurations like temperature.


In [31]:
models = ["models/text-bison-001"]

prompts = [
    "Is 97 a prime number?",
    "Answer the following question only if you know the answer or can make a well-informed guess; otherwise tell me you don't know it. Is 17077 a prime number?",
]

temperatures = [0.0]  # [0.0, 1.0]  # You can try different temperature if you'd like.

experiment = GooglePaLMCompletionExperiment(model=models, prompt=prompts, temperature=temperatures)

In [32]:
experiment.run()
experiment.visualize()

Unnamed: 0,prompt,response,latency
0,Is 97 a prime number?,[yes],0.568011
1,Answer the following question only if you know the answer or can make a well-informed guess; otherwise tell me you don't know it. Is 17077 a prime number?,[yes],0.407583


## Evaluate the model response

To evaluate the results, we'll define an eval function. We can use semantic distance to check if the model's response is similar to our expected output.

In [33]:
from prompttools.utils import semantic_similarity

In [35]:
experiment.evaluate("similar_to_expected", semantic_similarity, expected=["yes"] * 2)

In [36]:
experiment.visualize()

Unnamed: 0,prompt,response,latency,similar_to_expected
0,Is 97 a prime number?,[yes],0.568011,1.0
1,Answer the following question only if you know the answer or can make a well-informed guess; otherwise tell me you don't know it. Is 17077 a prime number?,[yes],0.407583,1.0


## Auto-Evaluate the model response

To evaluate the model response, we can define an eval method that passes a fact and the previous model response into another LLM to get feedback.

In this case, we are using a built-in evaluation function `autoeval_scoring` provided within `prompttools`.

The evaluation function provides a model (you can choose which model) with a fact (truth) and the previous model response. With those, the function asks the model to provide a score from 1 - 7, with a lower score means the answer is factually wrong, higher score means the answer is correct, and a medium score for uncertain answer that is not necessary wrong.

You can also write your own auto-evaluation function or pick a different model to be the judge.

In [38]:
from prompttools.utils import autoeval_scoring

os.environ["ANTHROPIC_API_KEY"] = ""  # If you would like to use Claude 2 as the judge
fact = "97 and 17077 are both prime numbers, because they have no divisor aside from 1 and themselves."
experiment.evaluate("Score", autoeval_scoring, expected=[fact] * 2)
experiment.visualize()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Unnamed: 0,prompt,response,latency,similar_to_expected,Score
0,Is 97 a prime number?,[yes],0.568011,1.0,7
1,Answer the following question only if you know the answer or can make a well-informed guess; otherwise tell me you don't know it. Is 17077 a prime number?,[yes],0.407583,1.0,7


From the scores above, we see that Claude 2 does provide low score when the model response is wrong or uncertain.
You should also consider using a different model (such as GPT-4) as the judge.