# Prompt Gym

by [Inspired Cognition](https://inspiredco.ai)

This is a notebook that demonstrates how to simply play around with prompts for text generation, testing **different models** and **different prompts** and evaluating according to **different criteria**.

<p align="center">
<img src="prompt-gym.png"  width="256" height="256">
</p>

In the example here we test two different companies' text generation models, [OpenAI's GPT-3](https://openai.com/blog/gpt-3-apps/), and [Cohere's text generation models](https://cohere.ai/generate). Evaluation of the models is done with the [Inspired Cognition Critique](https://docs.inspiredco.ai/critique/) tool for text generation evaluation. We demonstrate the case for text summarization on 100 examples from the [CNN-DailyMail dataset](https://huggingface.co/datasets/cnn_dailymail). But you can swap in whatever models, prompts, metrics, and data that you would like to try on other tasks too!

By the end of the exploration, you will have a **comprehensive report** of which prompts and models work well along a number of axes, like the actual table below that was generated from this notebook:

| Model | Prompt | UniEval (Consistency) | UniEval (Coherence) | UniEval (Fluency) | UniEval (Relevance) | BartScore (Coverage) | Length Ratio |
| --- | --- | --- | --- | --- | --- | --- | --- |
| cohere_medium | standard | 0.7466 | 0.4006 | 0.8869 | 0.3438 | -3.4095 | 2.5533 |
| cohere_medium | tldr | 0.5006 | 0.2967 | 0.8539 | 0.3312 | -3.1348 | 2.5800 |
| cohere_medium | concise | 0.8542 | 0.6115 | 0.9140 | 0.6167 | -3.4220 | 2.4500 |
| cohere_medium | complete | 0.8331 | 0.4845 | 0.8825 | 0.5214 | -3.1689 | 2.6767 |
| openai_babbage_001 | standard | 0.9409 | 0.9036 | 0.8782 | 0.7975 | -3.4083 | 2.0800 |
| openai_babbage_001 | tldr | 0.8728 | 0.9072 | 0.9593 | 0.8145 | -3.5234 | 1.0200 |
| openai_babbage_001 | concise | 0.9483 | 0.9365 | 0.8669 | 0.8431 | -3.2528 | 2.2800 |
| openai_babbage_001 | complete | 0.9306 | 0.8278 | 0.8634 | 0.6951 | -3.2720 | 2.2633 |
| openai_ada_001 | standard | 0.6750 | 0.7270 | 0.8850 | 0.8174 | -3.6719 | 2.0067 |
| openai_ada_001 | tldr | 0.7999 | 0.7122 | 0.7973 | 0.6728 | -3.7436 | 1.5300 |
| openai_ada_001 | concise | 0.7776 | 0.7439 | 0.8106 | 0.5852 | -3.6096 | 2.3600 |
| openai_ada_001 | complete | 0.7732 | 0.5008 | 0.7332 | 0.3283 | -3.5246 | 2.4567 |

Some pointers into how to do **further exploration** into the results, such as finding examples where a particular method is doing well or poorly, or where one method is outperforming the other.

If you want to discuss more, you can join the [Inspired Cognition Discord](https://discord.com/invite/vJHdpCBqWN) or get in touch through our [contact page](https://inspiredco.ai/contact/), we love talking about applications of generative AI!

## Setup

First, we import the necessary libraries and set up our API keys.

To install the requirements, run:

```bash
pip install -r requirements.txt
```

You can get the necessary API keys here:
* [OpenAI API Key](https://openai.com/blog/openai-api/)
* [Cohere API Key](https://cohere.ai/)
* [Inspired Cognition API Key](https://dashboard.inspiredco.ai)

Then, create a file called `.env` in the same directory as this notebook, and add the following lines (with ... replaced by your API keys)

```
OPENAI_API_KEY=...
COHERE_API_KEY=...
IC_API_KEY=...
```

Finally execute the following cell to set everything up:

In [None]:
import os
import json
import time

import cohere
import inspiredco.critique
import openai
import tqdm

# Load environment variables from a .env file
import dotenv
dotenv.load_dotenv()

# Set up API credentials
openai.api_key = os.environ["OPENAI_API_KEY"]
cohere_api_key = os.environ["COHERE_API_KEY"]
co = cohere.Client(cohere_api_key)
inspiredco_api_key = os.environ["INSPIREDCO_API_KEY"]
critique = inspiredco.critique.Critique(inspiredco_api_key)

## Choosing Models

Next, you'll want to decide which models and configurations you'll want to use. These should follow the configuration supported by the provider. You can see more info about the generation APIs supported by each here:

* [OpenAI API Doc](https://beta.openai.com/docs/api-reference/completions/create)
* [Cohere API Doc](https://cohere.ai/docs/api/)

For demonstration purposes, we use the smaller versions of each model (`text-babbage-001` and `text-ada-001` for OpenAI, and `medium` for Cohere), but you can change these to the larger versions (`text-davinci-003` for OpenAI and `xlarge` for Cohere) if you want to try them out.

In [None]:
# Specify which models you want to use
models = {
    "cohere_medium": {
        "provider": "cohere",
        "config": {
            "model": "medium",
            "temperature": 0.3,
            "max_tokens": 100,
            "top_p": 1,
        }
    },
    "openai_babbage_001": {
        "provider": "openai",
        "config": {
            "model": "text-babbage-001",
            "temperature": 0.3,
            "max_tokens": 100,
            "top_p": 1,
        }
    },
    "openai_ada_001": {
        "provider": "openai",
        "config": {
            "model": "text-ada-001",
            "temperature": 0.3,
            "max_tokens": 100,
            "top_p": 1,
        }
    },
}

## Choosing Prompts

Next, you will want to choose prompts. For the prompts, the input text will be input into the placeholder `[X]`.

The examples below are for text summarization, but you can change the prompts to do other tasks as well.

In [None]:
# Specify the prompts you want to use
prompts = {
    "standard": "Summarize the following text:\n[X]\n\nSummary:",
    "tldr": "[X]\nTL;DR:",
    "concise": "Write a short and concise summary of the following text:\n[X]\n\nSummary:",
    "complete": "Write a complete summary of the following text:\n[X]\n\nSummary:",
}

## Choosing Evaluation Metrics

Finally, you'll want to decide which metrics you use to evaluate the quality of the generated text. Critique supports a wide variety of metrics, so you'll want to pick appropriate ones for your task. You can read more about this on the following pages:
* [Critique Evaluation Criteria](https://docs.inspiredco.ai/critique/criteria.html)
* [Critique Metrics](https://docs.inspiredco.ai/critique/metrics.html)

In [None]:
metrics = {
    "UniEval (Consistency)": {
        "metric": "uni_eval",
        "config": {"task": "summarization", "evaluation_aspect": "consistency"},
    },
    "UniEval (Coherence)": {
        "metric": "uni_eval",
        "config": {"task": "summarization", "evaluation_aspect": "coherence"},
    },
    "UniEval (Fluency)": {
        "metric": "uni_eval",
        "config": {"task": "summarization", "evaluation_aspect": "fluency"},
    },
    "UniEval (Relevance)": {
        "metric": "uni_eval",
        "config": {"task": "summarization", "evaluation_aspect": "relevance"},
    },
    "BartScore (Coverage)": {
        "metric": "bart_score",
        "config": {"model": "facebook/bart-large-cnn", "variety": "reference_target_bidirectional"},
    },
    "Length Ratio": {
        "metric": "length_ratio",
        "config": {},
    },
}

## Set up Data

Now we'll load our data! Put the inputs in a jsonl file, where the input is in the `source` field and a gold-standard output is in the `reference` field.
The source and target examples in this repo are 10 documents and summaries from the CNN-DailyMail dataset, but you can swap in whatever data you want. You'll probably want to use more examples for robust results in practice, but we use a small number here for demonstration purposes.

In [None]:
with open("input_data.jsonl", "r") as f:
    input_data = [json.loads(line) for line in f.readlines()]

## Generate Output Text

Iterate through the models and prompts and generate the output text. This may take a little while to hit the APIs many times. This will also write out the generated data to `output_data/targets.json` so you can re-run the evaluation step later without having to generate outputs again.

In [None]:
targets = {}
for model_name, model_info in models.items():
    config, provider = model_info["config"], model_info["provider"]
    targets[model_name] = {}
    for prompt_name, prompt_template in prompts.items():
        my_data = []
        for input in tqdm.tqdm(input_data, desc=f"Generating {model_name=} {prompt_name=}"):
            source = input["source"]
            if provider == "openai":
                response = openai.Completion.create(
                    engine=config["model"],
                    prompt=prompt_template.replace("[X]", source),
                    temperature=config["temperature"],
                    max_tokens=config["max_tokens"],
                    top_p=config["top_p"],
                )
                my_data.append(response["choices"][0]["text"])
            elif provider == "cohere":
                response = co.generate(  
                    model=config["model"],  
                    prompt=prompt_template.replace("[X]", source),
                    temperature=config["temperature"],  
                    max_tokens=config["max_tokens"],
                    p=config["top_p"], 
                )
                my_data.append(response.generations[0].text)
                time.sleep(10)  # Sleep to avoid rate limiting on developer API
            else:
                raise ValueError("Unknown provider, but you can add your own!")
        targets[model_name][prompt_name] = my_data
if not os.path.exists("output_data"):
    os.makedirs("output_data")
with open("output_data/targets.json", "w") as f:
    json.dump(targets, f, indent=2)

## Evaluate Output Text

Now we'll evaluate the output text from a number of different perspectives. We'll also save the evaluation results for later use.

In [None]:
critique_data = {}
# Dispatch jobs
for model_name, model_data in targets.items():
    critique_data[model_name] = {}
    for prompt_name, target_data in model_data.items():
        critique_data[model_name][prompt_name] = {}
        print(f"Submitting evaluation jobs for {model_name=} {prompt_name=}")
        for metric_name, metric_info in metrics.items():
            metric, config = metric_info["metric"], metric_info["config"]
            dataset = [
                {"source": input["source"], "target": target, "references": [input["reference"]]}
                for input, target in zip(input_data, target_data)
            ]
            critique_data[model_name][prompt_name][metric_name] = critique.submit_task(
                metric=metric,
                config=config,
                dataset=dataset,
            )

# Collect results
for model_name, model_data in critique_data.items():
    for prompt_name, prompt_data in model_data.items():
        print(f"Retrieving evaluation results for {model_name=} {prompt_name=}")
        for metric_name, task_id in prompt_data.items():
            prompt_data[metric_name] = critique.wait_for_result(task_id)
with open("output_data/evaluations.json", "w") as f:
    json.dump(critique_data, f, indent=2)

## Report Evaluation Table

Finally, we'll generate a table of the evaluation results, and save it to `output_data/evaluation_table.md`, as well as displaying it here.

In [None]:
with open("output_data/evaluation_table.md", "w") as f:
    def print_both(text):
        print(text)
        print(text, file=f)
    metric_list = list(metrics.keys())
    headers = ["Model", "Prompt"] + metric_list
    print_both("| " + " | ".join(headers) + " |")
    print_both("| " + " | ".join("---" for _ in headers) + " |")
    for model_name, model_data in critique_data.items():
        for prompt_name, prompt_data in model_data.items():
            row = [model_name, prompt_name]
            for metric_name, metric_data in prompt_data.items():
                row.append(f"{metric_data['overall']['value']:0.4f}")
            print_both("| " + " | ".join(row) + " |")

## Further Exploration

You can also further explore the results on an example-by-example basis for a particular model, prompt, and metric. First, let's specify the model, prompt, and metric we want to explore.

In [None]:
model = "cohere_medium"
prompt = "standard"
metric = "UniEval (Relevance)"

### Finding High/low-performing Examples

Here's an example of how you can sort the outputs and find particularly high-performing or low-performing examples to do error analysis. Here we're only outputting the system outputs and references because the summarization sources are long, but you could also print out the sources if you wanted to.

In [None]:
metric_data = [x["value"] for x in critique_data[model][prompt][metric]["examples"]]
target_data = targets[model][prompt]
graded_data = list(zip(input_data, target_data, metric_data))
graded_data.sort(key=lambda x: x[2], reverse=True)

def print_graded(input, target, metric_value):
    print(f"=== {metric_value:0.4f} ===")
    print("*** Reference ***")
    print(input["reference"])
    print()
    print("*** Target ***")
    print(target)
    print()

print(f"----- Best {metric} Examples for '{model} {prompt}' -----")
for input, target, metric_value in graded_data[:3]:
    print_graded(input, target, metric_value)


print(f"----- Worst {metric} Examples for '{model} {prompt}' -----")
for input, target, metric_value in graded_data[:3]:
    print_graded(input, target, metric_value)


### Finding Examples Where One Method Outperforms Another

When working with prompting, we'll often want to know which prompts are better than others and in which ways. First, let's specify two methods we're interested in:

In [None]:
model1 = "cohere_medium"
prompt1 = "standard"
model2 = "cohere_medium"
prompt2 = "tldr"
metric = "UniEval (Relevance)"

Then we can find examples where one method outperforms the other:

In [None]:
metric_data1 = [x["value"] for x in critique_data[model1][prompt1][metric]["examples"]]
target_data1 = targets[model1][prompt1]
metric_data2 = [x["value"] for x in critique_data[model2][prompt2][metric]["examples"]]
target_data2 = targets[model2][prompt2]
graded_data = list(zip(input_data, target_data1, target_data2, metric_data1, metric_data2))
graded_data.sort(key=lambda x: x[3]-x[4], reverse=True)

def print_graded(input, target1, target2, metric_value1, metric_value2):
    print(f"=== {metric_value1:0.4f} vs. {metric_value2:0.4f} ===")
    print("*** Reference ***")
    print(input["reference"])
    print()
    print("*** Target 1 ***")
    print(target1)
    print()
    print("*** Target 2 ***")
    print(target2)
    print()

print(f"----- Examples where '{model1} {prompt1}' outperforms '{model2} {prompt2}' on {metric} -----")
for input, target1, target2, metric_value1, metric_value2 in graded_data[:3]:
    print_graded(input, target1, target2, metric_value1, metric_value2)

print(f"----- Examples where '{model2} {prompt2}' outperforms '{model1} {prompt1}' on {metric} -----")
for input, target1, target2, metric_value1, metric_value2 in graded_data[-3:]:
    print_graded(input, target1, target2, metric_value1, metric_value2)

## Final Words

That's it for today! Hope you had a good prompting workout, and if you have any comments, questions, or suggestions, drop us a line in [discord](https://discord.com/invite/vJHdpCBqWN) or through our [contact page](https://inspiredco.ai/contact/).