<a href="https://colab.research.google.com/github/jgracie52/bh-2025/blob/main/Promptfoo_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will guide you through the basics of using the promptfoo command-line interface (CLI) for evaluating and testing Large Language Models (LLMs) and prompts.

promptfoo is a powerful open-source tool that helps you:

Compare different prompts.

Evaluate various LLM providers and models.

Run automated tests for correctness, safety, and quality.

Track performance metrics.

Note: This notebook is designed to be run in Google Colab or a local Jupyter environment with Node.js installed.

**1. Setup and Installation**
promptfoo is built with Node.js. If you're using Colab, we'll install Node.js first. If you're running locally and don't have Node.js, please install it (Node.js 18+ is recommended).

In [None]:
# @title 1. Install Promptfoo CLI
# Install promptfoo globally using npm.

print("Installing promptfoo...")
!npm install -g promptfoo

print("\nPromptfoo installed successfully!")
# Verify installation
!promptfoo --version

Installing promptfoo...
[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G

**2. Basic Concepts: Prompts, Providers, and Assertions**
Before we dive into testing, let's understand the core components promptfoo uses:

* `prompts.txt` / `prompts.yaml`: A file (or files) where you define the prompts you want to test.

* `config.yaml`: The main configuration file where you specify:

  * `providers`: The LLM APIs or local models you want to use (e.g., OpenAI, Anthropic, Hugging Face, local Ollama).

  * `tests`: Defines your test cases, including:

     * `vars` (variables): Data points to inject into your prompts.

     * `assert` (assertions): Rules to evaluate the LLM's output (e.g., regex match, JSON validation, AI-powered evaluation).

  * `output.jsonl` / `output.csv`: The results file generated by promptfoo.


**3. Define your `promptfoo.yaml` configuration:**

This file will specify:

* Your prompts.

* Which LLM provider to use (e.g., OpenAI `gpt-3.5-turbo`, or Hugging Face `gpt2` for local testing).

* The variables (`product` in this case) we want to inject.

* A simple assertion to check if the output is not empty.

**Important for LLM Providers:**

**OpenAI/Anthropic**: You'll need an API key. It's best practice to set this as an environment variable (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`).

**Hugging Face (local)**: `promptfoo` can run Hugging Face models locally using the `huggingface-local` provider, but this might require significant resources (especially GPU) for larger models. For quick demos, `gpt2` is a good choice.



In [None]:
# @title 3.2. Create promptfoo.yaml
# You can choose a provider here.
# For OpenAI, uncomment the lines below and ensure your API key is set.
# For local Hugging Face, uncomment those lines.

LLM_PROVIDER_TYPE = "google" # @param ["openai", "huggingface-local", "anthropic", "google"]
OPENAI_MODEL_NAME = "gpt-3.5-turbo" # @param
HUGGINGFACE_LOCAL_MODEL_NAME = "gpt2" # @param
ANTHROPIC_MODEL_NAME = "claude-3-haiku-20240307" # @param
GOOGLE_MODEL_NAME = "gemini-2.5-flash" # @param

# Set API key for OpenAI if using Colab secrets
if LLM_PROVIDER_TYPE == "openai":
    import os
    try:
        from google.colab import userdata
        os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
        print("OpenAI API key loaded from Colab secrets.")
    except Exception as e:
        print(f"Could not load OpenAI API key from Colab secrets: {e}")
        print("Please ensure you have an 'OPENAI_API_KEY' secret set in Colab.")
        print("You can also set it manually: os.environ['OPENAI_API_KEY'] = 'YOUR_API_KEY'")

elif LLM_PROVIDER_TYPE == "anthropic":
    import os
    try:
        from google.colab import userdata
        os.environ["ANTHROPIC_API_KEY"] = userdata.get('ANTHROPIC_API_KEY')
        print("Anthropic API key loaded from Colab secrets.")
    except Exception as e:
        print(f"Could not load Anthropic API key from Colab secrets: {e}")
        print("Please ensure you have an 'ANTHROPIC_API_KEY' secret set in Colab.")
        print("You can also set it manually: os.environ['ANTHROPIC_API_KEY'] = 'YOUR_API_KEY'")

elif LLM_PROVIDER_TYPE == "google":
    import os
    try:
        from google.colab import userdata
        os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')
        print("Google API key loaded from Colab secrets.")
    except Exception as e:
        print(f"Could not load Google API key from Colab secrets: {e}")
        print("Please ensure you have an 'GOOGLE_API_KEY' secret set in Colab.")
        print("You can also set it manually: os.environ['GOOGLE_API_KEY'] = 'YOUR_API_KEY'")


product = '{{product}}'
config_content = f"""
prompts:
  - 'Write a positive review about {product}.'
  - 'Suggest ways to improve {product} in a positive tone.'

providers:
"""

if LLM_PROVIDER_TYPE == "openai":
    config_content += f"""  - openai:{OPENAI_MODEL_NAME}
"""
elif LLM_PROVIDER_TYPE == "huggingface-local":
    config_content += f"""  - huggingface-local:{HUGGINGFACE_LOCAL_MODEL_NAME}
"""
elif LLM_PROVIDER_TYPE == "anthropic":
    config_content += f"""  - anthropic:{ANTHROPIC_MODEL_NAME}
"""
elif LLM_PROVIDER_TYPE == "google":
    config_content += f"""  - google:{GOOGLE_MODEL_NAME}
"""


config_content += f"""
tests:
  - vars:
      product: "smartphone"
    assert:
      - type: javascript
        value: "output.length > 0"
      - type: icontains
        value: "positive" # Ensure the review contains positive sentiment
  - vars:
      product: "coffee maker"
    assert:
      - type: javascript
        value: "output.length > 0"
      - type: icontains
        value: "great" # Another simple check
"""

# Write the content to promptfoo.yaml
with open("promptfoo.yaml", "w") as f:
    f.write(config_content)

print("promptfoo.yaml created successfully:")
!cat promptfoo.yaml

Google API key loaded from Colab secrets.
promptfoo.yaml created successfully:

prompts:
  - 'Write a positive review about {{product}}.'
  - 'Suggest ways to improve {{product}} in a positive tone.'

providers:
  - google:gemini-2.5-flash

tests:
  - vars:
      product: "smartphone"
    assert:
      - type: javascript
        value: "output.length > 0"
      - type: icontains
        value: "positive" # Ensure the review contains positive sentiment
  - vars:
      product: "coffee maker"
    assert:
      - type: javascript
        value: "output.length > 0"
      - type: icontains
        value: "great" # Another simple check


**4. Running the Tests**

Now that we have our `promptfoo.yaml` set up, we can run the tests using the promptfoo eval command.

In [None]:
# @title 4. Run the Promptfoo Evaluation
# This will execute the tests defined in promptfoo.yaml.
# The results will be stored in promptfoo_output.json.
print("Running promptfoo evaluation...")
!promptfoo eval --config promptfoo.yaml --output promptfoo_output.json --verbose

print("\nEvaluation complete. Results saved to promptfoo_output.json")

Running promptfoo evaluation...
[36m[main.js:88] [39mVerbose mode enabled via --verbose flag
[36m[migrate.js:51] [39m[DB Migrate] Running migrations from /tools/node/lib/node_modules/promptfoo/dist/drizzle...
[36m[migrate.js:53] [39m[DB Migrate] Migrations completed
[36m[index.js:156] [39mReading prompts from ["Write a positive review about {{product}}.","Suggest ways to improve {{product}} in a positive tone."]
[36m[filterTests.js:45] [39mStarting filterTests with options: {}
[36m[filterTests.js:46] [39mInitial test count: 2
[36m[eval.js:169] [39mInserting prompt 31db021bccff5faa494b57abcfc523c19400d5bb2692e256c2466d4f32ebd8b2
[36m[eval.js:169] [39mInserting prompt 1ed6cb7fa14ca319ed81f0d03b62ea09b1a322179705329c46234eb533b35ac3
[36m[eval.js:192] [39mInserting dataset 6d6dfc5eb64d2ec9c9802e7e1b4eba34560af0e0365474b1638983be195dd7fc
[36m[evaluatorTracing.js:83] [39m[EvaluatorTracing] Checking tracing config: undefined
[36m[evaluatorTracing.js:84] [39m[EvaluatorTra

**5. Viewing Results**

`promptfoo` provides a few ways to view results:

* **CLI output**: The `eval` command prints a summary to the console.

* **JSON/CSV output**: The `--output` flag saves raw results to a file.

* **Web UI**: The `promptfoo show` command launches a local web server for interactive viewing. This is highly recommended!

In [None]:
# @title 5.1. View Raw JSON Output
# This cell will display the contents of the generated JSON output file.

import json

try:
    with open('promptfoo_output.json', 'r') as f:
        data = json.load(f)
        print(json.dumps(data, indent=2))
except FileNotFoundError:
    print("promptfoo_output.json not found. Please run the evaluation cell first.")
except json.JSONDecodeError:
    print("Error decoding JSON from promptfoo_output.json.")

{
  "evalId": "eval-pSb-2025-08-05T17:14:23",
  "results": {
    "version": 3,
    "timestamp": "2025-08-05T17:14:23.192Z",
    "prompts": [
      {
        "raw": "Write a positive review about {{product}}.",
        "label": "Write a positive review about {{product}}.",
        "id": "31db021bccff5faa494b57abcfc523c19400d5bb2692e256c2466d4f32ebd8b2",
        "provider": "google:gemini-2.5-flash",
        "metrics": {
          "score": 1.5,
          "testPassCount": 1,
          "testFailCount": 1,
          "testErrorCount": 0,
          "assertPassCount": 3,
          "assertFailCount": 1,
          "totalLatencyMs": 152,
          "tokenUsage": {
            "prompt": 0,
            "completion": 0,
            "cached": 3098,
            "total": 3098,
            "numRequests": 0,
            "completionDetails": {
              "reasoning": 2252,
              "acceptedPrediction": 0,
              "rejectedPrediction": 0
            },
            "assertions": {
            

**6. Advanced Assertions and Providers (Optional Exploration)**

This section provides examples of more advanced `promptfoo` features. You can modify `promptfoo.yaml` and re-run the `eval` command to test these.

In [None]:
# @title 6.1. Example: Using AI-powered Assertions
# promptfoo can use an LLM (e.g., GPT-4) to evaluate outputs.
# This is powerful for subjective evaluations like "politeness" or "factuality".

# Modify promptfoo.yaml to include an AI assertion
# Ensure you have a powerful enough model configured (e.g., openai:gpt-4o, anthropic:claude-3-opus)

topic = '{{topic}}'
ai_assert_config_content = f"""
prompts:
  - "Tell me about {topic}."

providers:
"""

llm_judge = LLM_PROVIDER_TYPE
# Assuming you've already set up an API key for one of these
if LLM_PROVIDER_TYPE == "openai":
    ai_assert_config_content += f"""  - openai:{OPENAI_MODEL_NAME}
"""
    llm_judge += f":{OPENAI_MODEL_NAME}"
elif LLM_PROVIDER_TYPE == "anthropic":
    ai_assert_config_content += f"""  - anthropic:{ANTHROPIC_MODEL_NAME}
"""
    llm_judge += f":{ANTHROPIC_MODEL_NAME}"
elif LLM_PROVIDER_TYPE == "google":
    ai_assert_config_content += f"""  - google:{GOOGLE_MODEL_NAME}
"""
    llm_judge += f":{GOOGLE_MODEL_NAME}"
else:
    print("AI-powered assertions often require API-based models (OpenAI, Anthropic, Google). Skipping for huggingface-local.")


if LLM_PROVIDER_TYPE in ["openai", "anthropic", "google"]:
    ai_assert_config_content += f"""
tests:
  - vars:
      topic: "the history of the internet"
    assert:
      - type: llm-rubric
        value: "The response must be factually accurate and cover key milestones."
        provider: {llm_judge}
        pass: "yes" # The expected outcome from the evaluation LLM
      - type: llm-rubric
        value: "The tone should be informative and neutral."
        provider: {llm_judge}
        pass: "yes"

  - vars:
      topic: "your favorite color"
    assert:
      - type: llm-rubric
        value: "The response must clearly state a color."
        provider: {llm_judge}
        pass: "yes"
      - type: llm-rubric
        value: "The response should not refuse to answer."
        provider: {llm_judge}
        pass: "yes"
"""

    with open("promptfoo_ai_assert.yaml", "w") as f:
        f.write(ai_assert_config_content)

    print("promptfoo_ai_assert.yaml created:")
    !cat promptfoo_ai_assert.yaml

    print("\nRunning AI-powered assertion test (may take longer and incur API costs)...")
    !promptfoo eval -c promptfoo_ai_assert.yaml --output promptfoo_ai_output.json --verbose

    print("\nAI-powered evaluation complete. View results in promptfoo_ai_output.json or via `promptfoo show` (after re-running the ngrok cell if necessary).")

    # Display JSON output
    try:
        with open('promptfoo_ai_output.json', 'r') as f:
            data_ai = json.load(f)
            print(json.dumps(data_ai, indent=2))
    except FileNotFoundError:
        print("promptfoo_ai_output.json not found.")

promptfoo_ai_assert.yaml created:

prompts:
  - "Tell me about {{topic}}."

providers:
  - google:gemini-2.5-flash

tests:
  - vars:
      topic: "the history of the internet"
    assert:
      - type: llm-rubric
        value: "The response must be factually accurate and cover key milestones."
        provider: google:gemini-2.5-flash
        pass: "yes" # The expected outcome from the evaluation LLM
      - type: llm-rubric
        value: "The tone should be informative and neutral."
        provider: google:gemini-2.5-flash
        pass: "yes"

  - vars:
      topic: "your favorite color"
    assert:
      - type: llm-rubric
        value: "The response must clearly state a color."
        provider: google:gemini-2.5-flash
        pass: "yes"
      - type: llm-rubric
        value: "The response should not refuse to answer."
        provider: google:gemini-2.5-flash
        pass: "yes"

Running AI-powered assertion test (may take longer and incur API costs)...
[36m[main.js:88] [39

**7. Clean Up (Optional)**

Remove the generated files.

In [None]:
# @title 7. Clean Up
# This cell removes the files created during this tutorial.

!rm -f prompts.txt promptfoo.yaml promptfoo_output.json promptfoo_ai_assert.yaml promptfoo_ai_output.json

print("Clean up complete.")

Clean up complete.
