# Promptfoo Python Integration Guide

This notebook demonstrates how to use promptfoo with Python in Google Colab:

1. **Setup**: Install Node.js 22 and promptfoo
2. **Basic Evaluation**: Compare multiple LLM models with different prompts
3. **Python Integration**: Evaluate custom Python code alongside LLMs
4. **Advanced Examples**: Multiple Python adapters and use cases

**What you'll learn:**
- Multi-model comparisons (o1-mini, GPT-4o-mini, Claude-3.5-Sonnet)
- Custom Python provider evaluation
- LangChain integration patterns
- Web-based result analysis



# Node/Promptfoo setup

Install Node.js 22 (required for promptfoo). Colab comes with an older version, so we'll upgrade it.

In [None]:
# Install Node.js 22 (required for promptfoo)
# Colab comes with Node 18, but we'll upgrade to Node 22 for best compatibility
!curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
!sudo apt-get install -y nodejs

# Verify installation
!node --version
!npm --version

Next, we'll install and initialize promptfoo.

In [2]:
# Set up promptfoo
%env npm_config_yes=true
!npx promptfoo@latest init

env: npm_config_yes=true
[32m[1mWrote prompts.txt and promptfooconfig.yaml. Open README.md to get started![22m[39m


# Configure promptfoo

First, we set up the prompts. See https://promptfoo.dev/docs/configuration/parameters for more info on prompt files.

In [3]:
%%writefile prompts.txt
You're an ecommerce chat assistant for a shoe company.
Answer this user's question: {{name}}: "{{question}}"
---
You're a smart, bubbly chat assistant for a shoe company.
Answer this user's question: {{name}}: "{{question}}"

Overwriting prompts.txt


Next, we set up the configuration. See https://promptfoo.dev/docs/configuration/guide for more info on configuration.

In [4]:
%%writefile promptfooconfig.yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "E-commerce chat assistant evaluation across latest models"
prompts: [prompts.txt]
providers: 
  - openai:o1-mini
  - openai:gpt-4.1-mini
  - anthropic:claude-3-5-sonnet-20241022
tests:
  - vars:
      name: Bob
      question: Can you help me find a pair of sandals on your website?
    assert:
      - type: contains
        value: sandals
      - type: python
        value: "len(output) > 50 and 'helpful' in output.lower()"
      - type: llm-rubric
        value: "Response is helpful and professional for an e-commerce site"
  - vars:
      name: Jane
      question: Do you have any discounts available?
    assert:
      - type: contains
        value: discount
      - type: python
        value: |
          # Check if response mentions specific discount types
          discount_types = ['sale', 'coupon', 'promo', 'percent', '%', 'off']
          return any(word in output.lower() for word in discount_types)
  - vars:
      name: Dave
      question: What are your shipping and return policies?
    assert:
      - type: contains
        value: shipping
      - type: python
        value: |
          # Custom scoring for policy completeness
          policy_elements = ['shipping', 'return', 'exchange', 'refund']
          score = sum(1 for element in policy_elements if element in output.lower()) / len(policy_elements)
          return {
            'pass': score > 0.5,
            'score': score,
            'reason': f'Policy completeness: {score:.1%}'
          }

Overwriting promptfooconfig.yaml


# Set up secrets

This section loads API keys for the LLM providers we'll be testing.

**Option 1: Google Drive secrets.json** (recommended for this notebook):

Create `/content/drive/MyDrive/Projects/promptfoo/secrets.json`:
```json
{
  "OPENAI_API_KEY": "sk-abc123...",
  "ANTHROPIC_API_KEY": "sk-ant-api03-abc123..."
}
```

**Option 2: Colab Secrets** (alternative):
Use Colab's built-in secrets management (🔑 icon in sidebar).

In [5]:
from google.colab import drive

drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
import json
import os

with open("/content/drive/MyDrive/Projects/promptfoo/secrets.json") as f:
    secrets = json.load(f)

for key, value in secrets.items():
    os.environ[key] = value

# Run the eval

First, run the eval - this will produce a quick side-by-side table view.

In [7]:
!npx promptfoo@latest eval -c /content/promptfooconfig.yaml --no-progress-bar

Creating cache folder at /root/.promptfoo/cache.

[90m┌────────────────────[39m[90m┬────────────────────[39m[90m┬────────────────────[39m[90m┬────────────────────[39m[90m┬────────────────────[39m[90m┬────────────────────┐[39m
[90m│[39m[1m[34m name               [39m[22m[90m│[39m[1m[34m question           [39m[22m[90m│[39m[1m[34m [openai:gpt-3.5-tu [39m[22m[90m│[39m[1m[34m [openai:gpt-4] You [39m[22m[90m│[39m[1m[34m [openai:gpt-3.5-tu [39m[22m[90m│[39m[1m[34m [openai:gpt-4] You [39m[22m[90m│[39m
[90m│[39m[1m[34m                    [39m[22m[90m│[39m[1m[34m                    [39m[22m[90m│[39m[1m[34m rbo-0613] You're a [39m[22m[90m│[39m[1m[34m 're an ecommerce c [39m[22m[90m│[39m[1m[34m rbo-0613] You're a [39m[22m[90m│[39m[1m[34m 're a smart, bubbl [39m[22m[90m│[39m
[90m│[39m[1m[34m                    [39m[22m[90m│[39m[1m[34m                    [39m[22m[90m│[39m[1m[34m n ecommerce chat 

Now we can share and view this in the web viewer, which is easier to use and contains tools for drilling down and grading the outputs.

In [15]:
!npx promptfoo@latest share --yes

View results: [92m[1mhttps://app.promptfoo.dev/eval/f:ffaf05af-6d34-4ff3-b035-2a73e0c97375[22m[39m


# Eval a Python script in Colab

# Python Code Evaluation Examples

This section shows how to evaluate custom Python code alongside LLM providers. We'll demonstrate:

1. **LangChain Math Chain** - Compare LangChain's math capabilities vs direct LLM calls
2. **Custom Python Functions** - Evaluate your own Python logic
3. **API Integration** - Test external API calls vs LLM responses

This approach works with any Python code - machine learning models, APIs, databases, etc.

In [None]:
# Install Python packages for our examples
%pip install -q langchain langchain-openai langchain-community openai requests numpy

## Example 1: LangChain Math Chain

Here's a LangChain implementation that solves mathematical problems:

In [10]:
%%writefile langchain_math.py
import sys
import os
from langchain_openai import OpenAI
from langchain.chains import LLMMathChain

# Initialize OpenAI LLM with modern LangChain
llm = OpenAI(
    temperature=0,
    openai_api_key=os.getenv('OPENAI_API_KEY'),
    model_name="gpt-3.5-turbo-instruct"  # Use instruct model for math
)

# Create math chain
math_chain = LLMMathChain.from_llm(llm, verbose=True)

# Process the math question
try:
    result = math_chain.run(sys.argv[1])
    print(f"Answer: {result}")
except Exception as e:
    print(f"Error: {e}")

Overwriting langchain_example.py


In [None]:
%%writefile custom_assertions.py
def evaluate_customer_service(output, context):
    """
    Advanced assertion function that evaluates customer service quality
    with detailed scoring and multiple criteria.
    """
    import re
    
    # Initialize scoring components
    scores = {
        'politeness': 0,
        'helpfulness': 0,
        'specificity': 0,
        'completeness': 0
    }
    
    output_lower = output.lower()
    
    # Check politeness
    polite_words = ['please', 'thank you', 'sorry', 'appreciate', 'welcome']
    politeness_score = sum(1 for word in polite_words if word in output_lower) / len(polite_words)
    scores['politeness'] = min(politeness_score, 1.0)
    
    # Check helpfulness 
    helpful_phrases = ['i can help', 'let me assist', 'here are some options', 'i recommend']
    helpfulness_score = sum(1 for phrase in helpful_phrases if phrase in output_lower) / len(helpful_phrases)
    scores['helpfulness'] = min(helpfulness_score, 1.0)
    
    # Check specificity (mentions specific products, prices, etc.)
    specific_indicators = ['$', 'size', 'color', 'material', 'brand', 'model']
    specificity_score = sum(1 for indicator in specific_indicators if indicator in output_lower) / len(specific_indicators)
    scores['specificity'] = min(specificity_score, 1.0)
    
    # Check completeness (length and structure)
    word_count = len(output.split())
    completeness_score = min(word_count / 50, 1.0)  # Normalize to 50 words as ideal
    scores['completeness'] = completeness_score
    
    # Calculate overall score
    overall_score = sum(scores.values()) / len(scores)
    
    # Determine pass/fail
    passes = overall_score > 0.6
    
    return {
        'pass': passes,
        'score': overall_score,
        'reason': f'Customer service quality: {overall_score:.1%}',
        'namedScores': scores,
        'componentResults': [
            {
                'pass': scores['politeness'] > 0.3,
                'score': scores['politeness'],
                'reason': f'Politeness: {scores["politeness"]:.1%}'
            },
            {
                'pass': scores['helpfulness'] > 0.3,
                'score': scores['helpfulness'], 
                'reason': f'Helpfulness: {scores["helpfulness"]:.1%}'
            }
        ]
    }

def evaluate_translation_quality(output, context):
    """Evaluate translation quality with language-specific checks"""
    target_lang = context.get('vars', {}).get('target_language', '').lower()
    
    # Language-specific keyword checks
    language_indicators = {
        'french': ['le', 'la', 'de', 'et', 'je', 'vous'],
        'spanish': ['el', 'la', 'de', 'y', 'que', 'en'],
        'german': ['der', 'die', 'das', 'und', 'ist', 'mit'],
        'italian': ['il', 'la', 'di', 'e', 'che', 'in']
    }
    
    if target_lang in language_indicators:
        indicators = language_indicators[target_lang]
        found_indicators = sum(1 for word in indicators if word in output.lower())
        language_score = found_indicators / len(indicators)
        
        return {
            'pass': language_score > 0.2,
            'score': language_score,
            'reason': f'Language authenticity for {target_lang}: {language_score:.1%}'
        }
    
    # Fallback for unknown languages
    return {
        'pass': len(output) > 0,
        'score': 0.5,
        'reason': 'Basic translation check passed'
    }


In [None]:
%%writefile test_generators.py
from typing import Any, Dict, Optional
import random

def generate_translation_tests(config: Optional[Dict[str, Any]] = None):
    """Generate translation test cases with optional configuration."""
    
    # Default test data
    phrases = [
        "Hello, how are you?",
        "Thank you very much",
        "Where is the library?",
        "I would like to order food",
        "What time is it?"
    ]
    
    languages = ["French", "Spanish", "German", "Italian"]
    
    # Override with config if provided
    if config:
        phrases = config.get("phrases", phrases)
        languages = config.get("languages", languages)
        max_tests = config.get("max_tests", len(phrases) * len(languages))
    else:
        max_tests = 8  # Default limit
    
    test_cases = []
    count = 0
    
    for phrase in phrases:
        for lang in languages:
            if count >= max_tests:
                break
                
            test_case = {
                "vars": {
                    "text": phrase,
                    "target_language": lang
                },
                "assert": [
                    {"type": "python", "value": "file://custom_assertions.py:evaluate_translation_quality"},
                    {"type": "regex", "value": ".{10,}"}  # At least 10 characters
                ],
                "description": f"Translate '{phrase}' to {lang}"
            }
            test_cases.append(test_case)
            count += 1
            
        if count >= max_tests:
            break
    
    return test_cases

def generate_customer_service_tests(config: Optional[Dict[str, Any]] = None):
    """Generate customer service evaluation test cases."""
    
    # Realistic customer service scenarios
    scenarios = [
        {
            "name": "Sarah",
            "question": "I received a damaged product, what should I do?",
            "category": "returns"
        },
        {
            "name": "Mike", 
            "question": "Do you have this shirt in size large and blue color?",
            "category": "product_inquiry"
        },
        {
            "name": "Lisa",
            "question": "My order hasn't arrived yet and it's been 2 weeks",
            "category": "shipping"
        },
        {
            "name": "Alex",
            "question": "Can I get a discount on my first order?",
            "category": "promotions"
        }
    ]
    
    # Override scenarios with config if provided
    if config and "scenarios" in config:
        scenarios = config["scenarios"]
    
    test_cases = []
    for scenario in scenarios:
        test_case = {
            "vars": {
                "name": scenario["name"],
                "question": scenario["question"]
            },
            "assert": [
                {"type": "python", "value": "file://custom_assertions.py:evaluate_customer_service"},
                {"type": "contains", "value": scenario["name"]},
                {"type": "llm-rubric", "value": f"Response appropriately addresses a {scenario['category']} concern"}
            ],
            "description": f"Handle {scenario['category']} question from {scenario['name']}"
        }
        test_cases.append(test_case)
    
    return test_cases


In [None]:
%%writefile advanced_config.yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "Comprehensive Python integration showcase"
prompts:
  - "You're a helpful e-commerce assistant. {{name}} asks: '{{question}}'"
  - "Translate '{{text}}' to {{target_language}}"
providers:
  - openai:o1-mini
  - openai:gpt-4.1-mini
  - anthropic:claude-3-5-sonnet-20241022

# Test cases using different Python patterns
tests:
  # 1. Basic test with multiple assertion types
  - vars:
      name: "Emma"
      question: "What's your return policy?"
    assert:
      - type: contains
        value: "return"
      - type: python
        value: "len(output.split()) >= 20"  # Inline Python
      - type: python
        value: "file://custom_assertions.py:evaluate_customer_service"  # External function
      - type: llm-rubric
        value: "Response is professional and informative"

  # 2. Python test case generators
  - file://test_generators.py:generate_customer_service_tests

  # 3. Test generator with custom configuration
  - path: file://test_generators.py:generate_translation_tests
    config:
      max_tests: 4
      languages: ["French", "Spanish"]
      phrases: ["Hello world", "How are you?"]

  # 4. Complex inline Python assertion
  - vars:
      name: "John"
      question: "Do you have winter boots in size 10?"
    assert:
      - type: python
        value: |
          # Multi-criteria evaluation
          criteria = {
            'mentions_size': '10' in output,
            'mentions_winter': 'winter' in output.lower(),
            'mentions_boots': 'boots' in output.lower() or 'footwear' in output.lower(),
            'helpful_tone': any(word in output.lower() for word in ['yes', 'available', 'stock', 'have'])
          }
          
          score = sum(criteria.values()) / len(criteria)
          
          return {
            'pass': score >= 0.75,
            'score': score,
            'reason': f'Met {sum(criteria.values())}/{len(criteria)} criteria',
            'namedScores': criteria
          }


In [None]:
!npx promptfoo eval -c advanced_config.yaml --no-progress-bar

In [None]:
%%writefile custom_math.py
import sys
import re
import math

def solve_math_problem(question):
    """Simple math solver using Python"""
    question = question.lower()
    
    # Handle different types of math problems
    if "cube root" in question:
        # Extract number for cube root
        numbers = re.findall(r'\d+', question)
        if numbers:
            num = int(numbers[0])
            result = round(num ** (1/3), 2)
            return f"The cube root of {num} is {result}"
    
    elif "binary" in question and "base 10" in question:
        # Convert binary to decimal
        binary_match = re.search(r'\b[01]+\b', question)
        if binary_match:
            binary = binary_match.group()
            decimal = int(binary, 2)
            return f"Binary {binary} equals {decimal} in base 10"
    
    elif "fourth root" in question:
        # Extract number for fourth root
        numbers = re.findall(r'\d+', question)
        if numbers:
            num = int(numbers[0])
            result = num ** (1/4)
            return f"The fourth root of {num} is {result}"
    
    elif "magnitude" in question and "complex" in question:
        # Handle complex number magnitude
        real_match = re.search(r'(\d+)\s*\+', question)
        imag_match = re.search(r'\+\s*(\d+)i', question)
        if real_match and imag_match:
            real = int(real_match.group(1))
            imag = int(imag_match.group(1))
            magnitude = math.sqrt(real**2 + imag**2)
            return f"The magnitude of {real} + {imag}i is {magnitude}"
    
    return f"Python solver: Unable to solve '{question}'"

if __name__ == "__main__":
    result = solve_math_problem(sys.argv[1])
    print(result)


## Math Problem Configuration

Now let's set up a comprehensive evaluation comparing:
1. **GPT-4 Turbo** - Direct LLM math reasoning
2. **LangChain Math Chain** - Structured math problem solving  
3. **Custom Python** - Our own math logic

In [17]:
%%writefile mathconfig.yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "Math problem solving: Latest LLMs vs Python"
prompts: mathprompt.txt
providers:
  - openai:o1-mini
  - exec:python langchain_math.py
  - exec:python custom_math.py
tests:
  - vars:
      question: What is the cube root of 389017?
    assert:
      - type: contains
        value: "73"
  - vars:
      question: If you have 101101 in binary, what number does it represent in base 10?
    assert:
      - type: contains
        value: "45"
  - vars:
      question: What is the natural logarithm (ln) of 89234?
    assert:
      - type: contains
        value: "11.39"
  - vars:
      question: If a complex number is represented as 3 + 4i, what is its magnitude?
    assert:
      - type: contains
        value: "5"
  - vars:
      question: What is the fourth root of 1296?
    assert:
      - type: contains
        value: "6"

Overwriting mathconfig.yaml


And provide a prompt:

In [12]:
%%writefile mathprompt.txt
Think carefully and answer this math problem: {{question}}

Overwriting mathprompt.txt


In [None]:
%%writefile weather_api.py
import sys
import requests
import json

def get_weather_info(location):
    """Get weather information using a public API"""
    try:
        # Using OpenWeatherMap API (free tier)
        # Note: In production, you'd use a real API key
        api_key = "demo_key"  # Replace with actual API key
        
        # For demo purposes, return mock data
        weather_data = {
            "New York": {"temp": "22°C", "condition": "Partly cloudy", "humidity": "65%"},
            "London": {"temp": "15°C", "condition": "Rainy", "humidity": "80%"},
            "Tokyo": {"temp": "28°C", "condition": "Sunny", "humidity": "55%"},
            "Paris": {"temp": "18°C", "condition": "Overcast", "humidity": "70%"}
        }
        
        # Simple location matching
        for city, data in weather_data.items():
            if city.lower() in location.lower():
                return f"Weather in {city}: {data['temp']}, {data['condition']}, Humidity: {data['humidity']}"
        
        return f"Weather API: Sorry, no weather data available for {location}"
        
    except Exception as e:
        return f"Weather API Error: {str(e)}"

if __name__ == "__main__":
    location = sys.argv[1] if len(sys.argv) > 1 else "Unknown"
    result = get_weather_info(location)
    print(result)


In [None]:
%%writefile weatherconfig.yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "Weather information: LLM knowledge vs API data"
prompts: 
  - "What's the current weather in {{location}}?"
providers:
  - openai:gpt-4o-mini
  - exec:python weather_api.py
tests:
  - vars:
      location: "New York"
    assert:
      - type: contains
        value: "weather"
  - vars:
      location: "London"
    assert:
      - type: contains
        value: "weather"
  - vars:
      location: "Tokyo"
    assert:
      - type: contains
        value: "weather"


In [None]:
!npx promptfoo eval -c weatherconfig.yaml --no-progress-bar

Run the eval to produce a comparison:

In [19]:
!npx promptfoo@latest eval -c mathconfig.yaml --no-progress-bar


[90m┌────────────────────────────────────────[39m[90m┬────────────────────────────────────────[39m[90m┬────────────────────────────────────────┐[39m
[90m│[39m[1m[34m question                               [39m[22m[90m│[39m[1m[34m [openai:gpt-4-0613] Think carefully an [39m[22m[90m│[39m[1m[34m [exec:python langchain_example.py] Thi [39m[22m[90m│[39m
[90m│[39m[1m[34m                                        [39m[22m[90m│[39m[1m[34m d answer this math problem: {{question [39m[22m[90m│[39m[1m[34m nk carefully and answer this math prob [39m[22m[90m│[39m
[90m│[39m[1m[34m                                        [39m[22m[90m│[39m[1m[34m }}                                     [39m[22m[90m│[39m[1m[34m lem: {{question}}                      [39m[22m[90m│[39m
[90m├────────────────────────────────────────[39m[90m┼────────────────────────────────────────[39m[90m┼────────────────────────────────────────┤[39m
[90m│[39m What is the

And view the results in the web viewer:

In [20]:
!npx promptfoo@latest share --yes

View results: [92m[1mhttps://app.promptfoo.dev/eval/f:f447942f-42a9-46fa-b38e-3ec7bf860418[22m[39m
