# Evaluating prompt effectiveness

Evaluating how well a prompt performs is crucial for any application that uses language models. The quality of the prompt directly affects the model's output in terms of relevance, consistency, specificity, and clarity. In this notebook, we aim to build a toolkit that allows us to evaluate prompts using both manual and automated methods.

In [1]:
import os
from langchain_openai import ChatOpenAI
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Initialize the language model
We instantiate a lightweight GPT model from OpenAI using LangChain.

In [2]:
# Initialize the language model
llm = ChatOpenAI(model="gpt-4o-mini-2024-07-18")

### Semantic similarity calculation

One of the core tools in our evaluation is the ability to compare how semantically similar two responses are. We will use cosine similarity over embeddings for this.

In [3]:
# Initialize sentence transformer for semantic similarity
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(text1, text2):
    """Calculate semantic similarity between two texts using cosine similarity."""
    # Encode both texts into embeddings
    embeddings = sentence_model.encode([text1, text2])
    # Return cosine similarity between the two vectors
    return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/90.4M [00:00<?, ?B/s]

model_O1.onnx:   0%|          | 0.00/90.4M [00:00<?, ?B/s]

model_O2.onnx:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

model_O3.onnx:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

model_O4.onnx:   0%|          | 0.00/45.2M [00:00<?, ?B/s]

model_qint8_arm64.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

model_qint8_arm64.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

model_qint8_arm64.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

model_quint8_avx2.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

openvino_model.bin:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

openvino_model.xml:   0%|          | 0.00/211k [00:00<?, ?B/s]

openvino_model_qint8_quantized.bin:   0%|          | 0.00/22.9M [00:00<?, ?B/s]

openvino_model_qint8_quantized.xml:   0%|          | 0.00/368k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

This function encodes both input texts using the `sentence-transformers` model and returns a similarity score between 0 and 1 based on cosine distance. Higher values indicate closer semantic meaning.


### Metrics for measuring prompt performance
We will define three automated metrics to assess response quality:
- Relevance: Measures semantic similarity to a reference/expected answer.
- Consistency: Measures similarity across multiple responses to the same prompt.
- Specificity: Measures how detailed or generic a response is.

In [4]:
def relevance_score(response, expected_content):
    """Calculate relevance score based on semantic similarity to expected content."""
    return semantic_similarity(response, expected_content)

def consistency_score(responses):
    """Calculate consistency score based on similarity between multiple responses."""
    if len(responses) < 2:
        return 1.0  # Perfect consistency if there's only one response
    similarities = []
    for i in range(len(responses)):
        for j in range(i+1, len(responses)):
            similarities.append(semantic_similarity(responses[i], responses[j]))
    return np.mean(similarities)

def specificity_score(response):
    """Calculate specificity score based on response length and unique word count."""
    words = response.split()
    unique_words = set(words)
    return len(unique_words) / len(words) if words else 0

These functions offer a quick, objective way to rate response quality. Specificity favors responses that are both long and rich in vocabulary, while relevance requires a reference answer to compare against.

## Manual evaluation techniques
Although automation helps scale evaluations, human feedback is still important for subjective qualities like clarity, tone, or factuality. The following function supports manual scoring.

In [5]:
def manual_evaluation(prompt, response, criteria):
    """Simulate manual evaluation of a prompt-response pair."""
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print("\nEvaluation Criteria:")

    # Iterate through each manual criterion
    for criterion in criteria:
        score = float(input(f"Score for {criterion} (0-10): "))
        print(f"{criterion}: {score}/10")

    print("\nAdditional Comments:")
    comments = input("Enter any additional comments: ")
    print(f"Comments: {comments}")

# Example usage
prompt = "Explain the concept of machine learning in simple terms."
response = llm.invoke(prompt).content
criteria = ["Clarity", "Accuracy", "Simplicity"]
manual_evaluation(prompt, response, criteria)

Prompt: Explain the concept of machine learning in simple terms.
Response: Machine learning is a type of technology that allows computers to learn from data and improve their performance over time without being explicitly programmed. 

Here’s a simple way to think about it:

1. **Learning from Examples**: Just like how a child learns to recognize animals by looking at pictures and hearing names, a machine learning model learns by being fed lots of examples. For instance, if you show it many pictures of cats and dogs, it learns to distinguish between the two.

2. **Patterns and Predictions**: As it processes these examples, the machine finds patterns or features that help it tell the difference. Once it's learned enough, it can make predictions or decisions based on new data it hasn't seen before. For example, it can look at a new picture and say whether it’s a cat or a dog.

3. **Improvement Over Time**: The more data the machine learns from, the better it gets at making accurate predi

Score for Clarity (0-10):  5


Clarity: 5.0/10


Score for Accuracy (0-10):  5


Accuracy: 5.0/10


Score for Simplicity (0-10):  5


Simplicity: 5.0/10

Additional Comments:


Enter any additional comments:  


Comments: 


The function `manual_evaluation` prints a prompt-response pair and prompts the evaluator for numerical scores. It also allows capturing free-form feedback. This would typically be used in a controlled annotation environment. After running this, we will manually rate how well the model explains the topic based on clarity, correctness, and ease of understanding.

## Automated evaluation techniques
Next, let’s combine our automated metrics into a single evaluation function that prints and returns the results.

In [6]:
def automated_evaluation(prompt, response, expected_content, n_additional_responses=2):
    """Perform automated evaluation of a prompt-response pair."""
    # Generate multiple responses to assess consistency
    other_responses = [llm.invoke(prompt).content for _ in range(n_additional_responses)]
    # Include the original response
    all_responses = [response] + other_responses

    # Relevance: how close is this response to the expected answer?
    relevance = relevance_score(response, expected_content)

    # Specificity: how non-generic is this response?
    specificity = specificity_score(response)

    # Consistency: compare main response with N additional responses
    consistency = consistency_score(all_responses)

    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print(f"\nRelevance Score: {relevance:.2f}")
    print(f"Specificity Score: {specificity:.2f}")
    print(f"Consistency Score (across {n_additional_responses + 1} responses): {consistency:.2f}")

    return {"relevance": relevance, "specificity": specificity, "consistency": consistency}

# Example usage
prompt = "What are the three main types of machine learning?"
expected_content = "The three main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
response = llm.invoke(prompt).content
automated_evaluation(prompt, response, expected_content)

Prompt: What are the three main types of machine learning?
Response: The three main types of machine learning are:

1. **Supervised Learning**: In supervised learning, the model is trained on a labeled dataset, which means that each training example is paired with an output label. The goal is to learn a mapping from inputs to outputs so that the model can predict the output for new, unseen data. Common applications include classification (e.g., spam detection) and regression (e.g., predicting house prices).

2. **Unsupervised Learning**: In unsupervised learning, the model is trained on data without labeled outputs. The goal is to identify patterns or structures within the data. Common techniques include clustering (e.g., grouping similar customers) and dimensionality reduction (e.g., reducing the number of features while retaining important information).

3. **Reinforcement Learning**: In reinforcement learning, an agent learns to make decisions by taking actions in an environment to 

{'relevance': 0.73353064,
 'specificity': 0.6414141414141414,
 'consistency': 0.9613181}

This evaluation prints prompt and response text, followed by the scores. It also returns the scores for possible logging or further analysis.

## Comparing multiple prompts
To improve prompts, we often test several variations. The following function automates this comparison.

In [7]:
def compare_prompts(prompts, expected_content, n_additional_responses=2):
    """Compare the effectiveness of multiple prompts for the same task."""
    results = []
    for prompt in prompts:
        # Generate an initial response
        response = llm.invoke(prompt).content
        # Evaluate using all metrics
        evaluation = automated_evaluation(prompt, response, expected_content, n_additional_responses=n_additional_responses)
        # Store results with prompt
        results.append({"prompt": prompt, **evaluation})

    # Sort results by relevance score descending
    sorted_results = sorted(results, key=lambda x: x['relevance'], reverse=True)

    # Display all sorted results
    print("Prompt Comparison Results:")
    for i, result in enumerate(sorted_results, 1):
        print(f"\n{i}. Prompt: {result['prompt']}")
        print(f"   Relevance: {result['relevance']:.2f}")
        print(f"   Specificity: {result['specificity']:.2f}")
        print(f"   Consistency: {result['consistency']:.2f}")

    return sorted_results

# Example usage
prompts = [
    "List the types of machine learning.",
    "What are the main categories of machine learning algorithms?",
    "Explain the different approaches to machine learning."
]
expected_content = "The main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
compare_prompts(prompts, expected_content)

Prompt: List the types of machine learning.
Response: Machine learning can be broadly categorized into several types based on the nature of the learning process and the types of data used. Here are the main types of machine learning:

1. **Supervised Learning**: In this approach, the model is trained on a labeled dataset, which means that the input data is paired with the correct output. The goal is to learn a mapping from inputs to outputs. Examples include:
   - Classification (e.g., spam detection, image classification)
   - Regression (e.g., predicting house prices, stock prices)

2. **Unsupervised Learning**: This type involves training on data that does not have labeled responses. The model tries to learn the underlying structure or distribution in the data. Examples include:
   - Clustering (e.g., customer segmentation, grouping similar items)
   - Dimensionality Reduction (e.g., PCA, t-SNE for data visualization)
   - Association (e.g., market basket analysis)

3. **Semi-Superv

[{'prompt': 'List the types of machine learning.',
  'relevance': 0.70720327,
  'specificity': 0.5922619047619048,
  'consistency': 0.92151624},
 {'prompt': 'Explain the different approaches to machine learning.',
  'relevance': 0.65206325,
  'specificity': 0.5345794392523364,
  'consistency': 0.89919156},
 {'prompt': 'What are the main categories of machine learning algorithms?',
  'relevance': 0.6494788,
  'specificity': 0.5943661971830986,
  'consistency': 0.9645951}]

This is especially helpful during prompt tuning. It evaluates each prompt on the same task and ranks them by relevance, helping us decide which formulation yields the most useful outputs. We compare different phrasings to determine which is most likely to yield a high-quality, informative response.

### Putting it all together
Lastly, let’s combine both automated and manual evaluations into a unified function that gives us a comprehensive prompt evaluation.

In [8]:
def evaluate_prompt(prompt, expected_content, manual_criteria=['Clarity', 'Accuracy', 'Relevance']):
    """Perform a comprehensive evaluation of a prompt using both manual and automated techniques."""
    response = llm.invoke(prompt).content

    print("Automated Evaluation:")
    auto_results = automated_evaluation(prompt, response, expected_content)

    print("\nManual Evaluation:")
    manual_evaluation(prompt, response, manual_criteria)

    return {"prompt": prompt, "response": response, **auto_results}

# Example usage
prompt = "Explain the concept of overfitting in machine learning."
expected_content = "Overfitting occurs when a model learns the training data too well, including its noise and fluctuations, leading to poor generalization on new, unseen data."
evaluate_prompt(prompt, expected_content)

Automated Evaluation:
Prompt: Explain the concept of overfitting in machine learning.
Response: Overfitting is a common problem in machine learning where a model learns to capture the noise and details of the training data to an extent that it performs poorly on new, unseen data. Essentially, the model becomes too complex and tailored to the training dataset, which limits its ability to generalize to other datasets.

### Key Points about Overfitting:

1. **Complexity of the Model**: Overfitting often occurs when a model is too complex relative to the amount of training data available. For instance, deep neural networks with many layers can fit very intricate patterns in the training data, including noise that doesn't represent the underlying distribution.

2. **Training vs. Validation Performance**: In overfitting, a model shows excellent performance (low error) on the training data but significantly poorer performance on validation or test datasets. This disparity indicates that the m

Score for Clarity (0-10):  5


Clarity: 5.0/10


Score for Accuracy (0-10):  5


Accuracy: 5.0/10


Score for Relevance (0-10):  5


Relevance: 5.0/10

Additional Comments:


Enter any additional comments:  


Comments: 


{'prompt': 'Explain the concept of overfitting in machine learning.',
 'response': "Overfitting is a common problem in machine learning where a model learns to capture the noise and details of the training data to an extent that it performs poorly on new, unseen data. Essentially, the model becomes too complex and tailored to the training dataset, which limits its ability to generalize to other datasets.\n\n### Key Points about Overfitting:\n\n1. **Complexity of the Model**: Overfitting often occurs when a model is too complex relative to the amount of training data available. For instance, deep neural networks with many layers can fit very intricate patterns in the training data, including noise that doesn't represent the underlying distribution.\n\n2. **Training vs. Validation Performance**: In overfitting, a model shows excellent performance (low error) on the training data but significantly poorer performance on validation or test datasets. This disparity indicates that the model h

This approach is ideal for final validation before deploying a prompt in production. It combines objective metrics with subjective assessment, enabling a more nuanced understanding of prompt behavior.