# Prompt optimization techniques

This notebook explores practical methods for improving prompt effectiveness when working with LLMs. We focus on two main techniques: A/B testing and iterative refinement. These strategies help us fine-tune prompt phrasing to yield better, more relevant responses from the model.

Crafting the right prompt becomes important. It is not just about asking a question anymore—how we ask it can dramatically change the quality and usefulness of the output. So the goal here is to learn how to evaluate and enhance prompts systematically—just like we would optimize hyperparameters in a machine learning model.

In [1]:
import os
import re

from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
import numpy as np
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Initialize the language model
We instantiate a lightweight GPT model from OpenAI using LangChain.

In [2]:
# Initialize the language model
llm = ChatOpenAI(model="gpt-4o-mini-2024-07-18")

## A/B testing prompts
A/B testing allows us to compare different versions of a prompt to evaluate which performs better for a specific task. The idea is to isolate variations in phrasing and measure how those variations affect the response quality.

In [3]:
# Define prompt variation A
prompt_a = PromptTemplate(
    input_variables=["topic"],
    template="Explain {topic} in simple terms."
)

# Define prompt variation B
prompt_b = PromptTemplate(
    input_variables=["topic"],
    template="Provide a beginner-friendly explanation of {topic}, including key concepts and an example."
)

We define two alternative prompts that aim to explain the same topic. Prompt A is very general, while Prompt B adds a request for structure and an example, which may help guide the model toward more informative answers. We use LangChain’s `PromptTemplate` to define reusable prompt formats. Both prompts accept a `topic` variable and ask the model to explain it.

#### Evaluation function
We need a way to evaluate the responses generated by each prompt. We will do this by scoring them based on several criteria: clarity, informativeness, and engagement.

In [4]:
# Function to evaluate response quality
def evaluate_response(response, criteria):
    """Evaluate the quality of a response based on given criteria.

    Args:
        response (str): The generated response.
        criteria (list): List of criteria to evaluate.

    Returns:
        float: The average score across all criteria.
    """
    scores = []
    for criterion in criteria:
        print(f"Evaluating response based on {criterion}...")
        # Ask the model to rate the response
        prompt = f"On a scale of 1-10, rate the following response on {criterion}. Start your response with the numeric score:\n\n{response}"
        response = llm.invoke(prompt).content
        # show 50 characters of the response
        # Use regex to find the first number in the response
        score_match = re.search(r'\d+', response)
        if score_match:
            score = int(score_match.group())
            scores.append(min(score, 10))  # Ensure score is not greater than 10
        else:
            print(f"Warning: Could not extract numeric score for {criterion}. Using default score of 5.")
            scores.append(5)  # Default score if no number is found
    return np.mean(scores)

For each response, we generate follow-up prompts asking the model to self-evaluate its output against a given criterion. This approach gives us a semi-automated evaluation loop using the model itself for subjective scoring. We extract the numerical score using regex and calculate the average across all criteria.

#### Running the A/B test on an example
Now we can plug in a topic and compare both prompt versions side-by-side.

In [5]:
# Define the topic to explain
topic = "machine learning"

# Generate responses for both prompt versions
response_a = llm.invoke(prompt_a.format(topic=topic)).content
response_b = llm.invoke(prompt_b.format(topic=topic)).content

# Define evaluation criteria
criteria = ["clarity", "informativeness", "engagement"]

# Evaluate both responses
score_a = evaluate_response(response_a, criteria)
score_b = evaluate_response(response_b, criteria)

# Print results
print(f"Prompt A score: {score_a:.2f}")
print(f"Prompt B score: {score_b:.2f}")
print(f"Winning prompt: {'A' if score_a > score_b else 'B'}")

Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Prompt A score: 9.00
Prompt B score: 9.00
Winning prompt: B


We run both prompts with the same topic and generate responses. These are then scored based on our criteria, and the scores are compared. This helps us determine which prompt leads to better overall output.

## Iterative refinement of prompts
Now that we have tested which prompt performs better, let’s say we want to improve it further. Iterative refinement lets us gradually improve a prompt by analyzing its outputs and generating suggestions for enhancement.

In [6]:
def refine_prompt(initial_prompt, topic, iterations=3):
    """Refine a prompt through multiple iterations.

    Args:
        initial_prompt (PromptTemplate): The starting prompt template.
        topic (str): The topic to explain.
        iterations (int): Number of refinement iterations.

    Returns:
        PromptTemplate: The final refined prompt template.
    """
    current_prompt = initial_prompt
    for i in range(iterations):
        try:
            # Generate response using current prompt
            response = llm.invoke(current_prompt.format(topic=topic)).content
        except KeyError as e:
            print(f"Error in iteration {i+1}: Missing key {e}. Adjusting prompt...")
            # Remove the problematic placeholder
            current_prompt.template = current_prompt.template.replace(f"{{{e.args[0]}}}", "relevant example")
            response = llm.invoke(current_prompt.format(topic=topic)).content

        # Ask the model how to improve the prompt
        feedback_prompt = f"Analyze the following explanation of {topic} and suggest improvements to the prompt that generated it:\n\n{response}"
        feedback = llm.invoke(feedback_prompt).content

        # Use model's feedback to refine the prompt
        refine_prompt = f"Based on this feedback: '{feedback}', improve the following prompt template. Ensure to only use the variable {{topic}} in your template:\n\n{current_prompt.template}"
        refined_template = llm.invoke(refine_prompt).content

        # Create a new prompt template from refined result
        current_prompt = PromptTemplate(
            input_variables=["topic"],
            template=refined_template
        )

        print(f"Iteration {i+1} prompt: {current_prompt.template}")

    return current_prompt

This function takes a prompt and iteratively improves it. After each round, it asks the model for feedback on how to improve the prompt based on the generated response. Then it uses that feedback to create a refined version of the prompt, repeating this loop for a specified number of iterations. This loop mimics how we would improve a prompt manually, but automates it with the LLM's help.

#### Applying iterative refinement to the best prompt on an example
We use the prompt that performed better in the A/B test and apply our refinement process.

In [7]:
# Perform A/B test
topic = "machine learning"
# Generate and score responses for original and refined prompts
response_a = llm.invoke(prompt_a.format(topic=topic)).content
response_b = llm.invoke(prompt_b.format(topic=topic)).content

# Define evaluation criteria
criteria = ["clarity", "informativeness", "engagement"]
# Evaluate both responses
score_a = evaluate_response(response_a, criteria)
score_b = evaluate_response(response_b, criteria)

print(f"Prompt A score: {score_a:.2f}")
print(f"Prompt B score: {score_b:.2f}")
print(f"Winning prompt: {'A' if score_a > score_b else 'B'}")

# Start with the winning prompt from A/B testing
initial_prompt = prompt_b if score_b > score_a else prompt_a

# Refine the prompt through multiple iterations
refined_prompt = refine_prompt(initial_prompt, "machine learning")

print("\nFinal refined prompt:")
print(refined_prompt.template)

Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Prompt A score: 9.00
Prompt B score: 8.00
Winning prompt: A
Iteration 1 prompt: Here’s an improved prompt template that incorporates the suggestions provided while maintaining the variable {topic}:

---

"Explain {topic} in simple terms, including its different aspects such as key categories (e.g., supervised, unsupervised, reinforcement learning), real-world applications (like healthcare, finance, and autonomous vehicles), and the significance of data quality. Describe the fundamental learning process of algorithms, touching on how they adjust based on feedback. Additionally, highlight some limitations or challenges associated with {topic}. Use a relatable analogy to clarify the concept, and consider incorporating relevant tech

We generate one more response for both the original and refined prompts and score them. Then, We select the winning prompt based on A/B test results and feed it through the refinement loop to improve it further. We can apply this same loop to any kind of prompt: summarization, translation, extraction, etc.

#### Comparing original and refined prompts
Finally, let’s compare the actual outputs generated by the original and refined prompts side by side using the same evaluation method as before.

In [8]:
# Generate responses using both prompts
original_response = llm.invoke(initial_prompt.format(topic="machine learning")).content
refined_response = llm.invoke(refined_prompt.format(topic="machine learning")).content

# Score both responses
original_score = evaluate_response(original_response, criteria)
refined_score = evaluate_response(refined_response, criteria)

# Show final comparison
print(f"Original prompt score: {original_score:.2f}")
print(f"Refined prompt score: {refined_score:.2f}")
print(f"Improvement: {(refined_score - original_score):.2f} points")

Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Original prompt score: 9.00
Refined prompt score: 9.00
Improvement: 0.00 points


We are measuring the performance uplift achieved by our refinement strategy, based on the same scoring criteria. If the refined score is higher, we know our iteration loop is working as intended.