Evaluator-optimizer

## Workflow: **Evaluator-optimizer**

The evaluator-optimizer workflow ensures task requirements are fully met through iterative refinement. An LLM performs a task, followed by a second LLM evaluating whether the result satisfies all specified criteria. If not, the process repeats with adjustments, continuing until the evaluator confirms all requirements are met.

![evaluator-optimizer](resources/evaluator-optimizer.jpg)

## **Use cases:**


- Generating code that meets specific requirements, such as ensuring runtime complexity.
- Searching for information and using an evaluator to verify that the results include all the required details.
- Writing a story or article with specific tone or style requirements and using an evaluator to ensure the output matches the desired criteria, such as adhering to a particular voice or narrative structure.
- Generating structured data from unstructured input and using an evaluator to verify that the data is properly formatted, complete, and consistent.
- Creating user interface text, like tooltips or error messages, and using an evaluator to confirm the text is concise, clear, and contextually appropriate.

In [None]:
%pip install --upgrade openai

In [None]:
import getpass
import json
from openai import OpenAI

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("OPENAI_API_KEY")

In [3]:
from openai import OpenAI
client = OpenAI()

def run_llm(user_prompt: str, model: str = 'gpt-4o-mini', system_prompt: str = None):
    messages = []
    if system_prompt:
        messages.append({"role": "developer", "content": system_prompt})
    messages.append({"role": "user", "content": user_prompt})

    completion = client.chat.completions.create(
        model=model,
        messages=messages
    )
    return completion.choices[0].message.content

run_llm('what is the meaning of life?')

'The meaning of life is a deeply philosophical question and can vary greatly from person to person. Some people find meaning through relationships, love, and community, while others seek it in personal achievement, knowledge, or spiritual beliefs. Philosophers, religious leaders, and thinkers throughout history have proposed various interpretations, ranging from existentialism, which suggests we create our own meaning, to religious viewpoints that posit a divinely ordained purpose. Ultimately, the meaning of life may be something each individual must explore and define for themselves based on their beliefs, experiences, and values.'

In [4]:
import re

task = """
Write a one-sentence bedtime story about a unicorn, for a five year old girl
"""

GENERATOR_PROMPT = """
Your goal is to complete the task based on <user input>. If there are feedback
from your previous generations, you should reflect on them to improve your solution

Output your answer concisely in the following format:

Thoughts:
[Your understanding of the task and feedback and how you plan to improve]

Response:
[Your response here]
"""

def generate(task: str, generator_prompt: str, context: str="") -> tuple[str, str]:
    """Generate and improve a solution based on feedback."""
    full_prompt = f"{generator_prompt}\n{context}\nTask: {task}" if context else f"{generator_prompt}\nTask: {task}"

    response = run_llm(full_prompt)

    print("\n## Generation start")
    print(f"Output:\n{response}\n")

    return response


EVALUATOR_PROMPT = """
Evaluate this following response for:
1. age appropriateness
2. is only ten words or fewer
3. style and best practices

You should be evaluating only and not attempting to solve the task.

Only output "PASS" if all criteria are met and you have no further suggestions for improvements.

Provide detailed feedback if there are areas that need improvement. You should specify what needs improvement and why.

Return in this format:
Status: [PASS/FAIL]
Feedback: [Your feedback here]
"""

def evaluate(task: str, evaluator_prompt: str, generated_content: str) -> tuple[str, str]:
    """Evaluate if a solution meets requirements"""

    full_prompt = f"{evaluator_prompt}\nOriginal task: {task}\nContent to evaluate: {generated_content}"

    response = run_llm(full_prompt)

    status_match = re.search(r"Status:\s*(.*?)(?:\n|$)", response, re.IGNORECASE)
    feedback_match = re.search(r"Feedback:\s*([\s\S]*)", response, re.IGNORECASE)

    if status_match is None or feedback_match is None:
        raise ValueError("Could not parse evaluation response. Expected format: 'Status: [PASS/FAIL]\\nFeedback: [feedback]'")

    evaluation = status_match.group(1).strip()
    feedback = feedback_match.group(1).strip()

    print("## Evaluation start")
    print(f"Status: {evaluation}")
    print(f"Feedback: {feedback}")

    return evaluation, feedback

def loop_workflow(task: str, generator_prompt: str, evaluator_prompt: str, context: str = "") -> tuple[str, list[dict]]:
    """Keep generating and evaluating until the evaluator passes the last generated response."""
    memory = []

    response = generate(task, generator_prompt)
    memory.append(response)

    max_iterations = 5
    while max_iterations > 0:
        evaluation, feedback = evaluate(task, evaluator_prompt, response)

        if evaluation.upper() == "PASS":
            return response

        context = "\n".join([
            "Previous attempts:",
            *[f"- {m}" for m in memory],
            f"\nFeedback: {feedback}"
        ])
        response = generate(task, generator_prompt, context)
        memory.append(response)

        max_iterations -= 1

loop_workflow(task, GENERATOR_PROMPT, EVALUATOR_PROMPT)


## Generation start
Output:
Thoughts:
I understand that the task requires creating a simple and enchanting bedtime story suitable for a five-year-old girl, and I'm aiming to make it magical yet easy to understand. I need to ensure that the sentence is engaging and evoking a sense of wonder to inspire sweet dreams.

Response:
Once upon a time, a gentle unicorn named Sparkle danced through a rainbow forest, spreading glittering dreams and laughter wherever she went, making every little girl’s wishes come true.

## Evaluation start
Status: FAIL
Feedback: The response is inappropriate for several reasons. First, it exceeds the ten-word limit significantly, which is a key requirement for the task. Second, while the content is age-appropriate for a five-year-old, it would need to be simplified further to meet the word count stipulation. A more concise sentence that captures the magical essence while adhering to the stated limit is essential for aligning with best practices for young childre