## Evaluator-Optimizer Workflow
In this workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.

### When to use this workflow
This workflow is particularly effective when we have:

- Clear evaluation criteria
- Value from iterative refinement

The two signs of good fit are:

- LLM responses can be demonstrably improved when feedback is provided
- The LLM can provide meaningful feedback itself

In [1]:
#!pip install anthropic

### Get API Key

Buy $5 credits. 
https://console.anthropic.com/dashboard

In [2]:
# Set environment varaible. Needed by util.py

import os

ANTHROPIC_API_KEY = "xxx"

In [3]:
from util import llm_call, extract_xml

def generate(prompt: str, task: str, context: str = "") -> tuple[str, str]:
    """Generate and improve a solution based on feedback."""
    full_prompt = f"{prompt}\n{context}\nTask: {task}" if context else f"{prompt}\nTask: {task}"
    response = llm_call(full_prompt)
    thoughts = extract_xml(response, "thoughts")
    result = extract_xml(response, "response")
    
    print("\n=== GENERATION START ===")
    print(f"Thoughts:\n{thoughts}\n")
    print(f"Generated:\n{result}")
    print("=== GENERATION END ===\n")
    
    return thoughts, result

def evaluate(prompt: str, content: str, task: str) -> tuple[str, str]:
    """Evaluate if a solution meets requirements."""
    full_prompt = f"{prompt}\nOriginal task: {task}\nContent to evaluate: {content}"
    response = llm_call(full_prompt)
    evaluation = extract_xml(response, "evaluation")
    feedback = extract_xml(response, "feedback")
    
    print("=== EVALUATION START ===")
    print(f"Status: {evaluation}")
    print(f"Feedback: {feedback}")
    print("=== EVALUATION END ===\n")
    
    return evaluation, feedback

def loop(task: str, evaluator_prompt: str, generator_prompt: str) -> tuple[str, list[dict]]:
    """Keep generating and evaluating until requirements are met."""
    memory = []
    chain_of_thought = []
    
    thoughts, result = generate(generator_prompt, task)
    memory.append(result)
    chain_of_thought.append({"thoughts": thoughts, "result": result})
    
    while True:
        evaluation, feedback = evaluate(evaluator_prompt, result, task)
        if evaluation == "PASS":
            return result, chain_of_thought
            
        context = "\n".join([
            "Previous attempts:",
            *[f"- {m}" for m in memory],
            f"\nFeedback: {feedback}"
        ])
        
        thoughts, result = generate(generator_prompt, task, context)
        memory.append(result)
        chain_of_thought.append({"thoughts": thoughts, "result": result})

### Example Use Case: Iterative coding loop



In [5]:
evaluator_prompt = """
Evaluate this following code implementation for:
1. code correctness
2. time complexity
3. style and best practices

You should be evaluating only and not attemping to solve the task.
Only output "PASS" if all criteria are met and you have no further suggestions for improvements.
Output your evaluation concisely in the following format.

<evaluation>PASS, NEEDS_IMPROVEMENT, or FAIL</evaluation>
<feedback>
What needs improvement and why.
</feedback>
"""

generator_prompt = """
Your goal is to complete the task based on <user input>. If there are feedback 
from your previous generations, you should reflect on them to improve your solution

Output your answer concisely in the following format: 

<thoughts>
[Your understanding of the task and feedback and how you plan to improve]
</thoughts>

<response>
[Your code implementation here]
</response>
"""

task = """
<user input>
Implement a Stack with:
1. push(x)
2. pop()
3. getMin()
All operations should be O(1).
</user input>
"""

task = """
<user input>
Implement a python program to retrieve stock data.
1. Use yfinance to acqure data for QQQ
2. Plot the results with plotly
All operations should be O(1).
</user input>
"""

loop(task, evaluator_prompt, generator_prompt)



=== GENERATION START ===
Thoughts:

I understand I need to create a program that:
1. Uses yfinance library to get QQQ stock data
2. Visualizes it using plotly
3. Ensures operations are O(1) time complexity
4. Keep code concise and efficient

The main operations (API call and plotting) are inherently O(1) as they're single operations regardless of data size.


Generated:

```python
import yfinance as yf
import plotly.graph_objects as go

# Get QQQ data - O(1) operation
qqq = yf.Ticker("QQQ")
data = qqq.history(period="1y")

# Create plot - O(1) operation
fig = go.Figure(data=[go.Candlestick(x=data.index,
                open=data['Open'],
                high=data['High'],
                low=data['Low'],
                close=data['Close'])])

fig.update_layout(title='QQQ Stock Price',
                 yaxis_title='Price (USD)',
                 xaxis_title='Date')

fig.show()
```

=== GENERATION END ===

=== EVALUATION START ===
Status: NEEDS_IMPROVEMENT
Feedback: 
While the code is 

 [{'thoughts': "\nI understand I need to create a program that:\n1. Uses yfinance library to get QQQ stock data\n2. Visualizes it using plotly\n3. Ensures operations are O(1) time complexity\n4. Keep code concise and efficient\n\nThe main operations (API call and plotting) are inherently O(1) as they're single operations regardless of data size.\n",
   'result': '\n```python\nimport yfinance as yf\nimport plotly.graph_objects as go\n\n# Get QQQ data - O(1) operation\nqqq = yf.Ticker("QQQ")\ndata = qqq.history(period="1y")\n\n# Create plot - O(1) operation\nfig = go.Figure(data=[go.Candlestick(x=data.index,\n                open=data[\'Open\'],\n                high=data[\'High\'],\n                low=data[\'Low\'],\n                close=data[\'Close\'])])\n\nfig.update_layout(title=\'QQQ Stock Price\',\n                 yaxis_title=\'Price (USD)\',\n                 xaxis_title=\'Date\')\n\nfig.show()\n```\n'},
  {'thoughts': "\nBased on the feedback, I understand that:\n1. The O(1) 