<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/094_Testing_Agent_Designs_Through_Conversation_Simulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Agent Designs Through Conversation Simulation

The big focus for this lecture is **testing your GAME design through simulation before coding** — basically a “dress rehearsal” for your agent.

Here’s what you should concentrate on:

---

### **Key Concepts to Learn**

1. **Simulation Before Implementation**

   * You’re verifying that your *Goals, Actions, Memory, and Environment* are actually sufficient to achieve the intended results.
   * This is the low-cost, low-risk phase to discover issues early.

2. **Structured Simulation Setup**

   * Start with a clear prompt that defines:

     * **Goals**: What the agent is trying to achieve.
     * **Actions**: The tools it has available.
   * Tell the LLM to output only the next action and wait for you to simulate the result.

3. **Observing Agent Reasoning**

   * Watch the order in which it uses tools.
   * Experiment with how much context (metadata, extra structure) helps it make better decisions.

4. **Refining Goals and Tools**

   * Use the simulation to spot vague instructions or unclear tool descriptions.
   * Update them until the agent behaves as intended.

5. **Memory in Simulation**

   * The chat naturally mirrors the agent’s message list memory.
   * You can see how well it keeps context as history grows.

6. **Learning from Failures**

   * Introduce errors, malformed data, or missing information to see how it recovers.
   * This informs how you’ll build **robust error handling** in the real agent.

7. **Termination Testing**

   * Ensure your agent knows when to stop — simulate different stopping conditions.

8. **Rapid Iteration**

   * You can test dozens of scenarios in minutes without writing a line of “real” code.

9. **Agent Reflection**

   * Ask the LLM what tools, details, or rules it wishes it had — often it will point out gaps you missed.

10. **Example Library Building**

    * Save both **good** and **bad** simulation runs.
    * Later, these become *training material* for your agent prompts and test cases.

---

💡 **Most Valuable Takeaway:**
This stage isn’t about *how* to implement, it’s about *if* the design works — and you’ll catch 80% of the problems before touching your first line of Python.



In [1]:
!pip install --quiet python-dotenv openai

In [2]:
# === GAME Simulation: Proactive Coder (no real I/O, just mock results) ===
import os, json
from dotenv import load_dotenv
from openai import OpenAI

# 1) Model setup
load_dotenv('/content/API_KEYS.env')
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# 2) GAME: Goals & Actions (abstract)
GOALS = [
    "Identify small, self-contained code enhancements",
    "Keep interfaces stable",
    "Only proceed with user approval"
]

ACTIONS = [
    # We simulate these; no real side effects here
    {"name":"list_project_files", "desc":"List project files"},
    {"name":"read_project_file", "desc":"Read a file's content", "params":{"filename":"string"}},
    {"name":"ask_user_approval", "desc":"Ask user to approve a proposal", "params":{"proposal":"string"}},
    {"name":"edit_project_file", "desc":"Apply small edits to a file", "params":{"filename":"string","changes":"string"}},
    {"name":"terminate", "desc":"Finish with a summary message", "params":{"message":"string"}}
]

# 3) Function-calling tool specs for structured actions
TOOLS = [
    {"type":"function","function":{
        "name":"list_project_files",
        "description":"Return a list of project file names.",
        "parameters":{"type":"object","properties":{},"required":[]}
    }},
    {"type":"function","function":{
        "name":"read_project_file",
        "description":"Read a file's content.",
        "parameters":{"type":"object","properties":{"filename":{"type":"string"}},"required":["filename"]}
    }},
    {"type":"function","function":{
        "name":"ask_user_approval",
        "description":"Ask the user to approve a proposal.",
        "parameters":{"type":"object","properties":{"proposal":{"type":"string"}},"required":["proposal"]}
    }},
    {"type":"function","function":{
        "name":"edit_project_file",
        "description":"Apply a small, self-contained edit to a file.",
        "parameters":{"type":"object","properties":{"filename":{"type":"string"},"changes":{"type":"string"}},"required":["filename","changes"]}
    }},
    {"type":"function","function":{
        "name":"terminate",
        "description":"Conclude the task and stop.",
        "parameters":{"type":"object","properties":{"message":{"type":"string"}},"required":["message"]}
    }},
]

# 4) Minimal simulation prompt (what + how)
SYSTEM = f"""
You are simulating a Proactive Coder agent using the GAME framework.

GOALS:
- {GOALS[0]}
- {GOALS[1]}
- {GOALS[2]}

ACTIONS (choose ONE per step):
- list_project_files()
- read_project_file(filename)
- ask_user_approval(proposal)
- edit_project_file(filename, changes)
- terminate(message)

Rules:
- Always return a function call if a tool is needed; otherwise, you may terminate.
- Keep actions small and self-contained.
- Seek approval before edits.
"""

# 5) Mock environment: provide canned results to test reasoning
def mock_env_dispatch(name: str, args: dict):
    # You can mutate these to test different scenarios (errors, big projects, etc.)
    PROJECT_FILES = ["main.py", "utils.py", "data_processor.py"]
    MOCK_CONTENT = {
        "main.py": "def main(): pass\n# TODO: improve error handling",
        "utils.py": "def add(a,b): return a+b\n# no input validation",
        "data_processor.py": "def clean(df): return df.dropna()"
    }

    if name == "list_project_files":
        return {"ok": True, "files": PROJECT_FILES, "total": len(PROJECT_FILES), "dir": "/project"}

    if name == "read_project_file":
        fn = args.get("filename","")
        if fn not in MOCK_CONTENT:
            # JIT guidance to test recovery
            return {"ok": False, "error": f"{fn} not found", "hint": "Pick from list_project_files()", "retryable": True}
        return {"ok": True, "filename": fn, "content": MOCK_CONTENT[fn]}

    if name == "ask_user_approval":
        proposal = args.get("proposal","")
        # Simulate a user approval flow (approve small changes only)
        approved = "validate" in proposal or "error handling" in proposal or "docstring" in proposal
        return {"ok": True, "approved": approved, "note": ("Approved" if approved else "Rejected: too large")}

    if name == "edit_project_file":
        fn = args.get("filename","")
        changes = args.get("changes","")
        if not fn or not changes:
            return {"ok": False, "error":"Missing filename/changes", "retryable": True}
        if len(changes) > 400:
            return {"ok": False, "error":"Change set too large", "hint":"Propose smaller edits", "retryable": True}
        return {"ok": True, "result": f"Applied small edits to {fn}"}

    if name == "terminate":
        return {"ok": True, "message": args.get("message","Done.")}

    return {"ok": False, "error": f"Unknown tool {name}"}

# 6) Conversation loop: model chooses action → we simulate result → feed back → repeat
def simulate(task: str, max_steps: int = 8, verbose: bool = True):
    messages = [
        {"role":"system", "content": SYSTEM},
        {"role":"user", "content": f"Task: {task}"}
    ]
    for step in range(1, max_steps+1):
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            tools=TOOLS,
            tool_choice="auto"
        )
        msg = resp.choices[0].message

        if msg.tool_calls:  # model decided to call a tool
            for call in msg.tool_calls:
                name = call.function.name
                args = json.loads(call.function.arguments) if call.function.arguments else {}
                if verbose:
                    print(f"\n[{step}] ACTION → {name}({args})")

                # Simulate environment result
                result = mock_env_dispatch(name, args)
                if verbose:
                    print(f"[{step}] RESULT ← {result}")

                # Feed result back (as a tool result or user feedback)
                messages.append({"role":"assistant", "tool_calls":[call]})
                messages.append({"role":"tool", "tool_call_id": call.id, "content": json.dumps(result)})

                # Optional: stop if terminated
                if name == "terminate":
                    if verbose: print("\nSimulation ended by agent.")
                    return messages
        else:
            # No tool call; the model might be giving a final message
            if verbose:
                print(f"\n[{step}] TEXT → {msg.content}")
            messages.append({"role":"assistant", "content": msg.content})
            break
    return messages

# 7) Try a few “dress rehearsal” scenarios
_ = simulate("Scan the project, propose one small improvement, get approval, apply it, then terminate.")
# To test failure handling, try:
# _ = simulate("Read README.md then terminate.")  # file doesn't exist → see recovery behavior
# _ = simulate("Make sweeping refactors across all files.")  # should seek approval / keep small



[1] ACTION → list_project_files({})
[1] RESULT ← {'ok': True, 'files': ['main.py', 'utils.py', 'data_processor.py'], 'total': 3, 'dir': '/project'}

[2] ACTION → read_project_file({'filename': 'main.py'})
[2] RESULT ← {'ok': True, 'filename': 'main.py', 'content': 'def main(): pass\n# TODO: improve error handling'}

[3] ACTION → ask_user_approval({'proposal': "Enhance error handling in 'main.py' by adding a try-except block to catch and handle exceptions."})
[3] RESULT ← {'ok': True, 'approved': True, 'note': 'Approved'}

[4] ACTION → edit_project_file({'filename': 'main.py', 'changes': "def main():\n    try:\n        pass\n    except Exception as e:\n        print(f'An error occurred: {e}')"})
[4] RESULT ← {'ok': True, 'result': 'Applied small edits to main.py'}

[5] ACTION → terminate({'message': 'Task completed successfully with the enhancement implemented.'})
[5] RESULT ← {'ok': True, 'message': 'Task completed successfully with the enhancement implemented.'}

Simulation ended by 

## Code Walk Through

Here’s how that simulation harness maps directly to **“Testing Agent Designs Through Conversation Simulation”**:

# 1) Define the agent up front (design-first)

* **`SYSTEM` prompt** embeds the **Goals** and **Actions** exactly like the lecture suggests.
* You tell the model: *pick ONE action per step, keep edits small, seek approval.*
* This is your “dress rehearsal script”—before any real I/O exists.

# 2) Use function calling to force structured actions

* **`TOOLS`** (the function-calling specs) list the actions the agent may take:
  `list_project_files`, `read_project_file`, `ask_user_approval`, `edit_project_file`, `terminate`.
* Because we pass `tools=...` and `tool_choice="auto"`, the model returns **`tool_calls`** (name + JSON args) instead of free text.
* This avoids the “please output JSON” prompt fragility and keeps the simulation crisp.

# 3) Mock the Environment to iterate fast

* **`mock_env_dispatch(name, args)`** is the “stage set”: canned project files, canned file contents, and **deliberately crafted errors/hints**.
* You can flip behaviors instantly (e.g., file missing, too-large changes) to see how the agent responds.
* No real files, no real APIs → you can test dozens of scenarios in minutes.

# 4) The conversation loop mirrors the agent loop

* **`simulate(...)`** sends `{system, user}` → gets **a tool call** → **feeds back a result** → repeats.
* Tool **results** are added as **`role: "tool"`** messages tied to the specific tool\_call—this is the API-native way to give the agent its “world feedback.”
* This **chat history** *is* your memory—exactly the list-of-messages memory you’ll use in a real agent.

# 5) Observe and refine decision quality

* Because each step prints **ACTION → RESULT**, you can watch the agent’s reasoning unfold:

  * Does it **list files first** before reading?
  * Does it **seek approval** before editing?
  * Does it **propose small changes** (as instructed)?
* You can change the **result format** (add `total`, `dir`, etc.) to see if richer metadata improves choices—just like the lecture recommends.

# 6) Test error handling and recovery behavior

* The mock returns **structured failures** with **hints** (e.g., `"retryable": true` or “Pick from `list_project_files()`”).
* You can inject malformed/missing data to see if the agent recovers or derails.
* This directly informs how you’ll design **robust tool errors** in the real environment.

# 7) Practice termination and guardrails

* Include a **`terminate`** action and test that the agent knows when to stop.
* You can also cap `max_steps` to prevent runaway loops and study whether the agent converges under your rules.

# 8) Build your example library

* Interesting traces (good and bad) can be copy-pasted into your notebook as **training cues** for prompts and future tests.
* Over time, this becomes your “playbook” of patterns to encourage or avoid.

---

## What this buys you (why simulate first)

* **Catches design gaps early** (unclear tools, vague goals, missing hints).
* **De-risks implementation**—you’re not guessing whether the plan works; you’ve watched it work.
* **Speeds iteration**—change the mock or the rules, rerun, observe.
* **Directly transferable**—the chat loop you see here becomes the real agent loop once you swap the mock environment for a real one.

If you want, we can try two quick runs next:

* A **happy path** (list → read → propose → ask approval → edit → terminate).
* A **failure path** (ask to read a missing file, see the recovery).

That’ll make the simulation value crystal clear.
