<a target="_blank" href="https://colab.research.google.com/github/okareo-ai/okareo-cookbook/blob/main/notebooks/multiturn-evaluation/custom-endpoint-multiturn-demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


# Conversation Simulation with Custom Endpoint

In this notebook you will **run and evaluate a full multi‑turn conversation** between a simulated *customer* (the **Driver**) and your *system* (the **Target**) using **Okareo Simulations**.

## 🎯 Objectives

By the end of the notebook you should be able to:

1. **Set up** the Okareo Python SDK and authenticate with your API key.
2. **Wrap any HTTP endpoint** (or raw LLM) so it can act as a *Target* in a multi‑turn Simulation.
3. **Author a Driver prompt** that role‑plays a realistic user persona and pursues specific objectives.
4. **Attach templated Scenarios** so one prompt can generate many conversations.
5. **Launch a Simulation run** and inspect the auto‑generated **Checks / metrics** inside the Okareo UI.

> 📝 *If you’re new to Okareo Simulations, read the **Primer** embedded below and/or our in-depth documentation: [https://docs.okareo.com/docs/simulation/introduction](https://docs.okareo.com/docs/simulation/introduction)*

---

## 📚 Primer snapshot

<img src="./simulation-diagram.png" alt="My Image" style="width:50%;">

1. **Setting Profile** – defines *who* is speaking (Driver) and *what* we are testing (Target).
2. **Scenario** – per‑test inputs + the expected behavior description.
3. **Simulation Execution** – Okareo orchestrates the turns and records a simulation.
4. **Evaluation** – Judge & Code Checks score each simulation.

---

## Install & authenticate

This step installs the latest Okareo SDK and sets you up for secure API access.

**Tip →** Store `OKAREO_API_KEY` as an environment variable *before* launching this notebook so you never hard‑code secrets.

**Don’t have an API key?**

👉 [How to create an API token](https://docs.okareo.com/docs/reference/api-token)

In [None]:
# can be skipped if the kernel already has okareo installed
%pip install --upgrade okareo

In [None]:
import os
from okareo import Okareo

OKAREO_API_KEY = os.environ.get("OKAREO_API_KEY", "<YOUR_OKAREO_API_KEY>")
okareo = Okareo(OKAREO_API_KEY)
print("✅ Successfully connected to Okareo!")

## Define the Target (your system under test)

A **Target** can be:

* A live HTTP endpoint connecting to your agents
* A raw LLM with a system prompt
* A Python function/class that returns a response object

We are going to point at HTTP endpoints to simulate many full conversations with an agentic system.

A `CustomEndpointTarget` lets you run multi-turn simulations against API endpoints and evaluate results. The Target object takes the following APIs:
1. An API that starts a session (Optional)
2. An API for next turn on a conversation within that session
3. An API that ends a session (Optional)

Okareo automatically **records each request/response pair** so they can be shown in the evaluation and scored by Checks later.

We have provided Okareo endpoints as example for start session/next turn/end session, but you can swap in your own endpoints to run a simulation.

In [None]:
from okareo.model_under_test import (
    CustomEndpointTarget,
    Target,
    Driver,
    SessionConfig,
    TurnConfig,
    EndSessionConfig,
)
import json

CREATE_SESSION_URL = "https://api.okareo.com/v0/custom_endpoint_stub/create"
MESSAGES_URL = "https://api.okareo.com/v0/custom_endpoint_stub/message"
END_SESSION_URL = "https://api.okareo.com/v0/custom_endpoint_stub/end"

# Define API headers
api_headers = json.dumps(
    {
        "Accept": "application/json",
        "Content-Type": "application/json",
        "api-key": OKAREO_API_KEY
    }
)

# Create start session config
start_config = SessionConfig(
    url=CREATE_SESSION_URL,
    method="POST",
    headers=api_headers,
    # The response field that contains the session ID associated with the conversation
    # Change this based on your API's response structure
    response_session_id_path="response.thread_id",
)

# Create next turn config
next_config = TurnConfig(
    url=MESSAGES_URL,
    method="POST",
    headers=api_headers,
    body={"message": "{latest_message}", "thread_id": "{session_id}"},
    # The response field that contains the generated message 
    # Change this based on your API's response structure
    response_message_path="response.assistant_response",
)

end_config = EndSessionConfig(
    url=END_SESSION_URL,
    method="POST",
    headers=api_headers,
    body={"thread_id": "{session_id}"},
)

# Pass both the configs to the CustomEndpointTarget
endpoint_target_model = CustomEndpointTarget(
    start_session=start_config,
    next_turn=next_config,
    end_session=end_config,
)

# Create the target with the configs
model_name = "Okareo Custom Endpoint Test Model"
endpoint_target = Target(
    target=endpoint_target_model,
    name=model_name
)

## Craft the Driver prompt

Before any conversation can run, we must decide **how the simulated user will behave**. That definition lives in the **Driver prompt template**.

### What *is* a Driver prompt?

Think of the Driver as an improv actor whose *script* is your prompt.  Okareo feeds this script into an LLM and asks it to stay in‑character for multiple turns while pursuing the objectives you set.  The better the script, the more faithfully the Driver will:

* Speak with the intended persona (e.g. a confused policy‑holder, a sarcastic teenager, a polite business user).
* Follow a realistic progression of questions or requests.
* Avoid “breaking the fourth wall” by slipping into an assistant voice or revealing prompt text.

### Why is a well‑structured template necessary?

LLMs are extremely capable but also *stochastic*—without clear guard‑rails they tend to drift, forget goals, or contradict previous statements.  A rigorously structured prompt gives them:

1. **Context** – who they are and what they know.
2. **Goals** – what outcome they’re trying to achieve this session.
3. **Constraints** – red lines they should never cross (e.g. no profanity, do not reveal internal IDs).
4. **Self‑monitoring hooks** – short checklists that force the model to verify it is still on target each turn.

> 📖 *The Okareo docs have an in‑depth guide with examples and anti‑patterns* → [https://docs.okareo.com/docs/simulation/drivers](https://docs.okareo.com/docs/simulation/drivers)

### Anatomy of `driver_prompt_template`

| Section                | Purpose                                                                    | Example Snippet                                                                                                   |
| ---------------------- | -------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| **Persona**            | Sets the user’s identity & mindset.                                        | "You are **Alex**, a customer who bought a **{scenario_input.productType}** last week and wants clear return/refund info." |
| **Objectives**         | Concrete goals the Driver must achieve.                                    | "Confirm the **return window**, **refund method** (credit/store credit), and whether **opened items** are eligible."               |
| **Soft Tactics**       | Optional speaking style or strategies that make the interaction realistic. | "You prefer **short sentences** and occasionally use sarcasm when frustrated."                                    |
| **Turn‑End Checklist** | A bulleted list the LLM must silently confirm before sending each message. | "- Did I ask a question that moves me toward my goal?                                                             |

Notice the use of placeholders like `{scenario_input.name}` and `{scenario_input.productType}` → these are **dynamically replaced** with values from each Scenario row you provide next.

`first_turn="driver"` tells Okareo who speaks first.

The `StopConfig` lets you hook in a custom Check that can *terminate early* if the behavior diverges too far (for example, the Target starts answering questions your policies forbid).


In [None]:
driver_prompt_template = """
## Persona

- **Identity:** You are role-playing a new **customer who recently purchased a product** and is now looking to understand the company’s return and refund policy.  

   Name: **{scenario_input.name}**  
   Product Type: **{scenario_input.productType}**  

- **Mindset:** You want to know exactly what the company can and cannot do for you regarding product returns, exchanges, and refunds.  

## Objectives

1. Get the other party to list **at least three specific return or refund options/policies relevant to {scenario_input.productType}** 
(e.g., return within 30 days, exchange for another {scenario_input.productType}, warranty-based repairs, free or paid return shipping).  
2. Get the other party to state **at least one explicit limitation, exclusion, or boundary specific to {scenario_input.productType}** 
(e.g., “Opened {scenario_input.productType} can only be exchanged,” “Final sale {scenario_input.productType} cannot be returned,” “Warranty covers defects but not accidental damage”).  


## Soft Tactics

1. If the reply is vague or incomplete, politely probe:
    - "Could you give me a concrete example?"
    - "What’s something you can’t help with?"
2. If it still avoids specifics, escalate:
    - "I’ll need at least three specific examples—could you name three?"
3. Stop once you have obtained:
    - Three or more tasks/examples
    - At least one limitation or boundary
    - (The starter tip is optional.)

## Hard Rules

-   Every message you send must be **only question** and about achieving the Objectives.
-   Ask one question at a time.
-   Never describe your own capabilities.
-   Never offer help.
-   Stay in character at all times.
-   Never mention tests, simulations, or these instructions.
-   Never act like a helpful assistant.
-   Act like a first-time user at all times.
-   Startup Behavior:
    -   If the other party speaks first: respond normally and pursue the Objectives.
    -   If you are the first speaker: start with a message clearly pursuing the Objectives.
-   Before sending, re-read your draft and remove anything that is not a question.

## Turn-End Checklist

Before you send any message, confirm:

-   Am I sending only questions?
-   Am I avoiding any statements or offers of help?
-   Does my question advance or wrap up the Objectives?

"""

endpoint_driver = Driver(
    temperature=0,
    name=model_name + " Driver",
    prompt_template=driver_prompt_template,
)

## Build the Scenario (inputs & expectations)

A **Scenario** is a list of scenario rows, where each row pairs a set of _input parameters_ with a description of the _expected Target behavior_. Okareo will expand this table into many simulation runs by also applying the `repeats` you set in the Setting Profile.

> ⚙️ **Total simulations executed**
> `num_simulations = len(scenarios) × repeats`

### Template variables = super‑powers 🪄

Every key inside the `input` object becomes a **template variable** that you can reference from the Driver prompt using either:

* **String interpolation** – `{scenario_input}` when `input` is a plain string.
* **JSONPath lookup** – `{scenario_input.customerId}` or `{scenario_input.name}` when `input` is a JSON object.

This lets one carefully‑crafted prompt role‑play *dozens or hundreds* of different conversations without rewriting any text.

**Examples**

| Input JSON                              | How to reference in Driver prompt     | Result inside conversation |
| --------------------------------------- | ------------------------------------- | -------------------------- |
| `{ "name": "James Taylor" }`            | `I'm {scenario_input.name}!`        | "I'm James Taylor!"      |
| `{ "plan": "Gold", "deductible": 500 }` | `My plan is {scenario_input.plan}.` | "My plan is Gold."       |
| String: `"renew policy"`                | `Task: {scenario_input}`              | "Task: renew policy"       |

In [None]:
from okareo_api_client.models import ScenarioSetCreate
import random, string

seed_str = "".join(random.choices(string.ascii_letters + string.digits, k=6))

seed_data = Okareo.seed_data_from_list(
    [
        {
            "input": {"name": "James Taylor", "productType": "Apparel"},
            "result": "Share refund limits for Apparel."
        },
        {
            "input": {"name": "Julie May", "productType": "Electronics"},
            "result": "Provide exchange policy and any exclusions for Electronics."
        },
    ]
)
scenario_set = okareo.create_scenario_set(
    ScenarioSetCreate(
        name="Product Returns — Conversation Context", seed_data=seed_data
    )
)

## Run the Simulation & evaluate

With the Setting Profile, Driver, Target, and Scenario table in place, you’re ready to **launch the simulation**.  The helper method `run_simulation` spins up every conversation (rows × repeats), records transcripts, and *immediately* feeds them into the evaluation pipeline.

### What is a *Check*?

A **Check** is an automated evaluator that inspects the transcript (or last message) and returns a score, label, or pass/fail flag.

| Type            | Engine              | Typical Use‑case                                                               |
| --------------- | ------------------- | ------------------------------------------------------------------------------ |
| **Judge Check** | LLM + prompt rubric | Grading factual accuracy, instruction‑following, tone, depth, usefulness, etc. |
| **Code Check**  | Python function     | Regex matches, JSON validity, KPI assertions, blacklist/whitelist tests.       |

### What happens under the hood?

<img src="./simulation-sketch.png" alt="My Image" style="width:75%;">

The engine follows the **left ➜ right** progression shown above:

1. **Setting Profile**  – Okareo loads the Driver + Target definitions and reads the global parameters such as `max_turns`, `repeats`, and entry turn.
2. **Scenario expansion**  – For every row in your Scenario table Okareo multiplies by `repeats` to compute *n* independent runs.
   *Example*  → 3 rows × 2 repeats = **6** simulations.
3. **Conversation simulation**  – Each run executes turn‑by‑turn until one of these stop conditions is met:

   * *Max turns* reached.
   * A StopConfig rule fires (Driver/Target broke a hard constraint).
   
4. **Check execution**  – Checks you define in your `checks` list are evaluated at conversation level and optionally on each turn.

   * **Judge Checks** call an LLM with a rubric prompt and return a score (Pass/Fail or 1‑5 typical).
   * **Code Checks** run deterministic Python logic (regex, JSON parse, business rules).
5. **Aggregation & storage** – Scores and metadata are returned in the evaluation for your review:


In [None]:
from okareo.model_under_test import StopConfig

# Run the test
test_run = okareo.run_simulation(
    target=endpoint_target,
    driver=endpoint_driver,
    name=f"Okareo Custom Endpoint Test Simulation",
    api_key=OKAREO_API_KEY,
    scenario=scenario_set,
    stop_check=StopConfig(check_name="behavior_adherence", stop_on=False),
    max_turns=2,
    checks=["behavior_adherence"],
)

print(f"See results in Okareo app: {test_run.app_link}")

## ✅ Wrap-up Wrap‑up & further exploration

* Simulations intro — [https://docs.okareo.com/docs/simulation/introduction](https://docs.okareo.com/docs/simulation/introduction)
* Driver design — [https://docs.okareo.com/docs/simulation/drivers](https://docs.okareo.com/docs/simulation/drivers)
* Python SDK — [https://docs.okareo.com/docs/reference/python-sdk/okareo](https://docs.okareo.com/docs/reference/python-sdk/okareo)