# LangWatch Evaluation Tracking

## Simple Evaluation Loop

In [None]:
import langwatch

langwatch.login()

LangWatch API key is already set, if you want to login again, please call as langwatch.login(relogin=True)


In [3]:
import random
import pandas as pd
import time

df = pd.DataFrame(
    [
        {
            "question": "What is LangWatch?",
            "answer": "LangWatch is a platform for evaluating and improving language models.",
        },
        {
            "question": "How do I use LangWatch?",
            "answer": "You can use LangWatch by installing the LangWatch SDK and then calling the LangWatch API.",
        },
        {
            "question": "Does LangWatch support multiple language models?",
            "answer": "Yes, LangWatch is compatible with all language models by using LiteLLM under the hood.",
        },
        {
            "question": "Can I visualize evaluation metrics in LangWatch?",
            "answer": "Yes, LangWatch provides dashboards for visualizing key evaluation metrics.",
        },
        {
            "question": "Is there a free tier for LangWatch?",
            "answer": "LangWatch offers a free tier with limited usage, ideal for small projects and evaluation.",
        },
        {
            "question": "Where can I find documentation for LangWatch?",
            "answer": "You can find the official documentation on the LangWatch website or GitHub repository.",
        },
        {
            "question": "![](https://i.imgur.com/Tb5hyby.jpeg)",
            "answer": "This is a screenshot of LangWatch website"
        }
    ]
)

evaluation = langwatch.experiment.init("my-incredible-experiment")


@langwatch.trace()
def agent(question):
    time.sleep(random.randint(0, 10))
    return {"text": "foo bar"}


for index, row in evaluation.loop(df.iterrows()):
    result = agent(row["question"])  # your code

    score = random.randint(0, 80) / 100 + 0.2
    evaluation.log("sample_metric", index=index, score=score, passed=score > 0.5)

Follow the results at: http://localhost:5560/inbox-narrator/experiments/my-incredible-experiment?runId=eggplant-trout-of-painting


Evaluating: 100%|██████████| 7/7 [00:42<00:00,  6.05s/it]


## Parallel Evaluation Loop

In [None]:
import random
import time

langwatch.setup()
evaluation = langwatch.experiment.init("my-incredible-experiment")

@langwatch.trace()
def agent(question):
    time.sleep(random.randint(0, 10))
    return "foo parallel"

for index, row in evaluation.loop(df.iterrows(), threads=4):
    def task(index, row):
        result = agent(row["question"])
        evaluation.log("sample_metric", index=index, data={"response": result}, score=1)
    evaluation.submit(task, index, row)

2026-01-15 12:44:09,320 - langwatch.client - INFO - Configuring OTLP exporter with endpoint: http://localhost:5560/api/otel/v1/traces
Follow the results at: http://localhost:5560/inbox-narrator/experiments/my-incredible-experiment?runId=prophetic-uakari-of-satiation


Evaluating: 100%|██████████| 7/7 [00:13<00:00,  1.91s/it]


## Multi-Target Comparison

Use `evaluation.target()` to compare multiple models/configurations on the same dataset.
Each target gets its own dataset entry with automatic latency tracking.

In [5]:
import random
import time

langwatch.setup()
evaluation = langwatch.experiment.init("model-comparison-experiment")


@langwatch.trace()
def call_model(model_name: str, question: str):
    """Simulate calling different LLM models."""
    # Simulate different latencies per model
    latency = {"gpt-4": 0.5, "gpt-3.5": 0.2, "claude": 0.3}.get(model_name, 0.1)
    time.sleep(latency + random.random() * 0.2)
    return f"Response from {model_name}: This is a simulated answer to '{question[:30]}...'"


for index, row in evaluation.loop(df.iterrows(), threads=4):
    def task(index, row):
        # Compare GPT-4 vs GPT-3.5 vs Claude on the same question
        with evaluation.target("gpt-4", {"model": "openai/gpt-4", "temperature": 0.7}):
            response = call_model("gpt-4", row["question"])
            evaluation.log_response(response)  # Store the model output
            score = random.uniform(0.8, 1.0)  # GPT-4 tends to score higher
            evaluation.log("quality", index=index, score=score)

        with evaluation.target("gpt-3.5", {"model": "openai/gpt-3.5-turbo", "temperature": 0.7}):
            response = call_model("gpt-3.5", row["question"])
            evaluation.log_response(response)
            score = random.uniform(0.6, 0.9)  # GPT-3.5 scores a bit lower
            evaluation.log("quality", index=index, score=score)

        with evaluation.target("claude", {"model": "anthropic/claude-3-sonnet", "temperature": 0.7}):
            response = call_model("claude", row["question"])
            evaluation.log_response(response)
            score = random.uniform(0.75, 0.95)  # Claude in between
            evaluation.log("quality", index=index, score=score)

    evaluation.submit(task, index, row)

2026-01-15 12:44:34,997 - langwatch.client - INFO - Configuring OTLP exporter with endpoint: http://localhost:5560/api/otel/v1/traces
Follow the results at: http://localhost:5560/inbox-narrator/experiments/model-comparison-experiment?runId=versatile-arrogant-woodpecker


Evaluating: 100%|██████████| 7/7 [00:02<00:00,  2.38it/s]
