# LangWatch Evaluation Tracking

## Simple Evaluation Loop

In [2]:
import random
import langwatch
import pandas as pd
import time

df = pd.DataFrame(
    [
        {
            "question": "What is LangWatch?",
            "answer": "LangWatch is a platform for evaluating and improving language models.",
        },
        {
            "question": "How do I use LangWatch?",
            "answer": "You can use LangWatch by installing the LangWatch SDK and then calling the LangWatch API.",
        },
        {
            "question": "Does LangWatch support multiple language models?",
            "answer": "Yes, LangWatch is compatible with all language models by using LiteLLM under the hood.",
        },
        {
            "question": "Can I visualize evaluation metrics in LangWatch?",
            "answer": "Yes, LangWatch provides dashboards for visualizing key evaluation metrics.",
        },
        {
            "question": "Is there a free tier for LangWatch?",
            "answer": "LangWatch offers a free tier with limited usage, ideal for small projects and evaluation.",
        },
        {
            "question": "Where can I find documentation for LangWatch?",
            "answer": "You can find the official documentation on the LangWatch website or GitHub repository.",
        },
    ]
)

evaluation = langwatch.evaluation.init("my-incredible-experiment")


@langwatch.trace()
def agent(question):
    time.sleep(random.randint(0, 10) / 10)
    return {"text": "foo bar"}


for index, row in evaluation.loop(df.iterrows()):
    result = agent(row["question"])  # your code

    score = random.randint(0, 80) / 100 + 0.2
    evaluation.log("sample_metric", index=index, score=score, passed=score > 0.5)


Follow the results at: http://localhost:5560/inbox-narrator/experiments/my-incredible-experiment?runId=industrious-convivial-waxbill


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

## Parallel Evaluation Loop

In [9]:
import random
import time

langwatch.setup()
evaluation = langwatch.evaluation.init("my-incredible-experiment")

@langwatch.trace()
def agent(question):
    time.sleep(random.randint(0, 10) / 10)
    return "foo parallel"

for index, row in evaluation.loop(df.iterrows(), threads=4):
    def evaluate(index, row):
        result = agent(row["question"])
        evaluation.log("sample_metric", index=index, data={"response": result}, score=1)
    evaluation.submit(evaluate, index, row)

2025-05-24 09:17:59,631 - langwatch.client - INFO - Registering atexit handler to flush tracer provider on exit
Follow the results at: http://localhost:5560/inbox-narrator/experiments/my-incredible-experiment?runId=flawless-conscious-lori


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

Failed to detach context
Traceback (most recent call last):
  File "/Users/rchaves/Projects/langwatch-saas/langwatch/python-sdk/.venv/lib/python3.11/site-packages/pydantic/type_adapter.py", line 271, in _init_core_attrs
    self.core_schema = _getattr_no_parents(self._type, '__pydantic_core_schema__')
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rchaves/Projects/langwatch-saas/langwatch/python-sdk/.venv/lib/python3.11/site-packages/pydantic/type_adapter.py", line 55, in _getattr_no_parents
    raise AttributeError(attribute)
AttributeError: __pydantic_core_schema__

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/rchaves/Projects/langwatch-saas/langwatch/python-sdk/.venv/lib/python3.11/site-packages/opentelemetry/context/__init__.py", line 155, in detach
    _RUNTIME_CONTEXT.detach(token)
  File "/Users/rchaves/Projects/langwatch-saas/langwatch/python-sdk/.venv/li

In [3]:
for index, row in df.iterrows():
    print(row.to_dict())

{'_is_copy': None, '_mgr': SingleBlockManager
Items: Index(['question', 'answer'], dtype='object')
NumpyBlock: 2 dtype: object, '_item_cache': {}, '_attrs': {}, '_flags': <Flags(allows_duplicate_labels=True)>, '_name': 0}
{'_is_copy': None, '_mgr': SingleBlockManager
Items: Index(['question', 'answer'], dtype='object')
NumpyBlock: 2 dtype: object, '_item_cache': {}, '_attrs': {}, '_flags': <Flags(allows_duplicate_labels=True)>, '_name': 1}


In [17]:
from dotenv import load_dotenv

load_dotenv(dotenv_path="langwatch/python-sdk/.env")

from langchain.prompts import ChatPromptTemplate

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores.faiss import FAISS
from langchain_core.vectorstores.base import VectorStoreRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.tools.retriever import create_retriever_tool
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain.tools import BaseTool, StructuredTool, tool
from langchain_core.documents import Document


loader = WebBaseLoader("https://docs.langwatch.ai")
docs = loader.load()
documents = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
).split_documents(docs)

vector = FAISS.from_documents(documents, OpenAIEmbeddings())
retriever = vector.as_retriever()

retrieved_documents = []

# Wrap the FAISS retriever so that we can capture which documents were used to generate the response
@tool
def langwatch_search(
    query: str
) -> list[Document]:
    """"Search for information about LangWatch. For any questions about LangWatch, use this tool if you didn't already"""

    global retrieved_documents
    retrieved_documents = retriever.get_relevant_documents(query)
    return retrieved_documents

tools = [langwatch_search]
model = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that only reply in short tweet-like responses, use tools only once.\n\n{agent_scratchpad}",
        ),
        ("human", "{question}"),
    ]
)
agent = create_tool_calling_agent(model, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=False)  # type: ignore

output = executor.invoke({"question": "What is LangWatch?"})["output"]

print("")
print("retrieved_documents:", [ d.page_content for d in retrieved_documents])
print("output:", output)


retrieved_documents: ['Introduction - LangWatchLangWatch home pageSearch...llms.txtSupportDashboardlangwatch/langwatchlangwatch/langwatchSearch...NavigationGet StartedIntroductionDocumentationOpen DashboardGitHub RepoGet StartedIntroductionSelf HostingCookbooksLLM ObservabilityOverviewConceptsLanguage APIs & SDKsUser EventsMonitoring & AlertsCode ExamplesLLM EvaluationOffline EvaluationReal-Time EvaluationList of EvaluatorsDatasetsAnnotationsLLM DevelopmentPrompt Optimization StudioDSPy VisualizationLangWatch MCPPrompt VersioningAPI EndpointsTracesPromptsAnnotationsDatasetsSupportTroubleshooting and SupportStatus PageGet StartedIntroductionCopy pageWelcome to LangWatch, the all-in-one open-source LLMops platform.LangWatch allows you to track, monitor, guardrail and evaluate your LLMs apps for measuring quality and alert on issues.\nFor domain experts, it allows you to easily sift through conversations, see topics being discussed and annotate and score messages', 'For domain experts, i

## Step 2: Run the Offline Evaluation

Now we can use the dataset we have from LangWatch to run a batch evaluation experiment through our LLM pipeline, to see the results and tweak it for optimizations.

In [None]:
import langwatch
import pandas as pd

# Create a dataset
df = pd.DataFrame(
    [
        {
            "question": "What is LangWatch?",
            "answer": "LangWatch is a platform for evaluating and improving language models.",
        },
        {
            "question": "How do I use LangWatch?",
            "answer": "You can use LangWatch by installing the LangWatch SDK and then calling the LangWatch API.",
        },
        {
            "question": "Does LangWatch support multiple language models?",
            "answer": "Yes, LangWatch is compatible with all language models by using LiteLLM under the hood.",
        },
        {
            "question": "Can I visualize evaluation metrics in LangWatch?",
            "answer": "Yes, LangWatch provides dashboards for visualizing key evaluation metrics.",
        },
        {
            "question": "Is there a free tier for LangWatch?",
            "answer": "LangWatch offers a free tier with limited usage, ideal for small projects and evaluation.",
        },
        {
            "question": "Where can I find documentation for LangWatch?",
            "answer": "You can find the official documentation on the LangWatch website or GitHub repository.",
        },
    ]
)
# Or retrieve it from LangWatch:
# df = langwatch.dataset.get_dataset("CEtFivQeya4kyAzy9eJht").to_pandas()  # dataset--rSAYL4HxQRXHSayq6c7A


evaluation = langwatch.evaluation.init("my-incredible-experiment")

for index, row in evaluation.loop(df.iterrows()):
    response = executor.invoke({"question": row["question"]})["output"]

    evaluation.run(
        "ragas/faithfulness",
        index=index,
        data={
            "input": row["question"],
            "output": response,
            "contexts": [d.page_content for d in retrieved_documents],
        },
        settings={
            "model": "openai/gpt-4o-mini",
            "max_tokens": 2048,
            "autodetect_dont_know": True,
        },
    )

Follow the results at: http://localhost:5560/inbox-narrator/experiments/my-incredible-experiment?runId=aloof-simple-mammoth


Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

NameError: name 'executor' is not defined

In [None]:
from dotenv import load_dotenv

load_dotenv()

from langwatch.batch_evaluation import BatchEvaluation, DatasetEntry


def callback(entry: DatasetEntry):
    output = executor.invoke({"question": entry["question"]})["output"]

    return {"input": entry["question"], "output": output, "contexts": [d.page_content for d in retrieved_documents]}

# Instantiate the BatchEvaluation object
evaluation = BatchEvaluation(
    experiment="LangWatch RAG Experiment",
    dataset="dataset--rSAYL4HxQRXHSayq6c7A",
    evaluations=["jailbreak-detection", "faithfulness"],
    callback=callback,
)

# Run the evaluation
results = evaluation.run()
results.df

Starting batch evaluation...
Follow the results at: https://app.langwatch.ai/public-documentation-examples-YxS3Bf/experiments/langwatch-rag-experiment?runId=mysterious-peculiar-cricket


100%|██████████| 10/10 [00:45<00:00,  4.53s/it]

Batch evaluation done!





Unnamed: 0,question,input,output,contexts,jailbreak-detection,faithfulness
0,Can I customize the evaluation metrics in Lang...,Can I customize the evaluation metrics in Lang...,"Yes, you can customize evaluation metrics in L...",[Introduction - LangWatchLangWatch home pageSe...,True,0.0
1,What programming languages are supported by La...,What programming languages are supported by La...,LangWatch supports Python and TypeScript for i...,[Introduction - LangWatchLangWatch home pageSe...,True,1.0
2,How do I configure alerts in LangWatch?,How do I configure alerts in LangWatch?,Check the LangWatch documentation for configur...,[Introduction - LangWatchLangWatch home pageSe...,True,0.0
3,Does it support langchain?,Does it support langchain?,"Yes, LangWatch supports LangChain! You can tra...",[Introduction - LangWatchLangWatch home pageSe...,True,0.666667
4,How does LangWatch help in LLMOps?,How does LangWatch help in LLMOps?,"LangWatch aids in LLMOps by tracking, monitori...",[Introduction - LangWatchLangWatch home pageSe...,True,0.888889
5,How can I visualize the traces collected by La...,How can I visualize the traces collected by La...,Check out the LangWatch documentation for guid...,[Introduction - LangWatchLangWatch home pageSe...,True,0.0
6,Is it possible to automate evaluations with La...,Is it possible to automate evaluations with La...,"Yes, LangWatch allows for batch evaluations an...",[Introduction - LangWatchLangWatch home pageSe...,True,0.666667
7,What programming languages are supported by La...,What programming languages are supported by La...,Agent stopped due to max iterations.,[Introduction - LangWatchLangWatch home pageSe...,True,0.0
8,How do I set up LangWatch in my python app?,How do I set up LangWatch in my python app?,Check out the [Python Integration Guide](https...,[Introduction - LangWatchLangWatch home pageSe...,True,0.0
9,What are the best practices for using LangWatc...,What are the best practices for using LangWatc...,"1. **Integration**: Follow Python, TypeScript,...",[Introduction - LangWatchLangWatch home pageSe...,True,1.0


: 