# LangWatch Batch Evaluation Cookbook

## Step 1: Define our LLM pipeline

Let's create a simple RAG pipeline using LangChain, guaranteeing that we can get the output and the retrieved documents used during generation.

In [1]:
from dotenv import load_dotenv

load_dotenv(dotenv_path="langwatch/python-sdk/.env")

from langchain.prompts import ChatPromptTemplate

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores.faiss import FAISS
from langchain_core.vectorstores.base import VectorStoreRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.tools.retriever import create_retriever_tool
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain.tools import BaseTool, StructuredTool, tool
from langchain_core.documents import Document


loader = WebBaseLoader("https://docs.langwatch.ai")
docs = loader.load()
documents = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
).split_documents(docs)

vector = FAISS.from_documents(documents, OpenAIEmbeddings())
retriever = vector.as_retriever()

retrieved_documents = []

# Wrap the FAISS retriever so that we can capture which documents were used to generate the response
@tool
def langwatch_search(
    query: str
) -> list[Document]:
    """"Search for information about LangWatch. For any questions about LangWatch, use this tool if you didn't already"""

    global retrieved_documents
    retrieved_documents = retriever.get_relevant_documents(query)
    return retrieved_documents

tools = [langwatch_search]
model = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that only reply in short tweet-like responses, use tools only once.\n\n{agent_scratchpad}",
        ),
        ("human", "{question}"),
    ]
)
agent = create_tool_calling_agent(model, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=False)  # type: ignore

output = executor.invoke({"question": "What is LangWatch?"})["output"]

print("")
print("retrieved_documents:", [ d.page_content for d in retrieved_documents])
print("output:", output)

USER_AGENT environment variable not set, consider setting it to identify your requests.
  retrieved_documents = retriever.get_relevant_documents(query)



retrieved_documents: ['Introduction - LangWatchLangWatch home pageSearch...llms.txtSupportDashboardlangwatch/langwatchlangwatch/langwatchSearch...NavigationGet StartedIntroductionDocumentationOpen DashboardGitHub RepoGet StartedIntroductionSelf HostingCookbooksLLM ObservabilityOverviewConceptsLanguage APIs & SDKsUser EventsMonitoring & AlertsCode ExamplesLLM EvaluationOffline EvaluationReal-Time EvaluationList of EvaluatorsDatasetsAnnotationsLLM DevelopmentPrompt Optimization StudioDSPy VisualizationLangWatch MCPPrompt VersioningAPI EndpointsTracesPromptsAnnotationsDatasetsSupportTroubleshooting and SupportStatus PageGet StartedIntroductionCopy pageWelcome to LangWatch, the all-in-one open-source LLMops platform.LangWatch allows you to track, monitor, guardrail and evaluate your LLMs apps for measuring quality and alert on issues.\nFor domain experts, it allows you to easily sift through conversations, see topics being discussed and annotate and score messages', 'For domain experts, i

## Step 2: Run the Batch Evaluation Experiment

Now we can use the dataset we have from LangWatch to run a batch evaluation experiment through our LLM pipeline, to see the results and tweak it for optimizations.

In [2]:
from dotenv import load_dotenv

load_dotenv()

from langwatch.batch_evaluation import BatchEvaluation, DatasetEntry


def callback(entry: DatasetEntry):
    output = executor.invoke({"question": entry["question"]})["output"]

    return {"input": entry["question"], "output": output, "contexts": [d.page_content for d in retrieved_documents]}

# Instantiate the BatchEvaluation object
evaluation = BatchEvaluation(
    experiment="LangWatch RAG Experiment",
    dataset="dataset--rSAYL4HxQRXHSayq6c7A",
    evaluations=["jailbreak-detection", "faithfulness"],
    callback=callback,
)

# Run the evaluation
results = evaluation.run()
results.df

Starting batch evaluation...
Follow the results at: https://app.langwatch.ai/demo/experiments/langwatch-rag-experiment?runId=electric-dashing-chameleon


100%|██████████| 10/10 [00:50<00:00,  5.05s/it]

Batch evaluation done!





Unnamed: 0,question,input,output,contexts,jailbreak-detection,faithfulness
0,Can I customize the evaluation metrics in Lang...,Can I customize the evaluation metrics in Lang...,"Yes, you can customize evaluation metrics in L...",[Introduction - LangWatchLangWatch home pageSe...,True,0.0
1,What programming languages are supported by La...,What programming languages are supported by La...,"LangWatch supports Python, TypeScript, and RES...",[Introduction - LangWatchLangWatch home pageSe...,True,1.0
2,How do I configure alerts in LangWatch?,How do I configure alerts in LangWatch?,Check the LangWatch documentation for configur...,[Introduction - LangWatchLangWatch home pageSe...,True,0.0
3,What programming languages are supported by La...,What programming languages are supported by La...,"LangWatch supports Python and TypeScript, alon...",[Introduction - LangWatchLangWatch home pageSe...,True,1.0
4,Does it support langchain?,Does it support langchain?,"Yes, LangWatch supports LangChain! You can tra...",[Introduction - LangWatchLangWatch home pageSe...,True,0.5
5,How does LangWatch help in LLMOps?,How does LangWatch help in LLMOps?,"LangWatch streamlines LLMOps by tracking, moni...",[Introduction - LangWatchLangWatch home pageSe...,True,1.0
6,How can I visualize the traces collected by La...,How can I visualize the traces collected by La...,You can visualize traces collected by LangWatc...,[Introduction - LangWatchLangWatch home pageSe...,True,0.25
7,Is it possible to automate evaluations with La...,Is it possible to automate evaluations with La...,"Yes, LangWatch allows for automated evaluation...",[Introduction - LangWatchLangWatch home pageSe...,True,1.0
8,How do I set up LangWatch in my python app?,How do I set up LangWatch in my python app?,Check the [Python Integration Guide](https://d...,[Introduction - LangWatchLangWatch home pageSe...,True,0.0
9,What are the best practices for using LangWatc...,What are the best practices for using LangWatc...,Best practices for using LangWatch in producti...,[Introduction - LangWatchLangWatch home pageSe...,True,0.133333
