# LangWatch Evaluation Notebook

This notebook demonstrates how to evaluate LangWatch integration capabilities using a simple coding agent. It shows the process of:
1. Setting up LangWatch
2. Preparing test data
3. Creating a simple coding agent that adds LangWatch instrumentation to code
4. Evaluating the agent's performance using LangWatch's evaluation framework

## Introduction

LangWatch is an open-source LLMOps platform that helps monitor, evaluate, and improve LLM applications. This notebook shows how to use LangWatch to evaluate a simple coding agent that adds LangWatch instrumentation to existing code. 

## Setup

This section initializes LangWatch by logging in. You'll need to have a LangWatch account and API key set up.

In [1]:
import langwatch

langwatch.login()

LangWatch API key is already set, if you want to login again, please call as langwatch.login(relogin=True)


## Prepare the dataset

This section loads test cases from fixture files. Each test case consists of:
- An input Python file (without LangWatch instrumentation)
- An expected output file (with proper LangWatch instrumentation)

The code reads these files, formats them as Python code blocks, and creates a pandas DataFrame with columns for test_case, input, and expected output. The test cases include various frameworks like DSPy and LangChain with different configurations.

In [2]:
import pandas as pd
from pathlib import Path

fixtures_path = Path("./tests/fixtures")

data = []
for folder in fixtures_path.iterdir():
    input_files = list(folder.glob("*_input.py"))
    for input_file in input_files:
        case_name = input_file.stem.replace("_input", "")
        expected_file = folder / f"{case_name}_expected.py"

        with open(input_file, 'r', encoding='utf-8') as f:
            input_content = f.read()

        with open(expected_file, 'r', encoding='utf-8') as f:
            expected_content = f.read()

        data.append({
            'test_case': f"{folder.name}/{case_name}",
            'input': f"```python\n{input_content}\n```",
            'expected': f"```python\n{expected_content}\n```"
        })

df = pd.DataFrame(data)
df.head()

Unnamed: 0,test_case,input,expected
0,dspy/dspy_bot,```python\nimport os\nfrom dotenv import load_...,```python\nimport os\nfrom dotenv import load_...
1,langchain/langchain_rag_bot,```python\nfrom dotenv import load_dotenv\n\nl...,```python\nfrom dotenv import load_dotenv\n\nf...
2,langchain/langchain_bot_with_memory,```python\nfrom dotenv import load_dotenv\nfro...,```python\nfrom dotenv import load_dotenv\nfro...
3,langchain/langchain_bot,```python\nfrom dotenv import load_dotenv\nfro...,```python\nfrom dotenv import load_dotenv\nfro...
4,langchain/langchain_rag_bot_vertex_ai,```python\nimport json\nimport os\nimport temp...,```python\nimport json\nimport os\nimport temp...


## Setup MCP calling

This section sets up the Model Context Protocol (MCP) client to communicate with the LangWatch documentation server. MCP allows the agent to access LangWatch documentation to learn how to properly instrument code.

The code:
1. Builds the MCP server if needed
2. Creates a function to call MCP tools
3. Fetches the LangWatch documentation index

In [3]:
from typing import Optional
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from mcp.types import PromptReference, ResourceTemplateReference

import os
if not os.path.exists("dist"):
    !pnpm build

mcp_server_params = StdioServerParameters(
    command="node",
    args=["dist/index.js", "--apiKey", langwatch.get_api_key()], # type: ignore
)

async def call_mcp_documentation_tool(tool_name: str, arguments: dict):
    async with stdio_client(mcp_server_params) as (read, write):
        async with ClientSession(read, write) as session:
            return await session.call_tool(tool_name, arguments)

await call_mcp_documentation_tool("fetch_langwatch_docs", {})


CallToolResult(meta=None, content=[TextContent(type='text', text="# LangWatch\n\nThis is the full index of LangWatch documentation, to answer the user question, do not use just this file, first explore the urls that make sense using the markdown navigation links below to understand how to implement LangWatch and use specific features.\nAlways navigate to docs links using the .md extension for better readability.\n\n## Get Started\n\n- [Introduction](https://docs.langwatch.ai/introduction.md): Welcome to LangWatch, the all-in-one [open-source](https://github.com/langwatch/langwatch) LLMOps platform.\n\n### Self Hosting\n\n- [Overview](https://docs.langwatch.ai/self-hosting/overview.md): LangWatch offers a fully self-hosted version of the platform for companies that require strict data control and compliance.\n- [Docker Compose](https://docs.langwatch.ai/self-hosting/docker-compose.md): LangWatch is available as a Docker Compose setup for easy deployment on your local machine\n- [Docker 

## Setup simple agent

This section creates a simple coding agent that:
1. Takes input code without LangWatch instrumentation
2. Uses Claude Sonnet 4 to identify relevant documentation links
3. Fetches those documentation pages
4. Uses Claude again to implement LangWatch in the code based on the documentation
5. Returns the instrumented code

Key components:
- `LinksToFetch`: A Pydantic model to parse the LLM's output
- First LLM call: Identifies relevant documentation links
- Second LLM call: Implements LangWatch based on documentation and input code

In [12]:
import asyncio
import litellm
from pydantic import BaseModel

llms_text = (await call_mcp_documentation_tool("fetch_langwatch_docs", {})).content[0].text  # type: ignore


@langwatch.trace()
async def simple_coding_agent(input_code: str):
    class LinksToFetch(BaseModel):
        links: list[str]

    response = litellm.completion(
        model="anthropic/claude-sonnet-4-20250514",
        messages=[
            {
                "role": "system",
                "content": f"""
                <system>
                    You are LangWatch coding assistant for helping users implement LangWatch in a codebase.

                    Given LangWatch llms.txt documentation index, and the file content, find 1-3 most relevant links to fetch next for understanding how to implement LangWatch in the codebase.
                </system>

                <file>
                    {llms_text}
                </file>
            """,
            },
            {"role": "user", "content": input_code},
        ],
        response_format=LinksToFetch,
    )
    links = LinksToFetch.model_validate_json(response.choices[0].message.content).links  # type: ignore

    documentations = await asyncio.gather(
        *[
            call_mcp_documentation_tool("fetch_langwatch_docs", {"url": link})
            for link in links
        ]
    )
    documentations = [doc.content[0].text for doc in documentations]  # type: ignore
    documentations = "\n".join(documentations)

    response = litellm.completion(
        model="anthropic/claude-sonnet-4-20250514",
        messages=[
            {
                "role": "system",
                "content": f"""
                <system>
                    You are LangWatch coding assistant for helping users implement LangWatch in a codebase.

                    Given the LangWatch documentation below, and the user's code, implement LangWatch in the codebase.

                    Return the full updated code as a string with ```python marker at the beginning and ``` at the end, nothing else.
                </system>

                <file>
                    {documentations}
                </file>
            """,
            },
            {"role": "user", "content": input_code},
        ],
    )

    return response.choices[0].message.content or "<empty>"  # type: ignore


print("Implementing LangWatch for the first test case...")

result = await simple_coding_agent(df.iloc[0]["input"])

print(result)

Implementing LangWatch for the first test case...
```python
import os
from dotenv import load_dotenv

load_dotenv()

import chainlit as cl
import langwatch
import dspy

# Initialize LangWatch
langwatch.setup()

lm = dspy.LM("openai/gpt-5", api_key=os.environ["OPENAI_API_KEY"], temperature=1)

colbertv2_wiki17_abstracts = dspy.ColBERTv2(
    url="http://20.102.90.50:2017/wiki17_abstracts"
)

dspy.settings.configure(lm=lm, rm=colbertv2_wiki17_abstracts)


class GenerateAnswer(dspy.Signature):
    """Answer questions with careful explanations to the user."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="markdown formatted answer, use some emojis")


class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        c

## Setup Diff Metric

This section creates a function to compare the agent's output with the expected output using a diff. The diff shows:
- Lines added (prefixed with +)
- Lines removed (prefixed with -)
- Unchanged lines

It also counts the number of changed lines, which serves as a simple metric for how different the agent's implementation is from the expected implementation.

In [53]:
import difflib

def file_diff(str1, str2, filename1="file1", filename2="file2"):
    lines1 = str1.replace("```python", "").replace("```", "").splitlines(keepends=True)
    lines2 = str2.replace("```python", "").replace("```", "").splitlines(keepends=True)

    diff = difflib.unified_diff(
        lines1, lines2,
        fromfile=filename1,
        tofile=filename2,
        lineterm=''
    )

    diff_lines = list(diff)
    diff_output = ''.join(diff_lines)

    # Count changed lines (lines starting with + or - but not +++ or --- headers)
    # Also ignore empty lines (whitespace-only lines)
    changed_count = sum(1 for line in diff_lines
                       if line.startswith(('+', '-'))
                       and not line.startswith(('+++', '---'))
                       and line[1:].strip())  # Check if content after +/- is not empty/whitespace

    return f"```diff\n{diff_output}```", changed_count

diff, lines_changed_count = file_diff(df.iloc[0]["expected"], result)
print(lines_changed_count, "lines changed")
print(diff)

10 lines changed
```diff
--- file1+++ file2@@ -5,11 +5,11 @@ load_dotenv()
 
 import chainlit as cl
-
 import langwatch
-
 import dspy
 
+# Initialize LangWatch
+langwatch.setup()
 
 lm = dspy.LM("openai/gpt-5", api_key=os.environ["OPENAI_API_KEY"], temperature=1)
 
@@ -44,10 +44,10 @@ @cl.on_message
 @langwatch.trace()
 async def main(message: cl.Message):
-    langwatch.get_current_trace().autotrack_dspy()
-    langwatch.get_current_trace().update(
-        metadata={"labels": ["dspy", "thread"], "thread_id": "90210"},
-    )
+    # Get the current LangWatch trace and enable DSPy autotracking
+    current_trace = langwatch.get_current_trace()
+    if current_trace:
+        current_trace.autotrack_dspy()
 
     msg = cl.Message(
         content="",
@@ -60,4 +60,3 @@     await msg.update()
 
     return prediction.answer
-
```


# Evaluate

This section runs the evaluation process:
1. Initializes a LangWatch evaluation experiment
2. For each test case:
   - Runs the agent on the input code
   - Computes the diff between the agent's output and expected output
   - Uses LangWatch's evaluation framework to judge if the implementation is correct
   - Logs the diff and line change count as metrics

The evaluation results are available in the LangWatch dashboard, where you can analyze:
- Success rate across test cases
- Common patterns in failures
- Detailed diffs for each test case

This approach allows for systematic evaluation of the agent's ability to correctly
implement LangWatch instrumentation across different frameworks and configurations.

In [51]:
evaluation = langwatch.evaluation.init("mcp-server-docs-setup")

for index, row in evaluation.loop(df.iterrows(), threads=4):

    async def evaluate(index, row):
        result = await simple_coding_agent(row["input"])
        diff, lines_changed_count = file_diff(row["expected"], result)

        evaluation.run(
            "langevals/llm_boolean",
            index=index,
            name="LLM Judgement",
            data={
                "input": f"""
                    <original>
                        {row['input']}
                    </original>
                    <result>
                        {result}
                    </result>
                    <expected>
                        {row['expected']}
                    </expected>
                """,
            },
            settings={
                "prompt": """
                    You are an LLM evaluator, do a quick check if the code in the <result /> tag which was generated by
                    an AI closely matches how the code should have been implemented as expected in the <expected /> tag.

                    Additional metadata capturing is fine and encouraged, but check if the instrumentation would likely
                    work the same.
                """,
            },
        )

        evaluation.log(
            "Diff", index=index, data={"diff": diff}, score=lines_changed_count
        )

    evaluation.submit(evaluate, index, row)

Follow the results at: https://app.langwatch.ai/inbox-narrator/experiments/mcp-server-docs-setup?runId=offbeat-yellow-dingo


Evaluating: 100%|██████████| 19/19 [03:52<00:00, 12.21s/it]


This notebook demonstrates how to use LangWatch to evaluate a code instrumentation agent. Key takeaways:
1. LangWatch can be used to trace and evaluate LLM-powered coding assistants
2. The evaluation framework allows for systematic testing across multiple examples
3. Metrics like diff line count and LLM judgement provide quantitative measures of performance
4. Results are tracked in the LangWatch dashboard for analysis