## LLM-as-Judge for Semantic Evaluation

### Motivation
Exact-match metrics are often too restrictive for evaluating generative systems, especially when multiple responses can be semantically correct. To address this, recent work has explored using large language models themselves as evaluators, assigning scores based on semantic correctness rather than string-level overlap. This notebook examines LLM-as-judge evaluation as a practical but imperfect tool for assessing open-ended model outputs.

The goal here is not to treat LLM judgments as ground truth, but to study when and how they provide useful signal, and where they may introduce new sources of error or bias.

> **Note 1:** This walkthrough focuses on testing the end-to-end performance of the system. It's also crucial to evaluate individual components. For instance, the retriever can be tested separately using standard information retrieval metrics (e.g., hit rate, MRR) to ensure it's fetching relevant documents effectively.

> **Note 2:** If your knowledge base (the documents your system answers from) is constantly changing, your reference answers might become outdated. It's important to have a strategy to manage this, such as freezing the knowledge source during testing or regularly updating your evaluation dataset.

### Experimental Setup
We use a high-capability language model as a grader, prompting it to compare a system’s output against a reference answer and assign a correctness score. The evaluation is applied across a fixed dataset, allowing direct comparison with simpler baselines such as exact match.

The grading prompt is held constant across runs to reduce variability, but the evaluator model remains stochastic, reflecting a realistic deployment scenario where judgments are approximate rather than deterministic.

### What LLM-as-Judge Captures
LLM-based evaluators are able to detect:
- semantic equivalence despite surface-level variation,
- partially correct answers that capture core intent,
- answers that are fluent but factually inconsistent with the reference.

In this sense, LLM-as-judge aligns more closely with human intuition than strict lexical metrics and can substantially reduce false negatives introduced by exact match.

### Prerequisites and Setup

- **`LANGCHAIN_ENDPOINT`**: This URL tells LangChain to send all tracing data to the LangSmith platform.
- **`LANGCHAIN_API_KEY`**: This is your secret key for authenticating with LangSmith.
- **`PROJECT_NAME`**: (Optional) This allows you to group related runs in LangSmith under a specific project. It's highly recommended for organization.

This tutorial uses OpenAI models, ChromaDB for the vector store, and LangChain for building the RAG chain. You will also need to set your OpenAI API key.

In [1]:
import os # Import the 'os' module to interact with the operating system's environment variables.

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the API endpoint for LangSmith.
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your personal LangSmith API key.
project_name = "YOUR PROJECT NAME"  # Update with a name for your LangSmith project.

- `langchain[openai]`: Installs the core LangChain library along with the specific integrations for OpenAI models.
- `chromadb`: The vector database we will use to store and retrieve document embeddings.
- `lxml`: A robust parser for HTML and XML, used by our document loader.
- `html2text`: A utility to convert HTML into clean, readable plain text.

In [2]:
# The '%pip install' command installs python packages. The '> /dev/null' part suppresses the output for a cleaner notebook.
# %pip install -U "langchain[openai]" > /dev/null
# %pip install chromadb > /dev/null
# %pip install lxml > /dev/null
# %pip install html2text > /dev/null

Set your OpenAI API Key. This is required to use OpenAI's embedding and language models. Replace `<YOUR-API-KEY>` with your actual key.

In [3]:
# The '%env' magic command sets an environment variable for the notebook session.
# %env OPENAI_API_KEY=<YOUR-API-KEY>

## Create a Dataset

For our Q&A system, the dataset will consist of question-answer pairs. The questions represent typical user queries, and the answers are the "ground truth" or reference responses we expect the system to provide.

For this example, we'll create a dataset about LangSmith documentation. We have hard-coded a few examples below. For a real-world scenario, it's best to have a much larger dataset (e.g., >100 examples) to get statistically significant results. These examples should ideally be sourced from real user interactions to ensure they are representative.

In [4]:
# We define a list of tuples, where each tuple contains a (question, answer) pair.
examples = [
    (
        "What is LangChain?", # This is the input question.
        "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.", # This is the reference answer.
    ),
    (
        "How might I query for all runs in a project?",
        "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})",
    ),
    (
        "What's a langsmith dataset?",
        "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point.",
    ),
    (
        "How do I use a traceable decorator?",
        """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,\
import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```""",
    ),
    (
        "Can I trace my Llama V2 llm?",
        "So long as you are using one of LangChain's LLM implementations, all your calls can be traced",
    ),
    (
        "Why do I have to set environment variables?",
        "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith."
        " While there are other ways to connect, environment variables tend to be the simplest way to configure your application.",
    ),
    (
        "How do I move my project between organizations?",
        "LangSmith doesn't directly support moving projects between organizations.",
    ),
]

Now, let's create a LangSmith client, which is our main entry point for interacting with the LangSmith platform.

In [6]:
from langsmith import Client # Import the Client class from the langsmith library.

client = Client() # Instantiate the client. It will automatically use the environment variables we set earlier.

Using the client, we will programmatically create a new dataset in LangSmith and populate it with our examples. We add a unique identifier (`uuid`) to the dataset name to prevent naming conflicts if we run this notebook multiple times.

In [7]:
import uuid # Import the uuid library to generate unique identifiers.

# Define a unique name for the dataset using a UUID to avoid collisions.
dataset_name = f"Retrieval QA Questions {str(uuid.uuid4())}"
# Create the dataset on the LangSmith platform and get back a dataset object.
dataset = client.create_dataset(dataset_name=dataset_name)
# Loop through our list of hard-coded question-answer pairs.
for q, a in examples:
    # For each pair, create an example in our LangSmith dataset.
    client.create_example(
        inputs={"question": q}, # The input dictionary must have keys that match what our chain expects.
        outputs={"answer": a}, # The output dictionary contains the ground truth reference answer.
        dataset_id=dataset.id # We specify which dataset to add this example to.
    )

### Define the RAG Q&A System

We are using a **Retrieval-Augmented Generation (RAG)** architecture:

1.  **Retrieval**: Given a user's question, the system first retrieves relevant information from a knowledge base. In our case, this knowledge base is the LangSmith documentation. This stage consists of:
    -   An **Embedding Model** (`OpenAIEmbeddings`): Converts both the documents and the user's question into numerical vectors (embeddings).
    -   A **Vector Store** (`Chroma`): A specialized database that stores the document vectors and allows for efficient searching to find vectors (and thus documents) that are most similar to the question vector.
    -   A **Retriever**: The component that orchestrates the search in the vector store and returns the most relevant documents.

2.  **Generation**: The retrieved documents are then passed to an LLM, along with the original question, to generate a final, synthesized answer. This stage consists of:
    -   A **Prompt Template** (`ChatPromptTemplate`): Structures the input for the LLM, combining the retrieved context and the user's question with instructions on how to answer.
    -   An **LLM** (`ChatOpenAI`): The language model that reads the prompt and generates the textual response.

We will use LangChain Expression Language (LCEL) to elegantly combine these components into a single, executable chain.

In [8]:
from langchain_community.document_loaders import RecursiveUrlLoader # A loader for recursively scraping a website.
from langchain_community.document_transformers import Html2TextTransformer # A transformer to convert HTML content to plain text.
from langchain_community.vectorstores import Chroma # The Chroma vector store implementation.
from langchain_text_splitters import TokenTextSplitter # A text splitter that splits based on token count.
from langchain_openai import OpenAIEmbeddings # The class for using OpenAI's embedding models.

# Initialize a loader to fetch all documents from the LangSmith documentation website.
api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
# Initialize a text splitter to break large documents into smaller chunks.
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo", # The model used to count tokens for splitting.
    chunk_size=2000, # The maximum size of each chunk in tokens.
    chunk_overlap=200, # The number of tokens to overlap between consecutive chunks.
)
# Initialize a transformer to clean up the raw HTML.
doc_transformer = Html2TextTransformer()
# Load the raw documents from the URL.
raw_documents = api_loader.load()
# Transform the raw HTML documents into plain text.
transformed = doc_transformer.transform_documents(raw_documents)
# Split the transformed documents into smaller, manageable chunks.
documents = text_splitter.split_documents(transformed)



With the documents processed, we can now create the vector store and the retriever. The vector store will embed and index our document chunks, and the retriever will provide the interface for searching them.

In [9]:
# Initialize the OpenAI embeddings model.
embeddings = OpenAIEmbeddings()
# Create a Chroma vector store from the documents, using the OpenAI embeddings model.
vectorstore = Chroma.from_documents(documents, embeddings)
# Create a retriever from the vector store, configured to return the top 4 most relevant documents.
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

In [10]:
from datetime import datetime # Import datetime to include the current time in the prompt.
from langchain_core.output_parsers import StrOutputParser # Import the string output parser.
from langchain_core.prompts import ChatPromptTemplate # Import the chat prompt template class.
from langchain_openai import ChatOpenAI # Import the OpenAI chat model class.

# Define the prompt template. This structures the input for the LLM.
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system", # The system message provides high-level instructions to the model.
            "You are a helpful documentation Q&A assistant, trained to answer"
            " questions from LangSmith's documentation."
            " LangChain is a framework for building applications using large language models."
            "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages.",
        ),
        ("system", "{context}"), # A placeholder for the retrieved documents (context).
        ("human", "{question}"), # A placeholder for the user's question.
    ]
).partial(time=str(datetime.now())) # Pre-fill the 'time' variable with the current time.

# Initialize the LLM we'll use for generation. We use gpt-3.5-turbo with a large context window and low temperature for factual answers.
model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
# Define the response generator part of the chain using LCEL. It pipes the prompt to the model, then to an output parser.
response_generator = prompt | model | StrOutputParser()

In [11]:
# The full chain combines the retriever and the response generator.
from operator import itemgetter # Import itemgetter for convenient data routing.

chain = (
    # A Runnable Map takes the input dictionary and prepares a new dictionary for the next step.
    {
        # The 'context' key is populated by a sub-chain: get the question, pass it to the retriever, and format the resulting documents.
        "context": itemgetter("question")
        | retriever
        | (lambda docs: "\n".join([doc.page_content for doc in docs])),
        # The 'question' key is passed through directly from the input.
        "question": itemgetter("question"),
    }
    | response_generator # The output of the map is piped into our response generator chain.
)

In [12]:
# We stream the output of the chain for a sample question.
for tok in chain.stream({"question": "How do I log user feedback to a run?"}):
    print(tok, end="", flush=True) # Print each token as it is generated for a real-time effect.

To log user feedback to a run in LangSmith, you can use the `create_feedback` method provided by the LangSmith client. Here's an example of how to log user feedback using the Python client:

```python
from langsmith import Client

client = Client()

# Specify the run ID and feedback key
run_id = "<run_id>"
feedback_key = "thumbs_up"

# Log the feedback
client.create_feedback(
    run_id,
    feedback_key,
    score=True
)
```

In this example, we log a "thumbs up" feedback for a specific run by calling `create_feedback` with the run ID, feedback key, and a score of `True`. You can customize the feedback by providing additional optional fields such as `value`, `correction`, `comment`, `source_info`, and `feedback_source_type`.

You can also log feedback using the LangSmith client in TypeScript. Here's an example:

```typescript
import { Client } from "langsmith";

const client = new Client();

// Specify the run ID and feedback key
const runId = "<run_id>";
const feedbackKey = "thumbs_u

### Evaluate the Chain

We will use one of LangSmith's built-in, LLM-assisted evaluators called `"qa"`. This evaluator is specifically designed for Q&A tasks. For each example in our dataset, it will:
1.  Receive the generated answer from our RAG chain.
2.  Receive the reference answer from our dataset.
3.  Use an LLM to determine if the generated answer is a "correct" answer based on the reference. It returns a binary score (1 for correct, 0 for incorrect).

We configure this using the `RunEvalConfig` object.

In [13]:
from langchain.smith import RunEvalConfig # Import the evaluation configuration class.

# Create an evaluation configuration object.
eval_config = RunEvalConfig(
    # Specify the evaluators to use. 'qa' is a built-in evaluator for question-answering correctness.
    evaluators=["qa"],
    # You can optionally configure the LLM used for evaluation if you want to use a different model.
    # eval_llm=ChatAnthropic(model="claude-2", temperature=0)
)

Now we execute the evaluation. The `client.arun_on_dataset` function orchestrates the entire process. It iterates through each example in our dataset, runs our RAG chain on the input question, and then applies the `qa` evaluator to score the result. The `await` keyword is used because this is an asynchronous operation, which can run evaluations in parallel for greater efficiency.

In [14]:
# Asynchronously run the evaluation on the dataset.
_ = await client.arun_on_dataset(
    dataset_name=dataset_name, # The name of the dataset to test against.
    llm_or_chain_factory=lambda: chain, # A function that returns an instance of the chain to be tested.
    evaluation=eval_config, # The evaluation configuration we defined earlier.
)

View the evaluation results for project 'test-virtual-attitude-2' at:
https://smith.langchain.com/o/9a6371ef-ea6a-4860-b3bd-9614084873e7/projects/p/b539e1db-d7db-4da7-87c3-d9087ed5d0b9
[------------------------------------------------->] 7/7

### Analyzing the Results in LangSmith

As the test progresses, you can click the link printed above to go to the LangSmith project. There, you can see real-time results, including the chain's outputs, the feedback scores from the evaluator, and detailed traces for each run.

To find problematic examples, you can filter the results. For example, to see all the runs that the `qa` evaluator marked as incorrect, you can filter for `"Correctness==0"`.

### Diagnosing and Fixing the Error
In this example, one of the traces was marked as "incorrect". By inspecting the trace, we might find that the model is "hallucinating".

To fix the hallucination, we can try making the prompt more robust. 

> Respond as best as you can. If no documents are retrieved or if you do not see an answer in the retrieved documents, admit you do not know or that you don't see it being supported at the moment.

After adding this message in the Playground and resubmitting, we can see if the model's behavior improves.

The new prompt seems to fix the issue for this specific example. However, we need to ensure this change doesn't negatively affect other examples (i.e., we're not overfitting to a single failure case). The next step is to re-run the entire evaluation with our improved chain.

### Iterate and Re-Evaluate

Evaluation is not a one-time event; it's a cycle. Below, we define a new RAG chain (`chain_2`) that includes the improved prompt with the added system message to discourage hallucination.

In [15]:
# Define the new, improved prompt template.
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful documentation Q&A assistant, trained to answer"
            " questions from LangSmith's documentation."
            "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages.",
        ),
        ("system", "{context}"),
        ("human", "{question}"),
        # Add the new system message here to make the model more cautious:
        (
            "system",
            "Respond as best as you can. If no documents are retrieved or if you do not see an answer in the retrieved documents,"
            " admit you do not know or that you don't see it being supported at the moment.",
        ),
    ]
).partial(time=lambda: str(datetime.now())) # Use a lambda to get the current time dynamically for each run.

# Re-initialize the model and response generator with the new prompt.
model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
response_generator_2 = prompt | model | StrOutputParser()
# Assemble the second version of our RAG chain.
chain_2 = {
    "context": itemgetter("question")
    | retriever
    | (lambda docs: "\n".join([doc.page_content for doc in docs])),
    "question": itemgetter("question"),
} | response_generator_2

In [16]:
# Run the evaluation again with the updated chain factory.
_ = await client.arun_on_dataset(
    dataset_name=dataset_name, # Use the same dataset as before.
    llm_or_chain_factory=lambda: chain_2, # Point to the new, improved chain.
    evaluation=eval_config, # Use the same evaluation configuration.
)

View the evaluation results for project 'test-impressionable-lead-60' at:
https://smith.langchain.com/o/9a6371ef-ea6a-4860-b3bd-9614084873e7/projects/p/fc4f3319-1707-4035-b4e8-b3b3fafcf5b7
[------------------------------------------------->] 7/7

### Limitations and Failure Modes
LLM-based evaluation introduces its own risks. The judge may:
- reward plausible but incorrect reasoning,
- overvalue fluency or verbosity,
- reflect its own biases or blind spots,
- vary across runs in subtle but consequential ways.

Importantly, the evaluator has no privileged access to the true causal process that generated the answer. As a result, agreement between a model and an LLM judge should not be conflated with faithfulness or correctness in a stronger sense.

### Role in a Broader Evaluation Framework
In this project, LLM-as-judge is treated as one signal among many rather than a definitive arbiter. Its outputs are most informative when compared against simpler baselines and more structured evaluations, such as trajectory analysis or component-wise testing.

By examining where LLM-as-judge agrees or disagrees with other metrics, we can better understand which types of errors each method surfaces—and which remain hidden.

## Discussion
LLM-as-judge offers a scalable and flexible way to evaluate generative systems, but it shifts the problem rather than solving it: evaluation becomes another modelling task with its own assumptions and failure modes. This notebook treats LLM-based evaluation as an object of study in its own right, highlighting both its utility and its limits within a safety-oriented evaluation pipeline.