## Dynamic Ground Truth for Evaluating Non-Stationary Tasks

### Motivation
Many evaluation setups implicitly assume that the ground truth is static. This assumption breaks down in settings where answers depend on external state: live databases, changing APIs, time-dependent facts, or evolving environments. In these cases, storing a fixed reference answer can be misleading, as a system may be penalised for producing a response that is correct at execution time but differs from an outdated label.

This notebook explores dynamic ground truth evaluation, where correctness is defined by executable logic rather than static annotations.

### Experimental Setup
Instead of storing answers directly, the dataset encodes ground truth as executable functions. At evaluation time, these functions are run to generate the correct answer based on the current state of the underlying data source.

This approach allows evaluation to remain aligned with the task as it actually exists at inference time, rather than with a snapshot of the world taken during dataset creation.

### What Dynamic Ground Truth Captures
Dynamic evaluation is particularly effective for detecting:
- failures caused by stale knowledge,
- incorrect tool use when querying live systems,
- mismatches between model assumptions and current system state,
- fragile reliance on memorised facts instead of retrieval or computation.

By recomputing ground truth at runtime, this method separates failures of reasoning from failures of data freshness.

> **Note**: We use a simple CSV file and pandas DataFrame to simulate a dynamic data source. This is for illustrative purposes; in a real-world scenario, this could be a SQL database, a GraphQL API, or any other data source.

### Prerequisites and Setup

- **`LANGCHAIN_ENDPOINT`**: This URL tells LangChain to send all tracing data to the LangSmith platform.
- **`LANGCHAIN_API_KEY`**: This is your secret key for authenticating with LangSmith.

In [None]:
import os # Import the 'os' module to interact with the operating system's environment variables.

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the API endpoint for LangSmith.
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your personal LangSmith API key.

- `langchain[openai]`: Installs the core LangChain library and integrations for OpenAI models.
- `pandas`: The library for data manipulation and analysis, used here as our data source.

In [None]:
# The '%pip install' command installs python packages. '> /dev/null' suppresses the output.
# %pip install -U "langchain[openai]" > /dev/null
# %pip install pandas > /dev/null
# The '%env' magic command sets an environment variable for the notebook session.
# %env OPENAI_API_KEY=<YOUR-API-KEY>

### Create a Dataset with Dynamic References

We will use the classic Titanic dataset as our data source. The key difference in our approach is how we define the labels. Instead of calculating the answers beforehand and storing them as static values, we will store Python code snippets that can be executed on the DataFrame to get the correct answer.

This is the principle of **indirection** in action. The label is not the answer itself, but a *recipe for finding the answer*. This ensures that our evaluation always compares against the most up-to-date data.

In [1]:
# Define a list of tuples, where each tuple is a (question, code_snippet) pair.
questions = [
    ("How many passengers were on the Titanic?", "len(df)"),
    ("How many passengers survived?", "df['Survived'].sum()"),
    ("What was the average age of the passengers?", "df['Age'].mean()"),
    ("How many male and female passengers were there?", "df['Sex'].value_counts()"),
    ("What was the average fare paid for the tickets?", "df['Fare'].mean()"),
    ("How many passengers were in each class?", "df['Pclass'].value_counts()"),
    (
        "What was the survival rate for each gender?",
        "df.groupby('Sex')['Survived'].mean()",
    ),
    (
        "What was the survival rate for each class?",
        "df.groupby('Pclass')['Survived'].mean()",
    ),
    (
        "Which port had the most passengers embark from?",
        "df['Embarked'].value_counts().idxmax()",
    ),
    (
        "How many children under the age of 18 survived?",
        "df[df['Age'] < 18]['Survived'].sum()",
    ),
]

In [2]:
import uuid # Import the uuid library to generate unique identifiers.

from langsmith import Client # Import the Client class to interact with LangSmith.

client = Client() # Instantiate the LangSmith client.
# Define a unique name for the dataset using a short hex code from a UUID.
dataset_name = f"Dynamic Titanic CSV {uuid.uuid4().hex[:4]}"
# Create the dataset on the LangSmith platform.
dataset = client.create_dataset(
    dataset_name=dataset_name, # The name for the new dataset.
    description="Test QA over CSV", # An optional description for the dataset.
)

# Create all the examples in the dataset in a single API call for efficiency.
client.create_examples(
    # The inputs are a list of dictionaries, each with a 'question' key.
    inputs=[{"question": example[0]} for example in questions],
    # The outputs are a list of dictionaries, each with a 'code' key containing the reference snippet.
    outputs=[{"code": example[1]} for example in questions],
    dataset_id=dataset.id, # Link these examples to the dataset we just created.
)

### Define the Q&A System

With the dataset created, it's time to define our question-answering system. We'll use a pre-built LangChain component: the **pandas dataframe agent**. This agent is specifically designed to answer questions about a pandas DataFrame by generating and executing Python code.

First, we load the Titanic data into a DataFrame. Then, we create a constructor function for our agent that we can pass to the evaluator.

In [3]:
import pandas as pd # Import the pandas library for data manipulation.

# The URL of the raw CSV file for the Titanic dataset.
titanic_path = "https://raw.githubusercontent.com/jorisvandenbossche/pandas-tutorial/master/data/titanic.csv"
# Read the CSV data from the URL into a pandas DataFrame.
df = pd.read_csv(titanic_path)

Now, we define the `predict` function. This function will be our "system under test". For each run, it initializes a new pandas dataframe agent with our designated LLM and the current state of the DataFrame `df`. It then invokes the agent with the user's question.

In [16]:
from langchain_core.prompts import ChatPromptTemplate # Import prompt templates.
from langchain_experimental.agents import create_pandas_dataframe_agent # Import the agent constructor.
from langchain_openai import ChatOpenAI # Import the OpenAI chat model wrapper.

# Initialize the LLM. We use a powerful model like GPT-4 for code generation tasks and set temperature to 0 for deterministic outputs.
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0.0)


# Define the function to be evaluated.
def predict(inputs: dict):
    # Inside the function, create an instance of the pandas dataframe agent.
    agent = create_pandas_dataframe_agent(agent_type="openai-tools", llm=llm, df=df)
    # Invoke the agent with the question from the input dictionary.
    return agent.invoke({"input": inputs["question"]})

In [17]:
# Run an example prediction to see the agent in action.
predict({"question": "How many passengers were on the Titanic?"})

{'input': 'How many passengers were on the Titanic?',
 'output': 'There were 891 passengers on the Titanic according to the dataframe.'}

### Run Evaluation with a Custom Evaluator

We need an evaluator that understands our dynamic labels. We'll create a custom evaluator by inheriting from `LabeledCriteriaEvalChain`. This base class is an LLM-powered evaluator that assesses a prediction based on a given criterion (e.g., "correctness") and a reference label.

Our customization is simple but powerful: we will override the `_get_eval_input` method. This method is responsible for preparing the inputs that get passed to the evaluator's LLM. In our overridden version, we will first call the parent method to get the standard inputs, and then we will **execute the `reference` value (our code snippet) using Python's `eval()` function**. This replaces the code snippet with its live result.

The result is that the evaluator's LLM never sees the code; it only sees the prediction and the freshly fetched, up-to-the-minute correct answer.

> **Security Warning**: Using `eval()` on untrusted code is extremely dangerous as it can execute arbitrary commands. In this tutorial, we are only evaluating code that we have written ourselves in a controlled environment. **Never** use this `eval()` approach in a production system where the code snippets could come from untrusted users.

In [11]:
from typing import Optional # Import typing hints.

from langchain.evaluation.criteria.eval_chain import LabeledCriteriaEvalChain # Import the base class for our custom evaluator.


# Define our custom evaluator by inheriting from the base class.
class CustomCriteriaEvalChain(LabeledCriteriaEvalChain):
    def _get_eval_input(
        self,
        prediction: str,
        reference: Optional[str],
        input: Optional[str],
    ) -> dict:
        # First, get the standard dictionary of inputs from the parent class.
        raw = super()._get_eval_input(prediction, reference, input)
        # This is the key step: we take the 'reference' (our code snippet) and execute it.
        # The result of the execution replaces the code snippet in the dictionary.
        # WARNING: This uses `eval`, which is a security risk with untrusted code.
        raw["reference"] = eval(raw["reference"])
        # Return the modified dictionary with the live, dereferenced answer.
        return raw

In [23]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate # Import the necessary evaluation functions.

# Instantiate our custom evaluator. We'll use GPT-4 as the judge for high-quality grading.
base_evaluator = CustomCriteriaEvalChain.from_llm(
    criteria="correctness", llm=ChatOpenAI(model="gpt-4", temperature=0.0)
)


# Define a helper function to prepare the data format that our evaluator expects.
def prepare_inputs(run, example):
    return {
        "prediction": next(iter(run.outputs.values())), # Get the model's predicted output.
        "reference": next(iter(example.outputs.values())), # Get the reference (our code snippet).
        "input": example.inputs["question"], # Get the original input question.
    }


# Wrap our custom evaluator in a LangChainStringEvaluator to make it compatible with the `evaluate` function.
criteria_evaluator = LangChainStringEvaluator(
    base_evaluator, prepare_data=prepare_inputs
)
# Run the evaluation.
chain_results = evaluate(
    predict, # The function representing our Q&A system.
    data=dataset_name, # The name of our dataset in LangSmith.
    evaluators=[criteria_evaluator], # The list of evaluators to apply.
    # The pandas agent does not currently support parallel execution.
    max_concurrency=1,
    metadata={
        "time": "T1", # Add metadata to tag this run as our first time point.
    },
)

  chain_results = evaluate_existing(\n

View the evaluation results for experiment: 'sparkling-suit-54' at:\n
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/8cf28879-611e-4532-9641-d593f6bffa20/compare?selectedSessions=e3e8387c-65b9-4f0e-bd20-519c28731949\n
\n
\n

0it [00:00, ?it/s]

### Re-evaluate After Data Changes

While the Titanic dataset is static, we can simulate a data update in a real-world system. We will modify the DataFrame by duplicating all the rows and shuffling some of the columns. This will drastically change the correct answer to every question in our dataset.

Because our dataset contains *instructions* for finding the answer, not the answers themselves, we can re-run the exact same evaluation on the new data and get a meaningful correctness score.

In [22]:
# Simulate a data update by doubling the number of rows.
df_doubled = pd.concat([df, df], ignore_index=True)
# Shuffle some of the columns to make the data changes less trivial.
df_doubled["Age"] = df_doubled["Age"].sample(frac=1).reset_index(drop=True)
df_doubled["Sex"] = df_doubled["Sex"].sample(frac=1).reset_index(drop=True)
# Overwrite the original DataFrame with the new, modified data.
df = df_doubled

Now, we run the evaluation again. Note that the code is identical to our first evaluation run, except for the metadata tag, which we'll change to `"T2"` to signify the second time point.

In [None]:
# Re-run the evaluation on the modified DataFrame.
chain_results = evaluate(
    predict, # The same Q&A system function.
    data=dataset_name, # The same dataset of questions and code snippets.
    evaluators=[criteria_evaluator], # The same custom evaluator.
    max_concurrency=1, # The agent still doesn't support concurrent runs.
    metadata={
        "time": "T2", # Update the metadata to mark this as the second run.
    },
)

  chain_results = evaluate(\n

View the evaluation results for experiment: 'perfect-sofa-52' at:\n
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/8cf28879-611e-4532-9641-d593f6bffa20/compare?selectedSessions=06bb0be8-4b77-43e3-80b3-e2c0b67900f8\n
\n
\n

0it [00:00, ?it/s]

### Review the results

Let's inspect the example for the question, "How many male and female passengers were there?". The table of linked runs clearly shows two different predictions for our two test runs (`T1` and `T2`).

- In the first run, the agent correctly predicted 577 male and 314 female passengers.
- In the second run, after we doubled the data, it correctly predicted 1154 male and 628 female passengers.

**Both test runs were marked as correct**. This demonstrates that our evaluation setup is working perfectly. The agent's predictions changed to reflect the new data, and our evaluator correctly fetched the new ground truth, confirming that both answers were correct *at the time they were generated*.

To be absolutely sure, we can inspect the traces of the evaluator itself. By clicking on the "correctness" feedback chips, we can see exactly what inputs the evaluator's LLM received. The screenshots below show the `reference` value that was passed to the LLM judge. You can see that for the `T1` run, the dereferenced value was `(577, 314)`, and for the `T2` run, it was `(1154, 628)`. This confirms our custom evaluator is successfully dereferencing the labels and fetching the live data before making its judgment.

### Limitations and Practical Considerations
Dynamic ground truth introduces additional complexity and requires careful control to ensure reproducibility. External dependencies must be stable enough to support repeated evaluation, and changes in upstream systems can alter results over time.

For this reason, dynamic evaluation is best used in conjunction with logging and versioning, so that changes in outcomes can be attributed to changes in the environment rather than to model behaviour alone.

### Role in a Broader Evaluation Framework
Within this project, dynamic ground truth addresses a class of failures that static benchmarks cannot capture. When combined with trajectory analysis and structured validation, it helps identify whether an agent failed because it misunderstood the task, queried the wrong information, or operated on outdated assumptions.

This distinction is especially important for agentic systems that interact with external tools and evolving data sources.

## Discussion
As models are increasingly deployed in environments that change over time, evaluation methods must account for non-stationarity. Dynamic ground truth provides a principled way to do this, shifting evaluation from matching stored answers to verifying behaviour against the current state of the world.