## Exact-Match Evaluation as a Baseline

### Motivation

Exact match is one of the simplest evaluation metrics used in question–answering systems: a response is marked correct only if it matches the reference answer exactly. Despite its simplicity and its well-known limitations it remains a useful baseline in controlled settings. This notebook includes exact match not because it is sufficient, but because it provides a clear lower bound against which more flexible or semantic evaluation methods can be compared.

In high-stakes or open-ended tasks, exact match is often too brittle to reflect meaningful correctness. However, its strictness can be informative when the task admits a narrow answer space, or when the goal is to detect surface-level failures such as hallucination, formatting errors, or deviation from required outputs.

### Experimental Setup
We apply exact-match evaluation to a set of question–answer pairs with predefined reference answers. The metric assigns a binary score, marking responses as correct only when the generated output matches the reference string exactly after normalisation.

This evaluation deliberately ignores semantic equivalence and paraphrasing. As such, it isolates a narrow class of failures related to precision and adherence to specification, rather than general understanding.

### What this metric captures and what it misses

Exact match is effective at detecting:
- hallucinated content where a precise answer is required,
- formatting or schema violations,
- failure to follow explicit instructions.

At the same time, it systematically underestimates performance in cases where multiple correct phrasings exist or where partial correctness is meaningful. For this reason, exact match should not be interpreted as a comprehensive measure of model quality, but as a diagnostic signal within a broader evaluation suite.

### Installing Dependencies

This first code cell handles the installation of the necessary Python libraries. 
- `langchain`: The core library for building applications with LLMs.
- `langchain_openai`: Provides specific integrations for using OpenAI's models within the LangChain framework.

In [1]:
# The `%pip` command is used to install Python packages directly from a Jupyter cell.
# The `-U` flag ensures that the packages are upgraded to their latest versions.
# The `--quiet` flag suppresses the installation output for a cleaner notebook.
# %pip install -U --quiet langchain langchain_openai

### Setting Up Environment Variables

- **`LANGCHAIN_ENDPOINT`**: This tells LangChain where to send the logging and tracing data. We point it to the LangSmith API endpoint.
- **`LANGCHAIN_API_KEY`**: This is your personal key to authenticate with your LangSmith account, allowing you to create datasets and log evaluation runs.
- **`OPENAI_API_KEY`**: This is your key for the OpenAI API, which is required to make calls to models like `gpt-3.5-turbo`.

You must replace the placeholder values (`"YOUR API KEY"` and `"Your openai api key"`) with your actual keys for this notebook to run.

In [2]:
import os # Import the 'os' module to interact with the operating system.

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY" # Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = "Your openai api key" # Set your OpenAI API key as an environment variable.

### Create an Evaluation Dataset

- **Inputs**: The data that will be fed into your model (e.g., a user's prompt).
- **Outputs (Reference Labels)**: The corresponding "ground truth" or expected answer that you want the model to produce.

Here, we will create a small dataset named `"Oracle of Exactness"` directly in LangSmith. It will contain two examples designed to test for precise outputs. We first check if the dataset already exists to avoid creating duplicates.

In [3]:
import langsmith # Import the LangSmith client library.

client = langsmith.Client() # Instantiate the LangSmith client to interact with the platform.
dataset_name = "Oracle of Exactness" # Define a name for our new dataset.

# Check if a dataset with this name already exists in your LangSmith project.
if not client.has_dataset(dataset_name=dataset_name):
    # If the dataset does not exist, create it.
    ds = client.create_dataset(dataset_name)
    # Add examples to the newly created dataset.
    client.create_examples(
        # 'inputs' is a list of dictionaries, each representing an input to the model.
        inputs=[
            {
                "prompt_template": "State the year of the declaration of independence. Respond with just the year in digits, nothign else"
            },
            {"prompt_template": "What's the average speed of an unladen swallow?"},
        ],
        # 'outputs' is a list of dictionaries with the corresponding expected or ground-truth answers.
        outputs=[{"output": "1776"}, {"output": "5"}],
        # 'dataset_id' links these examples to the dataset we created above.
        dataset_id=ds.id,
    )

### Define the System and Evaluators

Now we'll set up the components needed to run the evaluation. This involves three key parts:

1.  **The System Under Test (`predict_result`)**: This is the function that we want to evaluate. It takes an input dictionary (matching the structure of our dataset inputs), uses an OpenAI model to generate a response, and returns the result in a structured output dictionary.

2.  **A Custom Evaluator (`compare_label`)**: While LangSmith provides a built-in `"exact_match"` evaluator, we define our own here to demonstrate how you can create custom evaluation logic. This function receives the model's output (`run`) and the ground truth data (`example`), compares them, and returns a structured `EvaluationResult`. The `@run_evaluator` decorator registers this function with LangSmith so it can be used in an evaluation run.

3.  **The Evaluation Configuration (`RunEvalConfig`)**: This object bundles all the evaluators we want to apply to each model prediction. We include both LangSmith's pre-built `"exact_match"` evaluator and our custom `compare_label` function. This will allow us to see their results side-by-side and confirm they produce the same scores.

In [5]:
from langchain.smith import RunEvalConfig # Import the configuration class for evaluation runs.
from langchain_openai import ChatOpenAI # Import the ChatOpenAI class to interact with OpenAI's chat models.
from langsmith.evaluation import EvaluationResult, run_evaluator # Import classes for creating custom evaluators.

model = "gpt-3.5-turbo" # Specify the OpenAI model we want to use for our test.


# This is your model/system that you want to evaluate.
def predict_result(input_: dict) -> dict:
    # This function calls the OpenAI model with the provided prompt.
    response = ChatOpenAI(model=model).invoke(input_["prompt_template"])
    # It then returns the model's output in the standard dictionary format.
    return {"output": response.content}


# The '@run_evaluator' decorator registers this function as a LangSmith evaluator.
@run_evaluator
def compare_label(run, example) -> EvaluationResult:
    # Custom evaluators let you define how "exact" the match ought to be.
    # 'run' contains information about the model's execution, including its outputs.
    # 'example' contains information from the dataset, including the reference output.
    
    # Flexibly pick the fields to compare by accessing the dictionaries.
    prediction = run.outputs.get("output") or "" # Get the predicted output string from the run, defaulting to an empty string if not found.
    target = example.outputs.get("output") or "" # Get the target (reference) output string from the example.
    
    # Perform the direct string comparison.
    match = prediction and prediction == target
    
    # Return the result in the required EvaluationResult format.
    return EvaluationResult(key="matches_label", score=match)


# This defines how you generate metrics about the model's performance.
eval_config = RunEvalConfig(
    # Specify a list of built-in evaluators. `"exact_match"` performs the same logic as our custom one.
    evaluators=["exact_match"], 
    # Specify a list of custom evaluator functions to run.
    custom_evaluators=[compare_label],
)

# This is the main function that executes the evaluation.
client.run_on_dataset(
    dataset_name=dataset_name, # The name of the dataset in LangSmith to use for evaluation.
    llm_or_chain_factory=predict_result, # A reference to the function/chain that will be tested.
    evaluation=eval_config, # The evaluation configuration object we defined above.
    verbose=True, # Prints progress and links to the results in LangSmith.
    # Add any metadata to the project to help with tracking and organization.
    project_metadata={"version": "1.0.0", "model": model},
)

View the evaluation results for project 'impressionable-crew-29' at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/4f23ec54-3cf8-44fc-a729-ce08ad855bfd/compare?selectedSessions=a0672ba4-e513-4fef-84b8-bab439581721

View all tests for Dataset Oracle of Exactness at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/4f23ec54-3cf8-44fc-a729-ce08ad855bfd
[------------------------------------------------->] 2/2

Unnamed: 0,feedback.exact_match,feedback.matches_label,error,execution_time,run_id
count,2.0,2,0.0,2.0,2
unique,,2,0.0,,2
top,,False,,,2b4532af-445e-46aa-8170-d34c3af724a8
freq,,1,,,1
mean,0.5,,,0.545045,
std,0.707107,,,0.265404,
min,0.0,,,0.357376,
25%,0.25,,,0.451211,
50%,0.5,,,0.545045,
75%,0.75,,,0.63888,


{'project_name': 'impressionable-crew-29',
 'results': {'893730f0-393d-4c40-92f9-16ce24aaec1f': {'input': {'prompt_template': "What's the average speed of an unladen swallow?"},
   'feedback': [EvaluationResult(key='exact_match', score=0, value=None, comment=None, correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('089a016a-d847-4a26-850c-afc0e78879d5'))}, source_run_id=None, target_run_id=None),
    EvaluationResult(key='matches_label', score=False, value=None, comment=None, correction=None, evaluator_info={}, source_run_id=None, target_run_id=None)],
   'execution_time': 0.732714,
   'run_id': '2b4532af-445e-46aa-8170-d34c3af724a8',
   'output': {'output': 'The average speed of an unladen European swallow is approximately 20.1 miles per hour (32.4 km/h).'},
   'reference': {'output': '5'}},
  'ec9d8754-d264-4cec-802e-0c33513843d8': {'input': {'prompt_template': 'State the year of the declaration of independence.Respond with just the year in digits, nothign else'},
   'feed

### Role in a Broader Evaluation Framework
In this project, exact match serves as a reference point rather than a target metric. Later evaluations introduce semantic judges, trajectory-level analysis, and simulation-based methods that relax the strict assumptions made here. Comparing those methods against exact match helps clarify what each evaluation technique is sensitive to, and where they diverge.

By grounding the evaluation suite with a simple, transparent baseline, we can better interpret the behaviour of more complex evaluators and avoid attributing meaning to improvements that are purely artefacts of metric choice.

## Discussion
While exact match alone is inadequate for evaluating agentic or generative systems, its inclusion is intentional. It provides a clear illustration of how different evaluation choices surface different classes of failure and why relying on a single metric can be misleading when assessing system reliability.