# Evaluating Existing Runs

This tutorial shows how to evaluate and tag runs after they've already been logged. This is useful for some of the following scenarios:
- You have a new evaluator or version of an evaluator and want to add the eval metrics on existing test projects
- You want to use AI-assisted feedback within monitoring projects (non-test projects)

The typical steps are:
- Define the RunEvaluator
- Select the runs you wish to evaluate (see the [run filtering](https://docs.smith.langchain.com/tracing/use-cases/export-runs/local) docs for more information)
- Call the `evaluate_run` method, which runs the evaluation and logs the results as feedback.

In general, any evaluation results can be logged as feedback by calling the `client.create_feedback` method.

## Using just the SDK

You can add automated/algorithmic feedback to existing runs using just the SDK. The LangSmith client has a helpful `evaluate_run` method to apply a run evaluator to a traced run and save the resulting feedback for the run trace. Let's make an example evaluator to check if the output contains any numeric digits.

In [9]:
from typing import Optional

from evaluate import load
from langsmith.evaluation import EvaluationResult, RunEvaluator
from langsmith.schemas import Example, Run


class ContainsDigits(RunEvaluator):

    def evaluate_run(
        self, run: Run, example: Optional[Example] = None
    ) -> EvaluationResult:
        if run.outputs is None:
            raise ValueError("Run outputs cannot be None")
        prediction = str(next(iter(run.outputs.values())))
        contains_digits = any(c.isdigit() for c in prediction)
        return EvaluationResult(key="Contains Digits", score=contains_digits)

Here we've defined a simple check that returns `True` if the prediction contains any digits, and `False` otherwise.

The logic above assumes your chain only returns one value, meaning the `run.outputs` dictionary will have only one key. If there are multiple keys in your outputs, you will have to select whichever key(s) you wish to evaluate.

In this case, the evaluator is "reference-free", meaning we never would use the "example" argument even if present. Evaluators that require reference labels can only be applied to runs that are associated with a dataset example.

For more information on creating a custom evaluator, check out the [docs](https://docs.smith.langchain.com/evaluation/custom-evaluators).

In [10]:
from langsmith import Client

client = Client()

# In this case, we are 
project_name="1680dedc34134584be61a59eb5c3f31e-RunnableSequence"

evaluator = ContainsDigits()
runs = client.list_runs(
    project_name=project_name,
    execution_order=1,
)

for run in runs:
    feedback = client.evaluate_run(run, evaluator)

The evaluation results will all be saved as feedback to the run trace.

In our case, we only used the run itself (not the example object) to generate the evaluation results, but some evaluators may require reference labels. You can apply these when the run comes from a test project created when calling the `run_on_dataset` function.
This makes sure to assign the correct 'reference_example_id' to each run so that it is linked to that example in the dataset. Then when the client calls `evaluate_run`, it loads the example and passes it to the evaluator so it knows the ground truth for a given data point.

In [14]:
# Updating the aggregate stats is async, but after some time, the "Contains Digits" feedback will be available
client.read_project(project_name=project_name).feedback_stats

{'Perplexity': {'n': 3, 'avg': 20.9166269302368, 'mode': 12.5060758590698},
 'Contains Digits': {'n': 7, 'avg': 0.42857142857142855, 'mode': 0},
 'COT Contextual Accuracy': {'n': 7, 'avg': 0.7142857142857143, 'mode': 1}}

# Using a LangChain evaluator

LangChain has a number of evaluators you can  use off-the-shelf or modify to suit your needs. An easy way to use these is to modify the code above and apply the evaluator directly to the run. For more information on available LangChain evaluators, check out the [open source documentation](https://python.langchain.com/docs/guides/evaluation).

Below, we will demonstrate this by using the criteria evaluator to use an LLM to check that the responses contain both a python and typescript example, if needed.

In [28]:
from langchain import evaluation, callbacks

class SufficientCodeEvaluator(RunEvaluator):
    
    def __init__(self):
        criteria_description=(
            "If the submission contains code, does it contain both a python and typescript example?"
            " Y if no code is needed or if both languages are present, N if response is only in one language"
        )
        self.evaluator = evaluation.load_evaluator("criteria", 
                                      criteria={
                                          "sufficient_code": criteria_description
                                      })
    def evaluate_run(
        self, run: Run, example: Optional[Example] = None
    ) -> EvaluationResult:
        question = next(iter(run.inputs.values()))
        prediction = str(next(iter(run.outputs.values())))
        with callbacks.collect_runs() as cb:
            result = self.evaluator.evaluate_strings(input=question, prediction=prediction)
            run_id = cb.traced_runs[0].id
        return EvaluationResult(key="sufficient_code", evaluator_info={"__run": {"run_id": run_id}}, **result)


In [25]:
runs = client.list_runs(
    project_name=project_name,
    execution_order=1,
)
evaluator = SufficientCodeEvaluator()
for run in runs:
    feedback = client.evaluate_run(run, evaluator)

## Evaluating the whole trace

For some evaluations, you may want to consider information contained in multiple runs within a nested trace. You can do this by setting the `load_child_runs` argument to `True` when calling `evaluate_run` and then selecting the desired information from within the run tree.

An example of this is if you want to evaluate the sequence of actions in an agent's trajectory. Below, we will create an agent trajectory evaluator to do this. In this example we will:

- Select the tool child runs to represent the agents actions
- Use an LLM to grade the action choices based on the responses at each turn and the final answer
- Query the project for runs by name to select the agent executor
- Specify `load_child_runs=True` to direct the client to load the other child runs in the trace before evaluating

In [59]:
from langchain import evaluation, callbacks, agents

class AgentTrajectoryEvaluator(RunEvaluator):
    
    def __init__(self):
        self.evaluator = evaluation.load_evaluator("trajectory")
        
    @staticmethod
    def construct_trajectory(run: Run):
        trajectory = []
        for run in (run.child_runs or []):
            if run.run_type == "tool":
                action = agents.agent.AgentAction(tool=run.name, tool_input=run.inputs['input'], log='')
                trajectory.append((action, run.outputs['output']))
        return trajectory
        
    def evaluate_run(
        self, run: Run, example: Optional[Example] = None
    ) -> EvaluationResult:
        if run.outputs is None:
            return EvaluationResult(key="trajectory", score=None)
        question = next(iter(run.inputs.values()))
        prediction = str(next(iter(run.outputs.values())))
        trajectory = self.construct_trajectory(run)
        with callbacks.collect_runs() as cb:
            try:
                result = self.evaluator.evaluate_agent_trajectory(input=question,
                                                                  prediction=prediction,
                                                                  agent_trajectory=trajectory)
            except:
                # If the evaluation fails, we can log a null score
                return EvaluationResult(key="trajectory", score=None)
            run_id = cb.traced_runs[0].id
        return EvaluationResult(key="trajectory", evaluator_info={"__run": {"run_id": run_id}}, **result)


In [None]:
project_name = "7034821cd22f47368cfde810d98375b1-AgentExecutor"
runs = client.list_runs(
    project_name=project_name,
    execution_order=1,
    filter='eq(name, "AgentExecutor")',
)

evaluator = AgentTrajectoryEvaluator()
for run in runs:
    feedback = client.evaluate_run(run.id, evaluator, load_child_runs=True)

## Using a custom chain

In LangSmith, evaluation results are feedback. If you have a custom function or chain you'd like to use to evaluate a run, you can directly log the output using the `create_feedback` method on the client. If the evaluator algorithm is traced, you can optionally add the `source_run_id` to the feedback to have it associated in the app.

Let's make an LLM-powered example that tries to automatically tag the results based on the input content.
We will use LangChain's runnable lambda to conveniently batch calls.

In [27]:
from langchain import chat_models, prompts, callbacks, schema

chain = (
    prompts.ChatPromptTemplate.from_template(
    "The following is a user question:\n<question>\n{question}</question>\n\n"
    "Categorize it into 1 of the following categories:\n"
    "- API\n- Tracing\n- Evaluation\n- Off-Topic\n- Other\n\nCategory:")
    | chat_models.ChatOpenAI()
    | schema.output_parser.StrOutputParser()
)

def evaluate_run(run: Run):
    # You can get the run ID using the collect_runs callback manager
    with callbacks.collect_runs() as cb:
        result = chain.invoke({"question": next(iter(run.inputs.values()))})
        feedback = client.create_feedback(
            run.id,
            key="LangSmith Category",
            value=result,
            source_run_id=cb.traced_runs[0].id,
            feedback_source_type="model",
        )
    return feedback

wrapped_function = schema.runnable.RunnableLambda(evaluate_run)

runs = client.list_runs(
    project_name=project_name,
    execution_order=1,
)
all_feedback = wrapped_function.batch([run for run in runs], return_errors=True)

# Example of the first feedback example
all_feedback[0]

Feedback(id=UUID('1cae2b52-82cf-41c8-9b67-9f2b8180b66c'), created_at=datetime.datetime(2023, 8, 30, 14, 59, 22, 964340), modified_at=datetime.datetime(2023, 8, 30, 14, 59, 22, 964345), run_id=UUID('6ebc3c3b-9ab6-4bd9-832f-6d567c4bbedf'), key='LangSmith Category', score=None, value='Other', comment=None, correction=None, feedback_source=FeedbackSourceBase(type='model', metadata={'__run': {'run_id': '07dd5f11-dc55-4470-8261-b590986624fd'}}))

## Conclusion

This tutorial shows how to evaluate existing runs using evaluators or any chain. This is useful for some of the following scenarios:
- You have already run an model on the dataset and want to add evaluation results from a new evaluator for additional metrics
- You want to run an evaluator or chain to generate feedback on runs that aren't within a test project