-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Update eval intro docs for clarity online offline user journey #1518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Mintlify preview ID generated: preview-evalsi-1763502859-c2d102b |
|
Mintlify preview ID generated: preview-evalsi-1763668729-42c6cad |
| --- | ||
|
|
||
| LangSmith makes building high-quality evaluations easy. This guide explains the key concepts of the LangSmith evaluation framework. The building blocks of the LangSmith framework are: | ||
| LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring. | |
| LLM outputs are non-deterministic, which response quality hard to assess. Evaluations (evals) are a way to breakdown what "good" looks like and measure it. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring. |
| #### Heuristic | ||
|
|
||
| LLM-as-judge evaluators use LLMs to score the application's output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference output (e.g., check if the output is factually accurate relative to the reference). | ||
| _Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| _Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly. | |
| _Code evaluators_ are deterministic, rule-based functions. They work well for checks such as verifying the structure of a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really use the term heuristic evaluators
|
|
||
| Learn [how to analyze experiment results](/langsmith/analyze-an-experiment). | ||
|
|
||
| ## Experiment configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo this isnt conceptual anymore and not something i would want someone new to evals to learn right away. we should move to the Set up evolutions > evaluation techniques section.
|
WDYT about the following structure/content for the concepts page. Feels like theres a lot of info and i think the order can be better:
|
|
|
||
| Before production deployment, use offline evaluations to validate functionality, benchmark different approaches, and build confidence. | ||
|
|
||
| 1. Create a [dataset](/langsmith/manage-datasets) with representative test cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this flow but I'd remove the numbered bullets for each of these sections. I think for someone new since they're super high level it makes the bullets hard to follow
| 1. Deploy the updated application. | ||
| 1. Confirm the fix with online evaluations. | ||
|
|
||
| ## Core evaluation objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather frame it as: these are the kinds of evals you can run online vs offline. This section makes that really clear and then you can introduce the different concepts for offline and online
Thoughts on using the word evaluation "targets" instead of "objects"? Offline evaluations and online evaluations run on different targets; online evals operate on runs, while offline evals operate on examples
| _Synthetic data generation_ creates additional examples artificially from existing ones. This approach works best when starting with several high-quality, hand-crafted examples, because the synthetic data typically uses these as templates. This provides a quick way to expand dataset size. | ||
|
|
||
| ### Splits | ||
| #### Splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would remove Splits and Versions from here because its not really conceptual and they are explained elsewhere
|
|
||
|  | ||
|
|
||
| ### Benchmarking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idt Benchmarking, Unit tests, regression tests, backtesting add value to this guide. i'd keep pariwise though
|  | ||
|
|
||
| ## Testing | ||
| ### Real-time monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with these. Not super helpful as their own sections i think. i'd rather weave these concepts into the Online Evals section
|
Mintlify preview ID generated: preview-evalsi-1764016425-db5351a |
| Before building evaluations, identify what matters for your application. Break down your system into its critical components—LLM calls, retrieval steps, tool invocations, output formatting—and determine quality criteria for each. | ||
|
|
||
| A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair. | ||
| **Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| **Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance: | |
| **Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each component. These examples serve as your ground truth that the eval compares model outputs against. For instance: |
|
Mintlify preview ID generated: preview-evalsi-1764599832-a6e239a |
Fixes DOC-456
Preview
https://langchain-5e9cc07a-preview-evalsi-1764016425-db5351a.mintlify.app/langsmith/evaluation-concepts