Update eval intro docs for clarity online offline user journey #1518

katmayb · 2025-11-18T21:53:29Z

Fixes DOC-456

Preview

https://langchain-5e9cc07a-preview-evalsi-1764016425-db5351a.mintlify.app/langsmith/evaluation-concepts

github-actions · 2025-11-18T21:54:59Z

Mintlify preview ID generated: preview-evalsi-1763502859-c2d102b

github-actions · 2025-11-20T19:59:26Z

Mintlify preview ID generated: preview-evalsi-1763668729-42c6cad

tanushree-sharma · 2025-11-24T03:50:39Z

src/langsmith/evaluation-concepts.mdx

 ---

-LangSmith makes building high-quality evaluations easy. This guide explains the key concepts of the LangSmith evaluation framework. The building blocks of the LangSmith framework are:
+LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.


Suggested change

LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.

LLM outputs are non-deterministic, which response quality hard to assess. Evaluations (evals) are a way to breakdown what "good" looks like and measure it. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.

tanushree-sharma · 2025-11-24T03:58:50Z

src/langsmith/evaluation-concepts.mdx

+#### Heuristic

-LLM-as-judge evaluators use LLMs to score the application's output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference output (e.g., check if the output is factually accurate relative to the reference).
+_Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.


Suggested change

_Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.

_Code evaluators_ are deterministic, rule-based functions. They work well for checks such as verifying the structure of a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.

We don't really use the term heuristic evaluators

tanushree-sharma · 2025-11-24T04:02:02Z

src/langsmith/evaluation-concepts.mdx


+Learn [how to analyze experiment results](/langsmith/analyze-an-experiment).

 ## Experiment configuration


imo this isnt conceptual anymore and not something i would want someone new to evals to learn right away. we should move to the Set up evolutions > evaluation techniques section.

tanushree-sharma · 2025-11-24T04:22:42Z

WDYT about the following structure/content for the concepts page. Feels like theres a lot of info and i think the order can be better:

What to evaluate: Figure out whats important to measure. Spit out your system into its parts. Evaluate each critical part and evals. Recommendation: Start with building examples of what good looks like for each part (manually curated examples).
Introduce offline and online evals: when to use each. Offline and online eval targets. I also think we should separate out a new concept we're introducing (eg. Datasets) from best practices for using that concept (eg. Dataset Curation). Feels like a lot of info at onece. We could move best practices to a new section at the bottom and cross link?
Evaluation lifecycle
Evaluation types --> Link out to Set up evaluations > Evaluation types. Add a landing page that explains each (move the content that's here)
Best Practices

tanushree-sharma · 2025-11-24T04:25:40Z

src/langsmith/evaluation-concepts.mdx

+
+Before production deployment, use offline evaluations to validate functionality, benchmark different approaches, and build confidence.
+
+1. Create a [dataset](/langsmith/manage-datasets) with representative test cases.


I like this flow but I'd remove the numbered bullets for each of these sections. I think for someone new since they're super high level it makes the bullets hard to follow

tanushree-sharma · 2025-11-24T04:29:14Z

src/langsmith/evaluation-concepts.mdx

+1. Deploy the updated application.
+1. Confirm the fix with online evaluations.
+
+## Core evaluation objects


I'd rather frame it as: these are the kinds of evals you can run online vs offline. This section makes that really clear and then you can introduce the different concepts for offline and online

Thoughts on using the word evaluation "targets" instead of "objects"? Offline evaluations and online evaluations run on different targets; online evals operate on runs, while offline evals operate on examples

tanushree-sharma · 2025-11-24T04:31:40Z

src/langsmith/evaluation-concepts.mdx

+_Synthetic data generation_ creates additional examples artificially from existing ones. This approach works best when starting with several high-quality, hand-crafted examples, because the synthetic data typically uses these as templates. This provides a quick way to expand dataset size.

-### Splits
+#### Splits


would remove Splits and Versions from here because its not really conceptual and they are explained elsewhere

tanushree-sharma · 2025-11-24T04:36:55Z

src/langsmith/evaluation-concepts.mdx


 ![Offline](/langsmith/images/offline.png)

 ### Benchmarking


idt Benchmarking, Unit tests, regression tests, backtesting add value to this guide. i'd keep pariwise though

tanushree-sharma · 2025-11-24T04:38:43Z

src/langsmith/evaluation-concepts.mdx

 ![Online](/langsmith/images/online.png)

-## Testing
+### Real-time monitoring


Same with these. Not super helpful as their own sections i think. i'd rather weave these concepts into the Online Evals section

github-actions · 2025-11-24T20:34:17Z

Mintlify preview ID generated: preview-evalsi-1764016425-db5351a

tanushree-sharma · 2025-11-25T04:05:04Z

src/langsmith/evaluation-concepts.mdx

+Before building evaluations, identify what matters for your application. Break down your system into its critical components—LLM calls, retrieval steps, tool invocations, output formatting—and determine quality criteria for each.

-A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
+**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance:


Suggested change

**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance:

**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each component. These examples serve as your ground truth that the eval compares model outputs against. For instance:

src/langsmith/evaluation-concepts.mdx

github-actions · 2025-12-01T14:37:41Z

Mintlify preview ID generated: preview-evalsi-1764599832-a6e239a

WIP update eval intro docs for clarity online offline user journey

7f928c6

github-actions bot added the langsmith For docs changes to LangSmith label Nov 18, 2025

katmayb added 2 commits November 20, 2025 14:43

Updates to concepts page

7e5bd94

Remove test page and update overview

39bb829

tanushree-sharma reviewed Nov 24, 2025

View reviewed changes

Add updates from feedback

062b4e8

katmayb requested a review from tanushree-sharma November 24, 2025 21:13

tanushree-sharma reviewed Nov 25, 2025

View reviewed changes

tanushree-sharma approved these changes Nov 25, 2025

View reviewed changes

TS feedback

983fa4a

katmayb marked this pull request as ready for review December 1, 2025 14:37

katmayb requested a review from lnhsingh as a code owner December 1, 2025 14:37

katmayb changed the title ~~WIP update eval intro docs for clarity online offline user journey~~ Update eval intro docs for clarity online offline user journey Dec 1, 2025

katmayb merged commit aa6982c into main Dec 1, 2025
15 checks passed

katmayb deleted the evals-intro-docs branch December 1, 2025 14:44

	LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.
	LLM outputs are non-deterministic, which response quality hard to assess. Evaluations (evals) are a way to breakdown what "good" looks like and measure it. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring.

	_Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.
	_Code evaluators_ are deterministic, rule-based functions. They work well for checks such as verifying the structure of a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly.


		Learn [how to analyze experiment results](/langsmith/analyze-an-experiment).

		## Experiment configuration


		Before production deployment, use offline evaluations to validate functionality, benchmark different approaches, and build confidence.

		1. Create a [dataset](/langsmith/manage-datasets) with representative test cases.

	Start with manually curated examples. Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance:
	Start with manually curated examples. Create 5-10 examples of what "good" looks like for each component. These examples serve as your ground truth that the eval compares model outputs against. For instance:

Update eval intro docs for clarity online offline user journey #1518

Update eval intro docs for clarity online offline user journey #1518

Conversation

katmayb commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanushree-sharma commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanushree-sharma Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanushree-sharma Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

katmayb commented Nov 18, 2025 •

edited

Loading

tanushree-sharma commented Nov 24, 2025 •

edited

Loading

tanushree-sharma Nov 24, 2025 •

edited

Loading

tanushree-sharma Nov 24, 2025 •

edited

Loading