Skip to content

[DOC] Callout the aggregation result may be approximate#4922

Merged
LantaoJin merged 5 commits intoopensearch-project:mainfrom
LantaoJin:pr/issues/4915
Dec 11, 2025
Merged

[DOC] Callout the aggregation result may be approximate#4922
LantaoJin merged 5 commits intoopensearch-project:mainfrom
LantaoJin:pr/issues/4915

Conversation

@LantaoJin
Copy link
Copy Markdown
Member

@LantaoJin LantaoJin commented Dec 9, 2025

Description

[DOC] Callout the aggregation result may be approximate

Related Issues

Resolves #4915

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin LantaoJin added the documentation Improvements or additions to documentation label Dec 9, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 9, 2025

📝 Walkthrough

Summary by CodeRabbit

Documentation

  • Expanded limitations docs describing that bucket aggregation results can be approximate on large datasets and how that may affect downstream aggregations.
  • Added guidance on inaccuracies when sorting by ascending doc_count, explaining shard-level discrepancies and possible missed rare terms.
  • Updated aggregation functions documentation to note use of streamstats alongside stats and eventstats.

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

Added documentation clarifying that OpenSearch bucket aggregation doc_count can be approximate (affecting downstream aggregations and sorting) and expanded aggregation function docs to include streamstats. No functional code changes.

Changes

Cohort / File(s) Summary
Bucket aggregation limitations
docs/user/ppl/cmd/stats.md, docs/user/ppl/limitations/limitations.md
Added two sections explaining that bucket aggregation doc_count may be approximate on large datasets and that sorting by ascending doc_count can be inaccurate due to shard-level discrepancies; includes example PPL queries and explanations.
Aggregation function references
docs/user/ppl/functions/aggregations.md
Expanded description to list streamstats alongside stats and eventstats as commands used with aggregation functions.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

  • Docs-only changes; no code or API surface edits.
  • Review focus: clarity and correctness of examples in docs/user/ppl/cmd/stats.md and docs/user/ppl/limitations/limitations.md.

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding documentation callouts about approximate aggregation results.
Description check ✅ Passed The description is related to the changeset, referencing the documentation updates for issue #4915 about approximate aggregation results.
Linked Issues check ✅ Passed The PR successfully addresses issue #4915 by adding documentation callouts about approximate doc_count computation and aggregation results across multiple documentation files.
Out of Scope Changes check ✅ Passed All changes are in-scope documentation updates directly addressing the requirement to callout approximate aggregation results in user documentation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
docs/user/ppl/limitations/limitations.md (1)

105-115: Minor wording improvement for clarity.

The phrase starting with "A term that is globally infrequent..." could be tightened for improved readability. Consider simplifying to reduce wordiness while preserving the technical accuracy:

- A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results.
+ Rare terms may not be ranked consistently across shards. A term infrequent globally might rank higher on some shards or be absent from others. This shard-level inconsistency can cause rare terms to be missed during aggregation, resulting in incomplete results.
docs/user/ppl/cmd/stats.md (1)

68-79: Minor wording improvement for clarity.

The phrase in the second subsection could be simplified for improved readability. Consider the same improvement suggested for the parallel section in limitations.md:

- A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results.
+ Rare terms may not be ranked consistently across shards. A term infrequent globally might rank higher on some shards or be absent from others. This shard-level inconsistency can cause rare terms to be missed during aggregation, resulting in incomplete results.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2cc21b and 4515592.

📒 Files selected for processing (3)
  • docs/user/ppl/cmd/stats.md (1 hunks)
  • docs/user/ppl/functions/aggregations.md (1 hunks)
  • docs/user/ppl/limitations/limitations.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/user/ppl/limitations/limitations.md

[style] ~115-~115: You can shorten this phrase to improve clarity and avoid wordiness.
Context: ...) as c by URL | sort + c | head 10 ``` A term that is globally infrequent might not appear as infrequent on every...

(NNS_THAT_ARE_JJ)

docs/user/ppl/cmd/stats.md

[style] ~78-~78: You can shorten this phrase to improve clarity and avoid wordiness.
Context: ...) as c by URL | sort + c | head 10 ``` A term that is globally infrequent might not appear as infrequent on every...

(NNS_THAT_ARE_JJ)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
  • GitHub Check: build-linux (21, unit)
  • GitHub Check: build-linux (25, integration)
  • GitHub Check: build-linux (21, doc)
  • GitHub Check: build-linux (25, doc)
  • GitHub Check: build-linux (25, unit)
  • GitHub Check: build-linux (21, integration)
  • GitHub Check: bwc-tests-rolling-upgrade (21)
  • GitHub Check: bwc-tests-rolling-upgrade (25)
  • GitHub Check: security-it-linux (25)
  • GitHub Check: security-it-linux (21)
  • GitHub Check: build-windows-macos (macos-14, 25, integration)
  • GitHub Check: security-it-windows-macos (windows-latest, 21)
  • GitHub Check: build-windows-macos (macos-14, 25, unit)
  • GitHub Check: build-windows-macos (macos-14, 25, doc)
  • GitHub Check: security-it-windows-macos (macos-14, 25)
  • GitHub Check: security-it-windows-macos (windows-latest, 25)
  • GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, integration)
  • GitHub Check: build-windows-macos (macos-14, 21, unit)
  • GitHub Check: build-windows-macos (macos-14, 21, doc)
  • GitHub Check: build-windows-macos (macos-14, 21, integration)
  • GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, unit)
  • GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, unit)
  • GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, integration)
  • GitHub Check: CodeQL-Scan (java)
🔇 Additional comments (4)
docs/user/ppl/functions/aggregations.md (1)

5-5: Accurate expansion of supported commands.

The addition of streamstats to the list of commands that use aggregation functions is correct and aligns with the related documentation updates. The change reads naturally in context.

docs/user/ppl/limitations/limitations.md (2)

91-115: Well-documented limitations with clear examples.

The new sections effectively document the approximate behavior of bucket aggregations and related limitations. The examples and explanations are clear and directly address the issue #4915 requirements. The placement in the limitations section is appropriate.


52-115: Note: Content duplication with stats.md.

The Limitations section added here (lines 91-115) appears to have identical content in docs/user/ppl/cmd/stats.md (lines 52-79). While this ensures both the general limitations and command-specific documentation are comprehensive, you may want to consider a single source of truth approach or use cross-references to reduce maintenance burden. This is informational only.

docs/user/ppl/cmd/stats.md (1)

52-79: Appropriate command-level limitations documentation.

Adding limitations to the stats command documentation ensures users encounter this critical information in the relevant context. The examples effectively demonstrate the affected scenarios, and the explanations are clear.

Comment thread docs/user/ppl/cmd/stats.md Outdated
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
docs/user/ppl/limitations/limitations.md (1)

91-116: Documentation successfully clarifies approximate aggregation behavior.

The two new limitation sections effectively explain the approximation issues with bucket aggregations and sorting behavior, directly addressing issue #4915. Code examples are properly formatted with the ignore flag per doctest guidelines.

Minor clarity suggestion: Line 115 contains a wordy phrase. Consider shortening "A term that is globally infrequent might not appear as infrequent on every individual shard" to something more concise, such as "A globally infrequent term may not appear infrequent on every individual shard" for improved readability.

docs/user/ppl/cmd/stats.md (1)

52-80: Limitations section appropriately documents aggregation approximation behavior.

Placing the Limitations section within the stats command documentation is ideal for context awareness. Code examples follow doctest guidelines with proper ignore flags.

Same minor clarity suggestion as in limitations.md: Line 79 uses a wordy phrasing. Shorten "A term that is globally infrequent might not appear as infrequent on every individual shard" to improve clarity and reduce cognitive load.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4515592 and c36888c.

📒 Files selected for processing (2)
  • docs/user/ppl/cmd/stats.md (1 hunks)
  • docs/user/ppl/limitations/limitations.md (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-02T17:27:55.938Z
Learnt from: CR
Repo: opensearch-project/sql PR: 0
File: .rules/REVIEW_GUIDELINES.md:0-0
Timestamp: 2025-12-02T17:27:55.938Z
Learning: For PPL command PRs, refer docs/dev/ppl-commands.md and verify the PR satisfies the checklist

Applied to files:

  • docs/user/ppl/cmd/stats.md
🪛 LanguageTool
docs/user/ppl/cmd/stats.md

[style] ~78-~78: You can shorten this phrase to improve clarity and avoid wordiness.
Context: ...) as c by URL | sort + c | head 10 ``` A term that is globally infrequent might not appear as infrequent on every...

(NNS_THAT_ARE_JJ)

docs/user/ppl/limitations/limitations.md

[style] ~115-~115: You can shorten this phrase to improve clarity and avoid wordiness.
Context: ...) as c by URL | sort + c | head 10 ``` A term that is globally infrequent might not appear as infrequent on every...

(NNS_THAT_ARE_JJ)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: security-it-linux (25)
  • GitHub Check: security-it-linux (21)
  • GitHub Check: build-linux (21, unit)
  • GitHub Check: build-linux (25, integration)
  • GitHub Check: build-linux (25, unit)
  • GitHub Check: bwc-tests-rolling-upgrade (25)
  • GitHub Check: build-linux (21, integration)
  • GitHub Check: bwc-tests-full-restart (25)
  • GitHub Check: bwc-tests-rolling-upgrade (21)
  • GitHub Check: build-linux (21, doc)
  • GitHub Check: build-linux (25, doc)
  • GitHub Check: bwc-tests-full-restart (21)
  • GitHub Check: build-windows-macos (macos-14, 25, unit)
  • GitHub Check: security-it-windows-macos (macos-14, 25)
  • GitHub Check: security-it-windows-macos (macos-14, 21)
  • GitHub Check: build-windows-macos (macos-14, 21, unit)
  • GitHub Check: security-it-windows-macos (windows-latest, 25)
  • GitHub Check: build-windows-macos (macos-14, 21, doc)
  • GitHub Check: security-it-windows-macos (windows-latest, 21)
  • GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, integration)
  • GitHub Check: build-windows-macos (macos-14, 21, integration)
  • GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, unit)
  • GitHub Check: build-windows-macos (windows-latest, 21, -PbuildPlatform=windows, unit)
  • GitHub Check: build-windows-macos (macos-14, 25, integration)
  • GitHub Check: build-windows-macos (windows-latest, 25, -PbuildPlatform=windows, integration)
  • GitHub Check: build-windows-macos (macos-14, 25, doc)
  • GitHub Check: CodeQL-Scan (java)

@LantaoJin LantaoJin merged commit 90ee47c into opensearch-project:main Dec 11, 2025
57 of 59 checks passed
@LantaoJin LantaoJin deleted the pr/issues/4915 branch December 11, 2025 08:34
@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-4922-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 90ee47c6f909d38f5ba12cef3c2bda8c5f23cce5
# Push it to GitHub
git push --set-upstream origin backport/backport-4922-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-4922-to-2.19-dev.

opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 12, 2025
* [DOC] Callout the aggregation result may be approximate

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* add to limitation.rst

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* revert

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* add ignore format

Signed-off-by: Lantao Jin <ltjin@amazon.com>

---------

Signed-off-by: Lantao Jin <ltjin@amazon.com>
(cherry picked from commit 90ee47c)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
LantaoJin pushed a commit that referenced this pull request Dec 12, 2025
* [DOC] Callout the aggregation result may be approximate



* add to limitation.rst



* revert



* add ignore format



---------


(cherry picked from commit 90ee47c)

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DOC] doc_count computation in OpenSearch is approximated which cause approximated results in PPL/SQL aggregation

4 participants