WIP: statistics, executor: collect singleton sketches for row sampling#68157
WIP: statistics, executor: collect singleton sketches for row sampling#681570xPoe wants to merge 4 commits intopingcap:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughThis PR updates the ChangesDependency Update
Singleton FM-Sketch NDV Estimation
Sequence DiagramsequenceDiagram
participant Exec as Executor
participant Sampler as Row Sampler
participant Builder as Sketch Builder
participant Merger as Sketch Merger
participant NDVEst as NDV Estimator
Exec->>Sampler: Collect rows with FMSketch
Sampler->>Builder: Sample & insert into singleton builders (warm-up throttled)
Builder->>Sampler: Accumulate hashed values per column/group
Sampler->>Merger: BuildSingletonSketches() after iteration
Merger->>Sampler: Populate SingletonSketches from builders
Exec->>Merger: MergeCollector() to combine node results
Merger->>Sampler: Merge singleton sketches & sample counts
Exec->>NDVEst: Compute estimateNDVsBySketch(root sketches, node data, special index map)
NDVEst->>Exec: Return per-column NDV estimates (skip special indexes)
Exec->>Exec: Override hist.NDV when sketch estimate > sample NDV
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@0xPoe I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
8a78790 to
1d78d22
Compare
|
Review Complete Findings: 0 issues ℹ️ Learn more details on Pantheon AI. |
|
@0xPoe: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
pkg/statistics/sample_test.go (1)
275-316: ⚡ Quick winPlease cover the merge path too.
This subtest only exercises build + proto round-trip. The analyze path consumes singleton sketches after
FromProto()andMergeCollector(), so a deterministic case where the same value is singleton in two children and must disappear after merge would protect the new behavior much better.As per coding guidelines, "Prefer extending existing test suites and fixtures over creating new scaffolding."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/sample_test.go` around lines 275 - 316, The test SubTestRowSampleSingletonSketches only checks Build + proto round-trip; add a merge/analysis path to cover the case where the same value is singleton in two child collectors and must be removed after MergeCollector/Analyze. Create two ReservoirRowSampleCollector instances via NewReservoirRowSampleCollector, feed them deterministic rows so a particular value is singleton in both, call BuildSingletonSketches on each, use ToProto/FromProto or directly call MergeCollector to combine them, then run the same analysis path that consumes singleton sketches and assert that the merged collector no longer treats that value as singleton (NDV decreases/removes it) and that downstream FromProto/MergeCollector round-trip preserves this behavior; reuse the existing rows/colGroups and assertions in SubTestRowSampleSingletonSketches rather than adding new scaffolding.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@DEPS.bzl`:
- Around line 6585-6591: The Bazel fetch is failing because the module zip for
com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f referenced in
DEPS.bzl (sha256 =
"68768a27ed6c35716fcb01a0b4a15ff13e5c1a5dc11acc7a3d44ba02a2742077", strip_prefix
= "github.com/pingcap/tipb@v0.0.0-20260414032333-da912b84de6f") is not available
on the configured mirrors (the listed URLs), so either upload the module zip for
that pseudo-version to all mirrors (http://bazel-cache.pingcap.net,
http://ats.apps.svc, https://cache.hawkingrei.com,
https://storage.googleapis.com/pingcapmirror) and confirm go.sum contains the
matching hash, or revert/remove this DEPS.bzl entry until the artifact is
mirrored; after uploading or reverting, re-run the CI pipeline to verify the
fetch succeeds.
In `@pkg/executor/analyze_col_sampling.go`:
- Around line 965-973: The code currently computes a single aggregated
sampleSize from nodeSketchSampleCounts/nodeSampleSizes and passes it to
EstimateNDVByGEE(), but sketches are populated per-slice (collectColumns() may
skip nulls), so you must track and pass a per-slice sample count: change the
loop that builds nodeSketchSampleCounts to also compute per-index
actualSampleCounts[i] (the number of rows that contributed to each sketch, using
the same logic collectColumns() uses to skip nulls), and replace uses of the
aggregated sampleSize with actualSampleCounts[i] when computing
sampleNDV/singletonItems and when calling EstimateNDVByGEE() (also apply the
same fix in the similar block around lines 992-999). Ensure variable names
referenced are nodeSketchSampleCounts, nodeSampleSizes, sampleSize,
collectColumns(), EstimateNDVByGEE(), sampleNDV, and singletonItems so the
correct per-slice counts are used.
In `@pkg/statistics/row_sampler.go`:
- Around line 435-450: The current mergeSingletonSketches uses MergeFMSketch
(union), which incorrectly treats "seen-once" items as still singletons after
merge; instead preserve singleton sketches at the original child-collector
granularity by changing the storage and merge behavior: stop calling
MergeFMSketch and instead append a copy of each incoming singletonSketch to a
per-child list (e.g., change baseCollector.SingletonSketches from []*FMSketch to
[][]*FMSketch or otherwise store sketches by child index) inside
mergeSingletonSketches (use singletonSketch.Copy()), and update
buildSamplingStats/estimateNDVsBySketch consumers to iterate the per-child
sketches rather than a single merged sketch so singletonItems aren’t
double-counted.
- Around line 485-487: When deserializing singleton sketches, FMSketchFromProto
leaves maxSize == 0 which later causes shrink logic/corruption when those
sketches are merged; after calling FMSketchFromProto(pbSketch) in the loop that
populates s.SingletonSketches, set the returned sketch's maxSize to the expected
non-zero capacity (e.g. the same maxSize used for new sketches in this package
or derived from s or pbCollector) so that MergeFMSketch and
mergeSingletonSketches operate on a properly initialized sketch; update the loop
that builds s.SingletonSketches to normalize maxSize on each sketch returned by
FMSketchFromProto before appending.
---
Nitpick comments:
In `@pkg/statistics/sample_test.go`:
- Around line 275-316: The test SubTestRowSampleSingletonSketches only checks
Build + proto round-trip; add a merge/analysis path to cover the case where the
same value is singleton in two child collectors and must be removed after
MergeCollector/Analyze. Create two ReservoirRowSampleCollector instances via
NewReservoirRowSampleCollector, feed them deterministic rows so a particular
value is singleton in both, call BuildSingletonSketches on each, use
ToProto/FromProto or directly call MergeCollector to combine them, then run the
same analysis path that consumes singleton sketches and assert that the merged
collector no longer treats that value as singleton (NDV decreases/removes it)
and that downstream FromProto/MergeCollector round-trip preserves this behavior;
reuse the existing rows/colGroups and assertions in
SubTestRowSampleSingletonSketches rather than adding new scaffolding.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f100b334-ed63-4074-9cae-e68df8946d0b
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (7)
DEPS.bzlgo.modpkg/executor/analyze_col_sampling.gopkg/executor/analyze_utils_test.gopkg/statistics/fmsketch.gopkg/statistics/row_sampler.gopkg/statistics/sample_test.go
| sha256 = "68768a27ed6c35716fcb01a0b4a15ff13e5c1a5dc11acc7a3d44ba02a2742077", | ||
| strip_prefix = "github.com/pingcap/tipb@v0.0.0-20260414032333-da912b84de6f", | ||
| urls = [ | ||
| "http://bazel-cache.pingcap.net:8080/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260210113932-1447c9d7e9fe.zip", | ||
| "http://ats.apps.svc/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260210113932-1447c9d7e9fe.zip", | ||
| "https://cache.hawkingrei.com/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260210113932-1447c9d7e9fe.zip", | ||
| "https://storage.googleapis.com/pingcapmirror/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260210113932-1447c9d7e9fe.zip", | ||
| "http://bazel-cache.pingcap.net:8080/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f.zip", | ||
| "http://ats.apps.svc/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f.zip", | ||
| "https://cache.hawkingrei.com/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f.zip", | ||
| "https://storage.googleapis.com/pingcapmirror/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f.zip", |
There was a problem hiding this comment.
Bazel build is broken — mirror artifacts return 404 for the new tipb version.
The pipeline confirms that the zip for com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f does not exist at either the cache.hawkingrei.com or storage.googleapis.com mirror URLs. Bazel resolves URLs in order, so once all four fail the build is completely blocked.
This typically means the artifact still needs to be uploaded/mirrored before the DEPS.bzl entry can be merged. The steps are usually:
- Ensure
go.sumcontains the correct hash for the new pseudo-version. - Upload the module zip to all four mirror locations (internal cluster mirrors + the two public caches).
- Re-run the pipeline to confirm the fetch succeeds before removing the
do-not-mergelabel.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@DEPS.bzl` around lines 6585 - 6591, The Bazel fetch is failing because the
module zip for com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f
referenced in DEPS.bzl (sha256 =
"68768a27ed6c35716fcb01a0b4a15ff13e5c1a5dc11acc7a3d44ba02a2742077", strip_prefix
= "github.com/pingcap/tipb@v0.0.0-20260414032333-da912b84de6f") is not available
on the configured mirrors (the listed URLs), so either upload the module zip for
that pseudo-version to all mirrors (http://bazel-cache.pingcap.net,
http://ats.apps.svc, https://cache.hawkingrei.com,
https://storage.googleapis.com/pingcapmirror) and confirm go.sum contains the
matching hash, or revert/remove this DEPS.bzl entry until the artifact is
mirrored; after uploading or reverting, re-run the CI pipeline to verify the
fetch succeeds.
| var sampleSize uint64 | ||
| for _, size := range nodeSketchSampleCounts { | ||
| sampleSize += uint64(size) | ||
| } | ||
| if sampleSize == 0 { | ||
| for _, size := range nodeSampleSizes { | ||
| sampleSize += uint64(size) | ||
| } | ||
| } |
There was a problem hiding this comment.
Use per-slice sketch sample counts here.
sampleSize is aggregated once for the whole collector, but the sketches are not populated uniformly: collectColumns() skips null single-column values while multi-column groups still hash them. That means sampleNDV/singletonItems for a nullable slice can be computed from fewer sampled rows than the shared sampleSize passed to EstimateNDVByGEE(), which skews the estimate. Please track the number of rows that actually contributed to each sketch and use that per i.
Also applies to: 992-999
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/executor/analyze_col_sampling.go` around lines 965 - 973, The code
currently computes a single aggregated sampleSize from
nodeSketchSampleCounts/nodeSampleSizes and passes it to EstimateNDVByGEE(), but
sketches are populated per-slice (collectColumns() may skip nulls), so you must
track and pass a per-slice sample count: change the loop that builds
nodeSketchSampleCounts to also compute per-index actualSampleCounts[i] (the
number of rows that contributed to each sketch, using the same logic
collectColumns() uses to skip nulls), and replace uses of the aggregated
sampleSize with actualSampleCounts[i] when computing sampleNDV/singletonItems
and when calling EstimateNDVByGEE() (also apply the same fix in the similar
block around lines 992-999). Ensure variable names referenced are
nodeSketchSampleCounts, nodeSampleSizes, sampleSize, collectColumns(),
EstimateNDVByGEE(), sampleNDV, and singletonItems so the correct per-slice
counts are used.
| func (s *baseCollector) mergeSingletonSketches(singletonSketches []*FMSketch) { | ||
| if len(singletonSketches) == 0 { | ||
| return | ||
| } | ||
| if len(s.SingletonSketches) < len(singletonSketches) { | ||
| s.SingletonSketches = append(s.SingletonSketches, make([]*FMSketch, len(singletonSketches)-len(s.SingletonSketches))...) | ||
| } | ||
| for i, singletonSketch := range singletonSketches { | ||
| if singletonSketch == nil { | ||
| continue | ||
| } | ||
| if s.SingletonSketches[i] == nil { | ||
| s.SingletonSketches[i] = singletonSketch.Copy() | ||
| } else { | ||
| s.SingletonSketches[i].MergeFMSketch(singletonSketch) | ||
| } |
There was a problem hiding this comment.
Don't merge singleton sketches with FM union.
singletonSketch means “seen exactly once in this partition”, but MergeFMSketch() only unions hash membership. If a value is singleton in two child collectors, it still survives in the merged sketch even though it is no longer a singleton for the merged partition. buildSamplingStats() later feeds these merged sketches into estimateNDVsBySketch(), so singletonItems is biased upward and NDV can be overstated. Please preserve singleton sketches at the original collector granularity or merge them with once/multiple state instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/statistics/row_sampler.go` around lines 435 - 450, The current
mergeSingletonSketches uses MergeFMSketch (union), which incorrectly treats
"seen-once" items as still singletons after merge; instead preserve singleton
sketches at the original child-collector granularity by changing the storage and
merge behavior: stop calling MergeFMSketch and instead append a copy of each
incoming singletonSketch to a per-child list (e.g., change
baseCollector.SingletonSketches from []*FMSketch to [][]*FMSketch or otherwise
store sketches by child index) inside mergeSingletonSketches (use
singletonSketch.Copy()), and update buildSamplingStats/estimateNDVsBySketch
consumers to iterate the per-child sketches rather than a single merged sketch
so singletonItems aren’t double-counted.
| s.SingletonSketches = make([]*FMSketch, 0, len(pbCollector.GetSingletonSketch())) | ||
| for _, pbSketch := range pbCollector.GetSingletonSketch() { | ||
| s.SingletonSketches = append(s.SingletonSketches, FMSketchFromProto(pbSketch)) |
There was a problem hiding this comment.
Initialize deserialized singleton sketches with a non-zero maxSize.
FMSketchFromProto() leaves maxSize == 0. Here that sketch can later be copied into s.SingletonSketches and become the destination of MergeFMSketch() in mergeSingletonSketches(), which makes every insert trigger shrink logic and corrupts the sketch. Please normalize maxSize during deserialization before reusing these sketches.
Suggested fix
for _, pbSketch := range pbCollector.GetSingletonSketch() {
- s.SingletonSketches = append(s.SingletonSketches, FMSketchFromProto(pbSketch))
+ sketch := FMSketchFromProto(pbSketch)
+ if sketch != nil && sketch.maxSize == 0 {
+ sketch.maxSize = MaxSketchSize
+ }
+ s.SingletonSketches = append(s.SingletonSketches, sketch)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/statistics/row_sampler.go` around lines 485 - 487, When deserializing
singleton sketches, FMSketchFromProto leaves maxSize == 0 which later causes
shrink logic/corruption when those sketches are merged; after calling
FMSketchFromProto(pbSketch) in the loop that populates s.SingletonSketches, set
the returned sketch's maxSize to the expected non-zero capacity (e.g. the same
maxSize used for new sketches in this package or derived from s or pbCollector)
so that MergeFMSketch and mergeSingletonSketches operate on a properly
initialized sketch; update the loop that builds s.SingletonSketches to normalize
maxSize on each sketch returned by FMSketchFromProto before appending.
|
@0xPoe: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #68157 +/- ##
================================================
- Coverage 77.7624% 77.2950% -0.4675%
================================================
Files 1990 1984 -6
Lines 551788 555565 +3777
================================================
+ Hits 429084 429424 +340
- Misses 121784 125397 +3613
+ Partials 920 744 -176
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
What problem does this PR solve?
Issue Number: ref #67449
Problem Summary:
Row-sampling analyze needs singleton sketches to improve NDV estimation from distributed samples.
What changed and how does it work?
Check List
Tests
Manual test:
make bazel_preparemake lintSide effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Improvements
Dependencies
github.com/pingcap/tipbdependency.Tests