Skip to content

fix(vectorized): correct filter operator to exhaust all child batches#56

Merged
poyrazK merged 3 commits intomainfrom
fix/vectorized-filter-operator
Apr 20, 2026
Merged

fix(vectorized): correct filter operator to exhaust all child batches#56
poyrazK merged 3 commits intomainfrom
fix/vectorized-filter-operator

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented Apr 20, 2026

Summary

VectorizedFilterOperator::next_batch() was returning true after copying matches from the first batch, instead of continuing to consume all child batches until EOF.

Fix

  • Remove early return after first batch with matches
  • Continue processing until child is exhausted
  • Return true only if any rows were accumulated

Test plan

  • analytics_tests passes
  • CI verification

Summary by CodeRabbit

  • Bug Fixes
    • Fixed batch processing in filter operators to correctly accumulate results across multiple input batches instead of returning prematurely after the first batch.

VectorizedFilterOperator::next_batch() was returning true after
copying matches from the first batch, instead of continuing to
consume all child batches until EOF.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Warning

Rate limit exceeded

@github-actions[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 27 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 27 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fdac59ea-8e72-4fd0-9acb-e3000f91e9c2

📥 Commits

Reviewing files that changed from the base of the PR and between 8e5d762 and 0996e66.

📒 Files selected for processing (1)
  • include/executor/vectorized_operator.hpp
📝 Walkthrough

Walkthrough

The VectorizedFilterOperator::next_batch method was modified to process all available child batches before returning, accumulating output rows and clearing input state after each iteration, rather than returning immediately upon first non-empty selection.

Changes

Cohort / File(s) Summary
VectorizedFilterOperator batch processing logic
include/executor/vectorized_operator.hpp
Modified next_batch return behavior to continue scanning subsequent child batches and return based on accumulated output rows rather than returning after the first non-empty selection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Batches now flow without early escape,
Each one processed, no hasty reshape,
The operator gathers its harvest complete,
Before deciding if output's a treat!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: correcting the filter operator to exhaust all child batches instead of returning early.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/vectorized-filter-operator

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
include/executor/vectorized_operator.hpp (1)

104-138: ⚠️ Potential issue | 🟠 Major

Filter now drains the entire child pipeline in a single call — breaks vectorized/pipelined execution and risks unbounded memory.

With the early return removed, next_batch no longer returns once a non-empty match set is produced; instead it loops until the child reports EOF, appending every matching row across every child batch into a single out_batch. Consequences:

  • Unbounded output batch. For a low-selectivity predicate over a large table, out_batch accumulates potentially millions of rows in one VectorBatch, defeating the batch-at-a-time design (compare with VectorizedSeqScanOperator which yields ~batch_size_ rows).
  • Breaks pipelining. Any downstream operator that benefits from incremental delivery (e.g., a future LIMIT, top-N, or a consumer that interleaves work) is now forced to wait for the full scan. This turns Filter into a de-facto blocking operator, similar to VectorizedAggregateOperator — but Filter is not terminal.
  • Diagnosis of original bug may be wrong. Standard vectorized semantics is exactly "return true when this call produced rows; caller re-invokes to pull more." Since child_ preserves its own position across calls, the previous early-return was correct as long as the caller loops on next_batch. If the reported failure was that the caller only called next_batch once, the real fix belongs in the caller.

Suggested fix: return as soon as any matches are produced (or cap by a target batch size) and let the next caller invocation resume. Because child_->next_batch naturally returns false at EOF, no extra state is needed.

♻️ Proposed fix (return per-batch, preserving pipelining)
         while (child_->next_batch(*input_batch_)) {
             selection_mask_->clear();
             condition_->evaluate_vectorized(*input_batch_, child_->output_schema(),
                                             *selection_mask_);

             std::vector<size_t> selection;
             for (size_t r = 0; r < input_batch_->row_count(); ++r) {
                 common::Value val = selection_mask_->get(r);
                 if (!val.is_null() && val.as_bool()) {
                     selection.push_back(r);
                 }
             }

             if (!selection.empty()) {
-                // Batch-level append optimization: iterate columns once
                 for (size_t c = 0; c < input_batch_->column_count(); ++c) {
                     auto& src_col = input_batch_->get_column(c);
                     auto& dest_col = out_batch.get_column(c);
                     for (size_t r : selection) {
                         dest_col.append(src_col.get(r));
                     }
                 }
                 out_batch.set_row_count(out_batch.row_count() + selection.size());
+                input_batch_->clear();
+                return true;
             }
             input_batch_->clear();
         }
-        // Return true if we accumulated any rows, false if no matches found
-        return out_batch.row_count() > 0;
+        return false;

If coalescing small selections across child batches is desired for efficiency, cap the accumulation by a target size (e.g., batch_size) and return when that threshold is reached, rather than only at child EOF.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/executor/vectorized_operator.hpp` around lines 104 - 138, The current
VectorizedFilter next_batch implementation drains the entire child in one call,
causing unbounded out_batch growth and breaking pipelining; fix next_batch (the
override) so it returns as soon as it has appended any rows to out_batch (or
when out_batch reaches a configured target size like batch_size_), i.e., after
processing a child_->next_batch() and appending selection rows to out_batch,
immediately return true if out_batch.row_count() > 0 (or if accumulated >=
batch_size_), otherwise continue fetching child batches until EOF; update logic
around input_batch_, selection_mask_, and out_batch to preserve position across
calls but avoid aggregating all child rows in one invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 104-138: The current VectorizedFilter next_batch implementation
drains the entire child in one call, causing unbounded out_batch growth and
breaking pipelining; fix next_batch (the override) so it returns as soon as it
has appended any rows to out_batch (or when out_batch reaches a configured
target size like batch_size_), i.e., after processing a child_->next_batch() and
appending selection rows to out_batch, immediately return true if
out_batch.row_count() > 0 (or if accumulated >= batch_size_), otherwise continue
fetching child batches until EOF; update logic around input_batch_,
selection_mask_, and out_batch to preserve position across calls but avoid
aggregating all child rows in one invocation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ac3ddd17-5450-4732-b33c-6b74f89a50ff

📥 Commits

Reviewing files that changed from the base of the PR and between 3332b0d and 8e5d762.

📒 Files selected for processing (1)
  • include/executor/vectorized_operator.hpp

poyrazK and others added 2 commits April 20, 2026 20:43
…h at a time

Previously, VectorizedFilterOperator::next_batch() would drain the entire
child operator in one call, causing unbounded out_batch growth and breaking
pipelining. Now each call processes exactly one child batch, enabling
pipelined execution downstream.
Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to merge

@poyrazK poyrazK merged commit adfa4c5 into main Apr 20, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant