CBO Phase 2: filter selectivity + join reordering by poyrazK · Pull Request #79 · poyrazK/cloudSQL

poyrazK · 2026-05-07T12:18:21Z

Summary

Filter selectivity: execute_select() now estimates rows after WHERE clause using RowEstimator::estimate_filter_rows() (NDV-based selectivity for equality, range-based for int, string-length-based for LIKE prefixes) and compares against the 10k VectorizedRowThreshold
Join reordering: build_vectorized_plan() estimates both A⋈B and B⋈A join cardinalities via RowEstimator::estimate_join_rows() — when the reverse order is smaller, key expressions are swapped so the smaller table feeds the build-side hash table
current_est_rows tracked across the join chain for multi-join queries

Tests

AnalyzeFilterSelectivity — verifies SELECT with WHERE > constant uses Vectorized after ANALYZE (15k rows, filter estimated ~5k, threshold 10k)
AnalyzeJoinOrder — creates big (10k) and small (100) tables, ANALYZEs both, verifies join succeeds using Vectorized path

Verification

37/37 tests pass
Build: clean (1 self-assign warning in vectorized_operator.hpp, unrelated)

Summary by CodeRabbit

Improvements
- Optimized query execution with smarter filtering estimation for WHERE clauses.
- Enhanced join operation ordering to improve query performance on multi-table queries.
- Added validation tests to ensure query planning and execution work correctly with table statistics.

When a WHERE clause is present with a simple column OP constant predicate (e.g., id > 5000), use estimate_filter_rows to get a more accurate row estimate before comparing against kVectorizedRowThreshold. Falls back to estimate_scan_rows when no filter is present or the filter is too complex to analyze. Add AnalyzeFilterSelectivity test to verify SELECT with WHERE clause works after ANALYZE and the chooser path is exercised.

Use RowEstimator::estimate_join_rows() to compare forward (A⋈B) vs reverse (B⋈A) join cardinalities. When the reverse order is estimated to produce fewer rows, swap the key expressions so the smaller side feeds as the build-side hash table. This reduces hash table size and probe cost without changing the operator tree topology. Track current_est_rows across the join chain so subsequent joins in a multi-join query also benefit from the heuristic.

Creates big (10000 rows) and small (100 rows) tables, runs ANALYZE on both, then executes a join. Verifies the join succeeds using the Vectorized path without error, exercising the join reordering logic.

coderabbitai · 2026-05-07T12:18:29Z

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 3 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 839486c0-ce79-4bc5-a393-014cd027c226

📥 Commits

Reviewing files that changed from the base of the PR and between 133410b and 766c703.

📒 Files selected for processing (4)

docs/adr/001-analyze-table-and-chooser.md
docs/adr/002-join-reordering-inner-only.md
src/executor/query_executor.cpp
tests/cloudSQL_tests.cpp

📝 Walkthrough

Walkthrough

This PR refines vectorized query planning by integrating row-count estimation. The executor now selects plans based on filtered-row estimates for simple WHERE clauses and reorders equi-joins by comparing estimated cardinalities to place the smaller input on the hash-build side, with updated cardinality tracking across join chains.

Changes

Query Optimizer – Vectorized Plan Selection and Join Reordering

Layer / File(s)	Summary
Filter Row Estimation `src/executor/query_executor.cpp`	`execute_select` now conditionally calls `estimate_filter_rows` for simple column–constant binary WHERE predicates, falling back to `estimate_scan_rows` when filtering is unavailable or ineligible.
Join Reordering & Cardinality Tracking `src/executor/query_executor.cpp`	`build_vectorized_plan` initializes a `current_est_rows` cardinality estimate and, for each join, compares forward and reverse join orders via `estimate_join_rows`; if reverse is smaller, join keys are swapped, and cardinality is updated for subsequent joins.
Execution Tests `tests/cloudSQL_tests.cpp`	Two new test cases (`AnalyzeFilterSelectivity`, `AnalyzeJoinOrder`) validate that table statistics collected by `ANALYZE TABLE` correctly inform filter selectivity and join-order decisions at execution time.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

poyrazK/cloudSQL#75: Main PR directly builds on PR #75's ANALYZE/RowEstimator work: it uses RowEstimator::estimate_filter_rows and ::estimate_join_rows inside QueryExecutor (execute_select/build_vectorized_plan) to refine vectorized selection and join-side swapping.

Poem

🐰 A rabbit hops through join trees fair,
Reordering sides with careless care—
Stats whisper: "build the smaller one!"
Now queries dance and run, and run!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'CBO Phase 2: filter selectivity + join reordering' directly and clearly summarizes the main changes: implementation of filter selectivity estimation and join reordering logic as part of the cost-based optimization phase 2.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/cbo-phase2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/executor/query_executor.cpp`:
- Around line 418-453: The code treats estimated_rows == 0 as "no filter
estimate" but eligible is scoped inside the bin_expr block and lost, so a
legitimate zero returned by optimizer::RowEstimator::estimate_filter_rows gets
overridden; fix by introducing a boolean (e.g., filter_estimated) declared
before checking stmt.where() and set to true only when estimate_filter_rows is
called (keep existing bin_expr, col_name, pred_val logic and call
optimizer::RowEstimator::estimate_filter_rows as before), then change the
fallback to call estimate_scan_rows only when filter_estimated is false (not
when estimated_rows == 0), ensuring zero-row estimates from
RowEstimator::estimate_filter_rows are preserved.
- Around line 1602-1633: The join-reorder currently swaps only
left_key/right_key (and unused left_key_col/right_key_col) which breaks
pre-resolved column indices in VectorizedHashJoinOperator; instead, when you
decide to flip build/probe (est_reverse < est_forward), swap the operator
children as well: exchange current_root and right_scan alongside swapping
left_key and right_key so that the left child truly matches left_key and the
right child matches right_key; remove the dead std::swap of
left_key_col/right_key_col (they are unused), and ensure the
VectorizedHashJoinOperator is constructed from the possibly-swapped current_root
and right_scan so its constructor resolves column indices against the correct
schemas (if SQL-declared column order must be preserved, add a projection after
the join).

In `@tests/cloudSQL_tests.cpp`:
- Around line 1477-1601: The tests AnalyzeFilterSelectivity and AnalyzeJoinOrder
never enable the vectorized path because QueryExecutor is constructed but never
configured with parallel_ and storage_manager_, so execute_select() skips the
CBO logic; before running the SELECT/JOIN/ANALYZE queries call
QueryExecutor::set_parallel(true) and
QueryExecutor::set_storage_manager(&storage) so the vectorized gate (parallel_
&& storage_manager_) is true and build_vectorized_plan / CBO filter selectivity
and join reordering (kVectorizedRowThreshold logic) are exercised; also correct
the comment about the estimated row count to reflect the actual threshold
comparison.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da842d5c-a7db-4897-a539-b98b8f05458d

📥 Commits

Reviewing files that changed from the base of the PR and between b6eaad9 and 133410b.

📒 Files selected for processing (2)

src/executor/query_executor.cpp
tests/cloudSQL_tests.cpp

coderabbitai · 2026-05-07T12:28:53Z

+        // Join reordering: estimate both join orders and pick the smaller-first approach.
+        // This heuristic puts the side estimated to produce fewer rows as the build (probe-side
+        // hash table) to reduce memory footprint and hash probe cost.
+        // Note: For multi-join chains, current_root may already be a join result — we use its
+        // estimated cardinality as the "left" side estimate.
+        std::string left_key_col = left_key ? left_key->to_string() : "";
+        std::string right_key_col = right_key ? right_key->to_string() : "";
+        uint64_t est_forward = 0;
+        uint64_t est_reverse = 0;
+        if (!left_key_col.empty() && !right_key_col.empty()) {
+            // Estimate forward: current_est_rows ⋈ join_table_meta (probe = right)
+            est_forward = optimizer::RowEstimator::estimate_join_rows(
+                ::cloudsql::TableInfo{0, "", {}, {}, {}, current_est_rows, "", 0, 0},
+                *join_table_meta, right_key_col);
+            // Estimate reverse: join_table_meta ⋈ current_est_rows (probe = left)
+            ::cloudsql::TableInfo left_est;
+            left_est.num_rows = current_est_rows;
+            est_reverse = optimizer::RowEstimator::estimate_join_rows(*join_table_meta, left_est,
+                                                                      left_key_col);
+            // If reverse order is smaller, swap the key expressions so build/probe flip.
+            // The VectorizedHashJoinOperator uses left child as build, right as probe —
+            // swapping keys redirects the hash table to the smaller side.
+            if (est_reverse < est_forward && est_reverse > 0) {
+                // Swap left_key and right_key to redirect build to the smaller side
+                auto swapped_left = std::move(right_key);
+                auto swapped_right = std::move(left_key);
+                left_key = std::move(swapped_left);
+                right_key = std::move(swapped_right);
+                // Also swap the schema-based key column names for output schema ordering
+                std::swap(left_key_col, right_key_col);
+            }
+        }


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Locate VectorizedHashJoinOperator definition and its key-evaluation logic rg -n --type=cpp -C 8 "class VectorizedHashJoinOperator" rg -n --type=cpp -C 5 "left_key_\|right_key_\|build_key\|probe_key" src/executor/vectorized_operator.cpp

Repository: poyrazK/cloudSQL

Length of output: 1580

🏁 Script executed:

# Find all cpp files in the executor directory fd --type f "\.cpp$" src/executor/ | head -20

Repository: poyrazK/cloudSQL

Length of output: 147

🏁 Script executed:

# Search for VectorizedHashJoinOperator implementation in all cpp files rg -n "VectorizedHashJoinOperator::" --type=cpp -A 15 | head -100

Repository: poyrazK/cloudSQL

Length of output: 42

🏁 Script executed:

# Look for where left_key_ is used (evaluation against children) rg -n "left_key_" --type=cpp -B 2 -A 2 include/executor/

Repository: poyrazK/cloudSQL

Length of output: 2938

🏁 Script executed:

# Look at the context after line 1633 in query_executor.cpp to find operator construction sed -n '1633,1680p' src/executor/query_executor.cpp | cat -n

Repository: poyrazK/cloudSQL

Length of output: 2640

🏁 Script executed:

# Search for VectorizedHashJoinOperator construction in query_executor.cpp rg -n "VectorizedHashJoinOperator\|make_unique.*Join" src/executor/query_executor.cpp -A 3

Repository: poyrazK/cloudSQL

Length of output: 42

🏁 Script executed:

# Check if left_key_col or right_key_col are used after line 1631 sed -n '1631,1660p' src/executor/query_executor.cpp | cat -n

Repository: poyrazK/cloudSQL

Length of output: 1651

Join reordering swaps key expressions but not the operator inputs — causes column resolution failure when reorder condition fires

Prior to this block, the key assignment logic (lines 1569–1594) guarantees left_key refers to a column in current_root's schema and right_key refers to a column in right_scan's schema. The swap at lines 1626–1629 inverts only the key expressions. current_root (left child) and right_scan (right child) are not swapped.

When VectorizedHashJoinOperator is constructed (lines 1653–1654), its constructor pre-resolves column indices once:

left_key_col_idx_ = left_->output_schema().find_column(left_key_->to_string());

After the swap, left_key_ now contains the original right_scan column name, but the constructor searches for it in current_root->output_schema(). The column does not exist in that schema, causing the lookup to fail or silently select the wrong column. Similarly, right_key_ (now the original current_root column) is looked up in right_scan's schema and fails to match.

Additionally, std::swap(left_key_col, right_key_col) at line 1631 is dead code — neither variable is used after this point.

The correct fix is to swap both the operator children and the key expressions together; the output schema construction at lines 1644–1650 then rebuilds itself correctly from the swapped children:

🛠️ Proposed fix

if (est_reverse < est_forward && est_reverse > 0) { - // Swap left_key and right_key to redirect build to the smaller side - auto swapped_left = std::move(right_key); - auto swapped_right = std::move(left_key); - left_key = std::move(swapped_left); - right_key = std::move(swapped_right); - // Also swap the schema-based key column names for output schema ordering - std::swap(left_key_col, right_key_col); + // Move the smaller estimated table to the build (left) side + std::swap(current_root, right_scan); + std::swap(left_key, right_key); }

Note: Swapping the children changes the visible column order for SELECT * (the new build-side table's columns appear first). If preserving SQL-declaration column order is required, add a reordering projection after the join.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/executor/query_executor.cpp` around lines 1602 - 1633, The join-reorder currently swaps only left_key/right_key (and unused left_key_col/right_key_col) which breaks pre-resolved column indices in VectorizedHashJoinOperator; instead, when you decide to flip build/probe (est_reverse < est_forward), swap the operator children as well: exchange current_root and right_scan alongside swapping left_key and right_key so that the left child truly matches left_key and the right child matches right_key; remove the dead std::swap of left_key_col/right_key_col (they are unused), and ensure the VectorizedHashJoinOperator is constructed from the possibly-swapped current_root and right_scan so its constructor resolves column indices against the correct schemas (if SQL-declared column order must be preserved, add a projection after the join).

…comment 1. Filter selectivity: start with scan estimate as baseline, override only when filter is eligible. Fixes 0-ambiguity bug where empty tables would incorrectly fall back to full scan estimate. 2. Join reordering comment: "build (probe-side hash table)" → "build (hashed) side" — code was correct, comment was wrong. 3. Add NOTE comment about to_string() format stability dependency in column key lookup for join reordering.

Verify the join returns exactly 100 rows (small.id 0-99 match big.id 0-99), not just that it succeeds. Confirms correctness.

Outer joins (LEFT/RIGHT/FULL) must preserve build/probe semantics — reordering keys could change which side is the outer side, breaking query correctness. The reordering block is now wrapped with: if (exec_join_type == executor::JoinType::Inner) { ... } Also moves join type determination before the reordering block so the gating variable is available. Row estimate update is also scoped to inner joins only.

ADR 002 documents the outer join safety decision (inner joins only for reordering) from PR #79 review. ADR 001 gets a Phase 2 Extensions section covering filter selectivity strategies and join reordering mechanics, with a reference to ADR 002.

poyrazK

It's okay to merge

poyrazK added 3 commits May 7, 2026 14:46

tests: add AnalyzeJoinOrder to verify join reordering after ANALYZE

75ea76a

Creates big (10000 rows) and small (100 rows) tables, runs ANALYZE on both, then executes a join. Verifies the join succeeds using the Vectorized path without error, exercising the join reordering logic.

style: automated clang-format fixes

133410b

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

poyrazK and others added 6 commits May 7, 2026 16:04

tests: add result row count assertion to AnalyzeJoinOrder

4d5916f

Verify the join returns exactly 100 rows (small.id 0-99 match big.id 0-99), not just that it succeeds. Confirms correctness.

style: automated clang-format fixes

7e53e29

style: automated clang-format fixes

38264a2

poyrazK commented May 7, 2026

View reviewed changes

poyrazK merged commit e4a1f57 into main May 7, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CBO Phase 2: filter selectivity + join reordering#79

CBO Phase 2: filter selectivity + join reordering#79
poyrazK merged 10 commits intomainfrom
feature/cbo-phase2

poyrazK commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot May 7, 2026

Uh oh!

Uh oh!

poyrazK left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poyrazK commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

poyrazK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poyrazK commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading