Skip to content

perf: DirectIndexAgg for low-cardinality integer GROUP BY#157

Open
poyrazK wants to merge 21 commits into
mainfrom
perf/group-by-optimization
Open

perf: DirectIndexAgg for low-cardinality integer GROUP BY#157
poyrazK wants to merge 21 commits into
mainfrom
perf/group-by-optimization

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented May 21, 2026

Summary

  • Add DirectIndexAgg class: O(1) direct array indexing for integer GROUP BY keys (slot = key - min_key) instead of string-based hash lookup
  • Add steal() methods to ColumnVector/NumericVector/StringVector for O(1) batch transfers via vector swap
  • Fix vectorized path activation for GROUP BY with unanalyzed tables by hoisting estimated_rows declaration to enclosing scope
  • Enable vectorized path when GROUP BY exists but ANALYZE hasn't been run (num_rows==0), allowing DirectIndexAgg optimization to activate

Performance

  • Q1 GROUP BY: ~2.6M/s → ~28-31M/s (10x improvement)
  • Q6 (no GROUP BY): unchanged at ~5.9G/s

Test plan

  • make -j4 duckdb_comparison_bench
  • ./build/duckdb_comparison_bench --benchmark_filter=BM_CloudSQL_Q[16]/100000 --benchmark_repetitions=3

Summary by CodeRabbit

  • New Features

    • Optional parallel table scans using background threads
    • Memory-mapped columns for zero-copy reads
    • Column data transfer capability to avoid intermediate copies
  • Performance

    • Faster GROUP BY for integer keys (direct-index path) and a new fast general hash path
    • Reduced data copying when assembling parallel scan results
    • Smarter planner bias toward vectorized execution for applicable queries
  • Bug Fixes

    • AVG now correctly yields NULL for groups with no non-null inputs
  • Tests

    • Updated tests to cover new execution constructor/paths

Review Change Stack

poyrazK added 4 commits May 18, 2026 17:12
Key changes:
- query_executor.cpp: Remove parallel_ requirement from use_vectorized gate
  (Fix 3: enable vectorized path for large scans without parallel mode)
- query_executor.cpp: Resolve input_col_idx from aggregate function arguments
  (Fix 5: enable SUM/AVG to work on specific columns, not just COUNT(*))
- vectorized_operator.hpp: Add AVG support in update_accumulators and produce_output_batch
- operator.cpp: Replace std::map with std::unordered_map + binary key encoding
  (Fix 1+2: O(log n) → O(1) hash, string concat → length-prefixed binary)

Benchmark baseline (Q1 @ 100k rows):
- Before fixes: ~152k rows/sec
- After fixes: ~166k rows/sec
- DuckDB: ~197M rows/sec

The vectorized path is now enabled but VectorizedGroupByOperator needs
further optimization (batch-oriented key construction) to close gap.
This commits the full parallel scan implementation:
- VectorizedSeqScanOperator now accepts optional ThreadPool parameter
- When enabled (>50k rows, >1 thread), splits table into range chunks
  and reads each range in parallel via thread pool
- Added steal() to ColumnVector/NumericVector/StringVector for O(1)
  batch transfer via vector swap (avoids per-element copy)
- Parallel results are collected and returned via steal() to out_batch

The slow benchmark (2.6M/s) vs fast (598M/s for Q6/10k) suggests the
GROUP BY aggregation is the bottleneck, not I/O. The parallel scan
overhead (file opens/closes per task) likely exceeds I/O benefit at
100k row scale.
Implements direct array indexing for integer GROUP BY keys instead
of string-based hashing. When GROUP BY is on a single integer column,
uses slot = key - min_key for O(1) lookup without hash computation.

Also adds steal() method to ColumnVector/NumericVector/StringVector
for O(1) batch transfers via vector swap.

Note: Benchmark still shows ~2.6M/s suggesting the optimization path
may not be triggered, or there's another bottleneck in the execution
path that needs investigation.
The estimated_rows variable was being redeclared inside a nested block,
causing the GROUP BY heuristic check to always see stale (0) values.
Also adds the condition to enable vectorized path when GROUP BY exists
but num_rows==0 (ANALYZE not run), allowing DirectIndexAgg optimization
to activate.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

This PR accelerates vectorized query execution by adding parallel scan support with column stealing, introducing memory-mapped storage for zero-copy reads, and implementing a direct-index aggregation fast path for low-cardinality integer GROUP BY operations. Plan selection now uses estimated row cardinality to drive vectorization decisions and provisions a ThreadPool for parallel scans.

Changes

Vectorized Execution and Parallel Processing

Layer / File(s) Summary
ColumnVector steal() interface and implementations
include/executor/types.hpp
ColumnVector declares a new pure-virtual steal(ColumnVector&&) method. NumericVector<T> and StringVector implement steal via dynamic-cast type-checking, then swap internal buffers (data, null bitmap, size); they throw std::runtime_error on type mismatch.
Memory-mapped column storage abstraction
include/storage/memory_mapped_column.hpp, src/storage/memory_mapped_column.cpp
New MemoryMappedColumn class enables zero-copy reads by mmap-backing data and null regions. Provides map(data_path, null_path, element_size, row_count) to open/mmap files and unmap() for RAII cleanup; inline helpers (data_at, null_at) return per-row pointers via pointer arithmetic.
Group-by key encoding optimization
src/executor/operator.cpp
AggregateOperator switches the grouping container from ordered std::map to std::unordered_map with pre-allocation. Group-by key generation replaces string concatenation with binary length-prefixed encoding and explicit NULL handling.
Vectorized execution planning and wiring
src/executor/query_executor.cpp
Cost-based vectorization now tracks estimated_rows across scan/filter selectivity estimation, removes the parallel_ gate, and forces vectorized GROUP BY when table stats report zero rows. Vectorized plan construction creates a shared ThreadPool sized by std::thread::hardware_concurrency() and passes it to VectorizedSeqScanOperator. Aggregate metadata derives input_col_idx by resolving the first aggregate argument column name in the operator output schema; defaults to -1 for non-column arguments like COUNT(*).
Parallel vectorized scan with column stealing
include/executor/vectorized_operator.hpp
VectorizedSeqScanOperator accepts an optional ThreadPool in its constructor and detects parallel eligibility (pool present, multiple threads, table sufficiently large). next_batch_parallel() partitions remaining rows by thread, submits concurrent ColumnarTable::read_batch tasks, waits for completion, and assembles output by stealing columns from per-thread result batches.
Direct-index and open-addressing GROUP BY aggregation
include/executor/vectorized_operator.hpp
DirectIndexAgg provides slot-based accumulation for integer keys; OpenAddressHashAgg implements an open-addressing hash aggregator. VectorizedGroupByOperator detects single-column integer GROUP BY eligibility and routes input/output to direct or open-addressing processing paths. AVG is accumulated as sums with per-group counts and emitted as NULL when count is zero.
Benchmark storage manager wiring
benchmarks/duckdb_comparison_bench.cpp
CloudSQLContext initialization calls executor->set_storage_manager(storage.get()) to enable the executor to use the provided StorageManager.
Tests updated for new overload**
tests/*
Test call sites updated to pass a third nullptr ThreadPool argument into VectorizedSeqScanOperator constructors where applicable.

Sequence Diagram(s)

sequenceDiagram
  participant SeqScan as VectorizedSeqScanOperator
  participant Pool as ThreadPool
  participant Table as ColumnarTable
  participant Out as OutputBatch
  SeqScan->>SeqScan: partition remaining rows per thread
  SeqScan->>Pool: submit read_batch tasks (per-range)
  Pool->>Table: ColumnarTable::read_batch(range)
  Table-->>Pool: per-thread result batches
  Pool-->>SeqScan: tasks complete
  SeqScan->>Out: steal/move columns from per-thread batches
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • poyrazK/cloudSQL#59: Related changes to GROUP BY aggregation logic and operator code paths.
  • poyrazK/cloudSQL#75: Related vectorized execution/plan-selection heuristics using estimated row counts.
  • poyrazK/cloudSQL#10: Foundational vectorized execution and columnar operator work that this PR extends.

Poem

🐰 I hop through columns, buffers light,

I swap and steal to speed the flight,
Direct-index counts in tidy rows,
mmap gives zero-copy repose,
Fast queries bloom where the rabbit goes.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.66% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'perf: DirectIndexAgg for low-cardinality integer GROUP BY' directly reflects the primary optimization introduced—a performance enhancement using the DirectIndexAgg class for efficient integer GROUP BY operations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/group-by-optimization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
src/executor/operator.cpp (1)

577-605: ⚡ Quick win

Hoist key outside the loop and prefer try_emplace to avoid per-row allocations.

std::string key; is default-constructed inside the loop and then reserve(64) is called every iteration, which forces a heap allocation per input row (SSO capacity is well under 64). Hoisting the buffer above the while and using clear() to reset preserves capacity across rows and removes that per-row malloc/free pair from this hot path. While here, find followed by emplace does two hash probes on the miss path; try_emplace collapses it to one and avoids constructing the key string twice.

♻️ Proposed refactor
     Tuple tuple;
     auto child_schema = child_->output_schema();
+    std::string key;
+    key.reserve(64);  // reused across rows; capacity is preserved by clear()
     while (child_->next(tuple)) {
-        std::string key;
-        key.reserve(64);  // Pre-reserve to avoid repeated allocations
+        key.clear();
         std::vector<common::Value> gb_vals;
@@
-        auto it = groups_map.find(key);
-        if (it == groups_map.end()) {
-            it = groups_map.emplace(key, GroupState(aggregates_.size())).first;
-            it->second.group_values = std::move(gb_vals);
+        auto [it, inserted] = groups_map.try_emplace(key, GroupState(aggregates_.size()));
+        if (inserted) {
+            it->second.group_values = std::move(gb_vals);
         }

Also note the comment on Line 585 ("no string allocation") is slightly misleading — val.to_string() still allocates for non-null values; only the per-iteration key buffer can be made allocation-free.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/operator.cpp` around lines 577 - 605, Hoist the std::string key
buffer out of the per-row loop and reserve its capacity once (e.g.,
key.reserve(64) before the loop) then call key.clear() each iteration to
preserve capacity and avoid per-row heap allocations; keep gb_vals as a local
per-row vector but reuse it similarly if needed. Replace the find+emplace
pattern on groups_map with a single try_emplace call (using key as the
lookup/insert key and constructing GroupState(aggregates_.size()) only on miss)
to eliminate the double hash probe and duplicate key construction; after
insertion/update, assign it->second.group_values = std::move(gb_vals) as before.
Apply these changes around the existing symbols key, group_by_, gb_vals,
groups_map, try_emplace, GroupState, and aggregates_.
src/executor/query_executor.cpp (1)

1515-1517: ⚡ Quick win

Thread pool recreation on every vectorized query.

A new ThreadPool is created for each vectorized query execution (line 1515). Thread pool creation incurs overhead from spawning worker threads. For workloads with many short queries, this repeated construction/destruction degrades throughput. Consider caching a shared ThreadPool at the QueryExecutor instance level for reuse across queries.

♻️ Suggested approach

Add a member to QueryExecutor:

std::shared_ptr<ThreadPool> thread_pool_;

Initialize once in the constructor:

QueryExecutor::QueryExecutor(...) 
    : ..., thread_pool_(std::make_shared<ThreadPool>(std::thread::hardware_concurrency())) {}

Reuse in build_vectorized_plan:

-auto thread_pool = std::make_shared<executor::ThreadPool>(std::thread::hardware_concurrency());
 std::unique_ptr<VectorizedOperator> current_root =
-    std::make_unique<VectorizedSeqScanOperator>(base_table_name, col_table, thread_pool);
+    std::make_unique<VectorizedSeqScanOperator>(base_table_name, col_table, thread_pool_);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 1515 - 1517, The vectorized
path currently creates a new ThreadPool per query; add a
std::shared_ptr<executor::ThreadPool> thread_pool_ member to QueryExecutor,
initialize it once in QueryExecutor's constructor (using
std::thread::hardware_concurrency() or configured size), and update
build_vectorized_plan (where ThreadPool is currently created and used to
construct VectorizedSeqScanOperator) to reuse thread_pool_ instead of making a
fresh pool; ensure lifetime is managed by the QueryExecutor instance and remove
the local std::make_shared<executor::ThreadPool> creation in
build_vectorized_plan.
include/executor/vectorized_operator.hpp (1)

80-83: 💤 Low value

Consider making parallel scan threshold configurable.

The hardcoded 50,000 row threshold for enabling parallel scan may not be optimal for all workloads. The crossover point depends on thread pool size, column count, column width, and system characteristics. Consider making this threshold configurable or using a more sophisticated cost model.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` around lines 80 - 83, The code
currently hardcodes the parallel-scan enable threshold (50000) in the block that
sets num_threads_ and parallel_enabled_; replace this magic number with a
configurable parameter (e.g., parallel_scan_threshold_) that is read from
operator configuration or a global runtime config and fall back to 50000 if
unset, expose it via the VectorizedOperator constructor or a setter so callers
can tune it, and update the check to set parallel_enabled_ = table_->row_count()
> parallel_scan_threshold_; optionally document the new parameter where operator
options are defined.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@include/executor/types.hpp`:
- Around line 247-252: The doc for virtual void steal(ColumnVector&& other)
promises that 'other' is emptied after stealing but current implementations only
swap buffers; update each steal implementation (e.g., ColumnVector::steal and
StringVector::steal) to perform the swap and then explicitly clear or reset the
moved-from 'other' (call its clear/reset method, set size/length to zero, and
release/empty any auxiliary buffers) so that after the call 'other' is empty
while 'this' holds the original data; ensure you use the class-specific
clear/reset APIs to leave 'other' in a valid empty state.

In `@include/executor/vectorized_operator.hpp`:
- Around line 385-392: The GroupSlot struct has sums_float64 sized as
MAX_AGGREGATES / 2 which can be indexed up to aggregates_.size() in
process_input_batch_direct(), causing OOB access; change sums_float64 to match
sums_int64 (size MAX_AGGREGATES) so all aggregate indexes
(0..aggregates_.size()-1) are valid, and update any related assumptions/comments
referencing half-size to reflect the new full-size array; ensure GroupSlot,
sums_float64, MAX_AGGREGATES and process_input_batch_direct usage remain
consistent.
- Around line 417-431: The find_or_insert method currently grows slots_ to cover
min_key_/max_key_ which can allocate huge memory for sparse/wide keys; modify
find_or_insert(int64_t key1, int64_t key2) to check the computed new_size
(max_key_ - min_key_ + 1) against a configured safe threshold (e.g.,
MAX_SLOT_RANGE) before resizing, and if it exceeds the threshold switch to or
fallback on a sparse container (e.g., an unordered_map or existing hash-based
path) or return an error/limit-hit indicator instead of resizing; ensure you
reference and update min_key_, max_key_, and slots_ only when within limits and
add clear handling in callers of find_or_insert for the fallback/limit
condition.

In `@include/storage/memory_mapped_column.hpp`:
- Around line 45-54: The helpers data_at() and null_at() currently do pointer
arithmetic without validating row_idx; update both functions to return nullptr
if data_region_.addr or null_region_.addr is null OR if row_idx >= row_count_,
ensuring you check bounds before computing static_cast<char*>(data_region_.addr)
+ row_idx * element_size_ (and similarly for null_region_); this prevents
out-of-bounds pointers — use the existing members row_count_, element_size_,
data_region_.addr and null_region_.addr to perform the checks and only perform
the pointer arithmetic when within range.

In `@src/executor/query_executor.cpp`:
- Around line 1709-1711: The code unsafely casts the size_t from
current_root->output_schema().find_column(col->name()) to int; first capture the
result into a size_t (e.g., auto idx =
current_root->output_schema().find_column(col->name())), check explicitly for
the not-found sentinel (idx == size_t(-1)) and handle that case (set
info.input_col_idx = -1 or raise/log an error), otherwise safely
static_cast<int>(idx) into info.input_col_idx; update references to
info.input_col_idx and use col->name() in the check to preserve current
behavior.

---

Nitpick comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 80-83: The code currently hardcodes the parallel-scan enable
threshold (50000) in the block that sets num_threads_ and parallel_enabled_;
replace this magic number with a configurable parameter (e.g.,
parallel_scan_threshold_) that is read from operator configuration or a global
runtime config and fall back to 50000 if unset, expose it via the
VectorizedOperator constructor or a setter so callers can tune it, and update
the check to set parallel_enabled_ = table_->row_count() >
parallel_scan_threshold_; optionally document the new parameter where operator
options are defined.

In `@src/executor/operator.cpp`:
- Around line 577-605: Hoist the std::string key buffer out of the per-row loop
and reserve its capacity once (e.g., key.reserve(64) before the loop) then call
key.clear() each iteration to preserve capacity and avoid per-row heap
allocations; keep gb_vals as a local per-row vector but reuse it similarly if
needed. Replace the find+emplace pattern on groups_map with a single try_emplace
call (using key as the lookup/insert key and constructing
GroupState(aggregates_.size()) only on miss) to eliminate the double hash probe
and duplicate key construction; after insertion/update, assign
it->second.group_values = std::move(gb_vals) as before. Apply these changes
around the existing symbols key, group_by_, gb_vals, groups_map, try_emplace,
GroupState, and aggregates_.

In `@src/executor/query_executor.cpp`:
- Around line 1515-1517: The vectorized path currently creates a new ThreadPool
per query; add a std::shared_ptr<executor::ThreadPool> thread_pool_ member to
QueryExecutor, initialize it once in QueryExecutor's constructor (using
std::thread::hardware_concurrency() or configured size), and update
build_vectorized_plan (where ThreadPool is currently created and used to
construct VectorizedSeqScanOperator) to reuse thread_pool_ instead of making a
fresh pool; ensure lifetime is managed by the QueryExecutor instance and remove
the local std::make_shared<executor::ThreadPool> creation in
build_vectorized_plan.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5a44b60-77fa-4086-ab79-c50bf70f0427

📥 Commits

Reviewing files that changed from the base of the PR and between fa275e2 and 864bbca.

📒 Files selected for processing (7)
  • benchmarks/duckdb_comparison_bench.cpp
  • include/executor/types.hpp
  • include/executor/vectorized_operator.hpp
  • include/storage/memory_mapped_column.hpp
  • src/executor/operator.cpp
  • src/executor/query_executor.cpp
  • src/storage/memory_mapped_column.cpp

Comment on lines +247 to +252
/**
* @brief Steals data from another column vector by swapping internal buffers.
* After steal(), 'other' is emptied and 'this' holds the original data from 'other'.
* Throws std::runtime_error if types are incompatible.
*/
virtual void steal(ColumnVector&& other) = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Semantic contract mismatch between documentation and implementation.

The documentation states "After steal(), 'other' is emptied," but the implementations (lines 339-345, 411-417) only swap internal buffers without clearing other. After steal(), both vectors remain valid with swapped contents rather than other being empty. This violates the documented contract and could mislead callers who rely on other being emptied.

🔄 Proposed fix: clear 'other' after swapping

Update implementations to explicitly clear other after swapping:

 void steal(ColumnVector&& other) override {
     auto* other_num = dynamic_cast<NumericVector<T>*>(&other);
     if (!other_num) throw std::runtime_error("NumericVector::steal: type mismatch");
     data_.swap(other_num->data_);
     null_bitmap_.swap(other_num->null_bitmap_);
     std::swap(size_, other_num->size_);
+    other_num->data_.clear();
+    other_num->null_bitmap_.clear();
+    other_num->size_ = 0;
 }

Apply the same pattern to StringVector::steal at lines 411-417.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/types.hpp` around lines 247 - 252, The doc for virtual void
steal(ColumnVector&& other) promises that 'other' is emptied after stealing but
current implementations only swap buffers; update each steal implementation
(e.g., ColumnVector::steal and StringVector::steal) to perform the swap and then
explicitly clear or reset the moved-from 'other' (call its clear/reset method,
set size/length to zero, and release/empty any auxiliary buffers) so that after
the call 'other' is empty while 'this' holds the original data; ensure you use
the class-specific clear/reset APIs to leave 'other' in a valid empty state.

Comment thread include/executor/vectorized_operator.hpp
Comment thread include/executor/vectorized_operator.hpp
Comment on lines +45 to +54
const void* data_at(size_t row_idx) const {
if (!data_region_.addr) return nullptr;
return static_cast<char*>(data_region_.addr) + row_idx * element_size_;
}

// Direct pointer to null bit at row index
const uint8_t* null_at(size_t row_idx) const {
if (!null_region_.addr) return nullptr;
return static_cast<uint8_t*>(null_region_.addr) + row_idx;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add bounds validation to pointer arithmetic helpers.

data_at() and null_at() perform pointer arithmetic without validating that row_idx < row_count_. Returning out-of-bounds pointers can lead to undefined behavior when dereferenced by callers. Even if the caller is expected to validate, defensive bounds checking here prevents subtle bugs.

🛡️ Proposed fix: add bounds checking
 const void* data_at(size_t row_idx) const {
-    if (!data_region_.addr) return nullptr;
+    if (!data_region_.addr || row_idx >= row_count_) return nullptr;
     return static_cast<char*>(data_region_.addr) + row_idx * element_size_;
 }

 const uint8_t* null_at(size_t row_idx) const {
-    if (!null_region_.addr) return nullptr;
+    if (!null_region_.addr || row_idx >= row_count_) return nullptr;
     return static_cast<uint8_t*>(null_region_.addr) + row_idx;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const void* data_at(size_t row_idx) const {
if (!data_region_.addr) return nullptr;
return static_cast<char*>(data_region_.addr) + row_idx * element_size_;
}
// Direct pointer to null bit at row index
const uint8_t* null_at(size_t row_idx) const {
if (!null_region_.addr) return nullptr;
return static_cast<uint8_t*>(null_region_.addr) + row_idx;
}
const void* data_at(size_t row_idx) const {
if (!data_region_.addr || row_idx >= row_count_) return nullptr;
return static_cast<char*>(data_region_.addr) + row_idx * element_size_;
}
// Direct pointer to null bit at row index
const uint8_t* null_at(size_t row_idx) const {
if (!null_region_.addr || row_idx >= row_count_) return nullptr;
return static_cast<uint8_t*>(null_region_.addr) + row_idx;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/storage/memory_mapped_column.hpp` around lines 45 - 54, The helpers
data_at() and null_at() currently do pointer arithmetic without validating
row_idx; update both functions to return nullptr if data_region_.addr or
null_region_.addr is null OR if row_idx >= row_count_, ensuring you check bounds
before computing static_cast<char*>(data_region_.addr) + row_idx * element_size_
(and similarly for null_region_); this prevents out-of-bounds pointers — use the
existing members row_count_, element_size_, data_region_.addr and
null_region_.addr to perform the checks and only perform the pointer arithmetic
when within range.

Comment on lines +1709 to +1711
info.input_col_idx = static_cast<int>(
current_root->output_schema().find_column(col->name()));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Unsafe cast of size_t to int for column index.

find_column() returns size_t(-1) to indicate "not found" (line 1710). Casting size_t(-1) directly to int is implementation-defined and may not reliably produce -1. On platforms where sizeof(size_t) > sizeof(int), the cast can produce unexpected values. Check for "not found" explicitly before casting.

🔧 Proposed fix: explicit not-found check
 if (!func->args().empty()) {
     const auto& arg = func->args()[0];
     if (arg->type() == parser::ExprType::Column) {
         const auto* col = dynamic_cast<const parser::ColumnExpr*>(arg.get());
         if (col != nullptr) {
-            info.input_col_idx = static_cast<int>(
-                current_root->output_schema().find_column(col->name()));
+            size_t col_idx = current_root->output_schema().find_column(col->name());
+            info.input_col_idx = (col_idx != static_cast<size_t>(-1)) 
+                ? static_cast<int>(col_idx) 
+                : -1;
         }
     }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
info.input_col_idx = static_cast<int>(
current_root->output_schema().find_column(col->name()));
}
size_t col_idx = current_root->output_schema().find_column(col->name());
info.input_col_idx = (col_idx != static_cast<size_t>(-1))
? static_cast<int>(col_idx)
: -1;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 1709 - 1711, The code unsafely
casts the size_t from current_root->output_schema().find_column(col->name()) to
int; first capture the result into a size_t (e.g., auto idx =
current_root->output_schema().find_column(col->name())), check explicitly for
the not-found sentinel (idx == size_t(-1)) and handle that case (set
info.input_col_idx = -1 or raise/log an error), otherwise safely
static_cast<int>(idx) into info.input_col_idx; update references to
info.input_col_idx and use col->name() in the check to preserve current
behavior.

poyrazK and others added 8 commits May 21, 2026 16:21
…_insert

- Remove unused valid_slot_indices_ member (produces_output_batch_direct
  iterates direct_group_keys_ instead)
- Remove orphaned valid_slots() accessor
- Fix find_or_insert to mark slot.valid=true on first insertion (was
  relying on accidental post-condition in caller)
- Open-addressing hash table with linear probing and FNV-1a 64-bit hash
- Binary key encoding: [type_tag (1B)][len (4B)][data...] - avoids string allocation
- HashBucket stores key_bytes inline for fast comparison, is_new flag for iteration
- Added to VectorizedGroupByOperator as fallback when DirectIndexAgg doesn't apply
- Fixed grow() to use key_data instead of casting key_int64

Note: process_input_batch_open_addressing() is written but not yet wired
into process_input_batch() - the old hash path is still being used.
This is intentional to allow incremental testing.
Instead of returning an error for GROUP BY keys larger than 56 bytes,
use std::vector<uint8_t> heap_key when stack buffer is too small.
Also fix hash_agg_.find_or_insert call to use key_ptr instead of
key_buf when heap allocation was used.
- Add slot() accessor to OpenAddressHashAgg for bucket iteration
- Add produce_output_batch_open_addressing() to iterate hash_agg_ buckets
- Wire it into produce_output_batch() instead of the old hash path
- Keep process_input_batch_hash as dead code for now (for reference)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
include/executor/vectorized_operator.hpp (1)

68-72: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Build failure: ThreadPool is undeclared.

The pipeline failures confirm that ThreadPool is used but never included or forward-declared. This breaks the entire CI build across all configurations.

Add the missing include or forward declaration at the top of the file:

🐛 Proposed fix
 `#include` "executor/types.hpp"
 `#include` "parser/expression.hpp"
 `#include` "storage/columnar_table.hpp"
+#include "concurrency/thread_pool.hpp"  // or appropriate header for ThreadPool

If ThreadPool is defined elsewhere, ensure the correct header path is used.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` around lines 68 - 72, The build
fails because ThreadPool is referenced by the member thread_pool_ in
vectorized_operator.hpp but ThreadPool is neither included nor forward-declared;
fix by either adding the proper header include that defines ThreadPool at the
top of include/executor/vectorized_operator.hpp or, if ThreadPool is a
class/struct declared in another header, add a forward declaration (e.g., class
ThreadPool;) before it is used; ensure the chosen fix resolves the reference for
the member std::shared_ptr<ThreadPool> thread_pool_ and compiles across all
configurations.
♻️ Duplicate comments (1)
include/executor/vectorized_operator.hpp (1)

394-394: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

sums_float64 array is undersized for MAX_AGGREGATES.

Same issue as identified in DirectIndexAgg::GroupSlot: sums_float64 is declared with size MAX_AGGREGATES / 2 (4 elements) while aggregate indices can range from 0 to aggregates_.size() - 1. If more than 4 aggregates are used, bucket.sums_float64[i] at lines 824-825 causes out-of-bounds access.

🐛 Proposed fix: match array sizes
-        double sums_float64[MAX_AGGREGATES / 2] = {0.0};
+        double sums_float64[MAX_AGGREGATES] = {0.0};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` at line 394, The local array
sums_float64 is declared too small (double sums_float64[MAX_AGGREGATES / 2])
which can overflow when aggregates_.size() > MAX_AGGREGATES/2; change its
declaration to use the full MAX_AGGREGATES (e.g. double
sums_float64[MAX_AGGREGATES]) so accesses like bucket.sums_float64[i] (used in
VectorizedOperator / GroupSlot logic) cannot go out of bounds and match the
other aggregate arrays.
🧹 Nitpick comments (1)
include/executor/vectorized_operator.hpp (1)

728-733: 💤 Low value

Dead code: slot validity check is always true.

After find_or_insert() returns, slot.valid is always true (set at line 579), so the condition if (!slot.valid) at line 729 is never entered. This makes lines 730-732 unreachable.

♻️ Proposed cleanup
             size_t slot_idx = agg_.find_or_insert(key, 0);
             auto& slot = agg_.slot(slot_idx);

-            if (!slot.valid) {
-                slot.valid = true;
-                slot.key1 = key;
+            if (slot.key1 != key) {  // First time seeing this key
+                slot.key1 = key;
                 direct_group_keys_.push_back(key);
             }

Or simply track new keys differently (e.g., via a return value from find_or_insert).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` around lines 728 - 733, The
conditional checking slot.valid after calling find_or_insert() is dead because
find_or_insert() always sets slot.valid to true; update the API so the caller
can tell if the slot was newly inserted and only then push to
direct_group_keys_. Concretely, change find_or_insert(...) to return a boolean
(or a pair/struct) indicating "was_inserted" (or return an enum) alongside the
slot reference, then replace the if (!slot.valid) branch with a check on that
returned flag and push key to direct_group_keys_ only when was_inserted is true;
update call sites that use find_or_insert() (the caller around
direct_group_keys_ and the find_or_insert() implementation) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 804-807: The code is pushing bucket.key_int64 into
hash_group_keys_ even for non-int64 GROUP BY keys, causing uninitialized values
to be emitted; update the open-addressing path in vectorized_operator.hpp so
that when a new bucket is created you either store the full key (decoded from
bucket.key_data) instead of key_int64, or only allow this path for direct int64
keys (use the same is_direct_indexable_ guard) and fall back to the hash-based
path for string/complex keys; ensure produce_output_batch_open_addressing() can
retrieve the correct key form you store (reference bucket.key_data,
bucket.key_int64, hash_group_keys_, produce_output_batch_open_addressing(), and
is_direct_indexable_ to locate the change).
- Around line 444-446: find_or_insert() copies key_len bytes into
bucket.key_data (a fixed 64-byte buffer) without bounds checking, causing a
buffer overflow when key_len > 64; update find_or_insert() to guard the copy: if
key_len <= sizeof(bucket.key_data) copy into bucket.key_data as now, otherwise
set bucket.key_len and bucket.key_type appropriately and store the full key into
the existing dynamic storage (heap_key / key_string path) instead of copying
into key_data; also update key comparison logic and grow()'s rehashing to
consult key_string/heap_key when bucket.key_len > 64 so long keys are compared
and moved safely; ensure process_input_batch_open_addressing() still routes long
keys to the heap path but that find_or_insert()/grow() use the same overflow
branch.

---

Outside diff comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 68-72: The build fails because ThreadPool is referenced by the
member thread_pool_ in vectorized_operator.hpp but ThreadPool is neither
included nor forward-declared; fix by either adding the proper header include
that defines ThreadPool at the top of include/executor/vectorized_operator.hpp
or, if ThreadPool is a class/struct declared in another header, add a forward
declaration (e.g., class ThreadPool;) before it is used; ensure the chosen fix
resolves the reference for the member std::shared_ptr<ThreadPool> thread_pool_
and compiles across all configurations.

---

Duplicate comments:
In `@include/executor/vectorized_operator.hpp`:
- Line 394: The local array sums_float64 is declared too small (double
sums_float64[MAX_AGGREGATES / 2]) which can overflow when aggregates_.size() >
MAX_AGGREGATES/2; change its declaration to use the full MAX_AGGREGATES (e.g.
double sums_float64[MAX_AGGREGATES]) so accesses like bucket.sums_float64[i]
(used in VectorizedOperator / GroupSlot logic) cannot go out of bounds and match
the other aggregate arrays.

---

Nitpick comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 728-733: The conditional checking slot.valid after calling
find_or_insert() is dead because find_or_insert() always sets slot.valid to
true; update the API so the caller can tell if the slot was newly inserted and
only then push to direct_group_keys_. Concretely, change find_or_insert(...) to
return a boolean (or a pair/struct) indicating "was_inserted" (or return an
enum) alongside the slot reference, then replace the if (!slot.valid) branch
with a check on that returned flag and push key to direct_group_keys_ only when
was_inserted is true; update call sites that use find_or_insert() (the caller
around direct_group_keys_ and the find_or_insert() implementation) accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 53c94c16-99dc-4b16-8045-561114c03eb8

📥 Commits

Reviewing files that changed from the base of the PR and between 864bbca and bb64516.

📒 Files selected for processing (1)
  • include/executor/vectorized_operator.hpp

Comment on lines +444 to +446
bucket.key_len = static_cast<uint32_t>(key_len);
bucket.key_type = key[0];
std::memcpy(bucket.key_data, key, key_len);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Buffer overflow when key_len > 64.

key_data is a fixed 64-byte buffer, but find_or_insert() copies key_len bytes without bounds checking. While process_input_batch_open_addressing() switches to heap_key when the encoded key exceeds 64 bytes (lines 789-793), that larger key is still passed to find_or_insert() which blindly copies it into the 64-byte key_data buffer.

🐛 Proposed fix: guard the copy or use dynamic storage

Option 1 - Guard the copy and use key_string for overflow:

     HashBucket& find_or_insert(const uint8_t* key, size_t key_len, uint64_t hash) {
         // ... existing code ...
         if (!bucket.occupied) {
             bucket.occupied = true;
             bucket.is_new = true;
             bucket.key_hash = hash;
             bucket.key_len = static_cast<uint32_t>(key_len);
             bucket.key_type = key[0];
-            std::memcpy(bucket.key_data, key, key_len);
+            if (key_len <= sizeof(bucket.key_data)) {
+                std::memcpy(bucket.key_data, key, key_len);
+            } else {
+                bucket.key_string.assign(reinterpret_cast<const char*>(key), key_len);
+            }

Then update the key comparison and grow() rehashing to check key_string when key_len > 64.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` around lines 444 - 446,
find_or_insert() copies key_len bytes into bucket.key_data (a fixed 64-byte
buffer) without bounds checking, causing a buffer overflow when key_len > 64;
update find_or_insert() to guard the copy: if key_len <= sizeof(bucket.key_data)
copy into bucket.key_data as now, otherwise set bucket.key_len and
bucket.key_type appropriately and store the full key into the existing dynamic
storage (heap_key / key_string path) instead of copying into key_data; also
update key comparison logic and grow()'s rehashing to consult
key_string/heap_key when bucket.key_len > 64 so long keys are compared and moved
safely; ensure process_input_batch_open_addressing() still routes long keys to
the heap path but that find_or_insert()/grow() use the same overflow branch.

Comment thread include/executor/vectorized_operator.hpp
poyrazK and others added 8 commits May 22, 2026 15:14
- Remove unused key_string field from HashBucket (saves 32 bytes/bucket)
- Delete process_input_batch_hash dead code (never called)
- Fix self-assignment warning (resumed_bucket_idx_ = resumed_bucket_idx_)
- Add trailing newlines to MemoryMappedColumn files
…Operator

The constructor signature changed to accept an optional thread_pool
parameter, but the test constructions weren't updated.
…canOperator construction

Line 57 uses direct construction, not make_unique, so wasn't caught by
the previous nullptr fix.
g++ was treating ThreadPool as int in forward declaration resolution.
Include the header to ensure the type is fully defined.
Store actual common::Value objects in hash_group_keys_ instead of
just int64_t keys, preserving column type (TEXT, INT64, etc.)
for correct output emission.
- Add mins/maxes/has_mins fields to HashBucket for MIN/MAX tracking
- Wire MIN/MAX into process_input_batch_open_addressing accumulator loop
- Emit MIN/MAX in produce_output_batch_open_addressing
- Fix VectorizedSeqScanOperator construction in vectorized_operator_tests
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/storage/memory_mapped_column.cpp (2)

17-35: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clean up existing mappings before remapping.

Calling map() twice on the same instance leaks the previous data mapping: data_region_ is overwritten on Line 35 before the old region is released. Either unmap() at the top of map() or reject remapping while already mapped.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/memory_mapped_column.cpp` around lines 17 - 35,
MemoryMappedColumn::map currently overwrites data_region_ (and similarly
null_region_ if present) when called a second time, leaking the previous mmap
and file descriptor; before creating new mappings in map(), call the class's
unmap() helper (or implement unmap() if missing) to munmap existing regions and
close their fds, or alternatively detect an existing mapping and return false to
reject remapping; update MemoryMappedColumn::map to invoke unmap() at the start
(or guard with a check on data_region_.ptr/file descriptor) so the old mapping
is properly released before assigning data_region_ and null_region_.

23-57: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate file sizes before exposing row-based access.

map() never checks that the mapped lengths are consistent with row_count/element_size. A truncated column can map successfully and then data_at() / null_at() will hand out pointers beyond the mapped region; empty columns also hit mmap(..., 0, ...), which fails. Please add overflow-safe size checks before mapping and special-case the zero-row path instead of calling mmap with a zero length.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/memory_mapped_column.cpp` around lines 23 - 57, The map()
implementation maps data and null files without validating that st.st_size
matches the expected sizes derived from row_count and element_size, and it calls
mmap with length 0 for empty columns; update map() to compute expected_data_size
= row_count * element_size and expected_null_size = (row_count + 7) / 8 using
overflow-safe multiplication/checks, verify st.st_size == expected size (or at
least >= expected size) before calling mmap, and treat zero-row cases specially
by not calling mmap (set data_region_ / null_region_ to an empty sentinel and
keep their fds closed) and return false/cleanup if sizes mismatch; ensure
unmap()/::close() are called on any early failures to avoid leaks.
include/storage/memory_mapped_column.hpp (1)

19-35: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make MemoryMappedColumn non-copyable (and explicitly handle move semantics).

MemoryMappedColumn owns raw mmap pointers and file descriptors, but it’s implicitly copyable; a copy would duplicate the handles and both destructors would later munmap/close the same resources. Delete the copy constructor and copy assignment (and explicitly decide move ctor/assignment, since current destructor suppresses implicit moves).

Proposed fix
 class MemoryMappedColumn {
    public:
+    MemoryMappedColumn(const MemoryMappedColumn&) = delete;
+    MemoryMappedColumn& operator=(const MemoryMappedColumn&) = delete;
     ~MemoryMappedColumn() { unmap(); }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/storage/memory_mapped_column.hpp` around lines 19 - 35, The class
MemoryMappedColumn must be made non-copyable and support move semantics: delete
the copy constructor and copy assignment operator, and add a noexcept move
constructor and move assignment that transfer ownership of MappedRegion fields
(data_region_ and null_region_) and scalar state (element_size_,
is_fixed_width_, row_count_) to the destination, leaving the source regions with
addr=nullptr, size=0, fd=-1 and resetting scalar state as appropriate; ensure
unmap() and the destructor continue to work and that moved-from objects are safe
to destroy. Use the existing MappedRegion struct and unmap() helper when
implementing moves so file descriptors and mmap pointers are not duplicated or
closed twice.
♻️ Duplicate comments (2)
include/executor/vectorized_operator.hpp (2)

391-398: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

sums_float64 can still overflow for later aggregate positions.

bucket.sums_float64[i] is indexed by the aggregate slot, but the backing array is still only MAX_AGGREGATES / 2. A fifth FLOAT SUM/AVG will write past the end of the bucket. DirectIndexAgg::GroupSlot has the same layout, so the direct path still has the same memory corruption risk.

Also applies to: 821-832

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` around lines 391 - 398, The
sums_float64 array is too small (MAX_AGGREGATES/2) causing out-of-bounds writes
when more than half the aggregates are FLOAT SUM/AVG; change its declaration and
any matching layout in DirectIndexAgg::GroupSlot (and the other occurrence
around the 821-832 region) to allocate space for MAX_AGGREGATES (or otherwise
compute exact per-aggregate float count) and adjust initializers (e.g., {0.0})
accordingly so each aggregate slot maps safely to bucket.sums_float64[i] without
overflow.

399-401: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fixed-size key_data[64] still breaks wide GROUP BY keys.

hash_group_keys_ fixes output reconstruction, but insertion/rehash still stores the encoded key in bucket.key_data[64]. Any composite key longer than 64 bytes will still overflow the bucket during find_or_insert()/grow(), so the open-addressing path is not safe for wide keys yet.

Also applies to: 804-805

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` around lines 399 - 401, The
bucket's fixed-size key_data[64] overflows for wide composite GROUP BY keys;
replace the in-bucket fixed buffer with an indirection (e.g., a key_ptr /
key_index plus key_len and key_type) so buckets store either a pointer to
heap-allocated key bytes or an index into a separate contiguous key storage
(small-buffer-optimization can still be used in that external storage). Update
all code paths that read/write bucket.key_data: change bucket layout (remove
key_data[64], add key_ptr/key_index and key_len/key_type), then modify
find_or_insert(), grow(), and any rehash/move logic to allocate/copy keys into
the external storage and move pointers/indices instead of memcpy’ing into
key_data; ensure proper allocation, deallocation, and copy semantics and that
hash_group_keys_ reconstruction uses the same external storage semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 835-846: DirectIndexAgg’s fast path currently only updates
COUNT/SUM/AVG, so add handling for AggregateType::Min and AggregateType::Max in
the DirectIndexAgg update and emit logic (mirror the OpenAddressHashAgg
behavior): when agg.input_col_idx >= 0 read the value from
batch.get_column(agg.input_col_idx), check nulls, and update the per-bucket
fields bucket.mins[i], bucket.maxes[i] and bucket.has_mins[i] (and initialize
both min/max on first seen) just like the OpenAddressHashAgg branch; also ensure
the emit/finalize code for DirectIndexAgg outputs MIN/MAX correctly (use the has
flag to decide NULL vs value). Apply the same changes in the second occurrence
noted around the other block (the region referenced by lines 1027-1041).

In `@tests/vectorized_operator_tests.cpp`:
- Line 117: The test suite uses inconsistent calls to VectorizedSeqScanOperator;
inspect the constructor declaration for VectorizedSeqScanOperator to determine
whether the third parameter is required or has a default, then make a consistent
fix: either add a default value to the constructor signature in the class (e.g.,
give the third parameter a default nullptr) or update every test call site (all
uses of VectorizedSeqScanOperator(...) across tests) to pass the third argument
(nullptr) so signatures match; search for all call sites and adjust them, and
update the header/implementation where the constructor is declared/defined to
keep tests compiling.

---

Outside diff comments:
In `@include/storage/memory_mapped_column.hpp`:
- Around line 19-35: The class MemoryMappedColumn must be made non-copyable and
support move semantics: delete the copy constructor and copy assignment
operator, and add a noexcept move constructor and move assignment that transfer
ownership of MappedRegion fields (data_region_ and null_region_) and scalar
state (element_size_, is_fixed_width_, row_count_) to the destination, leaving
the source regions with addr=nullptr, size=0, fd=-1 and resetting scalar state
as appropriate; ensure unmap() and the destructor continue to work and that
moved-from objects are safe to destroy. Use the existing MappedRegion struct and
unmap() helper when implementing moves so file descriptors and mmap pointers are
not duplicated or closed twice.

In `@src/storage/memory_mapped_column.cpp`:
- Around line 17-35: MemoryMappedColumn::map currently overwrites data_region_
(and similarly null_region_ if present) when called a second time, leaking the
previous mmap and file descriptor; before creating new mappings in map(), call
the class's unmap() helper (or implement unmap() if missing) to munmap existing
regions and close their fds, or alternatively detect an existing mapping and
return false to reject remapping; update MemoryMappedColumn::map to invoke
unmap() at the start (or guard with a check on data_region_.ptr/file descriptor)
so the old mapping is properly released before assigning data_region_ and
null_region_.
- Around line 23-57: The map() implementation maps data and null files without
validating that st.st_size matches the expected sizes derived from row_count and
element_size, and it calls mmap with length 0 for empty columns; update map() to
compute expected_data_size = row_count * element_size and expected_null_size =
(row_count + 7) / 8 using overflow-safe multiplication/checks, verify st.st_size
== expected size (or at least >= expected size) before calling mmap, and treat
zero-row cases specially by not calling mmap (set data_region_ / null_region_ to
an empty sentinel and keep their fds closed) and return false/cleanup if sizes
mismatch; ensure unmap()/::close() are called on any early failures to avoid
leaks.

---

Duplicate comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 391-398: The sums_float64 array is too small (MAX_AGGREGATES/2)
causing out-of-bounds writes when more than half the aggregates are FLOAT
SUM/AVG; change its declaration and any matching layout in
DirectIndexAgg::GroupSlot (and the other occurrence around the 821-832 region)
to allocate space for MAX_AGGREGATES (or otherwise compute exact per-aggregate
float count) and adjust initializers (e.g., {0.0}) accordingly so each aggregate
slot maps safely to bucket.sums_float64[i] without overflow.
- Around line 399-401: The bucket's fixed-size key_data[64] overflows for wide
composite GROUP BY keys; replace the in-bucket fixed buffer with an indirection
(e.g., a key_ptr / key_index plus key_len and key_type) so buckets store either
a pointer to heap-allocated key bytes or an index into a separate contiguous key
storage (small-buffer-optimization can still be used in that external storage).
Update all code paths that read/write bucket.key_data: change bucket layout
(remove key_data[64], add key_ptr/key_index and key_len/key_type), then modify
find_or_insert(), grow(), and any rehash/move logic to allocate/copy keys into
the external storage and move pointers/indices instead of memcpy’ing into
key_data; ensure proper allocation, deallocation, and copy semantics and that
hash_group_keys_ reconstruction uses the same external storage semantics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 92252dce-f6c8-4554-9418-f723006711b9

📥 Commits

Reviewing files that changed from the base of the PR and between bb64516 and ab7e939.

📒 Files selected for processing (5)
  • include/executor/vectorized_operator.hpp
  • include/storage/memory_mapped_column.hpp
  • src/storage/memory_mapped_column.cpp
  • tests/analytics_tests.cpp
  • tests/vectorized_operator_tests.cpp

Comment on lines +396 to +398
int64_t mins[MAX_AGGREGATES] = {0};
int64_t maxes[MAX_AGGREGATES] = {0};
bool has_mins[MAX_AGGREGATES] = {false}; // Track if initialized
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

MIN/MAX in the open-addressing path is hard-coded to int64_t.

This path forces every MIN/MAX input through to_int64() and always emits make_int64(...). That changes both the value and the result type for non-INT64 aggregates, while the legacy hash path still preserves the original common::Value type. Either store typed common::Values here as well, or decline the open-addressing optimization for non-INT64 MIN/MAX.

Also applies to: 835-846, 1027-1038

Comment on lines +835 to +846
} else if ((agg.type == AggregateType::Min || agg.type == AggregateType::Max) &&
agg.input_col_idx >= 0) {
const auto& col = batch.get_column(agg.input_col_idx);
if (!col.is_null(r)) {
auto val = col.get(r).to_int64();
if (!bucket.has_mins[i]) {
bucket.mins[i] = val;
bucket.maxes[i] = val;
bucket.has_mins[i] = true;
} else {
bucket.mins[i] = std::min(bucket.mins[i], val);
bucket.maxes[i] = std::max(bucket.maxes[i], val);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This only wires MIN/MAX for one aggregation path.

The new cases update and emit extrema for OpenAddressHashAgg, but DirectIndexAgg still only handles COUNT/SUM/AVG. Integer GROUP BY queries that take the direct-index fast path can therefore return NULL/default results for MIN/MAX unless that path is updated too, or explicitly opted out of the optimization.

Also applies to: 1027-1041

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` around lines 835 - 846,
DirectIndexAgg’s fast path currently only updates COUNT/SUM/AVG, so add handling
for AggregateType::Min and AggregateType::Max in the DirectIndexAgg update and
emit logic (mirror the OpenAddressHashAgg behavior): when agg.input_col_idx >= 0
read the value from batch.get_column(agg.input_col_idx), check nulls, and update
the per-bucket fields bucket.mins[i], bucket.maxes[i] and bucket.has_mins[i]
(and initialize both min/max on first seen) just like the OpenAddressHashAgg
branch; also ensure the emit/finalize code for DirectIndexAgg outputs MIN/MAX
correctly (use the has flag to decide NULL vs value). Apply the same changes in
the second occurrence noted around the other block (the region referenced by
lines 1027-1041).


auto table_ptr = std::make_shared<ColumnarTable>(table);
VectorizedSeqScanOperator scan("sequential_scan", table_ptr);
VectorizedSeqScanOperator scan("sequential_scan", table_ptr, nullptr);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Verify constructor signature and update all call sites consistently.

Only one VectorizedSeqScanOperator constructor call in this file is updated with the third nullptr argument (Line 117), but dozens of other call sites throughout the file (Lines 41, 63, 91, 154, 184, 227, 267, 297, 330, 367, 395, 426, and many more in GROUP BY and JOIN tests) still use the 2-parameter constructor. This inconsistency suggests either:

  1. If the third parameter is required: The update is incomplete and will cause compilation failures.
  2. If the third parameter has a default value: The selective update is confusing and inconsistent.

Run the following script to check the VectorizedSeqScanOperator constructor signature and identify all unconverted call sites:

#!/bin/bash
# Description: Verify VectorizedSeqScanOperator constructor signature and find all call sites

# Check constructor declarations in header file
echo "=== Constructor signature(s) ==="
rg -n -A2 'VectorizedSeqScanOperator\s*\(' include/executor/vectorized_operator.hpp

echo -e "\n=== 2-parameter call sites in tests ==="
# Find all 2-parameter constructor calls (table_ptr); or table));
rg -n 'VectorizedSeqScanOperator\([^,]+,\s*[^,)]+\);' tests/

echo -e "\n=== 3-parameter call sites in tests ==="
# Find all 3-parameter constructor calls with nullptr
rg -n 'VectorizedSeqScanOperator\([^,]+,\s*[^,]+,\s*nullptr\);' tests/
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/vectorized_operator_tests.cpp` at line 117, The test suite uses
inconsistent calls to VectorizedSeqScanOperator; inspect the constructor
declaration for VectorizedSeqScanOperator to determine whether the third
parameter is required or has a default, then make a consistent fix: either add a
default value to the constructor signature in the class (e.g., give the third
parameter a default nullptr) or update every test call site (all uses of
VectorizedSeqScanOperator(...) across tests) to pass the third argument
(nullptr) so signatures match; search for all call sites and adjust them, and
update the header/implementation where the constructor is declared/defined to
keep tests compiling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant