Skip to content

feat(executor): add StringVector for TEXT column support#55

Merged
poyrazK merged 3 commits intomainfrom
feat/string-vector
Apr 20, 2026
Merged

feat(executor): add StringVector for TEXT column support#55
poyrazK merged 3 commits intomainfrom
feat/string-vector

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented Apr 20, 2026

Summary

  • Add StringVector class for variable-length string storage in columnar tables
  • Enable TYPE_TEXT, TYPE_VARCHAR, and TYPE_CHAR columns in VectorBatch
  • Add TEXT append/read support to ColumnarTable with length-prefixed format
  • Add 14 unit tests for StringVector
  • Add integration tests for TEXT in ColumnarTable

Test plan

  • string_vector_tests: 14/14 passed
  • columnar_table_tests: 12/12 passed (including TEXT tests)
  • analytics_tests: 5/5 passed

Summary by CodeRabbit

  • New Features

    • Vectorized execution now supports TEXT, VARCHAR, and CHAR column types with proper null handling.
    • Variable-length string persistence implemented for columnar storage.
  • Tests

    • Added comprehensive test coverage for string vector operations and end-to-end scenarios.
  • Documentation

    • Updated documentation to reflect StringVector support for Phase 8 analytics.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 19 minutes and 21 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 19 minutes and 21 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b04f1bd8-1d21-487b-abc0-91d5fdb0cdcc

📥 Commits

Reviewing files that changed from the base of the PR and between c609e92 and e8ae49b.

📒 Files selected for processing (2)
  • src/storage/columnar_table.cpp
  • tests/string_vector_tests.cpp
📝 Walkthrough

Walkthrough

This pull request implements support for variable-length string storage through a new StringVector class in the vectorized execution system. The implementation adds serialization/deserialization logic to the columnar table layer and includes comprehensive unit and integration tests for the new functionality.

Changes

Cohort / File(s) Summary
Build & Documentation
CMakeLists.txt, docs/phases/PHASE_8_ANALYTICS.md
Added test target for string_vector_tests and documented StringVector type as part of Phase 8 vectorized data structures for TEXT/VARCHAR/CHAR columns.
StringVector Implementation
include/executor/types.hpp
Introduced StringVector class supporting append (null-aware), get, set, clear, resize, and raw_data access. Updated VectorBatch::init_from_schema to instantiate StringVector for TYPE_TEXT, TYPE_VARCHAR, and TYPE_CHAR instead of throwing runtime exception.
Persistence Layer
src/storage/columnar_table.cpp
Added variable-length string serialization in append_batch (4-byte little-endian length prefix + bytes) and deserialization in read_batch with null flag handling for text column types.
Test Coverage
tests/columnar_table_tests.cpp, tests/string_vector_tests.cpp
Replaced UnsupportedTypeThrows test with TextTypeNowSupported validating successful batch creation; added TextLifecycle end-to-end test; created comprehensive new unit test suite for StringVector covering append, get, set, null semantics, resize, edge cases, type preservation, long strings, and special characters.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant VectorBatch
    participant StringVector
    participant ColumnarTable
    participant Storage as Storage Layer

    Client->>VectorBatch: create(schema with TEXT/VARCHAR/CHAR)
    VectorBatch->>VectorBatch: init_from_schema()
    VectorBatch->>StringVector: new StringVector(column_type)
    StringVector-->>VectorBatch: created

    Client->>VectorBatch: append_tuple(values)
    VectorBatch->>StringVector: append(text_value)
    StringVector->>StringVector: store string + update null_bitmap
    StringVector-->>VectorBatch: appended

    Client->>ColumnarTable: append_batch(vector_batch)
    ColumnarTable->>StringVector: iterate columns
    ColumnarTable->>Storage: write null flags + length prefix + string bytes
    Storage-->>ColumnarTable: persisted

    Client->>ColumnarTable: read_batch(row_range)
    ColumnarTable->>Storage: read null flags + length prefixes + string bytes
    Storage-->>ColumnarTable: data loaded
    ColumnarTable->>StringVector: populate with deserialized strings
    StringVector-->>ColumnarTable: batch ready
    ColumnarTable-->>Client: VectorBatch returned
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🐰 Hop, hop, strings now flow,
Variable-length fields all aglow,
Text and VARCHAR, CHAR take flight,
Vectorized storage burning bright!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding StringVector to support TEXT columns, which is the primary feature in this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/string-vector

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/storage/columnar_table.cpp`:
- Around line 152-173: read_batch's TEXT branch reads variable-length data
starting at d_in offset 0, so when start_row>0 the string stream isn't advanced
and strings get misaligned with the null bitmap; fix by advancing d_in to the
beginning of the requested row before reading actual_rows. In the branch
handling common::ValueType::TYPE_TEXT/TYPE_VARCHAR/TYPE_CHAR (where you have
d_in and n_in and str_vec/dynamic_cast<executor::StringVector&>), if start_row >
0 first seek d_in to 0 and loop r from 0 to start_row-1: read the 4-byte length
(uint32_t) and then seek forward by that length (use d_in.read to consume or
d_in.seekg(len, std::ios::cur)), handling EOF/errors; after skipping start_row
records proceed to the existing loop that reads actual_rows lengths+data and
appends values. Ensure you still read the null bitmap from n_in as currently
done and add error checks when skipping/reading lengths.

In `@tests/string_vector_tests.cpp`:
- Around line 175-183: The second append in TEST_F
StringVectorTests.SpecialCharacters is a no-op because the literal "emoji: 🎉
NULL: \0 embedded" is truncated at the embedded NUL; update the test to either
build the string with an explicit length so the embedded NUL is preserved (e.g.
construct a std::string with the bytes including '\0' and pass that to
Value::make_text) and then assert vec.get(1).as_text() contains the embedded
NUL, or remove the second vec.append(Value::make_text(...)) and its misleading
comment; modify the StringVector vec, vec.append, and Value::make_text usage
accordingly to reflect the chosen approach.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e55c50ef-3620-4cee-a002-5ac820eecdb0

📥 Commits

Reviewing files that changed from the base of the PR and between b52a71b and c609e92.

📒 Files selected for processing (6)
  • CMakeLists.txt
  • docs/phases/PHASE_8_ANALYTICS.md
  • include/executor/types.hpp
  • src/storage/columnar_table.cpp
  • tests/columnar_table_tests.cpp
  • tests/string_vector_tests.cpp

Comment thread src/storage/columnar_table.cpp
Comment thread tests/string_vector_tests.cpp
Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to merge

@poyrazK poyrazK merged commit 3332b0d into main Apr 20, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant