Skip to content

Introduce RowData structure for efficient CDC event handling#21

Merged
isdaniel merged 2 commits intomainfrom
refactor/row-data-improvement
Feb 17, 2026
Merged

Introduce RowData structure for efficient CDC event handling#21
isdaniel merged 2 commits intomainfrom
refactor/row-data-improvement

Conversation

@isdaniel
Copy link
Copy Markdown
Owner

  • Replace HashMap<String, serde_json::Value> with RowData for event payloads.
  • Use Arc for column names to optimize memory usage.
  • Implement methods for RowData including push, get, and serialization.
  • Update ChangeEvent to utilize RowData for Insert, Update, and Delete events.
  • Add benchmarks comparing RowData performance against the old HashMap approach.

- Replace HashMap<String, serde_json::Value> with RowData for event payloads.
- Use Arc<str> for column names to optimize memory usage.
- Implement methods for RowData including push, get, and serialization.
- Update ChangeEvent to utilize RowData for Insert, Update, and Delete events.
- Add benchmarks comparing RowData performance against the old HashMap approach.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @isdaniel, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a fundamental optimization to how Change Data Capture (CDC) event payloads are handled. By replacing the HashMap-based approach with a custom RowData structure that leverages Arc<str> for column and relation names, the system achieves substantial improvements in memory efficiency and processing speed. These changes are critical for high-throughput replication scenarios, reducing overhead associated with string allocations and hash map operations, and are validated by new performance benchmarks.

Highlights

  • Introduced RowData Structure: Replaced HashMap<String, serde_json::Value> with a new RowData struct for event payloads, providing an ordered list of (Arc<str>, serde_json::Value) pairs. This change significantly optimizes memory usage and performance for Change Data Capture (CDC) events.
  • Optimized String Handling with Arc<str>: Utilized Arc<str> for column names, schema names, and table names within ColumnInfo, RelationInfo, and ChangeEvent types. This enables zero-cost cloning and reduces heap allocations when sharing these immutable strings across multiple events.
  • Enhanced Performance Benchmarking: Added a new benchmark suite (benches/rowdata_vs_hashmap.rs) using the criterion crate to compare the performance of the new RowData approach against the old HashMap approach across various CDC event operations like construction, lookup, cloning, and serialization.
  • Deprecated TupleData::to_hash_map: The TupleData::to_hash_map method was deprecated in favor of TupleData::to_row_data, which returns the more efficient RowData structure. The deprecated method now internally calls to_row_data for compatibility.
  • Refined String Reading in Buffer: Optimized the BufferReader's string reading methods to perform UTF-8 validation on borrowed slices and avoid unnecessary copy_to_bytes and to_vec allocations, improving efficiency during protocol parsing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • Cargo.lock
    • Added numerous new dependencies for benchmarking and related utilities, including criterion, alloca, anes, anstyle, cast, ciborium, clap, criterion-plot, crossbeam, crunchy, half, itertools, oorandom, page_size, plotters, rayon, same-file, tinytemplate, walkdir, web-sys, winapi, and zerocopy.
    • Added tokio-stream dependency for examples.
  • Cargo.toml
    • Enabled the rc feature for the serde dependency to support Arc<str> serialization.
    • Added criterion as a dev-dependency with html_reports feature.
    • Introduced a new [[bench]] target for rowdata_vs_hashmap.
  • README.md
    • Updated the 'Working with Event Data' section to demonstrate usage of the new RowData type.
    • Added new performance considerations to highlight the benefits of Arc<str> and RowData.
  • benches/rowdata_vs_hashmap.rs
    • Added a new benchmark file to compare RowData and HashMap performance across various CDC event operations.
  • examples/arbitrary-fuzzing/src/main.rs
    • Updated TupleData conversion from to_hash_map to to_row_data in fuzzing example.
  • examples/basic-streaming/Cargo.lock
    • Added tokio-stream dependency.
  • examples/basic-streaming/Cargo.toml
    • Added tracing dependency.
  • examples/polling/Cargo.lock
    • Added tokio-util dependency.
  • src/buffer.rs
    • Optimized read_string_until_null to validate UTF-8 on a slice and avoid intermediate Bytes allocation.
    • Optimized read_string to validate UTF-8 on a slice and avoid intermediate Bytes allocation.
  • src/connection.rs
    • Added new test cases for PgReplicationConnection covering null connection behavior and replication mode errors.
  • src/lib.rs
    • Exported the new RowData type.
  • src/protocol.rs
    • Changed ColumnInfo.name from String to Arc<str>.
    • Changed RelationInfo.namespace and RelationInfo.relation_name from String to Arc<str>.
    • Updated ColumnInfo::new and RelationInfo::new to create Arc<str> from String.
    • Introduced TupleData::to_row_data to convert tuple data into the new RowData format.
    • Deprecated TupleData::to_hash_map and re-implemented it to use to_row_data.
    • Adjusted RelationInfo::get_column_by_name to correctly compare Arc<str> with &str.
    • Updated various test assertions to reflect the use of Arc<str> for names.
  • src/retry.rs
    • Updated test assertion for jitter delay to use range-based comparison.
  • src/stream.rs
    • Updated LogicalReplicationStream to use Arc<str> for schema, table, and key column names.
    • Modified tuple_to_data to return RowData instead of HashMap<String, serde_json::Value>.
    • Adjusted get_key_columns_for_relation and relation_metadata to return Vec<Arc<str>> and Arc<str> respectively.
    • Updated EventType::Truncate to store Vec<Arc<str>>.
    • Refactored protocol version validation using range syntax (1..=4).
    • Updated numerous test cases to align with the new RowData and Arc<str> types and their constructors.
  • src/types.rs
    • Introduced the RowData struct, an ordered list of (Arc<str>, serde_json::Value) pairs, replacing HashMap for event payloads.
    • Implemented core API for RowData including new, with_capacity, push, len, is_empty, get, contains_key, iter, into_vec, into_hash_map, and from_pairs.
    • Implemented Default, PartialEq, Eq, Serialize, and Deserialize for RowData.
    • Updated EventType enum variants (Insert, Update, Delete, Truncate) to use Arc<str> for names and RowData for data payloads.
    • Modified ChangeEvent constructor methods (insert, update, delete, truncate) to accept impl Into<Arc<str>> for names and RowData for data.
    • Updated ChangeEvent::get_key_columns to return Option<&Vec<Arc<str>>>.
    • Added extensive new test cases for RowData functionality, serialization, and deserialization, as well as CachePadded.
Activity
  • The author, isdaniel, initiated this pull request to introduce the RowData structure and optimize CDC event handling.
  • New benchmarks were added to quantitatively assess the performance improvements of the new data structures.
  • Documentation in README.md was updated to guide users on working with the new RowData type.
  • Existing code was refactored to integrate RowData and Arc<str> across the codebase, ensuring consistency and efficiency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.12%. Comparing base (be92966) to head (8d3e721).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #21      +/-   ##
==========================================
+ Coverage   87.97%   88.12%   +0.14%     
==========================================
  Files           8        8              
  Lines        7019     7056      +37     
==========================================
+ Hits         6175     6218      +43     
+ Misses        844      838       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent pull request that significantly improves the efficiency of CDC event handling by replacing HashMap<String, Value> with a custom RowData structure. The use of Arc<str> for column names, schemas, and table names is a smart optimization that reduces memory allocations and cloning overhead. The new RowData struct is well-designed, and its serialization to a JSON object ensures backward compatibility. I'm particularly impressed with the comprehensive benchmarks you've added, which clearly demonstrate the performance gains of this new approach across various workloads. The API changes are handled gracefully with deprecation attributes, and the documentation has been updated accordingly. I have a couple of suggestions to improve consistency and performance further, but overall, this is a high-quality contribution.

Comment thread src/types.rs
@isdaniel
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance improvement by replacing HashMap<String, serde_json::Value> with a custom RowData struct for handling CDC event payloads. The new RowData struct, which is essentially a Vec<(Arc<str>, serde_json::Value)>, reduces memory allocations and eliminates hashing overhead by using Arc<str> for column names and other identifiers. The changes are well-executed and consistently applied throughout the codebase, including core data types, protocol parsing, stream processing logic, and examples. The addition of a comprehensive benchmark suite to validate the performance gains is particularly commendable. The code quality is high, with good documentation for the new RowData struct. I've only found a couple of minor areas for improvement in the tests to ensure their robustness is maintained after the refactoring.

Comment thread src/stream.rs
Comment thread src/stream.rs
Comment thread src/types.rs
@isdaniel isdaniel merged commit 228c88b into main Feb 17, 2026
10 checks passed
@isdaniel isdaniel deleted the refactor/row-data-improvement branch February 23, 2026 03:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant