feat(perf): Streaming batch hashing and parallel table diffing for v0.7 by iamvirul · Pull Request #65 · iamvirul/deepdiff-db

iamvirul · 2026-03-14T04:50:54Z

Description

Implements Issue #12 — Streaming Support for Large Datasets. Introduces keyset-paginated batch hashing and a bounded parallel goroutine pool for table hashing, enabling comparison of databases with millions of rows while keeping memory usage flat.

Type of Change

New feature (non-breaking change which adds functionality)
Performance improvement
Documentation update

Related Issues

Fixes #12

Changes Made

internal/content/cursor.go — new shared BuildCursorQuery / buildCursorWhere for composite-PK keyset pagination; used by both hash and pack paths
internal/content/hash.go — HashTable extended with batchSize int; batchSize > 0 uses keyset-paginated hashTableBatched, batchSize = 0 falls back to original full-scan hashTableFull (backward compatible); per-batch runtime.GC() hint + DEBUG memory telemetry
cmd/deepdiffdb/main.go — --batch-size and --parallel flags on diff and gen-pack; hashTablesParallel replaces sequential for-loop using errgroup + semaphore.NewWeighted
pkg/config/config.go — PerformanceConfig{HashBatchSize, MaxParallelTables} with defaults (10000 / 1) and validation
deepdiffdb.config.yaml.example — documented performance: section
samples/14-streaming-large-datasets/ — SQLite seed script (500k orders / 100k products / 200k audit_logs), Makefile, README with memory tuning guide
CHANGELOG.md, README.md, ROADMAP.md — v0.7 release documentation

Testing

Unit tests added/updated
Manual testing performed
Tested with SQLite

Test Results

Three new unit tests in tests/content/hash_test.go:

TestHashTable_BatchedMatchesUnbatched — 1000 rows, batchSize=100 vs batchSize=0 produce identical hash maps
TestHashTable_KeysetPaginationCorrect — 250 rows / batchSize=50, verifies all 250 keys present with no gaps
TestHashTable_BatchedEmptyTable — empty table with batchSize=1000 returns empty map without error

All 17 existing integration test call-sites updated to pass batchSize=0 (preserves pre-v0.7 behaviour).

Checklist

Code follows the project's style guidelines
Self-review completed
Comments added for complex code
Documentation updated (if needed)
No new warnings generated
Tests pass locally
All existing tests pass
Changes are backward compatible (--batch-size 0 restores pre-v0.7 full-scan path)

Additional Notes

Memory behaviour: with hash_batch_size: 10000 and 800k rows, peak heap stays ~150–200 MB. Without batching (--batch-size 0) it can reach ~700–900 MB on the same dataset.

Performance: --parallel 4 on a multi-core host gives ~4× throughput on multi-table databases by hashing prod and dev tables concurrently.

--batch-size 0 is fully backward compatible — it bypasses pagination entirely and uses the original single-query full-scan path, so no existing behaviour changes without opt-in.

- CHANGELOG: add v0.7 entry with Added/Changed/Performance sections - README: document --batch-size, --parallel flags and performance config section; update How It Works and Limitations - ROADMAP: mark v0.7 complete, promote v0.8 to next upcoming, update priority matrix and success criteria

coderabbitai · 2026-03-14T04:51:02Z

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Free

Run ID: 5f8b58c1-ebec-4d11-bd40-c19841141f5d

📥 Commits

Reviewing files that changed from the base of the PR and between acce2a6 and d58eb89.

📒 Files selected for processing (3)

CHANGELOG.md
README.md
ROADMAP.md

📝 Walkthrough

Walkthrough

Documentation updates for v0.7 release introducing streaming support for large datasets, including keyset-paginated batching, parallel processing capabilities, and performance configuration options. Three core files updated: CHANGELOG with feature details, README with usage documentation, and ROADMAP with project timeline adjustments.

Changes

Cohort / File(s)	Summary
v0.7 Release Documentation `CHANGELOG.md`	Adds v0.7 Unreleased section documenting streaming support with keyset-paginated hashing, CLI batch-size and parallel overrides, shared cursor builder, sample suite, and performance improvements (~4x throughput). Updates HashTable signature to accept batchSize parameter; defaults set to hash_batch_size=10000 and max_parallel_tables=1.
Usage & Configuration Guide `README.md`	Documents streaming and parallel processing capabilities, introduces performance configuration options (hash_batch_size, max_parallel_tables) with defaults and YAML examples. Extends CLI usage with Large Dataset Options flags (--batch-size, --parallel) for diff, gen-pack, and other commands. Updates narrative on keyset-paginated batching and bounded heap behavior.
Project Roadmap `ROADMAP.md`	Transitions current status from v0.6 to v0.7 (dated 2026-03-14). Reworks feature descriptions with new capabilities (HTML reports, visual diff viewer, per-table conflict resolutions). Reorganizes Upcoming/Completed Release sections, moves v0.7 features to Completed Release entry, and updates success criteria and priorities.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A hop and a batch, through datasets so vast,
Keyset-paginated, memory holds fast,
Four times the speed with parallel might,
Streaming solutions working just right!
v0.7 bounds forth with performance in sight. 🚀

📝 Coding Plan

Generate coding plan for human review comments

Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-03-14T04:55:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions Bot added the documentation Improvements or additions to documentation label Mar 14, 2026

iamvirul self-assigned this Mar 14, 2026

iamvirul merged commit 8f32315 into main Mar 14, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(perf): Streaming batch hashing and parallel table diffing for v0.7#65

feat(perf): Streaming batch hashing and parallel table diffing for v0.7#65
iamvirul merged 1 commit intomainfrom
release/v0.7

iamvirul commented Mar 14, 2026

Uh oh!

coderabbitai Bot commented Mar 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

codecov Bot commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iamvirul commented Mar 14, 2026

Description

Type of Change

Related Issues

Changes Made

Testing

Test Results

Checklist

Additional Notes

Uh oh!

coderabbitai Bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

codecov Bot commented Mar 14, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Mar 14, 2026 •

edited

Loading