Skip to content

feat(perf): Streaming batch hashing and parallel table diffing for v0.7#65

Merged
iamvirul merged 1 commit intomainfrom
release/v0.7
Mar 14, 2026
Merged

feat(perf): Streaming batch hashing and parallel table diffing for v0.7#65
iamvirul merged 1 commit intomainfrom
release/v0.7

Conversation

@iamvirul
Copy link
Copy Markdown
Owner

Description

Implements Issue #12 — Streaming Support for Large Datasets. Introduces keyset-paginated batch hashing and a bounded parallel goroutine pool for table hashing, enabling comparison of databases with millions of rows while keeping memory usage flat.

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Performance improvement
  • Documentation update

Related Issues

Fixes #12

Changes Made

  • internal/content/cursor.go — new shared BuildCursorQuery / buildCursorWhere for composite-PK keyset pagination; used by both hash and pack paths
  • internal/content/hash.goHashTable extended with batchSize int; batchSize > 0 uses keyset-paginated hashTableBatched, batchSize = 0 falls back to original full-scan hashTableFull (backward compatible); per-batch runtime.GC() hint + DEBUG memory telemetry
  • cmd/deepdiffdb/main.go--batch-size and --parallel flags on diff and gen-pack; hashTablesParallel replaces sequential for-loop using errgroup + semaphore.NewWeighted
  • pkg/config/config.goPerformanceConfig{HashBatchSize, MaxParallelTables} with defaults (10000 / 1) and validation
  • deepdiffdb.config.yaml.example — documented performance: section
  • samples/14-streaming-large-datasets/ — SQLite seed script (500k orders / 100k products / 200k audit_logs), Makefile, README with memory tuning guide
  • CHANGELOG.md, README.md, ROADMAP.md — v0.7 release documentation

Testing

  • Unit tests added/updated
  • Manual testing performed
  • Tested with SQLite

Test Results

Three new unit tests in tests/content/hash_test.go:

  • TestHashTable_BatchedMatchesUnbatched — 1000 rows, batchSize=100 vs batchSize=0 produce identical hash maps
  • TestHashTable_KeysetPaginationCorrect — 250 rows / batchSize=50, verifies all 250 keys present with no gaps
  • TestHashTable_BatchedEmptyTable — empty table with batchSize=1000 returns empty map without error

All 17 existing integration test call-sites updated to pass batchSize=0 (preserves pre-v0.7 behaviour).

Checklist

  • Code follows the project's style guidelines
  • Self-review completed
  • Comments added for complex code
  • Documentation updated (if needed)
  • No new warnings generated
  • Tests pass locally
  • All existing tests pass
  • Changes are backward compatible (--batch-size 0 restores pre-v0.7 full-scan path)

Additional Notes

Memory behaviour: with hash_batch_size: 10000 and 800k rows, peak heap stays ~150–200 MB. Without batching (--batch-size 0) it can reach ~700–900 MB on the same dataset.

Performance: --parallel 4 on a multi-core host gives ~4× throughput on multi-table databases by hashing prod and dev tables concurrently.

--batch-size 0 is fully backward compatible — it bypasses pagination entirely and uses the original single-query full-scan path, so no existing behaviour changes without opt-in.

- CHANGELOG: add v0.7 entry with Added/Changed/Performance sections
- README: document --batch-size, --parallel flags and performance config section; update How It Works and Limitations
- ROADMAP: mark v0.7 complete, promote v0.8 to next upcoming, update priority matrix and success criteria
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 14, 2026

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Free

Run ID: 5f8b58c1-ebec-4d11-bd40-c19841141f5d

📥 Commits

Reviewing files that changed from the base of the PR and between acce2a6 and d58eb89.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • README.md
  • ROADMAP.md

📝 Walkthrough

Walkthrough

Documentation updates for v0.7 release introducing streaming support for large datasets, including keyset-paginated batching, parallel processing capabilities, and performance configuration options. Three core files updated: CHANGELOG with feature details, README with usage documentation, and ROADMAP with project timeline adjustments.

Changes

Cohort / File(s) Summary
v0.7 Release Documentation
CHANGELOG.md
Adds v0.7 Unreleased section documenting streaming support with keyset-paginated hashing, CLI batch-size and parallel overrides, shared cursor builder, sample suite, and performance improvements (~4x throughput). Updates HashTable signature to accept batchSize parameter; defaults set to hash_batch_size=10000 and max_parallel_tables=1.
Usage & Configuration Guide
README.md
Documents streaming and parallel processing capabilities, introduces performance configuration options (hash_batch_size, max_parallel_tables) with defaults and YAML examples. Extends CLI usage with Large Dataset Options flags (--batch-size, --parallel) for diff, gen-pack, and other commands. Updates narrative on keyset-paginated batching and bounded heap behavior.
Project Roadmap
ROADMAP.md
Transitions current status from v0.6 to v0.7 (dated 2026-03-14). Reworks feature descriptions with new capabilities (HTML reports, visual diff viewer, per-table conflict resolutions). Reorganizes Upcoming/Completed Release sections, moves v0.7 features to Completed Release entry, and updates success criteria and priorities.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A hop and a batch, through datasets so vast,
Keyset-paginated, memory holds fast,
Four times the speed with parallel might,
Streaming solutions working just right!
v0.7 bounds forth with performance in sight. 🚀

📝 Coding Plan
  • Generate coding plan for human review comments

Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Mar 14, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@iamvirul iamvirul self-assigned this Mar 14, 2026
@iamvirul iamvirul merged commit 8f32315 into main Mar 14, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Streaming Support for Large Datasets

1 participant