feat(perf): Streaming batch hashing and parallel table diffing for v0.7#65
feat(perf): Streaming batch hashing and parallel table diffing for v0.7#65
Conversation
- CHANGELOG: add v0.7 entry with Added/Changed/Performance sections - README: document --batch-size, --parallel flags and performance config section; update How It Works and Limitations - ROADMAP: mark v0.7 complete, promote v0.8 to next upcoming, update priority matrix and success criteria
ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Free Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughDocumentation updates for v0.7 release introducing streaming support for large datasets, including keyset-paginated batching, parallel processing capabilities, and performance configuration options. Three core files updated: CHANGELOG with feature details, README with usage documentation, and ROADMAP with project timeline adjustments. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
📝 Coding Plan
Note 🎁 Summarized by CodeRabbit FreeYour organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Description
Implements Issue #12 — Streaming Support for Large Datasets. Introduces keyset-paginated batch hashing and a bounded parallel goroutine pool for table hashing, enabling comparison of databases with millions of rows while keeping memory usage flat.
Type of Change
Related Issues
Fixes #12
Changes Made
internal/content/cursor.go— new sharedBuildCursorQuery/buildCursorWherefor composite-PK keyset pagination; used by both hash and pack pathsinternal/content/hash.go—HashTableextended withbatchSize int;batchSize > 0uses keyset-paginatedhashTableBatched,batchSize = 0falls back to original full-scanhashTableFull(backward compatible); per-batchruntime.GC()hint + DEBUG memory telemetrycmd/deepdiffdb/main.go—--batch-sizeand--parallelflags ondiffandgen-pack;hashTablesParallelreplaces sequential for-loop usingerrgroup+semaphore.NewWeightedpkg/config/config.go—PerformanceConfig{HashBatchSize, MaxParallelTables}with defaults (10000 / 1) and validationdeepdiffdb.config.yaml.example— documentedperformance:sectionsamples/14-streaming-large-datasets/— SQLite seed script (500k orders / 100k products / 200k audit_logs), Makefile, README with memory tuning guideCHANGELOG.md,README.md,ROADMAP.md— v0.7 release documentationTesting
Test Results
Three new unit tests in
tests/content/hash_test.go:TestHashTable_BatchedMatchesUnbatched— 1000 rows, batchSize=100 vs batchSize=0 produce identical hash mapsTestHashTable_KeysetPaginationCorrect— 250 rows / batchSize=50, verifies all 250 keys present with no gapsTestHashTable_BatchedEmptyTable— empty table with batchSize=1000 returns empty map without errorAll 17 existing integration test call-sites updated to pass
batchSize=0(preserves pre-v0.7 behaviour).Checklist
--batch-size 0restores pre-v0.7 full-scan path)Additional Notes
Memory behaviour: with
hash_batch_size: 10000and 800k rows, peak heap stays ~150–200 MB. Without batching (--batch-size 0) it can reach ~700–900 MB on the same dataset.Performance:
--parallel 4on a multi-core host gives ~4× throughput on multi-table databases by hashing prod and dev tables concurrently.--batch-size 0is fully backward compatible — it bypasses pagination entirely and uses the original single-query full-scan path, so no existing behaviour changes without opt-in.