fix(scanner): semaphore pool + completion-bitmap watermark#300
Conversation
The scanner ran files in ordered batches of scanWorkers with a wg.Wait
barrier between batches. One slow file in a batch held back the other
N-1, and all N completions clustered into the same CURRENT_TIMESTAMP
second so the scan-detail UI showed identical timestamps for big
chunks at a time.
Replace the fan-out/fan-in batches with a semaphore-bounded worker
pool plus a completion bitmap and a contiguous-done watermark
ticker. Workers commit their scan_files row as they finish (out of
order); a separate goroutine walks the bitmap and flushes the
contiguous-prefix watermark to scans.current_file_index every
~2s, owning all DB progress writes for the parallel path so the
persisted index can't go backwards.
Resume after interruption replays from the watermark to the actual
highest-completed index; that window is handled by making scan_files
inserts idempotent (migration 010 adds UNIQUE(scan_id, file_path);
Record uses ON CONFLICT DO NOTHING). The scanned_at column also moves
to strftime('%Y-%m-%d %H:%M:%f', 'now') millisecond precision so
real wall-clock order is visible in the UI instead of clustering on
the second.
Cancellation and pause are checked per dispatch iteration; in-flight
workers see ctx.Done() at the same place they always have. The
single-threaded scanWorkers<=1 path is unchanged.
Test infrastructure: NewTestDB now pins MaxOpenConns=1. The
parallel scanner refactor exposed a latent issue where in-memory
SQLite gives each pool connection its own empty database, so writes
on a fresh connection saw 'no such table' once the watermark
goroutine ran alongside per-file workers. The inline scan_files
schemas in scan_file_test.go and handlers_scans_test.go now mirror
migration 010's UNIQUE index so they exercise the ON CONFLICT path.
Closes #290
|
Warning Review limit reached
More reviews will be available in 22 minutes and 34 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Closes #290.
What changes
The scanner replaces its batched fan-out/fan-in with a semaphore-bounded worker pool and a completion-bitmap watermark.
scanWorkersfiles;wg.Wait()between batchesscanWorkersin flight via a buffered-channel semaphorestrftime('%Y-%m-%d %H:%M:%f', 'now')ms-precisionmarkFileProcessedwrites worker'sfileIndexevery 10 files (non-monotonic under out-of-order completion)markFileProcessedNoSyncfor parallel path; a watermark goroutine owns the DB progress writeWhy a watermark + bitmap
Resume safety requires the persisted
current_file_indexto be monotonic so that on the next startup, "replay from index N" is guaranteed to cover every file that hadn't completed. With out-of-order worker completion, the worker's ownfileIndexis no longer suitable for this. The bitmap (done []atomic.Boolindexed by file position) lets a separate goroutine walk forward over the contiguous prefix of trues; that walk position is the watermark.The watermark may lag behind the actual highest-completed index — that's the replay window. Migration 010 adds
UNIQUE INDEX (scan_id, file_path)andRecordnow usesON CONFLICT(scan_id, file_path) DO NOTHING, so re-processing a file during the replay window is a no-op rather than producing a duplicate row.Cancellation & pause
Both checks fire on every dispatch-loop iteration (previously only between batches). In-flight workers honor
ctx.Done()at the same point they always have. Thestoppedatomic flag prevents already-dispatched workers from starting expensive work, and lets the dispatch loop break out cleanly.Test changes
TestScannerService_ScanFilesParallel_WatermarkIsContiguous: after a clean run, persistedcurrent_file_index == total_files(the watermark walked the whole way).TestScannerService_Record_IdempotentOnDoubleRecord: a simulated replay (running the same file list twice) must not produce duplicatescan_filesrows.TestScanFileRepository_Record_IdempotentOnReplayandTestScanFileRepository_Record_StoresMillisecondTimestampin the repo tests.NewTestDBpinsMaxOpenConns=1. The parallel refactor exposed a latent issue where modernc.org/sqlite's:memory:is per-connection — multi-connection writes from worker+watermark in tests started seeing 'no such table' on fresh connections. Real production DBs use a file path so this only affected tests.Test plan
go build ./...clean/usr/bin/golangci-lint run ./...— 0 issuesgo test ./internal/repository/ ./internal/api/ ./internal/integration/ ./internal/db/— all greengo test -race ./internal/services/— full services suite green under the race detector (44s)WatermarkIsContiguous,IdempotentOnDoubleRecord,Record_IdempotentOnReplay,Record_StoresMillisecondTimestampall pass