Skip to content

fix(scanner): semaphore pool + completion-bitmap watermark#300

Merged
mescon merged 1 commit into
mainfrom
fix/scanner-watermark-pool
Jun 4, 2026
Merged

fix(scanner): semaphore pool + completion-bitmap watermark#300
mescon merged 1 commit into
mainfrom
fix/scanner-watermark-pool

Conversation

@mescon
Copy link
Copy Markdown
Owner

@mescon mescon commented Jun 4, 2026

Closes #290.

What changes

The scanner replaces its batched fan-out/fan-in with a semaphore-bounded worker pool and a completion-bitmap watermark.

Before After
Ordered batches of scanWorkers files; wg.Wait() between batches One worker per file, up to scanWorkers in flight via a buffered-channel semaphore
Slow file in a batch holds back the other N-1 Workers commit independently
All N completions cluster on the same CURRENT_TIMESTAMP second Per-file strftime('%Y-%m-%d %H:%M:%f', 'now') ms-precision
markFileProcessed writes worker's fileIndex every 10 files (non-monotonic under out-of-order completion) New markFileProcessedNoSync for parallel path; a watermark goroutine owns the DB progress write

Why a watermark + bitmap

Resume safety requires the persisted current_file_index to be monotonic so that on the next startup, "replay from index N" is guaranteed to cover every file that hadn't completed. With out-of-order worker completion, the worker's own fileIndex is no longer suitable for this. The bitmap (done []atomic.Bool indexed by file position) lets a separate goroutine walk forward over the contiguous prefix of trues; that walk position is the watermark.

The watermark may lag behind the actual highest-completed index — that's the replay window. Migration 010 adds UNIQUE INDEX (scan_id, file_path) and Record now uses ON CONFLICT(scan_id, file_path) DO NOTHING, so re-processing a file during the replay window is a no-op rather than producing a duplicate row.

Cancellation & pause

Both checks fire on every dispatch-loop iteration (previously only between batches). In-flight workers honor ctx.Done() at the same point they always have. The stopped atomic flag prevents already-dispatched workers from starting expensive work, and lets the dispatch loop break out cleanly.

Test changes

  • New TestScannerService_ScanFilesParallel_WatermarkIsContiguous: after a clean run, persisted current_file_index == total_files (the watermark walked the whole way).
  • New TestScannerService_Record_IdempotentOnDoubleRecord: a simulated replay (running the same file list twice) must not produce duplicate scan_files rows.
  • New TestScanFileRepository_Record_IdempotentOnReplay and TestScanFileRepository_Record_StoresMillisecondTimestamp in the repo tests.
  • Inline schemas in two test files updated to mirror migration 010's UNIQUE index.
  • NewTestDB pins MaxOpenConns=1. The parallel refactor exposed a latent issue where modernc.org/sqlite's :memory: is per-connection — multi-connection writes from worker+watermark in tests started seeing 'no such table' on fresh connections. Real production DBs use a file path so this only affected tests.

Test plan

  • go build ./... clean
  • /usr/bin/golangci-lint run ./... — 0 issues
  • go test ./internal/repository/ ./internal/api/ ./internal/integration/ ./internal/db/ — all green
  • go test -race ./internal/services/ — full services suite green under the race detector (44s)
  • New tests: WatermarkIsContiguous, IdempotentOnDoubleRecord, Record_IdempotentOnReplay, Record_StoresMillisecondTimestamp all pass

The scanner ran files in ordered batches of scanWorkers with a wg.Wait
barrier between batches. One slow file in a batch held back the other
N-1, and all N completions clustered into the same CURRENT_TIMESTAMP
second so the scan-detail UI showed identical timestamps for big
chunks at a time.

Replace the fan-out/fan-in batches with a semaphore-bounded worker
pool plus a completion bitmap and a contiguous-done watermark
ticker. Workers commit their scan_files row as they finish (out of
order); a separate goroutine walks the bitmap and flushes the
contiguous-prefix watermark to scans.current_file_index every
~2s, owning all DB progress writes for the parallel path so the
persisted index can't go backwards.

Resume after interruption replays from the watermark to the actual
highest-completed index; that window is handled by making scan_files
inserts idempotent (migration 010 adds UNIQUE(scan_id, file_path);
Record uses ON CONFLICT DO NOTHING). The scanned_at column also moves
to strftime('%Y-%m-%d %H:%M:%f', 'now') millisecond precision so
real wall-clock order is visible in the UI instead of clustering on
the second.

Cancellation and pause are checked per dispatch iteration; in-flight
workers see ctx.Done() at the same place they always have. The
single-threaded scanWorkers<=1 path is unchanged.

Test infrastructure: NewTestDB now pins MaxOpenConns=1. The
parallel scanner refactor exposed a latent issue where in-memory
SQLite gives each pool connection its own empty database, so writes
on a fresh connection saw 'no such table' once the watermark
goroutine ran alongside per-file workers. The inline scan_files
schemas in scan_file_test.go and handlers_scans_test.go now mirror
migration 010's UNIQUE index so they exercise the ON CONFLICT path.

Closes #290
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Warning

Review limit reached

@mescon, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 22 minutes and 34 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6f72edba-bfdc-4476-bd17-5e7239ca4488

📥 Commits

Reviewing files that changed from the base of the PR and between 8414614 and 2698727.

📒 Files selected for processing (7)
  • internal/api/handlers_scans_test.go
  • internal/db/migrations/010_scan_files_unique_index.sql
  • internal/repository/scan_file.go
  • internal/repository/scan_file_test.go
  • internal/services/scanner.go
  • internal/services/scanner_test.go
  • internal/testutil/testdb.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/scanner-watermark-pool

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mescon mescon merged commit 41f4b58 into main Jun 4, 2026
5 of 7 checks passed
@mescon mescon deleted the fix/scanner-watermark-pool branch June 4, 2026 18:18
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

❌ Patch coverage is 60.97561% with 32 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/services/scanner.go 59.49% 25 Missing and 7 partials ⚠️

📢 Thoughts on this report? Let us know!

@mescon mescon mentioned this pull request Jun 4, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scanner: replace ordered fan-in barrier so files commit out-of-order

1 participant