fix(dspinner): snapshot index before emitting pins by lidel · Pull Request #1140 · ipfs/boxo

lidel · 2026-04-21T18:22:47Z

Note

Found this on collab cluster while testing v0.41.0-rc2, not a release blocker but would be good to fix in 0.42
Kubo PR with changelog: ipfs/kubo#11290

Problem

streamIndex held pins.lock.RLock for the channel's full lifetime, so when the reprovider drained the channel at DHT speed the lock stayed held for hours and starved Pin/Unpin/Flush writers behind RWMutex writer preference.

Not impacting defaults (Provide.Strategy=all), but surfaces with Provide.Strategy=pinned*, concurrent pin ls and ipfs add (common RPCs done by ipfs-cluster) requests then piled up behind the queued writer. We had ipfs add waiting many hours for the lock on one of the boxes.

Proposed Solution (this PR)

streamIndex now materializes (key, value) pairs into a slice under the read lock and returns, letting the goroutine emit without holding any pinner lock.

when pin details are requested, loadPin runs lock-free in the detailed path; if a pin from the snapshot was removed and emits ErrNotFound it is skipped. this way we dont have any races, but also dont need to hold lock all the time.

Number of pins is small enough to keep them in memory, but we can add go-dsqueue is we ever have users with 100M of pins complaining.

TestStreamIndexDoesNotBlockWriters guards the regression.

TODO

add regression rest to not re-introduce lock
green ci here
kubo PR fix(pins): snapshot index before emitting pins kubo#11290 has green ci

streamIndex held p.lock.RLock for the channel's full lifetime, so when the reprovider drained the channel at DHT speed the lock stayed held for hours and starved Pin/Unpin/Flush writers behind RWMutex writer preference. With `Provide.Strategy=pinned*`, concurrent pin ls and ipfs add requests then piled up behind the queued writer. streamIndex now materializes (key, value) pairs into a slice under the read lock and returns, letting the goroutine emit without holding any pinner lock. loadPin runs lock-free in the detailed path; ErrNotFound is skipped as a benign snapshot-vs-load race. TestStreamIndexDoesNotBlockWriters guards the regression.

codecov · 2026-04-21T18:26:08Z

Codecov Report

❌ Patch coverage is 67.85714% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.01%. Comparing base (956010c) to head (08f27fc).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
pinning/pinner/dspinner/pin.go	67.85%	7 Missing and 2 partials ⚠️

@@            Coverage Diff             @@
##             main    #1140      +/-   ##
==========================================
+ Coverage   62.99%   63.01%   +0.02%     
==========================================
  Files         266      266              
  Lines       26661    26671      +10     
==========================================
+ Hits        16794    16808      +14     
+ Misses       8153     8150       -3     
+ Partials     1714     1713       -1

Files with missing lines	Coverage Δ
pinning/pinner/dspinner/pin.go	`63.15% <67.85%> (+0.81%)`	⬆️

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

drop the time.Sleep/time.After heuristics. under the broken code path both goroutines are durably blocked, which synctest surfaces as a deadlock instead of a 3s timeout.

* chore: bump boxo to ipfs/boxo#1140 picks up dspinner fix that snapshots the index before emitting pins, avoiding the streaming lock convoy. * docs: changelog entry for pinner stall fix * docs: clarify pinner snapshot behavior * chore: bump boxo to include ipfs/boxo#1146 Picks up the fix for "panic: pebble: closed" on shutdown (#11292): the dspinner streamIndex goroutine now recovers from any datastore panic and reports it as an error on the output channel, so the daemon exits cleanly instead of crashing when the datastore closes before pin enumeration drains. * fix(provider): quiet keystore-close on shutdown When the daemon shuts down, the keystore Close fires while the startup sync goroutine may still be in flight: the OnStart ctx is not yet cancelled, so ResetCids returning keystore.ErrClosed gets logged at Error as "sync failed". Treat keystore.ErrClosed the same as a cancelled ctx and log at Debug as "interrupted by shutdown". Apply the same rule to the periodic reprovide GC loop (whose error log got a unified message in the process). * test(cli): keystore-close log + pin ls shutdown Adds TestProviderKeystoreSyncShutdownQuiet, a CLI test that: 1. Verifies no shutdown-caused keystore-sync error (err="keystore is closed" or err="context canceled") is logged at Error level. Scans stderr line-by-line so unrelated Error logs (e.g. "reset already in progress" from the startup+periodic overlap at tight Intervals) do not false-positive the assertion. 2. Runs `ipfs pin ls --stream` against the live daemon, shuts the daemon down mid-stream, and asserts the CLI returns within 15s, does not observe a daemon panic, and produces a meaningful error message if it exited non-zero. Uses Provide.DHT.Interval=10ms so the periodic reprovide loop is always inside ResetCids when StopDaemon fires, making the shutdown race deterministic enough to catch the regression on most runs (verified empirically against the pre-fix provider.go).

* chore: bump boxo to ipfs/boxo#1140 picks up dspinner fix that snapshots the index before emitting pins, avoiding the streaming lock convoy. * docs: changelog entry for pinner stall fix * docs: clarify pinner snapshot behavior * chore: bump boxo to include ipfs/boxo#1146 Picks up the fix for "panic: pebble: closed" on shutdown (#11292): the dspinner streamIndex goroutine now recovers from any datastore panic and reports it as an error on the output channel, so the daemon exits cleanly instead of crashing when the datastore closes before pin enumeration drains. * fix(provider): quiet keystore-close on shutdown When the daemon shuts down, the keystore Close fires while the startup sync goroutine may still be in flight: the OnStart ctx is not yet cancelled, so ResetCids returning keystore.ErrClosed gets logged at Error as "sync failed". Treat keystore.ErrClosed the same as a cancelled ctx and log at Debug as "interrupted by shutdown". Apply the same rule to the periodic reprovide GC loop (whose error log got a unified message in the process). * test(cli): keystore-close log + pin ls shutdown Adds TestProviderKeystoreSyncShutdownQuiet, a CLI test that: 1. Verifies no shutdown-caused keystore-sync error (err="keystore is closed" or err="context canceled") is logged at Error level. Scans stderr line-by-line so unrelated Error logs (e.g. "reset already in progress" from the startup+periodic overlap at tight Intervals) do not false-positive the assertion. 2. Runs `ipfs pin ls --stream` against the live daemon, shuts the daemon down mid-stream, and asserts the CLI returns within 15s, does not observe a daemon panic, and produces a meaningful error message if it exited non-zero. Uses Provide.DHT.Interval=10ms so the periodic reprovide loop is always inside ResetCids when StopDaemon fires, making the shutdown race deterministic enough to catch the regression on most runs (verified empirically against the pre-fix provider.go). (cherry picked from commit 8416f38)

lidel added 3 commits April 21, 2026 20:29

docs(changelog): note dspinner streamIndex lock fix

40ae2f6

test(dspinner): use synctest for streamIndex lock regression

554b108

drop the time.Sleep/time.After heuristics. under the broken code path both goroutines are durably blocked, which synctest surfaces as a deadlock instead of a 3s timeout.

refactor(dspinner): use errors.Is for ds.ErrNotFound checks

08f27fc

lidel mentioned this pull request Apr 21, 2026

fix(pins): snapshot index before emitting pins ipfs/kubo#11290

Merged

4 tasks

lidel marked this pull request as ready for review April 21, 2026 19:31

lidel requested a review from a team as a code owner April 21, 2026 19:31

guillaumemichel approved these changes Apr 22, 2026

View reviewed changes

gammazero approved these changes Apr 22, 2026

View reviewed changes

lidel merged commit aceb6ce into main Apr 22, 2026
20 checks passed

lidel added a commit to ipfs/kubo that referenced this pull request Apr 22, 2026

chore: bump boxo to latest ipfs/boxo#1140

695eeca

lidel mentioned this pull request Apr 23, 2026

Release 0.41 ipfs/kubo#11082

Open

38 tasks

BrewTestBot mentioned this pull request Apr 23, 2026

kubo 0.41.0 Homebrew/homebrew-core#279184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dspinner): snapshot index before emitting pins#1140

fix(dspinner): snapshot index before emitting pins#1140
lidel merged 4 commits intomainfrom
fix/dspinner-streamindex-lock-convoy

lidel commented Apr 21, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lidel commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Proposed Solution (this PR)

TODO

Uh oh!

codecov Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lidel commented Apr 21, 2026 •

edited

Loading

codecov Bot commented Apr 21, 2026 •

edited

Loading