fix(dspinner): snapshot index before emitting pins#1140
Merged
Conversation
streamIndex held p.lock.RLock for the channel's full lifetime, so when the reprovider drained the channel at DHT speed the lock stayed held for hours and starved Pin/Unpin/Flush writers behind RWMutex writer preference. With `Provide.Strategy=pinned*`, concurrent pin ls and ipfs add requests then piled up behind the queued writer. streamIndex now materializes (key, value) pairs into a slice under the read lock and returns, letting the goroutine emit without holding any pinner lock. loadPin runs lock-free in the detailed path; ErrNotFound is skipped as a benign snapshot-vs-load race. TestStreamIndexDoesNotBlockWriters guards the regression.
Codecov Report❌ Patch coverage is
@@ Coverage Diff @@
## main #1140 +/- ##
==========================================
+ Coverage 62.99% 63.01% +0.02%
==========================================
Files 266 266
Lines 26661 26671 +10
==========================================
+ Hits 16794 16808 +14
+ Misses 8153 8150 -3
+ Partials 1714 1713 -1
... and 9 files with indirect coverage changes 🚀 New features to boost your workflow:
|
drop the time.Sleep/time.After heuristics. under the broken code path both goroutines are durably blocked, which synctest surfaces as a deadlock instead of a 3s timeout.
4 tasks
guillaumemichel
approved these changes
Apr 22, 2026
gammazero
approved these changes
Apr 22, 2026
lidel
added a commit
to ipfs/kubo
that referenced
this pull request
Apr 22, 2026
lidel
added a commit
to ipfs/kubo
that referenced
this pull request
Apr 23, 2026
* chore: bump boxo to ipfs/boxo#1140 picks up dspinner fix that snapshots the index before emitting pins, avoiding the streaming lock convoy. * docs: changelog entry for pinner stall fix * docs: clarify pinner snapshot behavior * chore: bump boxo to include ipfs/boxo#1146 Picks up the fix for "panic: pebble: closed" on shutdown (#11292): the dspinner streamIndex goroutine now recovers from any datastore panic and reports it as an error on the output channel, so the daemon exits cleanly instead of crashing when the datastore closes before pin enumeration drains. * fix(provider): quiet keystore-close on shutdown When the daemon shuts down, the keystore Close fires while the startup sync goroutine may still be in flight: the OnStart ctx is not yet cancelled, so ResetCids returning keystore.ErrClosed gets logged at Error as "sync failed". Treat keystore.ErrClosed the same as a cancelled ctx and log at Debug as "interrupted by shutdown". Apply the same rule to the periodic reprovide GC loop (whose error log got a unified message in the process). * test(cli): keystore-close log + pin ls shutdown Adds TestProviderKeystoreSyncShutdownQuiet, a CLI test that: 1. Verifies no shutdown-caused keystore-sync error (err="keystore is closed" or err="context canceled") is logged at Error level. Scans stderr line-by-line so unrelated Error logs (e.g. "reset already in progress" from the startup+periodic overlap at tight Intervals) do not false-positive the assertion. 2. Runs `ipfs pin ls --stream` against the live daemon, shuts the daemon down mid-stream, and asserts the CLI returns within 15s, does not observe a daemon panic, and produces a meaningful error message if it exited non-zero. Uses Provide.DHT.Interval=10ms so the periodic reprovide loop is always inside ResetCids when StopDaemon fires, making the shutdown race deterministic enough to catch the regression on most runs (verified empirically against the pre-fix provider.go).
lidel
added a commit
to ipfs/kubo
that referenced
this pull request
Apr 23, 2026
* chore: bump boxo to ipfs/boxo#1140 picks up dspinner fix that snapshots the index before emitting pins, avoiding the streaming lock convoy. * docs: changelog entry for pinner stall fix * docs: clarify pinner snapshot behavior * chore: bump boxo to include ipfs/boxo#1146 Picks up the fix for "panic: pebble: closed" on shutdown (#11292): the dspinner streamIndex goroutine now recovers from any datastore panic and reports it as an error on the output channel, so the daemon exits cleanly instead of crashing when the datastore closes before pin enumeration drains. * fix(provider): quiet keystore-close on shutdown When the daemon shuts down, the keystore Close fires while the startup sync goroutine may still be in flight: the OnStart ctx is not yet cancelled, so ResetCids returning keystore.ErrClosed gets logged at Error as "sync failed". Treat keystore.ErrClosed the same as a cancelled ctx and log at Debug as "interrupted by shutdown". Apply the same rule to the periodic reprovide GC loop (whose error log got a unified message in the process). * test(cli): keystore-close log + pin ls shutdown Adds TestProviderKeystoreSyncShutdownQuiet, a CLI test that: 1. Verifies no shutdown-caused keystore-sync error (err="keystore is closed" or err="context canceled") is logged at Error level. Scans stderr line-by-line so unrelated Error logs (e.g. "reset already in progress" from the startup+periodic overlap at tight Intervals) do not false-positive the assertion. 2. Runs `ipfs pin ls --stream` against the live daemon, shuts the daemon down mid-stream, and asserts the CLI returns within 15s, does not observe a daemon panic, and produces a meaningful error message if it exited non-zero. Uses Provide.DHT.Interval=10ms so the periodic reprovide loop is always inside ResetCids when StopDaemon fires, making the shutdown race deterministic enough to catch the regression on most runs (verified empirically against the pre-fix provider.go). (cherry picked from commit 8416f38)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Found this on collab cluster while testing v0.41.0-rc2, not a release blocker but would be good to fix in 0.42
Kubo PR with changelog: ipfs/kubo#11290
Problem
streamIndexheldpins.lock.RLockfor the channel's full lifetime, so when the reprovider drained the channel at DHT speed the lock stayed held for hours and starved Pin/Unpin/Flush writers behindRWMutexwriter preference.Not impacting defaults (
Provide.Strategy=all), but surfaces withProvide.Strategy=pinned*, concurrentpin lsandipfs add(common RPCs done by ipfs-cluster) requests then piled up behind the queued writer. We hadipfs addwaiting many hours for the lock on one of the boxes.Proposed Solution (this PR)
streamIndexnow materializes (key, value) pairs into a slice under the read lock and returns, letting the goroutine emit without holding any pinner lock.when pin details are requested,
loadPinruns lock-free in the detailed path; if a pin from the snapshot was removed and emitsErrNotFoundit is skipped. this way we dont have any races, but also dont need to hold lock all the time.Number of pins is small enough to keep them in memory, but we can add
go-dsqueueis we ever have users with 100M of pins complaining.TestStreamIndexDoesNotBlockWritersguards the regression.TODO