Skip to content

fix(dspinner): snapshot index before emitting pins#1140

Merged
lidel merged 4 commits intomainfrom
fix/dspinner-streamindex-lock-convoy
Apr 22, 2026
Merged

fix(dspinner): snapshot index before emitting pins#1140
lidel merged 4 commits intomainfrom
fix/dspinner-streamindex-lock-convoy

Conversation

@lidel
Copy link
Copy Markdown
Member

@lidel lidel commented Apr 21, 2026

Note

Found this on collab cluster while testing v0.41.0-rc2, not a release blocker but would be good to fix in 0.42
Kubo PR with changelog: ipfs/kubo#11290

Problem

streamIndex held pins.lock.RLock for the channel's full lifetime, so when the reprovider drained the channel at DHT speed the lock stayed held for hours and starved Pin/Unpin/Flush writers behind RWMutex writer preference.

Not impacting defaults (Provide.Strategy=all), but surfaces with Provide.Strategy=pinned*, concurrent pin ls and ipfs add (common RPCs done by ipfs-cluster) requests then piled up behind the queued writer. We had ipfs add waiting many hours for the lock on one of the boxes.

Proposed Solution (this PR)

streamIndex now materializes (key, value) pairs into a slice under the read lock and returns, letting the goroutine emit without holding any pinner lock.

when pin details are requested, loadPin runs lock-free in the detailed path; if a pin from the snapshot was removed and emits ErrNotFound it is skipped. this way we dont have any races, but also dont need to hold lock all the time.

Number of pins is small enough to keep them in memory, but we can add go-dsqueue is we ever have users with 100M of pins complaining.

TestStreamIndexDoesNotBlockWriters guards the regression.

TODO

streamIndex held p.lock.RLock for the channel's full lifetime, so when the
reprovider drained the channel at DHT speed the lock stayed held for hours
and starved Pin/Unpin/Flush writers behind RWMutex writer preference. With
`Provide.Strategy=pinned*`, concurrent pin ls and ipfs add requests then
piled up behind the queued writer.

streamIndex now materializes (key, value) pairs into a slice under the
read lock and returns, letting the goroutine emit without holding any
pinner lock. loadPin runs lock-free in the detailed path; ErrNotFound is
skipped as a benign snapshot-vs-load race.

TestStreamIndexDoesNotBlockWriters guards the regression.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 67.85714% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.01%. Comparing base (956010c) to head (08f27fc).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pinning/pinner/dspinner/pin.go 67.85% 7 Missing and 2 partials ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1140      +/-   ##
==========================================
+ Coverage   62.99%   63.01%   +0.02%     
==========================================
  Files         266      266              
  Lines       26661    26671      +10     
==========================================
+ Hits        16794    16808      +14     
+ Misses       8153     8150       -3     
+ Partials     1714     1713       -1     
Files with missing lines Coverage Δ
pinning/pinner/dspinner/pin.go 63.15% <67.85%> (+0.81%) ⬆️

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lidel added 3 commits April 21, 2026 20:29
drop the time.Sleep/time.After heuristics. under the broken
code path both goroutines are durably blocked, which synctest
surfaces as a deadlock instead of a 3s timeout.
@lidel lidel marked this pull request as ready for review April 21, 2026 19:31
@lidel lidel requested a review from a team as a code owner April 21, 2026 19:31
@lidel lidel merged commit aceb6ce into main Apr 22, 2026
20 checks passed
lidel added a commit to ipfs/kubo that referenced this pull request Apr 22, 2026
@lidel lidel mentioned this pull request Apr 23, 2026
38 tasks
lidel added a commit to ipfs/kubo that referenced this pull request Apr 23, 2026
* chore: bump boxo to ipfs/boxo#1140

picks up dspinner fix that snapshots the index before emitting pins,
avoiding the streaming lock convoy.

* docs: changelog entry for pinner stall fix

* docs: clarify pinner snapshot behavior

* chore: bump boxo to include ipfs/boxo#1146

Picks up the fix for "panic: pebble: closed" on shutdown (#11292):
the dspinner streamIndex goroutine now recovers from any datastore
panic and reports it as an error on the output channel, so the
daemon exits cleanly instead of crashing when the datastore closes
before pin enumeration drains.

* fix(provider): quiet keystore-close on shutdown

When the daemon shuts down, the keystore Close fires while the
startup sync goroutine may still be in flight: the OnStart ctx is
not yet cancelled, so ResetCids returning keystore.ErrClosed gets
logged at Error as "sync failed".

Treat keystore.ErrClosed the same as a cancelled ctx and log at
Debug as "interrupted by shutdown". Apply the same rule to the
periodic reprovide GC loop (whose error log got a unified message
in the process).

* test(cli): keystore-close log + pin ls shutdown

Adds TestProviderKeystoreSyncShutdownQuiet, a CLI test that:

1. Verifies no shutdown-caused keystore-sync error (err="keystore
   is closed" or err="context canceled") is logged at Error level.
   Scans stderr line-by-line so unrelated Error logs (e.g.
   "reset already in progress" from the startup+periodic overlap
   at tight Intervals) do not false-positive the assertion.

2. Runs `ipfs pin ls --stream` against the live daemon, shuts the
   daemon down mid-stream, and asserts the CLI returns within 15s,
   does not observe a daemon panic, and produces a meaningful
   error message if it exited non-zero.

Uses Provide.DHT.Interval=10ms so the periodic reprovide loop is
always inside ResetCids when StopDaemon fires, making the shutdown
race deterministic enough to catch the regression on most runs
(verified empirically against the pre-fix provider.go).
lidel added a commit to ipfs/kubo that referenced this pull request Apr 23, 2026
* chore: bump boxo to ipfs/boxo#1140

picks up dspinner fix that snapshots the index before emitting pins,
avoiding the streaming lock convoy.

* docs: changelog entry for pinner stall fix

* docs: clarify pinner snapshot behavior

* chore: bump boxo to include ipfs/boxo#1146

Picks up the fix for "panic: pebble: closed" on shutdown (#11292):
the dspinner streamIndex goroutine now recovers from any datastore
panic and reports it as an error on the output channel, so the
daemon exits cleanly instead of crashing when the datastore closes
before pin enumeration drains.

* fix(provider): quiet keystore-close on shutdown

When the daemon shuts down, the keystore Close fires while the
startup sync goroutine may still be in flight: the OnStart ctx is
not yet cancelled, so ResetCids returning keystore.ErrClosed gets
logged at Error as "sync failed".

Treat keystore.ErrClosed the same as a cancelled ctx and log at
Debug as "interrupted by shutdown". Apply the same rule to the
periodic reprovide GC loop (whose error log got a unified message
in the process).

* test(cli): keystore-close log + pin ls shutdown

Adds TestProviderKeystoreSyncShutdownQuiet, a CLI test that:

1. Verifies no shutdown-caused keystore-sync error (err="keystore
   is closed" or err="context canceled") is logged at Error level.
   Scans stderr line-by-line so unrelated Error logs (e.g.
   "reset already in progress" from the startup+periodic overlap
   at tight Intervals) do not false-positive the assertion.

2. Runs `ipfs pin ls --stream` against the live daemon, shuts the
   daemon down mid-stream, and asserts the CLI returns within 15s,
   does not observe a daemon panic, and produces a meaningful
   error message if it exited non-zero.

Uses Provide.DHT.Interval=10ms so the periodic reprovide loop is
always inside ResetCids when StopDaemon fires, making the shutdown
race deterministic enough to catch the regression on most runs
(verified empirically against the pre-fix provider.go).

(cherry picked from commit 8416f38)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants