Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invoiceregistry: re-design invoice registry notification internals, remove layers of mutexes #8023

Open
Roasbeef opened this issue Sep 22, 2023 · 1 comment
Assignees
Labels
bug Unintended code behaviour concurrency epic Issues created to track large feature development needs triage P1 MUST be fixed or reviewed refactoring
Milestone

Comments

@Roasbeef
Copy link
Member

For the past month or so, we've been hunting down some deadlocks in the daemon. After receiving a blocking profile, we realize that the invoice registry, and in particular the hodl invoice registration and notification code was the culprit. In the past, we've had some incremental fixes to attempt to resolve the issue. However, it's clear now that we should bite the bullet and refactor the internals to eliminate the layers of mutexes, in favor of:

  1. A central event loop for notification and registration. Channel sends would be use for sending messages into and out of the event loop.
  2. Usage of the subscribe package where relevant in favor of singly buffered channels. The package uses an unbounded concurrent queue under the hood. Within the invoice registry today, there're assumptions that singly buffered channels are enough as channels are never sent on twice, in practice we've seen this to be false.

Along the way, we may want to adopt a more type safe subscription/notification variant from one of our related codebases:
https://github.com/lightninglabs/taproot-assets/blob/603078bd3569886da20a36eb1ae1a89098a62aa5/fn/events.go#L62-L78C2

Short Term Fixes

After digging through the code a bit, I found some candidates for shorter term fixes to alleviate the issue, while we go to the drawing board to re-work the internal architecture:

  1. Add additional buffer to the hodlChan this would ensure that the main notifyHodlSubscribes method won't block on the send to the subscribers channel. AFAICT, this is part of the root issue as the main mutex is held here, and if the channel send blocks, then nothing else can proceed.
  2. Modify the subscriber type into a struct. This struct would then hold an internal cancel chan that's closed when the subscriber wants to exit. The main send loop would then listen on this, if sent, then it can remove those entries early. With this, then the HodlUnsubscribeAll method no longer needs to try grab that main mutex. Instead, it can just cancel the internal channel and exit.
  3. Have HodlUnsubscribeAll drain the subcriber channel before trying to grab the mutex. This would ensure that the notifyHodlSubscribes loop won't block and cause circular waiting.
  4. Run the inner notification dispatch inner loop of notifyHodlSubscribes in a goroutine. Go routines go brrr, so we eliminate blocking in that area.
  5. Modify the hodlChan to just be a concurrent queue using that subscribe package.

I lean towards either #5, or #2 for short term fixes.

I also think we should re-examine the practice of using a channel as the key to a map. Can't find anything concrete re pitfalls, but it sets my spidey senses off....

@Roasbeef Roasbeef added bug Unintended code behaviour P1 MUST be fixed or reviewed refactoring concurrency needs triage labels Sep 22, 2023
Roasbeef added a commit to Roasbeef/lnd that referenced this issue Sep 25, 2023
… queue

In this commit, we modify the incoming contest resolver to use a
concurrent queue. This is meant to ensure that the invoice registry
subscription loop never blocks. This change is meant to be minimal and
implements option `5` as outlined here:
lightningnetwork#8023.

With this change, the inner loop of the subscription dispatch method in
the invoice registry will no longer block, as the concurrent queue uses
a fixed buffer of a queue, then overflows into another queue when that
gets full.

Fixes lightningnetwork#7917
@yyforyongyu yyforyongyu self-assigned this Sep 26, 2023
@saubyk saubyk added the epic Issues created to track large feature development label Nov 21, 2023
@ziggie1984
Copy link
Collaborator

That's very helpful advice, will incooperate and share as soon as possible to discuss design choices, thank you 🙏

@ziggie1984 ziggie1984 assigned ziggie1984 and unassigned yyforyongyu Jun 12, 2024
@ziggie1984 ziggie1984 added this to the v0.18.2 milestone Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour concurrency epic Issues created to track large feature development needs triage P1 MUST be fixed or reviewed refactoring
Projects
None yet
Development

No branches or pull requests

4 participants