-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message handler is too slow; dropping message during tECDSA signing #3419
Comments
I am not sure if this is the root cause of the observed problem but I think the control loop of a receiver is still broken: keep-core/pkg/net/libp2p/channel.go Lines 138 to 163 in 3a446af
If the handlers get blocked for whatever reason (full buffer for example), I opened #3420 with a proposition of fixing the API but I think the refactoring will require more thinking to do this right and we could potentially fix this problem sooner by splitting the handler goroutine into two - one piping messages between channels and one controlling the lifecycle of a handler. |
One interesting observation is that so far, this problem was always activating after the announcement phase launches:
I think the problem lies in the size of the buffer used by the announcer: keep-core/pkg/protocol/announcer/announcer.go Line 103 in 4d9404c
Before a message in which the announcer is not interested is dropped, it is buffered in this channel. This also includes retransmissions from the previous signing protocols because retransmission cache filter is scoped to a keep-core/pkg/net/libp2p/channel.go Line 136 in 4d9404c
keep-core/pkg/net/retransmission/retransmission.go Lines 54 to 56 in 4d9404c
|
Refs #3419 During the tests of the code orchestrating tECDSA signing in #3404, we realized some buffers get full and the broadcast channel keeps writing to them even though the receiver is no longer alive. This was fixed in #3418 by introducing an additional context in the asynchronous state machine that unregisters the handler if the state machine (receiver) exits sooner than the context. This has fixed the problem on the receiver side but we still need to fix the problem on the producer side. And this is what this PR is doing. A separate goroutine controls the lifecycle of the handler. The message handler is removed from the channel if the context is done. This logic is placed in a separate goroutine because the call to `handleWithRetransmission` is blocking and we do not want to wait with removing the handler if that call blocks for a longer period of time, especially, when the underlying buffered channel is full.
Closes #3419 Depends on #3422 During the tests of the code orchestrating tECDSA signing in, we noticed the announcer's buffer sometimes gets full and the messages are dropped with the famous "message handler is too slow" warning. It turns out that the problem lies in the size of the buffer used by the announcer. Before a message in which the announcer is not interested is dropped, it is buffered in this channel. This also includes retransmissions from the previous signing protocols because the retransmission cache filter is scoped to the given `Recv` handler: ``` func WithRetransmissionSupport(delegate func(m net.Message)) func(m net.Message) { mutex := &sync.Mutex{} cache := make(map[string]bool) // (...) } ``` The announcer's buffer size has been increased to 512 elements, just like the buffer of the asynchronous state machine, and the problem seems to be gone based on local machine tests.
#3418 fixed the problem with dangling message handlers no longer processing messages and having their buffers full. During the tests of #3404 on commit 9b28e73, so after the fix was reverse-merged, I noticed similar behavior on one of the clients.
The second client (
happy-path-2.txt
) at2022-11-26T00:22:25.120+0100
started reporting a massive number ofmessage handler is too slow; dropping message
warnings. This problem was not present on other nodes at that time. The number of warnings suggests it could be - again - some problem with a dangling receiver no longer processing messages.The logs from all 4 clients running were uploaded to google drive given the size of the files:
https://drive.google.com/drive/folders/11MU2L6gDbUHgQ_b3X_m9cFjZLE3_mjxS?usp=sharing
The text was updated successfully, but these errors were encountered: