Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message handler is too slow; dropping message during tECDSA signing #3419

Closed
pdyraga opened this issue Nov 28, 2022 · 2 comments · Fixed by #3423
Closed

Message handler is too slow; dropping message during tECDSA signing #3419

pdyraga opened this issue Nov 28, 2022 · 2 comments · Fixed by #3423
Milestone

Comments

@pdyraga
Copy link
Member

pdyraga commented Nov 28, 2022

#3418 fixed the problem with dangling message handlers no longer processing messages and having their buffers full. During the tests of #3404 on commit 9b28e73, so after the fix was reverse-merged, I noticed similar behavior on one of the clients.

The second client (happy-path-2.txt) at 2022-11-26T00:22:25.120+0100 started reporting a massive number of message handler is too slow; dropping message warnings. This problem was not present on other nodes at that time. The number of warnings suggests it could be - again - some problem with a dangling receiver no longer processing messages.

The logs from all 4 clients running were uploaded to google drive given the size of the files:
https://drive.google.com/drive/folders/11MU2L6gDbUHgQ_b3X_m9cFjZLE3_mjxS?usp=sharing

@pdyraga
Copy link
Member Author

pdyraga commented Nov 28, 2022

I am not sure if this is the root cause of the observed problem but I think the control loop of a receiver is still broken:

go func() {
for {
select {
case <-ctx.Done():
logger.Debug("context is done; removing message handler")
c.removeHandler(messageHandler)
return
case msg := <-messageHandler.channel:
// Go language specification says that if one or more of the
// communications in the select statement can proceed, a single
// one that will proceed is chosen via a uniform pseudo-random
// selection.
// Thus, it can happen this communication is called when ctx is
// already done. Since we guarantee in the network channel API
// that handler is not called after ctx is done (client code
// could e.g. perform come cleanup), we need to double-check
// the context state here.
if messageHandler.ctx.Err() != nil {
continue
}
handleWithRetransmissions(msg)
}
}
}()

If the handlers get blocked for whatever reason (full buffer for example), c.removeHandler(messageHandler) will never get called, and messageHandler.channel writer will have to start dropping messages leading to the logs I observed.

I opened #3420 with a proposition of fixing the API but I think the refactoring will require more thinking to do this right and we could potentially fix this problem sooner by splitting the handler goroutine into two - one piping messages between channels and one controlling the lifecycle of a handler.

@pdyraga
Copy link
Member Author

pdyraga commented Nov 28, 2022

One interesting observation is that so far, this problem was always activating after the announcement phase launches:

2022-11-28T15:23:10.079+0100 INFO keep-tbtc tbtc/signing_loop.go:211 [member:61] starting announcement phase for attempt [1] {"wallet": "0x048da8d0ff8bfbb5c59d1b87519457cf90438266594319c82833cac04348c3db860af9e6325debf91c44e3a35c201642c9433b86d5cf03039e2ed81776f128ab58", "message": "0x66a8773a610c219e571734ad2ad614608b5888b6602e5a1a4178283088f30196", "signingStartBlock": 148626, "signingTimeoutBlock": 148831}
2022-11-28T15:23:10.204+0100 WARN keep-libp2p libp2p/channel.go:357 message handler is too slow; dropping message

I think the problem lies in the size of the buffer used by the announcer:

messagesChan := make(chan net.Message, a.groupSize)

Before a message in which the announcer is not interested is dropped, it is buffered in this channel. This also includes retransmissions from the previous signing protocols because retransmission cache filter is scoped to a Recv handler:

handleWithRetransmissions := retransmission.WithRetransmissionSupport(handler)

func WithRetransmissionSupport(delegate func(m net.Message)) func(m net.Message) {
mutex := &sync.Mutex{}
cache := make(map[string]bool)

@pdyraga pdyraga added this to the v2.0.0-m2 milestone Nov 29, 2022
@pdyraga pdyraga self-assigned this Nov 29, 2022
lukasz-zimnoch added a commit that referenced this issue Nov 29, 2022
Refs #3419


During the tests of the code orchestrating tECDSA signing in #3404, we
realized some buffers get full and the broadcast channel keeps writing
to them even though the receiver is no longer alive. This was fixed in
#3418 by introducing an additional context in the asynchronous state
machine that unregisters the handler if the state machine (receiver)
exits sooner than the context. This has fixed the problem on the
receiver side but we still need to fix the problem on the producer side.
And this is what this PR is doing.

A separate goroutine controls the lifecycle of the handler. The message
handler is removed from the channel if the context is done. This logic
is placed in a separate goroutine because the call to
`handleWithRetransmission` is blocking and we do not want to wait with
removing the handler if that call blocks for a longer period of time,
especially, when the underlying buffered channel is full.
lukasz-zimnoch added a commit that referenced this issue Nov 29, 2022
Closes #3419
Depends on #3422

During the tests of the code orchestrating tECDSA signing in, we noticed
the announcer's buffer sometimes gets full and the messages are dropped
with the famous "message handler is too slow" warning. It turns out that
the problem lies in the size of the buffer used by the announcer. Before
a message in which the announcer is not interested is dropped, it is
buffered in this channel. This also includes retransmissions from the
previous signing protocols because the retransmission cache filter is
scoped to the given `Recv` handler:
```
func WithRetransmissionSupport(delegate func(m net.Message)) func(m net.Message) {
  mutex := &sync.Mutex{}
  cache := make(map[string]bool)
  // (...)
}
```

The announcer's buffer size has been increased to 512 elements, just
like the buffer of the asynchronous state machine, and the problem seems
to be gone based on local machine tests.
@pdyraga pdyraga removed their assignment Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant