Message handler is too slow; dropping message during tECDSA signing #3419

pdyraga · 2022-11-28T11:58:53Z

#3418 fixed the problem with dangling message handlers no longer processing messages and having their buffers full. During the tests of #3404 on commit 9b28e73, so after the fix was reverse-merged, I noticed similar behavior on one of the clients.

The second client (happy-path-2.txt) at 2022-11-26T00:22:25.120+0100 started reporting a massive number of message handler is too slow; dropping message warnings. This problem was not present on other nodes at that time. The number of warnings suggests it could be - again - some problem with a dangling receiver no longer processing messages.

The logs from all 4 clients running were uploaded to google drive given the size of the files:
https://drive.google.com/drive/folders/11MU2L6gDbUHgQ_b3X_m9cFjZLE3_mjxS?usp=sharing

The text was updated successfully, but these errors were encountered:

pdyraga · 2022-11-28T12:16:52Z

I am not sure if this is the root cause of the observed problem but I think the control loop of a receiver is still broken:

keep-core/pkg/net/libp2p/channel.go

Lines 138 to 163 in 3a446af

    
           go func() { 
        
           	for { 
        
           		select { 
        
           		case <-ctx.Done(): 
        
           			logger.Debug("context is done; removing message handler") 
        
           			c.removeHandler(messageHandler) 
        
           			return 
        
           		case msg := <-messageHandler.channel: 
        
           			// Go language specification says that if one or more of the 
        
           			// communications in the select statement can proceed, a single 
        
           			// one that will proceed is chosen via a uniform pseudo-random 
        
           			// selection. 
        
           			// Thus, it can happen this communication is called when ctx is 
        
           			// already done. Since we guarantee in the network channel API 
        
           			// that handler is not called after ctx is done (client code 
        
           			// could e.g. perform come cleanup), we need to double-check 
        
           			// the context state here. 
        
           			if messageHandler.ctx.Err() != nil { 
        
           				continue 
        
           			} 
        
           			handleWithRetransmissions(msg) 
        
           		} 
        
           	} 
        
           }()

If the handlers get blocked for whatever reason (full buffer for example), c.removeHandler(messageHandler) will never get called, and messageHandler.channel writer will have to start dropping messages leading to the logs I observed.

I opened #3420 with a proposition of fixing the API but I think the refactoring will require more thinking to do this right and we could potentially fix this problem sooner by splitting the handler goroutine into two - one piping messages between channels and one controlling the lifecycle of a handler.

pdyraga · 2022-11-28T16:59:03Z

One interesting observation is that so far, this problem was always activating after the announcement phase launches:

2022-11-28T15:23:10.079+0100 INFO keep-tbtc tbtc/signing_loop.go:211 [member:61] starting announcement phase for attempt [1] {"wallet": "0x048da8d0ff8bfbb5c59d1b87519457cf90438266594319c82833cac04348c3db860af9e6325debf91c44e3a35c201642c9433b86d5cf03039e2ed81776f128ab58", "message": "0x66a8773a610c219e571734ad2ad614608b5888b6602e5a1a4178283088f30196", "signingStartBlock": 148626, "signingTimeoutBlock": 148831}
2022-11-28T15:23:10.204+0100 WARN keep-libp2p libp2p/channel.go:357 message handler is too slow; dropping message

I think the problem lies in the size of the buffer used by the announcer:

keep-core/pkg/protocol/announcer/announcer.go

Line 103 in 4d9404c

messagesChan := make(chan net.Message, a.groupSize)

Before a message in which the announcer is not interested is dropped, it is buffered in this channel. This also includes retransmissions from the previous signing protocols because retransmission cache filter is scoped to a Recv handler:

keep-core/pkg/net/libp2p/channel.go

Line 136 in 4d9404c

handleWithRetransmissions := retransmission.WithRetransmissionSupport(handler)

keep-core/pkg/net/retransmission/retransmission.go

Lines 54 to 56 in 4d9404c

    
           func WithRetransmissionSupport(delegate func(m net.Message)) func(m net.Message) { 
        
           	mutex := &sync.Mutex{} 
        
           	cache := make(map[string]bool)

Refs #3419 During the tests of the code orchestrating tECDSA signing in #3404, we realized some buffers get full and the broadcast channel keeps writing to them even though the receiver is no longer alive. This was fixed in #3418 by introducing an additional context in the asynchronous state machine that unregisters the handler if the state machine (receiver) exits sooner than the context. This has fixed the problem on the receiver side but we still need to fix the problem on the producer side. And this is what this PR is doing. A separate goroutine controls the lifecycle of the handler. The message handler is removed from the channel if the context is done. This logic is placed in a separate goroutine because the call to `handleWithRetransmission` is blocking and we do not want to wait with removing the handler if that call blocks for a longer period of time, especially, when the underlying buffered channel is full.

Closes #3419 Depends on #3422 During the tests of the code orchestrating tECDSA signing in, we noticed the announcer's buffer sometimes gets full and the messages are dropped with the famous "message handler is too slow" warning. It turns out that the problem lies in the size of the buffer used by the announcer. Before a message in which the announcer is not interested is dropped, it is buffered in this channel. This also includes retransmissions from the previous signing protocols because the retransmission cache filter is scoped to the given `Recv` handler: ``` func WithRetransmissionSupport(delegate func(m net.Message)) func(m net.Message) { mutex := &sync.Mutex{} cache := make(map[string]bool) // (...) } ``` The announcer's buffer size has been increased to 512 elements, just like the buffer of the asynchronous state machine, and the problem seems to be gone based on local machine tests.

pdyraga added the 📟 client label Nov 28, 2022

pdyraga mentioned this issue Nov 28, 2022

tECDSA signing: Serial signing mechanism #3404

Merged

This was referenced Nov 29, 2022

Control the broadcast channel lifecycle in a separate goroutine #3422

Merged

Increase announcer's receiver buffer to 512 #3423

Merged

pdyraga added this to the v2.0.0-m2 milestone Nov 29, 2022

pdyraga self-assigned this Nov 29, 2022

lukasz-zimnoch closed this as completed in #3423 Nov 29, 2022

pdyraga removed their assignment Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message handler is too slow; dropping message during tECDSA signing #3419

Message handler is too slow; dropping message during tECDSA signing #3419

pdyraga commented Nov 28, 2022

pdyraga commented Nov 28, 2022

pdyraga commented Nov 28, 2022 •

edited

Loading

Message handler is too slow; dropping message during tECDSA signing #3419

Message handler is too slow; dropping message during tECDSA signing #3419

Comments

pdyraga commented Nov 28, 2022

pdyraga commented Nov 28, 2022

pdyraga commented Nov 28, 2022 • edited Loading

pdyraga commented Nov 28, 2022 •

edited

Loading