Reduce overhead around DKG #3770

lukasz-zimnoch · 2024-02-05T10:48:11Z

The extension of the beta operators set made it clear that the current DKG process is resource-heavy and problematic for operators. The current retry scheme is inefficient, a huge amount of network traffic is generated, and logs are often clunky. This task aims to improve the situation.

Tasks

Give feedback

Upgrade Go and libp2p versions #3771

📟 client
Tweaks around libp2p pubsub seen messages cache #3773

📟 client
Single-shot DKG #3776

⛓️ solidity 📟 client
Tweak logs around network #3777

📟 client
Options

Refs: #3770 Closes: #3761 Here we upgrade all libp2p libraries to the recent versions. To make it possible, we were also forced to bump the Go version from 1.18 to 1.20. This is the minimum version supported by recent libp2p packages. I recommend reviewing commit by commit where specific changes are described in detail. Here is a brief summary of what has been done: ### Upgrade Go from 1.18 to 1.20 Upgrade of Go resulted in a need to: - Adjust the return type of the `slices.SortFunc` compare function we are using in one unit test. This is because the version of the `golang.org/x/exp` package had to be bumped up as well. The returned type of the compare function used in `slices.SortFunc` was changed from `bool` to `int` somewhere between (5a980c7) - Fix the `TestCoordinationExecutor_Coordinate` which started to be flaky due to a changed behavior of `ecdsa.GenerateKey`. [Since Go 1.20](golang/go#58637), the returned key no longer depends deterministically on the bytes read from the provided RNG, and may change between calls and/or between versions (2ed7179) - Fix the `TestWalletRegistry_getWalletByPublicKeyHash_NotFound` which used a dummy curve point. Since Go 1.19, such a behavior leads to a panic (50b6bd6) - Reformat code using the new `gofmt` version (3c2274e) - Adjust the Dockerfile (8e07451) - Bump `staticcheck` version used by CI and fix the new warnings about deprecated standard library functions by replacing them as recommended (a87eea3) ### Upgrade of libp2p libraries Upgrade of libp2p packages forced us to: - Adjust `go-libp2p-core` imports to be `go-libp/core` as this package was moved to the `go-libp2p` monorepo (95d60b8) - Adjust our `transport` and `authenticatedConnection` implementations to expose some additional functions required by libp2p interfaces (ac01765) - Set up our `transport` differently due to the changes around libp2p `Security` option (110fbb3, 6953b79)

Refs: #3770 Depends on: #3771 Recent libp2p versions (we started to use them in #3771) introduced a way to set the seen messages cache TTL and strategy. Here we leverage those settings to reduce the excessive message flooding effect that sometimes occurs on mainnet. This pull request consists of two steps ### Use longer TTL for pubsub seen messages cache Once a message is received and validated, pubsub re-broadcasts it to other peers and puts it into the seen messages cache. This way, subsequent arrivals of the same message are not re-broadcasted unnecessarily. This mechanism is important for the network to avoid excessive message flooding. The default value used by libp2p is 2 minutes. However, Keep client messaging sessions are quite time-consuming so, we use a longer TTL of 5 minutes to reduce flooding risk even further. Worth noting that this time cannot be too long as the cache may grow excessively and impact memory consumption. ### Use `LastSeen` as seen messages cache strategy By default, the libp2p seen messages cache uses the `FirstSeen` strategy which expires an entry once TTL elapses from when it was added. This means that if a single message is being received frequently and consistently, pubsub will re-broadcast it every TTL, rather than never re-broadcasting it. In the context of the Keep client which additionally uses app-level retransmissions, that often leads to a strong message amplification in the broadcast channel which causes a significant increase in the network load. As the problem is quite common (see libp2p/go-libp2p-pubsub#502), the libp2p team added a new `LastSeen` strategy which behaves differently. This strategy expires an entry once TTL elapses from the last time the message was touched by a cache write (`Add`) or read (`Has`) operation. That gives the desired behavior of never re-broadcasting a message that was already seen within the last TTL period. This reduces the risk of unintended over-amplification.

Refs: #3770 Depends on: #3775 The currently used DKG retry mechanism based on random exclusion turned out to be ineffective for a higher number of participating operators. Such retries have a very small chance of success and produce a lot of unnecessary network traffic that consumes bandwidth and CPU excessively. Here we aim to improve the situation. First, we are making DKG a single-shot process that fails fast if the result cannot be produced during the first attempt. Second, we are doubling down the announcement period to maximize participation chances for all selected operators, even those at the edge of the network. Last but not least, we are reducing the submission delay that is preserved between operators attempting to submit the final result on-chain. All those changes combined allow us to achieve shorter DKG iterations that can be timed out quicker. This way, we will be able to repeat DKG more often, with different operator sets. Last but not least, we are also changing the re-transmission strategy for the `resultSigningState` which was still using `StandardRetransmissionStrategy` with retransmissions occurring on each tick. All DKG states use the `BackoffRetransmissionStrategy` strategy which leads to a sparse distribution of retransmissions and is considered more lightweight. There is no point in making an exception for the `resultSigningState`. This should reduce network load in case one of the participants fails at the end of the protocol.

Closes: #3770 Here we are tweaking some logs, mostly related to the network layer. The goal here is to make debugging easier. We are trying to achieve that by adding additional logs that may help during that process and limiting the unuseful ones. I recommend reviewing commit by commit. A short summary of changes: - Set libp2p pubsub logs to `warn` by default and make that customizable using the `PUBSUB_LOG_LEVEL` environment variable (b725e1c) - Log members that are not ready during DKG/signing network announcement (551e57a) - Log distinct operators participating in DKG (980c85e) - Execute a ping test for freshly connected peers (b3d97ff)

lukasz-zimnoch self-assigned this Feb 5, 2024

lukasz-zimnoch added this to the v2.0.0-m7 milestone Feb 5, 2024

This was referenced Feb 6, 2024

Upgrade Go and libp2p versions #3771

Merged

Tweaks around libp2p pubsub seen messages cache #3773

Merged

Single-shot DKG #3776

Merged

lukasz-zimnoch mentioned this issue Feb 9, 2024

Tweak logs around network #3777

Merged

tomaszslabon closed this as completed in #3777 Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce overhead around DKG #3770

Reduce overhead around DKG #3770

lukasz-zimnoch commented Feb 5, 2024 •

edited

Tasks

Reduce overhead around DKG #3770

Reduce overhead around DKG #3770

Comments

lukasz-zimnoch commented Feb 5, 2024 • edited

Tasks

lukasz-zimnoch commented Feb 5, 2024 •

edited