Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce overhead around DKG #3770

Closed
4 tasks done
lukasz-zimnoch opened this issue Feb 5, 2024 · 0 comments · Fixed by #3777
Closed
4 tasks done

Reduce overhead around DKG #3770

lukasz-zimnoch opened this issue Feb 5, 2024 · 0 comments · Fixed by #3777
Assignees
Milestone

Comments

@lukasz-zimnoch
Copy link
Member

lukasz-zimnoch commented Feb 5, 2024

The extension of the beta operators set made it clear that the current DKG process is resource-heavy and problematic for operators. The current retry scheme is inefficient, a huge amount of network traffic is generated, and logs are often clunky. This task aims to improve the situation.

Tasks

  1. 📟 client
    lukasz-zimnoch
  2. 📟 client
    lukasz-zimnoch
  3. ⛓️ solidity 📟 client
    lukasz-zimnoch
  4. 📟 client
    lukasz-zimnoch
@lukasz-zimnoch lukasz-zimnoch self-assigned this Feb 5, 2024
@lukasz-zimnoch lukasz-zimnoch added this to the v2.0.0-m7 milestone Feb 5, 2024
tomaszslabon added a commit that referenced this issue Feb 8, 2024
Refs: #3770
Closes: #3761

Here we upgrade all libp2p libraries to the recent versions. To make it
possible, we were also forced to bump the Go version from 1.18 to 1.20.
This is the minimum version supported by recent libp2p packages.

I recommend reviewing commit by commit where specific changes are
described in detail. Here is a brief summary of what has been done:

### Upgrade Go from 1.18 to 1.20

Upgrade of Go resulted in a need to:
- Adjust the return type of the `slices.SortFunc` compare function we
are using in one unit test. This is because the version of the
`golang.org/x/exp` package had to be bumped up as well. The returned
type of the compare function used in `slices.SortFunc` was changed from
`bool` to `int` somewhere between
(5a980c7)
- Fix the `TestCoordinationExecutor_Coordinate` which started to be
flaky due to a changed behavior of `ecdsa.GenerateKey`. [Since Go
1.20](golang/go#58637), the returned key no
longer depends deterministically on the bytes read from the provided
RNG, and may change between calls and/or between versions
(2ed7179)
- Fix the `TestWalletRegistry_getWalletByPublicKeyHash_NotFound` which
used a dummy curve point. Since Go 1.19, such a behavior leads to a
panic
(50b6bd6)
- Reformat code using the new `gofmt` version
(3c2274e)
- Adjust the Dockerfile
(8e07451)
- Bump `staticcheck` version used by CI and fix the new warnings about
deprecated standard library functions by replacing them as recommended
(a87eea3)

### Upgrade of libp2p libraries

Upgrade of libp2p packages forced us to:
- Adjust `go-libp2p-core` imports to be `go-libp/core` as this package
was moved to the `go-libp2p` monorepo
(95d60b8)
- Adjust our `transport` and `authenticatedConnection` implementations
to expose some additional functions required by libp2p interfaces
(ac01765)
- Set up our `transport` differently due to the changes around libp2p
`Security` option
(110fbb3,
6953b79)
tomaszslabon added a commit that referenced this issue Feb 8, 2024
Refs: #3770
Depends on: #3771

Recent libp2p versions (we started to use them in
#3771) introduced a way to
set the seen messages cache TTL and strategy. Here we leverage those
settings to reduce the excessive message flooding effect that sometimes
occurs on mainnet. This pull request consists of two steps

### Use longer TTL for pubsub seen messages cache

Once a message is received and validated, pubsub re-broadcasts it to
other peers and puts it into the seen messages cache. This way,
subsequent arrivals of the same message are not re-broadcasted
unnecessarily. This mechanism is important for the network to avoid
excessive message flooding. The default value used by libp2p is 2
minutes. However, Keep client messaging sessions are quite
time-consuming so, we use a longer TTL of 5 minutes to reduce flooding
risk even further. Worth noting that this time cannot be too long as the
cache may grow excessively and impact memory consumption.

### Use `LastSeen` as seen messages cache strategy

By default, the libp2p seen messages cache uses the `FirstSeen` strategy
which expires an entry once TTL elapses from when it was added. This
means that if a single message is being received frequently and
consistently, pubsub will re-broadcast it every TTL, rather than never
re-broadcasting it.

In the context of the Keep client which additionally uses app-level
retransmissions, that often leads to a strong message amplification in
the broadcast channel which causes a significant increase in the network
load.

As the problem is quite common (see
libp2p/go-libp2p-pubsub#502), the libp2p team
added a new `LastSeen` strategy which behaves differently. This strategy
expires an entry once TTL elapses from the last time the message was
touched by a cache write (`Add`) or read (`Has`) operation. That gives
the desired behavior of never re-broadcasting a message that was already
seen within the last TTL period. This reduces the risk of unintended
over-amplification.
tomaszslabon added a commit that referenced this issue Feb 8, 2024
Refs: #3770
Depends on: #3775

The currently used DKG retry mechanism based on random exclusion turned
out to be ineffective for a higher number of participating operators.
Such retries have a very small chance of success and produce a lot of
unnecessary network traffic that consumes bandwidth and CPU excessively.

Here we aim to improve the situation. First, we are making DKG a
single-shot process that fails fast if the result cannot be produced
during the first attempt. Second, we are doubling down the announcement
period to maximize participation chances for all selected operators,
even those at the edge of the network. Last but not least, we are
reducing the submission delay that is preserved between operators
attempting to submit the final result on-chain.

All those changes combined allow us to achieve shorter DKG iterations
that can be timed out quicker. This way, we will be able to repeat DKG
more often, with different operator sets.

Last but not least, we are also changing the re-transmission strategy
for the `resultSigningState` which was still using
`StandardRetransmissionStrategy` with retransmissions occurring on each
tick. All DKG states use the `BackoffRetransmissionStrategy` strategy
which leads to a sparse distribution of retransmissions and is
considered more lightweight. There is no point in making an exception
for the `resultSigningState`. This should reduce network load in case
one of the participants fails at the end of the protocol.
tomaszslabon added a commit that referenced this issue Feb 9, 2024
Closes: #3770

Here we are tweaking some logs, mostly related to the network layer. The
goal here is to make debugging easier. We are trying to achieve that by
adding additional logs that may help during that process and limiting
the unuseful ones. I recommend reviewing commit by commit. A short
summary of changes:

- Set libp2p pubsub logs to `warn` by default and make that customizable
using the `PUBSUB_LOG_LEVEL` environment variable
(b725e1c)
- Log members that are not ready during DKG/signing network announcement
(551e57a)
- Log distinct operators participating in DKG
(980c85e)
- Execute a ping test for freshly connected peers
(b3d97ff)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant