Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DKG saga, part 4: On-chain integration for TBTC DKG #3427

Merged
merged 52 commits into from
Jan 4, 2023
Merged

Conversation

pdyraga
Copy link
Member

@pdyraga pdyraga commented Dec 2, 2022

Closes: #3368
Depends on #3428

This is a PR with an on-chain integration for all ethereum.TbtcChain functions. Functions based on a mocked contract implementation are replaced with real (TM) functions talking to the chain. Specific changes are described below.

Random Beacon test loops disabled

This PR disables mock random beacon chain functions. This is the M3 client that should talk to the real WalletRegistry contract, generate groups based on the contract requests, and generate heartbeat signatures. We no longer need artificial beacon activity for tests here.

Real implementation of the ethereum.TbtcChain functions

The ethereum.TbtcChain component is the Ethereum implementation of the tbtc.Chain interface that defines the expectations of a tBTC v2 node regarding the host chain. So far, the ethereum.TbtcChain was actually a mock that was not talking with real contracts but triggered some test actions based on the in-memory state. This pull request changes that and provides real implementations for all ethereum.TbtcChain methods which now talk with the on-chain contracts like Bridge and WalletRegistry. That means the tBTC v2 node is now fully integrated with the chain.

Submission and validation of the DKG result

As mentioned in the previous section, the chain integration was done. That means the tBTC v2 node must deal with DKG results according to the rules defined by the on-chain WalletRegistry contract. The node was adjusted to do the following:

  1. Specific DKG members form a submission queue according to their member index and try to submit the result to the chain
  2. Once the result is submitted, all network nodes (even those not taking part in the specific DKG) validate the submitted result using a view on-chain function.
  3. If the result is invalid, all nodes try to challenge it
  4. If the result is valid, members who participated in the given DKG form an approval queue according to their member index and with respect to the submitter precedence period, and try to approve the submitted result once the challenge period ends.

Having that implemented means that the tBTC v2 nodes are able to answer wallet creation requests issued by the Bridge contract. A successful DKG result approval registers the given wallet in the Bridge and makes it ready to handle tBTC protocol actions.

Worth noting that this PR does not handle DKG retries that should be made upon a DKG result challenge. This scenario will be handled separately.

Temporary wallet heartbeats

This pull request also changes the existing signing test loop to be a temporary heartbeat mechanism ensuring the on-chain wallets are live. The previous signing test loop asked the latest in-mem wallet to sign 10 distinct messages every 500 blocks. The new heartbeat mechanism asks the on-chain active wallet to sign 5 distinct messages every ~8 hours. What's important, the signed messages follow a special heartbeat format that makes them valid from the fraud mechanism's viewpoint. This is important as wallets registered on-chain can be subject to fraud challenges. If they sign things that are invalid from the tBTC protocol's perspective, they are slashed.

Testing

To test changes presented in this PR, the following on-chain requirements must be met:

  • The Bridge's contract state must allow for new wallet creation
  • The WalletRegistry must use a mocked random beacon that will answer requests by calling back the right WalletRegistry function
  • The WalletRegistry must be authorized in the ReimbursementPool contract (necessary for successful DKG result approval)
  • It is also recommended to use lower values of DKG parameters, especially the challenge period length and approval precedence period length. This leads to a shorter feedback cycle.

Each test scenario was triggered by calling Bridge's requestNewWallet function. Always 3 off-chain clients were involved. Here is a summary of the most important scenarios that were checked multiple times and in different flavors:

  1. New wallet was requested, all 100 members produced a valid DKG result and the first member submitted it to the chain. The result was approved by the first member one block after the challenge period ended.
  2. New wallet was requested, all 100 members produced an invalid DKG result (due to a bug introduced on purpose) and the first member submitted it to the chain. The result was immediately deemed invalid and challenged by all clients.
  3. New wallet was requested, the first member went offline before the announcement phase (temporary hack in the code) and the other 99 members produced a valid result that was submitted by the second member. The result was approved by the second member one block after the approval precedence period ended.
  4. New wallet was requested and members failed to produce a DKG result (temporary hack in the code). They tried to do so until the DKG timeout block was hit.
  5. New wallet was requested and all 100 members produced a valid DKG result but the result could not be approved due to the on-chain contract state (WalletRegistry unauthorized from the ReimbursementPool). All members tried to approve by submitting an approval transaction that eventually failed.

@pdyraga pdyraga added this to the v2.0.0-m3 milestone Dec 2, 2022
@pdyraga pdyraga self-assigned this Dec 2, 2022
This is a preparation for integrating ECDSA DKG with on-chain contracts.
We no longer want the beacon to execute test functions and pollute
production logs. We can always enable it back for development.
Introduced a function allowing to convert `map[group.MemberIndex][]byte`
into two slices: []*big.Int and []byte. This format conversion is needed
for SubmitDkgResult (to be implemented in one of the next commits).
Introduced a function allowing to convert `ecdsa.PublicKey` into a 64-byte long
array. This format conversion is needed for SubmitDkgResult (to be implemented
in one of the next commits).
@pdyraga pdyraga changed the base branch from dkg-part-2 to dkg-part-2.5 December 2, 2022 15:04
@pdyraga pdyraga changed the title On-chain integration for TBTC DKG DKG saga, part 4: On-chain integration for TBTC DKG Dec 2, 2022
We need to know identifiers of members selected to the signing group
during the DKG result submission. This change makes the IDs available
for storing and processing but is not yet using them. This will be done
in one of the next commits.
The function is partially implemented and DOES NOT WORK. It will be
finished in next commits. Specifically, member IDs selected for the
group by the sortition pool are not passed.
Base automatically changed from dkg-part-2.5 to main December 7, 2022 09:14
Here we wire up the missing pieces, i.e. the IDs of operators selected
for DKG and the hash of operators who completed the protocol successfully.
Here we introduce a temporary heartbeat mechanism that will request a heartbeat
from the active wallet every ~8 hours. A single request consists of 5 messages
that must be signed sequentially by the target wallet. A single message has
a format of a valid heartbeat and is compatible with the fraud mechanism.

Additionally, this changeset attaches to the real Bridge contract and
takes the WalletRegistry address from it, instead of taking it from the
config. Therefore, the latter option is now deprecated and should be removed
in the near future.
Here we add the outstanding parts required to finalize the DKG after
the result is submitted on-chain.
The result submission logic should use a block delay and have a cancel signal
based either on the DKG timeout or DKG submission event. This changeset align
the existing code to those requirements.
@lukasz-zimnoch lukasz-zimnoch marked this pull request as ready for review December 12, 2022 13:01
Copy link
Member Author

@pdyraga pdyraga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through all files but was doing no tests. Have no more comments but those posted so far.

pkg/chain/ethereum/tbtc.go Outdated Show resolved Hide resolved
pkg/chain/ethereum/tbtc.go Outdated Show resolved Hide resolved
pkg/tbtc/dkg_test.go Outdated Show resolved Hide resolved
pkg/tbtc/dkg_submit_test.go Show resolved Hide resolved
pkg/tbtc/dkg_submit.go Show resolved Hide resolved
@@ -125,9 +136,36 @@ func Initialize(
}()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more thing that occurred to me: at line 112 in this file (not modified in this PR) we have:

_ = chain.OnDKGStarted(func(event *DKGStartedEvent) {
  go func() {
    if ok := deduplicator.notifyDKGStarted(
      event.Seed,
    ); !ok {

    // (...)
    
    node.joinDKGIfEligible(
      event.Seed,
      event.BlockNumber,
    )
  }()
})

I am pretty sure this will be problematic on mainnet as-is because of the fact event.BlockNumber is a part of the signature:

// `\x19Ethereum signed message:\n${keccak256(
// groupPubKey, misbehavedMembersIndices, dkgStartBlock
// )}`
bytes signatures;

We will have small chain reorganizations on mainnet resulting in DKGStartedEvents emitted one after another, with different startBlock. They will be deduplicated by deduplicator.notifyDKGStarted but the startBlock matters later for the on-chain result submission.

What I would suggest doing here is to wait for 20 blocks (this feels like safe enough finalization for Ethereum) then take the last, GetPastDkgStartedEvent, and trigger DKG based on the values from that event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. Just a couple of questions:

  • Are we sure GetPastDkgStartedEvent will return the same event for all? I can imagine a case when not all members receive the same due to ETH node re-connections or slight state differences
  • What about deduplication? Do we want to keep the current logic? Do we want to compare the original event with the one fetched via GetPastDkgStartedEvent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure GetPastDkgStartedEvent will return the same event for all? I can imagine a case when not all members receive the same due to ETH node re-connections or slight state differences

If we wait for 20 blocks then I am pretty sure the answer is yes. This function, under the hood, is just fetching filtered logs from the contract. It's a simple call to Ethereum node, it does not maintain a state. If the smart contract's state has enough confirmations than all nodes should receive the same result.

What about deduplication? Do we want to keep the current logic? Do we want to compare the original event with the one fetched via GetPastDkgStartedEvent?

I think we need to deduplicate to not start multiple monitoring loops for 20 confirmations. I don't think we need to compare events, I think we should just accept the fact the event can change and use the one fetched after the confirmations.

In my mind, this works as:

  • I know DKG is probably going to happen. So I start just one (!) monitoring.
  • I know the information I have right now may change, I need to wait 20 blocks to see the final version.
  • After 20 blocks, I confirm if DKG is in progress by checking the event. I take that event as a final source of truth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll rework that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time on it and I'm afraid this is not so simple. If we wait for 20 confirmation blocks, we start the actual off-chain protocol with a delay that breaks a lot of things. This is because we use the original block number from the event as the starting point but the actual block is more than 20 blocks in the future. This PR is already a ball of mud so I think we should think about it and address that problem separately. I created an issue capturing it #3456

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use the original block number from the event. Agreed, let's address it separately given the size of the PR. I'll leave this conversation unresolved for now so that we can reference it.

The DKG result validation logic used so far was vulnerable to malicious result
re-submission and chain reorgs. Here we improve it to be resistant to both
problems. First, we improve the `notifyDKGResultSubmitted` method of the
event deduplicator to consider the DKG seed and submission block while
computing the cache key. This way, the same malicious result will not be
considered as duplicate due to different block (and sometimes seed) thus
will be validated and challenged by nodes. Second, each node will confirm
the DKG state after submitting a DKG result challenge. This way, nodes will
be able to re-submit the challenge if the previous one has not changed the
DKG state (for example, due to a chain reorg that wiped out the block containing
the previous challenge transaction)
Higher values will allow individual members to bump the transaction's gas price
2 times.
The current approach is that the client defines what is supported off-chain and
confirms if the chain does not ask for something else. The requirements about
the group size or threshold do not come from the chain.
Copy link
Contributor

@tomaszslabon tomaszslabon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdyraga @lukasz-zimnoch
I tested a network of three clients and one bootstrap. The DKG was submitted successfully:

2023-01-02T18:47:12.447+0100    INFO    keep-tbtc       tbtc/dkg.go:279 [member:63] DKG result with group public key [0xc0162fd817b7cbb9e4130bec8d69b7830c2610502e5c8e845bca0d1ab37fe6ac9d3e4972586aaa5de424d52505bbba347466a3078a9cb9fe493839a2f2a34d1b] and result hash [0x0ff2b9f501d2d81e608beb1645e1de1abed567dc24895fbf34b0e03982622613] submitted at block [1704] by member [1] {"seed": "0xbed8cada2f7e8cbfaae4b68557faab80ca3f6442532c6540ed181160fdb68dc5"}

Note that I used a simple RandomBeaconChaosnet contract that calls WalletRegistry's callback without any gas limit.

dkgResultApprovalHandlersMutex sync.Mutex
dkgResultApprovalHandlers map[int]func(submission *DKGResultApprovedEvent)

dkgResultApprovalGuard func() bool
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider renaming it to dkgResultApprovalRejectFn or dkgResultApprovalFailFn and explaining in the comment it is used in tests to programmatically fail the execution of ApproveDKGResult. It was not clear to me at first what the guard is doing - delaying or not letting to call. And if not letting to call, then why.

// Reject the first approval (with index 0) that will be made by
// member 1 (the submitter) in order to force the member 2 to
// approve after the precedence period.
rejectedApprovalsIndexes: []int{0},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more natural to have it as failedApprovalsCount or rejectedApprovalsCount? This is how it works later in the test - we increment approvalIndex which is actually an approval attempt count (I'd rename it as well).

For this test, we are always failing the first n attempts of the approval. For example, it's not possible to have here rejectedApprovalsIndexes: []int{0, 4}.

@@ -125,9 +136,36 @@ func Initialize(
}()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use the original block number from the event. Agreed, let's address it separately given the size of the PR. I'll leave this conversation unresolved for now so that we can reference it.

if err != nil {
return fmt.Errorf("cannot assemble DKG chain result [%w]", err)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that has just occurred to me: we could validate the result with a call to EcdsaDkgValidator contract before submitting it. This way, we can eliminate the risk of slashing the operator in case of a bug in the client. I'll opened a separate issue to capture this work: #3461.

); !ok {
logger.Warnf(
"Result with hash [0x%x] for DKG with seed [0x%x] "+
"and starting block [%v] has been already processed",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a starting block right? This is the block when the DKGResultSubmitted event was emitted, DKG start block is currently the block at which DkgStarted was emitted and may change in the future given #3456.

Should this log say submit block instead?


// notifyDKGResultSubmitted notifies the client wants to start some actions
// upon the DKG result submission. It returns boolean indicating whether the
// client should proceed with the actions or ignore the event as a duplicate.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current deduplication logic makes a lot of sense given the challenges but we should make it crystal-clear in the docs in case future-us will ever think about removing the result block from the hash. Something like:

// (...)
//
// IMPORTANT: This function considers two events with the same seed and hash as
// separate ones if they do not have the same block number. This is important in
// the context of a possible challenge - we never know how soon the next result
// is submitted after the challenge. In extreme circumstances, it may even be
// the same block as the last challenge.

Comment on lines +583 to +585
dkgLogger.Infof(
"invalid DKG result still not challenged; retrying",
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log may not always be true. We could have another result submitted in the meantime and challenged. I think we should remove this log from here and move it to the beginning of the loop after we check that the result we are challenging is the last one submitted.

Captured it in a separate issue: #3462


return
err = de.chain.ChallengeDKGResult(result)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we run this loop one more time, we do not know if the result is still the one most recently submitted to the chain. It could happen that the challenge was mined and another result got submitted later. Before we execute this loop again, we should check the result - probably reading it from the last event.

Captured it in a separate issue: #3462

@pdyraga
Copy link
Member Author

pdyraga commented Jan 4, 2023

I did just a few initial commits in this PR, all the heavy work was handled by @lukasz-zimnoch who is the actual author of the code in this PR. I reviewed @lukasz-zimnoch 's code and it looks good. This PR is huge so we decided to open separate issues to track the improvements:

I did not test this code manually but @lukasz-zimnoch performed a lot of tests, all listed in the PR description and @tomaszslabon confirmed the happy path works as well.

Given the tests were performed, I am not the actual author of this code (just a reviewer) and there is approval from another reviewer - @tomaszslabon - I am 🚢 ing this stuff to main and we'll work on improvements separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Full ECDSA DKG integration
4 participants