New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Khalil/5844 access node fallbacks #1395
Conversation
- update CLI for LN/SN nodes to remove access-address, and secure-access-node-api flags - add --access-node-ids flag - update unit tests - add fallback logic to retry logic of Broadcast and Vote
- update local net flags for LN/SN nodes - update default access node count to 2 - put flow client config prep in common function
- update integration tests flags and container configs (from #1323)
Codecov Report
@@ Coverage Diff @@
## master #1395 +/- ##
==========================================
- Coverage 55.41% 55.39% -0.03%
==========================================
Files 512 512
Lines 31942 31972 +30
==========================================
+ Hits 17700 17710 +10
- Misses 11862 11878 +16
- Partials 2380 2384 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
cmd/collection/main.go
Outdated
return fmt.Errorf("missing required flag --access-address") | ||
Module("sdk client connection options", func(builder cmd.NodeBuilder, node *cmd.NodeConfig) error { | ||
if len(accessNodeIDS) < common.DefaultAccessNodeIDSMinimum { | ||
return fmt.Errorf("invalid flag --access-node-ids atleast %x IDs must be provided", common.DefaultAccessNodeIDSMinimum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("invalid flag --access-node-ids atleast %x IDs must be provided", common.DefaultAccessNodeIDSMinimum) | |
return fmt.Errorf("invalid flag --access-node-ids atleast %d IDs must be provided", common.DefaultAccessNodeIDSMinimum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/consensus/main.go
Outdated
return fmt.Errorf("missing required flag --access-address") | ||
Module("sdk client connection options", func(builder cmd.NodeBuilder, node *cmd.NodeConfig) error { | ||
if len(accessNodeIDS) < common.DefaultAccessNodeIDSMinimum { | ||
return fmt.Errorf("invalid flag --access-node-ids atleast %x IDs must be provided", common.DefaultAccessNodeIDSMinimum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("invalid flag --access-node-ids atleast %x IDs must be provided", common.DefaultAccessNodeIDSMinimum) | |
return fmt.Errorf("invalid flag --access-node-ids atleast %d IDs must be provided", common.DefaultAccessNodeIDSMinimum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/util/cmd/common/flow_client.go
Outdated
for _, anID := range accessNodeIDS { | ||
id, err := flow.HexStringToIdentifier(anID) | ||
if err != nil { | ||
return nil, fmt.Errorf("could not get flow identifer from secured access node id: %s", id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, fmt.Errorf("could not get flow identifer from secured access node id: %s", id) | |
return nil, fmt.Errorf("could not get flow identifer from secured access node id (%s): %w", id, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/util/cmd/common/flow_client.go
Outdated
|
||
identities, err := snapshot.Identities(filter.HasNodeID(anIDS...)) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed get identities access node IDs from snapshot: %v", accessNodeIDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, fmt.Errorf("failed get identities access node IDs from snapshot: %v", accessNodeIDS) | |
return nil, fmt.Errorf("failed get identities access node identities (ids=%v) from snapshot: %w", accessNodeIDS, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/util/cmd/common/flow_client.go
Outdated
|
||
// make sure we have identities for all the access node IDs provided | ||
if len(identities) != len(accessNodeIDS) { | ||
return nil, fmt.Errorf("failed to get identity for all the access node IDs provided: %v, got %s", accessNodeIDS, identities) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, fmt.Errorf("failed to get identity for all the access node IDs provided: %v, got %s", accessNodeIDS, identities) | |
return nil, fmt.Errorf("failed to get identity for all the access node IDs provided: %v, got %v", accessNodeIDS, identities.NodeIDs()) |
so the expected and actual values are in the same format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integration/testnet/network.go
Outdated
|
||
// string slice argument for LN/SN node --access-node-ids | ||
// remove extra comma at the end of string | ||
anIDS := accessNodeIDS.String()[:accessNodeIDS.Len()-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use strings.Join
to make this a bit cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module/dkg/broker.go
Outdated
if err != nil { | ||
b.log.Error().Err(err).Msgf("error broadcasting, retrying (%x)", attempts) | ||
|
||
// retry with next fallback client every 2 attempts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// retry with next fallback client every 2 attempts | |
// retry with next fallback client after 2 failed attempts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
committee: committee, | ||
me: me, | ||
myIndex: myIndex, | ||
dkgContractClients: dkgContractClients, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing down some more detail of my suggestion from our call yesterday for sharing the fallback logic between the QCVoter and DKGBroker and keeping the details of the logic abstracted from QCVoter/DKGBroker.
type FallbackStrategy interface {
// return index of client to use
ClientIndex() int
// indicate successful use of client at index
Success(int)
// indicate failure to use client at index
Failure(int)
}
// broker.go
type Broker struct {
fallbackStrategy FallbackStrategy
clients []DKGContractClient
}
// ... when using a client
clientIndex := fallbackStrategy.ClientIndex()
err := clients[clientIndex].CallSomeFunction()
if err != nil {
fallbackStrategy.Failure(clientIndex)
} else {
fallbackStrategy.Sucess(clientIndex)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably an (excellent) idea that deserves its own issue.
me: me, | ||
signer: signer, | ||
state: state, | ||
qcContractClients: contractClients, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we initialize activeQCContractClient
to 0 here, just so it is explicit and the reader doesn't have to know about Go's initialization rules to reason about the behaviour. (same for Broker)
module/epochs/qc_voter.go
Outdated
|
||
// retry with next fallback client every 2 attempts | ||
if attempts%2 == 0 { | ||
log.Info().Msgf("retrying on attempt (%x) with fallback access node", attempts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Info().Msgf("retrying on attempt (%x) with fallback access node", attempts) | |
log.Info().Msgf("retrying on attempt (%d) with fallback access node", attempts) |
%x
is hex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a24564d
to
492c35c
Compare
we always used the public key from the first AN, making any fallback attempts useless (wrong key)
cmd/util/cmd/common/flow_client.go
Outdated
) | ||
|
||
// SecureFlowClient creates a flow client with secured GRPC connection | ||
func SecureFlowClient(accessAddress, accessApiNodePubKey string) (*client.Client, error) { | ||
type FlowClientOpt struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this FlowClientConfig
since they are configs instead of options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/util/cmd/common/flow_client.go
Outdated
@@ -47,3 +79,61 @@ func InsecureFlowClient(accessAddress string) (*client.Client, error) { | |||
|
|||
return flowClient, nil | |||
} | |||
|
|||
// PrepareFlowClientOpts will assemble connection options for the flow client for each access node id | |||
func PrepareFlowClientOpts(accessNodeIDS []string, insecureAccessAPI bool, snapshot protocol.Snapshot) ([]*FlowClientOpt, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func PrepareFlowClientOpts(accessNodeIDS []string, insecureAccessAPI bool, snapshot protocol.Snapshot) ([]*FlowClientOpt, error) { | |
func FlowClientConfigs(accessNodeIDS []string, insecureAccessAPI bool, snapshot protocol.Snapshot) ([]*FlowClientConfig, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/util/cmd/common/flow_client.go
Outdated
|
||
// ConvertAccessAddrFromState takes raw network address from the protocol state in the for of [DNS/IP]:PORT, removes the port and applies the appropriate | ||
// port number depending on the insecureAccessAPI arg. | ||
func ConvertAccessAddrFromState(address string, insecureAccessAPI bool) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func ConvertAccessAddrFromState(address string, insecureAccessAPI bool) string { | |
func convertAccessAddrFromState(address string, insecureAccessAPI bool) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a couple of nits but otherwise lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a bunch! This is probably close to done, I left a bunch of mostly-nits and will do a second pass soon.
cmd/collection/main.go
Outdated
flags.BoolVar(&insecureAccessAPI, "insecure-access-api", true, "required if insecure GRPC connection should be used") | ||
flags.StringSliceVar(&accessNodeIDS, "access-node-ids", []string{}, "array of access node ID's sorted in priority order where the first ID in this array will get the first connection attempt and each subsequent ID after serves as a fallback. minimum length 2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this refer to DefaultAccessNodeIDSMinimum
, if there's not dependency cycle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/util/cmd/common/flow_client.go
Outdated
|
||
// make sure we have identities for all the access node IDs provided | ||
if len(identities) != len(accessNodeIDS) { | ||
return nil, fmt.Errorf("failed to get identity for all the access node IDs provided: %v, got %v", accessNodeIDS, identities.NodeIDs()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil, fmt.Errorf("failed to get identity for all the access node IDs provided: %v, got %v", accessNodeIDS, identities.NodeIDs()) | |
return nil, fmt.Errorf("failed to get %v distinct identities for all the access node IDs provided: %v, found %v", len(accessNodeIDs), accessNodeIDS, identities.NodeIDs()) |
Nit: the user may repeat a single access node ID parameter in the arguments to the client, in the case where they only want to contact one (their) access node, thereby bypassing the fallback.
One way to solve this is to allow just one access node as a parameter, and if the overall issue was any less urgent, I'd actually suggest it.
But in so far as we won't come back to that, let's at least help with an error message that takes the repetition into account here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integration/localnet/bootstrap.go
Outdated
@@ -251,7 +254,7 @@ type Build struct { | |||
Target string | |||
} | |||
|
|||
func prepareServices(containers []testnet.ContainerConfig, secureAccessNodeID string) Services { | |||
func prepareServices(containers []testnet.ContainerConfig, secureAccessNodeIDS string) Services { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: if secure ANs have become an irrelevant concept, let's remove the adjective from those parameter names as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
committee: committee, | ||
me: me, | ||
myIndex: myIndex, | ||
dkgContractClients: dkgContractClients, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably an (excellent) idea that deserves its own issue.
integration/testnet/network.go
Outdated
accessNodeIDS = append(accessNodeIDS, n.NodeID.String()) | ||
} | ||
} | ||
require.True(t, len(accessNodeIDS) > 1, "at-least 2 access node that is not a ghost must be configured for test suite") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
require.True(t, len(accessNodeIDS) > 1, "at-least 2 access node that is not a ghost must be configured for test suite") | |
require.True(t, len(accessNodeIDS) > 1, "at-least 2 access node that is not a ghost must be configured for test suite") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// retry with next fallback client after 2 failed attempts | ||
if attempts%2 == 0 { | ||
b.log.Warn().Msgf("broadcast: retrying on attempt (%x) with fallback access node", attempts) | ||
b.updateActiveDKGContractClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I would actually call the update first, and then print the activeQCContractClient
in the message, so that the engineer trying to debug this from logs will know which node is attempted when.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// retry with next fallback client after 2 failed attempts | ||
if attempts%2 == 0 { | ||
b.log.Warn().Msgf("submit result: retrying on attempt (%d) with fallback access node", attempts) | ||
b.updateActiveDKGContractClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: here I would actually call the update first, and then print the activeQCContractClient
in the message, so that the engineer trying to debug this from logs will know which node is attempted when.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missed the change from d8c0995 here
// retry with next fallback client after 2 failed attempts | ||
if attempts%2 == 0 { | ||
log.Warn().Msgf("retrying on attempt (%d) with fallback access node", attempts) | ||
voter.updateActiveQcContractClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as elsewhere: I'd print the activeQCContractClient.
cmd/consensus/main.go
Outdated
flags.BoolVar(&insecureAccessAPI, "insecure-access-api", true, "required if insecure GRPC connection should be used") | ||
flags.StringSliceVar(&accessNodeIDS, "access-node-ids", []string{}, "array of access node ID's sorted in priority order where the first ID in this array will get the first connection attempt and each subsequent ID after serves as a fallback. minimum length 2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment: it would be great if this could use the constant (provided it introduces no dependency cycles)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for i, identity := range identities { | ||
accessAddress := ConvertAccessAddrFromState(identity.Address, insecureAccessAPI) | ||
|
||
// remove the 0x prefix from network public keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using TrimPrefix
instead, as you'll be removing key information and getting it reported as "invalid" or "wrong length" later down the line if the initial characters aren't what you expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- use trim prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
// retry with next fallback client after 2 failed attempts | ||
if attempts%2 == 0 { | ||
b.log.Warn().Msgf("submit result: retrying on attempt (%d) with fallback access node", attempts) | ||
b.updateActiveDKGContractClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missed the change from d8c0995 here
…/flow-go into khalil/5844-access-node-fallbacks
bors merge |
This PR removes 2 flags from LN/SN nodes --access-address & --secure-access-node-id . It replaces them with a single flag --access-node-ids because protocol state will be the source of truth for connection info to staked access node APIs.
--access-node-ids is a priority ordered list that will be used to create multiple flow client connections for QCVoter and DKG broker. This allows us to fallback to another access node if a node goes down.