Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Khalil/5844 access node fallbacks #1395

Merged
merged 21 commits into from Oct 1, 2021
Merged

Conversation

kc1116
Copy link
Contributor

@kc1116 kc1116 commented Sep 30, 2021

This PR removes 2 flags from LN/SN nodes --access-address & --secure-access-node-id . It replaces them with a single flag --access-node-ids because protocol state will be the source of truth for connection info to staked access node APIs.
--access-node-ids is a priority ordered list that will be used to create multiple flow client connections for QCVoter and DKG broker. This allows us to fallback to another access node if a node goes down.

  • update LN/SN flags
  • add common util funcs to setup multiple client connection options
  • update QCVoter and DKG Broker to support multiple contract client instances and fallback to the next instance in use every 2 failed retry attempts
  • update localnet container configs
  • update integration tests container configs (from make secure grpc access api default for LN/SN client connections #1323)

- update CLI for LN/SN nodes to remove access-address, and secure-access-node-api flags

- add --access-node-ids flag

- update unit tests

- add fallback logic to retry logic of Broadcast and Vote
- update local net flags for LN/SN nodes

- update default access node count to 2

-  put flow client config prep in common function
- update integration tests flags and container configs (from #1323)
@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2021

Codecov Report

Merging #1395 (78a0825) into master (9d0c789) will decrease coverage by 0.02%.
The diff coverage is 44.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1395      +/-   ##
==========================================
- Coverage   55.41%   55.39%   -0.03%     
==========================================
  Files         512      512              
  Lines       31942    31972      +30     
==========================================
+ Hits        17700    17710      +10     
- Misses      11862    11878      +16     
- Partials     2380     2384       +4     
Flag Coverage Δ
unittests 55.39% <44.73%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
module/dkg/controller_factory.go 0.00% <0.00%> (ø)
module/dkg/broker.go 58.55% <46.80%> (-7.31%) ⬇️
module/epochs/qc_voter.go 59.67% <52.17%> (+1.34%) ⬆️
engine/collection/synchronization/engine.go 62.90% <0.00%> (-1.08%) ⬇️
admin/command_runner.go 80.00% <0.00%> (+1.48%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d0c789...78a0825. Read the comment docs.

return fmt.Errorf("missing required flag --access-address")
Module("sdk client connection options", func(builder cmd.NodeBuilder, node *cmd.NodeConfig) error {
if len(accessNodeIDS) < common.DefaultAccessNodeIDSMinimum {
return fmt.Errorf("invalid flag --access-node-ids atleast %x IDs must be provided", common.DefaultAccessNodeIDSMinimum)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("invalid flag --access-node-ids atleast %x IDs must be provided", common.DefaultAccessNodeIDSMinimum)
return fmt.Errorf("invalid flag --access-node-ids atleast %d IDs must be provided", common.DefaultAccessNodeIDSMinimum)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return fmt.Errorf("missing required flag --access-address")
Module("sdk client connection options", func(builder cmd.NodeBuilder, node *cmd.NodeConfig) error {
if len(accessNodeIDS) < common.DefaultAccessNodeIDSMinimum {
return fmt.Errorf("invalid flag --access-node-ids atleast %x IDs must be provided", common.DefaultAccessNodeIDSMinimum)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("invalid flag --access-node-ids atleast %x IDs must be provided", common.DefaultAccessNodeIDSMinimum)
return fmt.Errorf("invalid flag --access-node-ids atleast %d IDs must be provided", common.DefaultAccessNodeIDSMinimum)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for _, anID := range accessNodeIDS {
id, err := flow.HexStringToIdentifier(anID)
if err != nil {
return nil, fmt.Errorf("could not get flow identifer from secured access node id: %s", id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("could not get flow identifer from secured access node id: %s", id)
return nil, fmt.Errorf("could not get flow identifer from secured access node id (%s): %w", id, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


identities, err := snapshot.Identities(filter.HasNodeID(anIDS...))
if err != nil {
return nil, fmt.Errorf("failed get identities access node IDs from snapshot: %v", accessNodeIDS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed get identities access node IDs from snapshot: %v", accessNodeIDS)
return nil, fmt.Errorf("failed get identities access node identities (ids=%v) from snapshot: %w", accessNodeIDS, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// make sure we have identities for all the access node IDs provided
if len(identities) != len(accessNodeIDS) {
return nil, fmt.Errorf("failed to get identity for all the access node IDs provided: %v, got %s", accessNodeIDS, identities)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed to get identity for all the access node IDs provided: %v, got %s", accessNodeIDS, identities)
return nil, fmt.Errorf("failed to get identity for all the access node IDs provided: %v, got %v", accessNodeIDS, identities.NodeIDs())

so the expected and actual values are in the same format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// string slice argument for LN/SN node --access-node-ids
// remove extra comma at the end of string
anIDS := accessNodeIDS.String()[:accessNodeIDS.Len()-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use strings.Join to make this a bit cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err != nil {
b.log.Error().Err(err).Msgf("error broadcasting, retrying (%x)", attempts)

// retry with next fallback client every 2 attempts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// retry with next fallback client every 2 attempts
// retry with next fallback client after 2 failed attempts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committee: committee,
me: me,
myIndex: myIndex,
dkgContractClients: dkgContractClients,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing down some more detail of my suggestion from our call yesterday for sharing the fallback logic between the QCVoter and DKGBroker and keeping the details of the logic abstracted from QCVoter/DKGBroker.

type FallbackStrategy interface {
  // return index of client to use
  ClientIndex() int
  // indicate successful use of client at index
  Success(int)
  // indicate failure to use client at index
  Failure(int)
}

// broker.go
type Broker struct {
  fallbackStrategy FallbackStrategy
  clients []DKGContractClient
}

// ... when using a client
clientIndex := fallbackStrategy.ClientIndex() 
err := clients[clientIndex].CallSomeFunction()
if err != nil {
  fallbackStrategy.Failure(clientIndex)
} else {
  fallbackStrategy.Sucess(clientIndex)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably an (excellent) idea that deserves its own issue.

me: me,
signer: signer,
state: state,
qcContractClients: contractClients,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we initialize activeQCContractClient to 0 here, just so it is explicit and the reader doesn't have to know about Go's initialization rules to reason about the behaviour. (same for Broker)


// retry with next fallback client every 2 attempts
if attempts%2 == 0 {
log.Info().Msgf("retrying on attempt (%x) with fallback access node", attempts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Info().Msgf("retrying on attempt (%x) with fallback access node", attempts)
log.Info().Msgf("retrying on attempt (%d) with fallback access node", attempts)

%x is hex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordanschalm jordanschalm force-pushed the khalil/5844-access-node-fallbacks branch from a24564d to 492c35c Compare October 1, 2021 15:59
we always used the public key from the first AN, making any fallback
attempts useless (wrong key)
)

// SecureFlowClient creates a flow client with secured GRPC connection
func SecureFlowClient(accessAddress, accessApiNodePubKey string) (*client.Client, error) {
type FlowClientOpt struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this FlowClientConfig since they are configs instead of options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -47,3 +79,61 @@ func InsecureFlowClient(accessAddress string) (*client.Client, error) {

return flowClient, nil
}

// PrepareFlowClientOpts will assemble connection options for the flow client for each access node id
func PrepareFlowClientOpts(accessNodeIDS []string, insecureAccessAPI bool, snapshot protocol.Snapshot) ([]*FlowClientOpt, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func PrepareFlowClientOpts(accessNodeIDS []string, insecureAccessAPI bool, snapshot protocol.Snapshot) ([]*FlowClientOpt, error) {
func FlowClientConfigs(accessNodeIDS []string, insecureAccessAPI bool, snapshot protocol.Snapshot) ([]*FlowClientConfig, error) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// ConvertAccessAddrFromState takes raw network address from the protocol state in the for of [DNS/IP]:PORT, removes the port and applies the appropriate
// port number depending on the insecureAccessAPI arg.
func ConvertAccessAddrFromState(address string, insecureAccessAPI bool) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func ConvertAccessAddrFromState(address string, insecureAccessAPI bool) string {
func convertAccessAddrFromState(address string, insecureAccessAPI bool) string {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@vishalchangrani vishalchangrani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple of nits but otherwise lgtm

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch! This is probably close to done, I left a bunch of mostly-nits and will do a second pass soon.

flags.BoolVar(&insecureAccessAPI, "insecure-access-api", true, "required if insecure GRPC connection should be used")
flags.StringSliceVar(&accessNodeIDS, "access-node-ids", []string{}, "array of access node ID's sorted in priority order where the first ID in this array will get the first connection attempt and each subsequent ID after serves as a fallback. minimum length 2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this refer to DefaultAccessNodeIDSMinimum, if there's not dependency cycle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// make sure we have identities for all the access node IDs provided
if len(identities) != len(accessNodeIDS) {
return nil, fmt.Errorf("failed to get identity for all the access node IDs provided: %v, got %v", accessNodeIDS, identities.NodeIDs())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed to get identity for all the access node IDs provided: %v, got %v", accessNodeIDS, identities.NodeIDs())
return nil, fmt.Errorf("failed to get %v distinct identities for all the access node IDs provided: %v, found %v", len(accessNodeIDs), accessNodeIDS, identities.NodeIDs())

Nit: the user may repeat a single access node ID parameter in the arguments to the client, in the case where they only want to contact one (their) access node, thereby bypassing the fallback.

One way to solve this is to allow just one access node as a parameter, and if the overall issue was any less urgent, I'd actually suggest it.

But in so far as we won't come back to that, let's at least help with an error message that takes the repetition into account here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -251,7 +254,7 @@ type Build struct {
Target string
}

func prepareServices(containers []testnet.ContainerConfig, secureAccessNodeID string) Services {
func prepareServices(containers []testnet.ContainerConfig, secureAccessNodeIDS string) Services {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if secure ANs have become an irrelevant concept, let's remove the adjective from those parameter names as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committee: committee,
me: me,
myIndex: myIndex,
dkgContractClients: dkgContractClients,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably an (excellent) idea that deserves its own issue.

accessNodeIDS = append(accessNodeIDS, n.NodeID.String())
}
}
require.True(t, len(accessNodeIDS) > 1, "at-least 2 access node that is not a ghost must be configured for test suite")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
require.True(t, len(accessNodeIDS) > 1, "at-least 2 access node that is not a ghost must be configured for test suite")
require.True(t, len(accessNodeIDS) > 1, "at-least 2 access node that is not a ghost must be configured for test suite")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// retry with next fallback client after 2 failed attempts
if attempts%2 == 0 {
b.log.Warn().Msgf("broadcast: retrying on attempt (%x) with fallback access node", attempts)
b.updateActiveDKGContractClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would actually call the update first, and then print the activeQCContractClient in the message, so that the engineer trying to debug this from logs will know which node is attempted when.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// retry with next fallback client after 2 failed attempts
if attempts%2 == 0 {
b.log.Warn().Msgf("submit result: retrying on attempt (%d) with fallback access node", attempts)
b.updateActiveDKGContractClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: here I would actually call the update first, and then print the activeQCContractClient in the message, so that the engineer trying to debug this from logs will know which node is attempted when.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missed the change from d8c0995 here

// retry with next fallback client after 2 failed attempts
if attempts%2 == 0 {
log.Warn().Msgf("retrying on attempt (%d) with fallback access node", attempts)
voter.updateActiveQcContractClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as elsewhere: I'd print the activeQCContractClient.

flags.BoolVar(&insecureAccessAPI, "insecure-access-api", true, "required if insecure GRPC connection should be used")
flags.StringSliceVar(&accessNodeIDS, "access-node-ids", []string{}, "array of access node ID's sorted in priority order where the first ID in this array will get the first connection attempt and each subsequent ID after serves as a fallback. minimum length 2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment: it would be great if this could use the constant (provided it introduces no dependency cycles)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for i, identity := range identities {
accessAddress := ConvertAccessAddrFromState(identity.Address, insecureAccessAPI)

// remove the 0x prefix from network public keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using TrimPrefix instead, as you'll be removing key information and getting it reported as "invalid" or "wrong length" later down the line if the initial characters aren't what you expect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kc1116 kc1116 requested a review from huitseeker October 1, 2021 18:34
Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

// retry with next fallback client after 2 failed attempts
if attempts%2 == 0 {
b.log.Warn().Msgf("submit result: retrying on attempt (%d) with fallback access node", attempts)
b.updateActiveDKGContractClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missed the change from d8c0995 here

@jordanschalm
Copy link
Member

bors merge

@bors
Copy link
Contributor

bors bot commented Oct 1, 2021

@bors bors bot merged commit 5b3a98a into master Oct 1, 2021
@bors bors bot deleted the khalil/5844-access-node-fallbacks branch October 1, 2021 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants