New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate machine account config at startup #1236
Validate machine account config at startup #1236
Conversation
…n/5792-check-machine-account-config-at-startup
…n/5792-check-machine-account-config-at-startup
This was previously a special case, because we wanted to detect but ALLOW incorrect configurations temporarily during the migration phase. We will check the flag inline as with all the other flags, and a missing machine account config file will be caught and error when we attempt to read the file
also refactor machine account components a bit
These tests checked that invalid configurations of the machine account were detected and _allowed_ temporarily during the migration phase. Now that we are requiring valid configurations, these tests can be replaced by test cases for the machine account validator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very no-panic compliant, I like this (perhaps somewhat perversely). Left one nit.
module/epochs/machine_account.go
Outdated
// either we cannot validate the configuration or there is a critical | ||
// misconfiguration - log a warning and retry - we will continue checking | ||
// and logging until the problem is resolved | ||
log.Warn(). | ||
Err(err). | ||
Msg("critical machine account misconfiguration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: shouldn't a critical misconfiguration be logged to the Error
channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely should be, updated here: df22d9b
// attempt to read NodeMachineAccountInfo | ||
info, err := cmd.LoadNodeMachineAccountInfoFile(node.BaseConfig.BootstrapDir, node.Me.NodeID()) | ||
if err != nil { | ||
return nil, fmt.Errorf("could not load node machine account info file: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's indeed in LoadNodeMachineAccountInfoFile
already, 👍
use error for critical misconfiguration, and info for confirmation of correctness
log.Error(). | ||
Err(err). | ||
Msg("critical machine account misconfiguration") | ||
return retry.RetryableError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to retry if checking the machine info fails ? What will change about the info between retries ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some failures can be resolved out-of-band and would be addressed after retrying. For example, a temporary failure to connect to the access node, or an insufficient balance.
For the failures that can't be resolved without a restart, if we aren't going to crash, I think it's better to periodically log so that it's easier to find this problem (eg. a query for this log over the last hour would still find it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a small comment
bors merge |
This PR extends #1233 to validate machine account configuration when a node requiring a machine account starts up.