(5.5) Better cluster detection mechanism. #855
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two thoughts:
- I would leave a comment / note that this will assume the install is complete as long as gravity-site is reachable, and ideally would be improved to also check the install state, although as discussed on slack that portion would need some work.
- I would update gravity status to also use this logic if it's not too much work.
lib/localenv/detect.go
Outdated
// - If the cluster is deployed/healthy, returns nil. | ||
// - If the cluster is not deployed and the node's clean, returns NotFound. | ||
// - If partial state is detected, returns a specific error. | ||
func DetectCluster(ctx context.Context, env *LocalEnvironment) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I believe this API belongs elsewhere (perhaps in lib/status)
- I don't think it will work as described (Initial
SiteOperator
call failure will not let you advance to look at packages). I would split it into functional parts:
// DetectCluster attempts to detect whether there is a deployed cluster.
//
// The behavior is as follows:
//
// - If the cluster is deployed/healthy, returns nil.
// - If the cluster is not deployed and the node's clean, returns NotFound.
// - If partial state is detected, returns a specific error.
func DetectCluster(ctx context.Context, env *LocalEnvironment) error {
err := isClusterController(env)
if err == nil {
return nil
}
err := isPackageState(env.Packages)
if err != nil {
return trace.Wrap(err)
}
// There are local packages: assume partial installation state or
// degraded state and log the original error for troubleshooting.
log.WithError(getErr).Warn("Failed to query local cluster.")
return trace.BadParameter(`Detected local Gravity state on the node but the cluster is inaccessible.
This usually means one of two things:
* The cluster is partially installed/uninstalled.
* The cluster is degraded.
To clean up the node from the partial state, run "gravity leave --force".
Otherwise, run "gravity status" and troubleshoot the degraded cluster.
`)
}
func isClusterController(env *localenv.LocalEnvironment) error {
clusterOperator, err := env.SiteOperator()
if err == nil {
_, getErr := clusterOperator.GetLocalSite()
if getErr == nil {
return nil
}
}
return trace.Wrap(err)
}
func isPackageState(packages pack.PackageService) error {
// Check local packages to see if there is a local state.
packages, err := packages.GetPackages(defaults.SystemAccountOrg)
if err != nil {
return trace.Wrap(err)
}
// There are no local packages: assume this is a clean node.
if len(packages) == 0 {
return trace.NotFound("no cluster installed")
}
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't put it into lib/status as-is due to cyclic deps (that's where I originally tried to place it) but I can make a function there that accepts operator/packages and call it from here.
Agree about splitting into functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-palchikov I've split the function into parts but decided to keep it where it is due to circular deps issues I mentioned above as I don't think the effort of trying to split it further and move worth it at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I tested all 3 scenarios - install checks on a clean node, upgrade checks when there's a cluster and error in the partial state case - and it works fine now.
f2013dc
to
d0ac78e
Compare
if clusterErr == nil { | ||
return nil | ||
} | ||
// Otherwise see if there is a partial state by looking at local packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to know what the error was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we do actually want to see it unless there are local packages. Because otherwise it will be logged every time checks are run on a clean node. So I'm logging the original error only in case a partial state detected on purpose.
lib/localenv/detect.go
Outdated
if trace.IsNotFound(err) { | ||
return trace.NotFound("no cluster detected") | ||
} | ||
return trace.Wrap(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is supposed to show the below message as well because it's the same partial state - i.e. considering failure to query cluster or query packages to be equal in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could return an aggregate here for sure.
@a-palchikov @knisbet Take a look. Based on our discussions in #845 I've implemented a bit more intelligent way of detecting the cluster. In addition to installed/not installed it now also recognizes the "partial state" or "degraded state" and recommends the user to figure it out first. Refs #840.