Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

(5.5) Better cluster detection mechanism. #855

Merged
merged 3 commits into from Nov 8, 2019
Merged

Conversation

r0mant
Copy link
Contributor

@r0mant r0mant commented Oct 31, 2019

@a-palchikov @knisbet Take a look. Based on our discussions in #845 I've implemented a bit more intelligent way of detecting the cluster. In addition to installed/not installed it now also recognizes the "partial state" or "degraded state" and recommends the user to figure it out first. Refs #840.

Copy link
Contributor

@knisbet knisbet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two thoughts:

  • I would leave a comment / note that this will assume the install is complete as long as gravity-site is reachable, and ideally would be improved to also check the install state, although as discussed on slack that portion would need some work.
  • I would update gravity status to also use this logic if it's not too much work.

// - If the cluster is deployed/healthy, returns nil.
// - If the cluster is not deployed and the node's clean, returns NotFound.
// - If partial state is detected, returns a specific error.
func DetectCluster(ctx context.Context, env *LocalEnvironment) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I believe this API belongs elsewhere (perhaps in lib/status)
  • I don't think it will work as described (Initial SiteOperator call failure will not let you advance to look at packages). I would split it into functional parts:
// DetectCluster attempts to detect whether there is a deployed cluster.
//
// The behavior is as follows:
//
//   - If the cluster is deployed/healthy, returns nil.
//   - If the cluster is not deployed and the node's clean, returns NotFound.
//   - If partial state is detected, returns a specific error.
func DetectCluster(ctx context.Context, env *LocalEnvironment) error {
	err := isClusterController(env)
	if err == nil {
		return nil
	}
	err := isPackageState(env.Packages)
	if err != nil {
		return trace.Wrap(err)
	}
	// There are local packages: assume partial installation state or
	// degraded state and log the original error for troubleshooting.
	log.WithError(getErr).Warn("Failed to query local cluster.")
	return trace.BadParameter(`Detected local Gravity state on the node but the cluster is inaccessible.
This usually means one of two things:

* The cluster is partially installed/uninstalled.
* The cluster is degraded.

To clean up the node from the partial state, run "gravity leave --force".
Otherwise, run "gravity status" and troubleshoot the degraded cluster.
`)
}

func isClusterController(env *localenv.LocalEnvironment) error {
	clusterOperator, err := env.SiteOperator()
	if err == nil {
		_, getErr := clusterOperator.GetLocalSite()
		if getErr == nil {
			return nil
		}
	}
	return trace.Wrap(err)
}

func isPackageState(packages pack.PackageService) error {
	// Check local packages to see if there is a local state.
	packages, err := packages.GetPackages(defaults.SystemAccountOrg)
	if err != nil {
		return trace.Wrap(err)
	}
	// There are no local packages: assume this is a clean node.
	if len(packages) == 0 {
		return trace.NotFound("no cluster installed")
	}
	return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't put it into lib/status as-is due to cyclic deps (that's where I originally tried to place it) but I can make a function there that accepts operator/packages and call it from here.

Agree about splitting into functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-palchikov I've split the function into parts but decided to keep it where it is due to circular deps issues I mentioned above as I don't think the effort of trying to split it further and move worth it at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I tested all 3 scenarios - install checks on a clean node, upgrade checks when there's a cluster and error in the partial state case - and it works fine now.

if clusterErr == nil {
return nil
}
// Otherwise see if there is a partial state by looking at local packages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to know what the error was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we do actually want to see it unless there are local packages. Because otherwise it will be logged every time checks are run on a clean node. So I'm logging the original error only in case a partial state detected on purpose.

if trace.IsNotFound(err) {
return trace.NotFound("no cluster detected")
}
return trace.Wrap(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to show the below message as well because it's the same partial state - i.e. considering failure to query cluster or query packages to be equal in this context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could return an aggregate here for sure.

@r0mant r0mant merged commit e77679f into version/5.5.x Nov 8, 2019
@r0mant r0mant deleted the roman/5.5/detect branch November 8, 2019 17:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants