Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterRelocate to move ClusterDeployments between Hive instances #1011

Merged

Conversation

staebler
Copy link
Contributor

@staebler staebler commented May 28, 2020

When a ClusterRelocate has a label selector that matches with a ClusterDeployment, the clusterrelocate controller will relocate the matching ClusterDeployment.

  1. Set the hive.openshift.io/relocate annotation to outgoing on the ClusterDeployment and the DNSZone.
  2. Copy secrets, configmaps, machinesets, syncsets, and syncidentityproviders.
  3. Copy the DNSZone for the ClusterDeployment, with the relocate annotation set to incoming.
  4. Copy the ClusterDeployment, with the relocate annotation set to incoming.
  5. Set the relocate annotation on the DNSZone and ClusterDeployemnt to complete.

When a ClusterDeployment with a relocate annotation set to incoming is reconciled by the clusterRelocate controller, the relocate annotation is removed from the DNSZone and the ClusterDeployment. This indicates that the relocate has completed on the destination side.

When a ClusterDeployment has the relocated annotation, no controllers will do any mutation of the remote cluster (such as syncing syncsets). When the ClusterDeployment has the relocated annotation, then controllers will not run finalizer code when the ClusterDeployment is deleted.

Add a hive_cluster_relocations counter vec that tracks the total number of successful cluster relocations. The metric has a single
cluster_relocate label that is the name of the ClusterRelocate directing the relocation.

Add a hive_aborted_cluster_relocations counter vec that tracks the total number of aborted cluster relocations. The metrics has two labels: cluster_relocate and reason. The cluster_relocate label is the name of the ClusterRelocate directing the aborted relocation. The reason label is the reason why the relocate was aborted. Possible values for the reason label are "no_match", "multiple_matches", and "new_match".

The number of failing cluster relocates can be seen by looking at the hive_cluster_deployments_conditions metric and filtering on
a value of RelocationFailed for the condition label.

https://issues.redhat.com/browse/CO-653

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2020
pkg/controller/add_clusterrelocate.go Outdated Show resolved Hide resolved
}

for _, cd := range clusterDeployments.Items {
if cd.Annotations[constants.RelocatingAnnotation] != clusterRelocate.Name &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't clear to me. Should it be if annotation != name OR !labelMatch ? I suppose the question is, what exactly is this condition catching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is meant to filter out the ClusterDeployments that are not related to the ClusterRelocate. If the ClusterDeployment does not have a relocating annotation matching the ClusterRelocate AND does not have a label matching the ClusterRelocate, then the ClusterDeployment is not queued.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2020
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2020
@staebler staebler changed the title WIP: controller: re-work clusterrelocate to add annotation to dnszone controller: re-work clusterrelocate to add annotation to dnszone Jun 18, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 18, 2020
@staebler
Copy link
Contributor Author

This has been tested now and is ready for further review.

@@ -431,6 +431,18 @@ type CertificateBundleStatus struct {
Generated bool `json:"generated"`
}

// RelocateStatus is the status of a cluster relocate
type RelocateStatus string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting the annotation it's used in would be useful context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not seeing the annotation it's used in mentioned here, something missed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I added it to ClusterRelocateStatus but not here. That brings to light that I have a ClusterRelocateStatus and a RelocateStatus. That is too confusing. I need to rename one or both of those.

if err := r.copyResource(dnsZone, destClient, false, logger); err != nil {
return errors.Wrap(err, "failed to copy dnszone")
}
}

// copy clusterdeployment
if err := r.copyResource(cd, destClient, true, logger); err != nil {
return errors.Wrap(err, "failed to copy clusterdeployment")
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the naked curly block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the logger variable isolate to inside the scope, in case there is other code added after the code block later that may use logger and not want the type and resource field. But, now I see that I replaced the function-scoped logger variable instead of creating a new logger variable.

@staebler staebler force-pushed the rework_cluster_relocate branch 3 times, most recently from 17031c1 to 91144bf Compare June 24, 2020 13:30

"github.com/pkg/errors"
log "github.com/sirupsen/logrus"
"sigs.k8s.io/controller-runtime/pkg/handler"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the grouping of the imports seems inconsistent. only one controller-runtime import is by itself.

@dgoodwin
Copy link
Contributor

Despite the minor annotation thing I'm going to lgtm and see if we can get this to merge today.

@dgoodwin
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2020
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@staebler
Copy link
Contributor Author

Let me fix the annotation comment and address Joel's comment and squash everything down.

@dgoodwin
Copy link
Contributor

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2020
@staebler
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2020
@staebler staebler changed the title controller: re-work clusterrelocate to add annotation to dnszone ClusterRelocate to move ClusterDeployments between Hive instances Jun 26, 2020
@staebler staebler force-pushed the rework_cluster_relocate branch 2 times, most recently from 26fec3c to cb2aa73 Compare June 26, 2020 14:07
When a ClusterRelocate has a label selector that matches with
a ClusterDeployment, the clusterrelocate controller will relocate
the matching ClusterDeployment.

1. Set the hive.openshift.io/relocate annotation to outgoing on the
   ClusterDeployment and the DNSZone.
2. Copy secrets, configmaps, machinesets, syncsets, and syncidentityproviders.
3. Copy the DNSZone for the ClusterDeployment, with the relocate annotation
   set to incoming.
4. Copy the ClusterDeployment, with the relocate annotation set to incoming.
5. Set the relocate annotation on the DNSZone and ClusterDeployemnt to complete.

When a ClusterDeployment with a relocate annotation set to incoming is
reconciled by the clusterRelocate controller, the relocate annotation is
removed from the DNSZone and the ClusterDeployment. This indicates that the
relocate has completed on the destination side.

When a ClusterDeployment has the relocated annotation, no controllers will do
any mutation of the remote cluster (such as syncing syncsets). When the
ClusterDeployment has the relocated annotation, then controllers will not run
finalizer code when the ClusterDeployment is deleted.

Add a hive_cluster_relocations counter vec that tracks the total number
of successful cluster relocations. The metric has a single
cluster_relocate label that is the name of the ClusterRelocate
directing the relocation.

Add a hive_aborted_cluster_relocations counter vec that tracks the
total number of aborted cluster relocations. The metrics has two
labels: cluster_relocate and reason. The cluster_relocate label
is the name of the ClusterRelocate directing the aborted relocation.
The reason label is the reason why the relocate was aborted. Possible
values for the reason label are "no_match", "multiple_matches", and
"new_match".

The number of failing cluster relocates can be seen by looking at
the hive_cluster_deployments_conditions metric and filtering on
a value of RelocationFailed for the condition label.

https://issues.redhat.com/browse/CO-653
@staebler
Copy link
Contributor Author

/test unit

@dgoodwin
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants