Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1883772: pkg/operator/clustermembercontroller: resync every minute #457

Merged
merged 1 commit into from Sep 30, 2020

Conversation

hexfusion
Copy link
Contributor

@hexfusion hexfusion commented Sep 30, 2020

In certain circumstances such as transient client failure followed by the operator process being killed (leader election lost) the operator can get in a situation where we forget to retry scaling members.

In this case we see the caches sync but scaling was not retried.

I0929 07:50:39.362608       1 shared_informer.go:223] Waiting for caches to sync for ClusterMemberController
I0929 07:50:40.862752       1 shared_informer.go:230] Caches are synced for ClusterMemberController

This PR ensures that we revisit membership in the case of a leaked event or unexpected cache to make sure we don't have any further work to do.

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Sep 30, 2020
@openshift-ci-robot
Copy link

@hexfusion: This pull request references Bugzilla bug 1883772, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1883772: pkg/operator/clustermembercontroller: resync every minute

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Sep 30, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2020
@retroflexer
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, retroflexer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ironcladlou
Copy link
Contributor

Seems harmless to resync periodically but I'm curious about what's going on here. Are you saying the operator remains alive but has missed some edge event that should trigger a sync but doesn't? If the process is killed and restarted I'd expect a resync. But are you saying the process is killed and not restarted, and having missed the event we never catch up with the missed event?

@hexfusion
Copy link
Contributor Author

Seems harmless to resync periodically but I'm curious about what's going on here. Are you saying the operator remains alive but has missed some edge event that should trigger a sync but doesn't? If the process is killed and restarted I'd expect a resync. But are you saying the process is killed and not restarted, and having missed the event we never catch up with the missed event?

I am saying the operator process is killed and new process is started but it appears we leaked an event and then did not retry.

@ironcladlou
Copy link
Contributor

Seems harmless to resync periodically but I'm curious about what's going on here. Are you saying the operator remains alive but has missed some edge event that should trigger a sync but doesn't? If the process is killed and restarted I'd expect a resync. But are you saying the process is killed and not restarted, and having missed the event we never catch up with the missed event?

I am saying the operator process is killed and new process is started but it appears we leaked an event and then did not retry.

Hm, wouldn't the resync happen on startup then? The logic should be level driven, after all

@hexfusion
Copy link
Contributor Author

hexfusion commented Sep 30, 2020

Hm, wouldn't the resync happen on startup then? The logic should be level driven, after all

It should yes and in this case it did.

I0929 07:50:39.362608       1 shared_informer.go:223] Waiting for caches to sync for ClusterMemberController
I0929 07:50:40.862752       1 shared_informer.go:230] Caches are synced for ClusterMemberController

But we should consider the leak a followup?

@openshift-ci-robot
Copy link

@hexfusion: This pull request references Bugzilla bug 1883772, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1883772: pkg/operator/clustermembercontroller: resync every minute

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou
Copy link
Contributor

Hm, wouldn't the resync happen on startup then? The logic should be level driven, after all

It should yes and in this case it did.

I0929 07:50:39.362608       1 shared_informer.go:223] Waiting for caches to sync for ClusterMemberController
I0929 07:50:40.862752       1 shared_informer.go:230] Caches are synced for ClusterMemberController

But we should consider the leak a followup?

Are you saying that had process actively resynced every minute, the issue would have resolved before the process ended up killed? Not sure of the timing details here (e.g. how long did it take before the process was nuked at which point the immediate resync fixed stuff)

@openshift-ci-robot
Copy link

@hexfusion: This pull request references Bugzilla bug 1883772, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1883772: pkg/operator/clustermembercontroller: resync every minute

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hexfusion
Copy link
Contributor Author

hexfusion commented Sep 30, 2020

Are you saying that had process actively resynced every minute, the issue would have resolved before the process ended up killed? Not sure of the timing details here (e.g. how long did it take before the process was nuked at which point the immediate resync fixed stuff)

Well I am saying once the event was leaked the controller would not resync. The net result was the cluster failed and never scaled this member.

timing

original operator process failure.

W0929 07:50:37.860383 1 leaderelection.go:69] leader election lost

new operator process few seconds later

I0929 07:50:39.362608       1 shared_informer.go:223] Waiting for caches to sync for ClusterMemberController
I0929 07:50:40.862752       1 shared_informer.go:230] Caches are synced for ClusterMemberController

scale other two members

425:I0929 07:50:57.371684       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ab4b54e4-a684-4248-96fb-d4cd458f490a", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberAdd' adding new peer https://192.168.1.11:2380
1439:I0929 07:52:53.048252       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ab4b54e4-a684-4248-96fb-d4cd458f490a", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberAdd' adding new peer https://192.168.1.29:2380

cache is synced but no event was triggered to attempt scale-up of 3rd. So it seems we leaked one?

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 30, 2020

@hexfusion: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-disruptive 2429c3e link /test e2e-disruptive

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 3c9d46e into openshift:master Sep 30, 2020
@openshift-ci-robot
Copy link

@hexfusion: All pull requests linked via external trackers have merged:

Bugzilla bug 1883772 has been moved to the MODIFIED state.

In response to this:

Bug 1883772: pkg/operator/clustermembercontroller: resync every minute

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hexfusion hexfusion deleted the fix-resync-cmc branch September 30, 2020 18:23
@hexfusion
Copy link
Contributor Author

/cherry-pick release-4.5

@openshift-cherrypick-robot

@hexfusion: new pull request created: #458

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants