New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1728873: pkg/daemon: reconcile killed just prior drain+reboot #995
Conversation
If the MCD has applied the update during a sync but it's still in drain+reboot phase and it gets killed (OOM for instance), it can end up in a permanent failure leaving node unschedulable as well. Since we know at which point of the sync the MCD was (we have state on disk and bootid), if it happens to be killed, instead of permanently failing, we can attempt the drain+reboot phase. This is definitely better than just bailing and leaving nodes unschedulable which might also disrupt upgrades. Signed-off-by: Antonio Murdaca <runcom@linux.com>
@runcom: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cgwalters ptal, we need approve and lgtm before running this through architects for the next z release |
@runcom: This pull request references a valid Bugzilla bug. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/approve |
This should fix a real-world failure case we've seen at least a few times. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@runcom @cgwalters where is the 4.2/master PR/bug? |
It's in the title, right? https://bugzilla.redhat.com/show_bug.cgi?id=1728873 |
#952 was for master |
that's the 4.1.z/release-4.1 BZ. I'm looking for the 4.2 BZ where QA verified this was fixed in master. I'm creating one for you. |
/bugzilla refresh |
@eparis: This pull request references an invalid Bugzilla bug:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
After the 4.2 bug is VERIFIED we'll need to run |
This is still not verified by QE in 4.2. Can we work with QE to move forward? |
just managed to verify it myself and provided steps for QE to verify it themselves in 4.2 |
4.2 BZ verified just now /bugzilla refresh |
@runcom: This pull request references Bugzilla bug 1728873, which is valid. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This has BZ in right state and priority and the master fix is soaked. Approving. |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
So everything is fine test-wise, except the whole infra test fails at the end:
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@runcom: All pull requests linked via external trackers have merged. Bugzilla bug 1728873 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
If the MCD has applied the update during a sync but it's still in
drain+reboot phase and it gets killed (OOM for instance), it can end up
in a permanent failure leaving node unschedulable as well. Since we know
at which point of the sync the MCD was (we have state on disk and bootid), if it
happens to be killed, instead of permanently failing, we can attempt the
drain+reboot phase again.
This is definitely better than just bailing and leaving nodes
unschedulable which might also disrupt upgrades.
Signed-off-by: Antonio Murdaca runcom@linux.com