Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1728873: pkg/daemon: reconcile killed just prior drain+reboot #995

Merged
merged 1 commit into from Sep 3, 2019

Conversation

runcom
Copy link
Member

@runcom runcom commented Jul 20, 2019

If the MCD has applied the update during a sync but it's still in
drain+reboot phase and it gets killed (OOM for instance), it can end up
in a permanent failure leaving node unschedulable as well. Since we know
at which point of the sync the MCD was (we have state on disk and bootid), if it
happens to be killed, instead of permanently failing, we can attempt the
drain+reboot phase again.
This is definitely better than just bailing and leaving nodes
unschedulable which might also disrupt upgrades.

Signed-off-by: Antonio Murdaca runcom@linux.com

If the MCD has applied the update during a sync but it's still in
drain+reboot phase and it gets killed (OOM for instance), it can end up
in a permanent failure leaving node unschedulable as well. Since we know
at which point of the sync the MCD was (we have state on disk and bootid), if it
happens to be killed, instead of permanently failing, we can attempt the
drain+reboot phase.
This is definitely better than just bailing and leaving nodes
unschedulable which might also disrupt upgrades.

Signed-off-by: Antonio Murdaca <runcom@linux.com>
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1728873: pkg/daemon: reconcile killed just prior drain+reboot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jul 20, 2019
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 20, 2019
@runcom
Copy link
Member Author

runcom commented Jul 20, 2019

@cgwalters ptal, we need approve and lgtm before running this through architects for the next z release

@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references a valid Bugzilla bug.

In response to this:

Bug 1728873: pkg/daemon: reconcile killed just prior drain+reboot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member

/approve

@cgwalters
Copy link
Member

This should fix a real-world failure case we've seen at least a few times.
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eparis
Copy link
Member

eparis commented Jul 22, 2019

@runcom @cgwalters where is the 4.2/master PR/bug?

@cgwalters
Copy link
Member

It's in the title, right? https://bugzilla.redhat.com/show_bug.cgi?id=1728873

@LorbusChris
Copy link
Member

#952 was for master

@eparis
Copy link
Member

eparis commented Jul 22, 2019

It's in the title, right? https://bugzilla.redhat.com/show_bug.cgi?id=1728873

that's the 4.1.z/release-4.1 BZ. I'm looking for the 4.2 BZ where QA verified this was fixed in master. I'm creating one for you.

@eparis
Copy link
Member

eparis commented Jul 22, 2019

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@eparis: This pull request references an invalid Bugzilla bug:

  • expected dependent Bugzilla bug to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is MODIFIED instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 22, 2019
@eparis
Copy link
Member

eparis commented Jul 22, 2019

After the 4.2 bug is VERIFIED we'll need to run /bugzilla refresh

@eparis
Copy link
Member

eparis commented Aug 20, 2019

This is still not verified by QE in 4.2. Can we work with QE to move forward?

@runcom
Copy link
Member Author

runcom commented Aug 28, 2019

This is still not verified by QE in 4.2. Can we work with QE to move forward?

just managed to verify it myself and provided steps for QE to verify it themselves in 4.2

@runcom
Copy link
Member Author

runcom commented Aug 28, 2019

After the 4.2 bug is VERIFIED we'll need to run /bugzilla refresh

4.2 BZ verified just now

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 28, 2019
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1728873, which is valid.

In response to this:

After the 4.2 bug is VERIFIED we'll need to run /bugzilla refresh

4.2 BZ verified just now

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 28, 2019
@mfojtik
Copy link
Member

mfojtik commented Sep 3, 2019

This has BZ in right state and priority and the master fix is soaked. Approving.

@mfojtik mfojtik added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 3, 2019
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Member Author

runcom commented Sep 3, 2019

So everything is fine test-wise, except the whole infra test fails at the end:

2019/09/03 12:29:56 Container test in pod e2e-aws-op completed successfully
2019/09/03 12:30:42 Container lease in pod e2e-aws-op completed successfully
2019/09/03 12:35:03 Container teardown in pod e2e-aws-op completed successfully
2019/09/03 12:35:03 Pod e2e-aws-op succeeded after 1h21m8s
2019/09/03 12:35:03 error: unable to signal to artifacts container to terminate in pod e2e-aws-op, triggering deletion: could not run remote command: container artifacts is not valid for pod e2e-aws-op
2019/09/03 12:35:03 error: unable to retrieve artifacts from pod e2e-aws-op: could not read gzipped artifacts: container artifacts is not valid for pod e2e-aws-op
2019/09/03 12:35:08 error: could not wait for pod 'e2e-aws-op': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help)
E0903 12:35:08.604840      14 event.go:191] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".15c0edc6d7ad5853", GenerateName:"", Namespace:"ci-op-fl6yt7sv", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-op-fl6yt7sv", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job pull-ci-openshift-machine-config-operator-release-4.1-e2e-aws-op for PR https://github.com/openshift/machine-config-operator/pull/995 in namespace ci-op-fl6yt7sv from author runcom", Source:v1.EventSource{Component:"ci-op-fl6yt7sv", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf53b55f23ec8053, ext:4919418925858, loc:(*time.Location)(0x26765e0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf53b55f23ec8053, ext:4919418925858, loc:(*time.Location)(0x26765e0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events ".15c0edc6d7ad5853" is forbidden: unable to create new content in namespace ci-op-fl6yt7sv because it is being terminated' (will not retry!)

cc @stevekuznetsov

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 916adbf into openshift:release-4.1 Sep 3, 2019
@openshift-ci-robot
Copy link
Contributor

@runcom: All pull requests linked via external trackers have merged. Bugzilla bug 1728873 has been moved to the MODIFIED state.

In response to this:

Bug 1728873: pkg/daemon: reconcile killed just prior drain+reboot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants