-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adjust sigterm handler and bump test timeouts #697
Conversation
The NodeController shouldn't rely just on what it's syncing at that moment to deduce that a node is unavailable. It may happen that the pool is updating to a rendered-config-A but it's not done, and the code was still going to apply a new rendered-config-B causing more than 1 node at the time to go unschedulable. This patch should fix that.
/retest |
@runcom just to clarify is this PR ensuring that we don't apply config B unless and until config A is successfully applied/finished and fixing a problem where things get trampled bc another config was applied too soon right? |
yup, that's the aim :) |
makes sense to me!!! 👍 |
Sounds like a good idea on general principles |
pkg/daemon/update.go
Outdated
if dn.installedSigterm { | ||
return | ||
} | ||
dn.installedSigterm = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be dn.updateActive
or something now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/daemon/daemon.go
Outdated
if dn.installedSigterm { | ||
glog.Info("Got SIGTERM, but actively updating") | ||
} else { | ||
close(stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying from the BZ your comment:
Now, in the MCD we have an handler for sigterm only during our sync, the rest of the code doesn't really care about sigterm so we don't catch it and we exit with 143 instead of 0 (if we had an handler).
I have serious difficulty believing that all other pods have SIGTERM
handlers of this form. Being killed by SIGTERM
seems like a completely normal event - what makes the MCD special here?
Maybe it's that most other pods are drained rather than killed by kubelet? If that's the case...OK I guess, but I would vehemently argue that whatever is treating "container got SIGTERM
" as an error is the real problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other operators aren't indeed getting sigtermed like us (we're a daemonset and we don't get drained).
but I would vehemently argue that whatever is treating "container got
SIGTERM
" as an error is the real problem.
I would be with you :)
Looks good will defer to @cgwalters to give his signoff. |
OK I don't quite understand yet how your fix is addressing the problem. Looking at this from the top, it certainly seems to me that Although...our default of I guess I can try your reproducer. OK nevermind, I think I do understand now, (Actually, why do we even have |
I wanted to actually do that, but I'm too afraid of changing the code....... |
Yeah it makes complete sense bc the math should check out. But... we could do it later too? |
/lgtm Proof will be if this merges I guess. Though it's not totally clear to me yet why this suddenly broke. |
the e2e I'm not sure about either, looks like a general slowdown? the race in node_controller was there for long apparently |
Looking at more PRs I see this consistently:
which looks to me like something is actually killing that node. |
Or hm maybe that's normal while it's rebooting. |
Alrighty, I've reverted the very first commit about not progressing the pool if it's not idle, that's breaking the roll back on bad mc. I've taken that out, I'm opening another PR so we can get this in to unblock tests (new PR here #699) |
/lgtm |
@@ -135,7 +135,7 @@ const ( | |||
pathStateJSON = "/etc/machine-config-daemon/state.json" | |||
// currentConfigPath is where we store the current config on disk to validate | |||
// against annotations changes | |||
currentConfigPath = "/var/machine-config-daemon/currentconfig" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait I'm confused, at this point aren't we chrooted into the host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaeh, I changed this for consistency with the other /etc paths and since /etc/machine-config-daemon is mounted in the daemonset. Will wait for tests to report and revert this if you don't want it (which I understand why)
@cgwalters added a question so: |
this fixed the e2e-aws-op, first pass in days: /lgtm |
/retest |
I wanted to look at the logs of that passed run to see if my new logging added any information; turns out the logs are now compressed twice just FYI everyone. I noticed this in the MCD logs:
Honestly I don't know why we're trying to do this stop channel stuff, let's just exit the process? |
Ohhh. so that's why I had problems yesterday afternoon. Thanks! |
fixed that in the latest commit |
This PR feels like a bit of a mess now...it's basically ended up just being adjusting the timeouts, right? If that's the case maybe it is clearer to do separate PRs for other things. But eh. There was a theory about this being network related with crio/kubelet; wonder if we just ended up winning a race. Anyways I was looking at the controller logs in the passed
You can see how when we revert the config to the previous we get the "completed update" message (which I hadn't considered when adding it but makes sense). However, for 2 minutes afterwards the nodes are still marked Unschedulable. Then the logs stop...and I don't see the nodes as Unschedulable in the |
I'm not sure what we're doing in those 2 minutes. Are MCD logs telling us something? It may be stuck in checkstateonboot? |
Here's another interesting thing, check out the logs from this failed run, specifically note the MCD is in an infinite loop of:
Which...hum indeed, I think we want
right? It looks like this test was intending to have validation fail due to the version and not the filesystem. So it looks like that MC source was deleted but its contents are still in the rendered version...the pool is correctly targeting the new one
but the MCD hasn't noticed this. Actually the problem is the node controller hasn't changed the desired annotation:
|
I believe the failed run you're looking at was just related to the fact that we couldn't reconcile back with the patch I've now reverted (in the other PR now). That test is supposed to create a failing MC and that should roll back if you delete it (that's what the test does). |
/retest |
/test e2e-aws-upgrade |
Shazam!! |
OK, might as well get this in |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, kikisdeliveryservice, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The NodeController shouldn't rely just on what it's syncing at that
moment to deduce that a node is unavailable. It may happen that the pool
is updating to a rendered-config-A but it's not done, and the code was
still going to apply a new rendered-config-B causing more than 1 node at
the time to go unschedulable. This patch should fix that.
This is also easy to reproduce... create a machineconfig, wait for 1 node to go unschedulable, then instantly create another MC which creates another rendered for a given pool, and start progressing on a second node.
This PR also:
fixes https://bugzilla.redhat.com/show_bug.cgi?id=1703877
and bump e2e timeouts