adjust sigterm handler and bump test timeouts #697

runcom · 2019-05-02T15:57:35Z

The NodeController shouldn't rely just on what it's syncing at that
moment to deduce that a node is unavailable. It may happen that the pool
is updating to a rendered-config-A but it's not done, and the code was
still going to apply a new rendered-config-B causing more than 1 node at
the time to go unschedulable. This patch should fix that.

This is also easy to reproduce... create a machineconfig, wait for 1 node to go unschedulable, then instantly create another MC which creates another rendered for a given pool, and start progressing on a second node.

This PR also:

fixes https://bugzilla.redhat.com/show_bug.cgi?id=1703877
and bump e2e timeouts

The NodeController shouldn't rely just on what it's syncing at that moment to deduce that a node is unavailable. It may happen that the pool is updating to a rendered-config-A but it's not done, and the code was still going to apply a new rendered-config-B causing more than 1 node at the time to go unschedulable. This patch should fix that.

pkg/controller/node/status.go

runcom · 2019-05-02T17:37:19Z

/retest

kikisdeliveryservice · 2019-05-02T18:29:44Z

@runcom just to clarify is this PR ensuring that we don't apply config B unless and until config A is successfully applied/finished and fixing a problem where things get trampled bc another config was applied too soon right?

runcom · 2019-05-02T18:31:37Z

@runcom just to clarify is this PR ensuring that we don't apply config B unless and until config A is successfully applied/finished and fixing a problem where things get trampled bc another config was applied too soon right?

yup, that's the aim :)

kikisdeliveryservice · 2019-05-02T18:35:09Z

yup, that's the aim :)

makes sense to me!!! 👍

RobertKrawitz · 2019-05-02T18:54:25Z

Sounds like a good idea on general principles

cgwalters · 2019-05-02T19:59:26Z

pkg/daemon/update.go

+	if dn.installedSigterm {
+		return
+	}
+	dn.installedSigterm = true


This should probably be dn.updateActive or something now.

cgwalters · 2019-05-02T20:06:36Z

pkg/daemon/daemon.go

+				if dn.installedSigterm {
+					glog.Info("Got SIGTERM, but actively updating")
+				} else {
+					close(stopCh)


Copying from the BZ your comment:

Now, in the MCD we have an handler for sigterm only during our sync, the rest of the code doesn't really care about sigterm so we don't catch it and we exit with 143 instead of 0 (if we had an handler).

I have serious difficulty believing that all other pods have SIGTERM handlers of this form. Being killed by SIGTERM seems like a completely normal event - what makes the MCD special here?

Maybe it's that most other pods are drained rather than killed by kubelet? If that's the case...OK I guess, but I would vehemently argue that whatever is treating "container got SIGTERM" as an error is the real problem.

other operators aren't indeed getting sigtermed like us (we're a daemonset and we don't get drained).

but I would vehemently argue that whatever is treating "container got SIGTERM" as an error is the real problem.

I would be with you :)

https://bugzilla.redhat.com/show_bug.cgi?id=1703877

kikisdeliveryservice · 2019-05-02T20:23:19Z

Looks good will defer to @cgwalters to give his signoff.

cgwalters · 2019-05-02T20:33:44Z

OK I don't quite understand yet how your fix is addressing the problem. Looking at this from the top, it certainly seems to me that getCandidateMachines() is broken - if getReadyMachines() returns nothing, it will happily reset the desired annotation for a node in progress.

Although...our default of maxUnavailable = 1 means that if there is an unavailable node we'd have already skipped out before going into the candidates.

I guess I can try your reproducer.

OK nevermind, I think I do understand now, getUnavailableNodes() was happily reporting nodes as not unavailable (note the double negative) if they didn't match the pool's current target. One good way to think of this probably is that I believe getReadyNodes() should be symmetric with getUnavailableNodes and with this change I think it is.

(Actually, why do we even have getUnavailableNodes() instead of just doing the set difference of nodes - getAvailable(nodes)?)

runcom · 2019-05-02T20:36:47Z

(Actually, why do we even have getUnavailableNodes() instead of just doing the set difference of nodes - getAvailable(nodes)?)

I wanted to actually do that, but I'm too afraid of changing the code.......

kikisdeliveryservice · 2019-05-02T20:39:32Z

(Actually, why do we even have getUnavailableNodes() instead of just doing the set difference of nodes - getAvailable(nodes)?)

I wanted to actually do that, but I'm too afraid of changing the code.......

Yeah it makes complete sense bc the math should check out. But... we could do it later too?

cgwalters · 2019-05-02T20:41:59Z

/lgtm

Proof will be if this merges I guess. Though it's not totally clear to me yet why this suddenly broke.

runcom · 2019-05-02T20:43:01Z

Though it's not totally clear to me yet why this suddenly broke.

the e2e I'm not sure about either, looks like a general slowdown? the race in node_controller was there for long apparently

cgwalters · 2019-05-02T20:51:37Z

Looking at more PRs I see this consistently:

            "spec": {
                "providerID": "aws:///us-east-1b/i-0cc60b89b38b0eb6d",
                "taints": [
                    {
                        "effect": "NoSchedule",
                        "key": "node.kubernetes.io/unschedulable",
                        "timeAdded": "2019-05-02T17:11:05Z"
                    },
                    {
                        "effect": "NoSchedule",
                        "key": "node.kubernetes.io/unreachable",
                        "timeAdded": "2019-05-02T17:11:41Z"
                    },
                    {
                        "effect": "NoExecute",
                        "key": "node.kubernetes.io/unreachable",
                        "timeAdded": "2019-05-02T17:11:47Z"
                    }
                ],
                "unschedulable": true
            },

which looks to me like something is actually killing that node.

cgwalters · 2019-05-02T20:52:51Z

which looks to me like something is actually killing that node.

Or hm maybe that's normal while it's rebooting.

runcom · 2019-05-03T18:24:58Z

Alrighty, I've reverted the very first commit about not progressing the pool if it's not idle, that's breaking the roll back on bad mc. I've taken that out, I'm opening another PR so we can get this in to unblock tests (new PR here #699)

kikisdeliveryservice · 2019-05-03T18:28:02Z

/lgtm

cgwalters · 2019-05-03T18:51:29Z

pkg/daemon/daemon.go

@@ -135,7 +135,7 @@ const (
 	pathStateJSON = "/etc/machine-config-daemon/state.json"
 	// currentConfigPath is where we store the current config on disk to validate
 	// against annotations changes
-	currentConfigPath = "/var/machine-config-daemon/currentconfig"


Wait I'm confused, at this point aren't we chrooted into the host?

yaeh, I changed this for consistency with the other /etc paths and since /etc/machine-config-daemon is mounted in the daemonset. Will wait for tests to report and revert this if you don't want it (which I understand why)

kikisdeliveryservice · 2019-05-03T18:52:39Z

@cgwalters added a question so:
/lgtm cancel

kikisdeliveryservice · 2019-05-03T19:51:23Z

this fixed the e2e-aws-op, first pass in days:
https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-op

/lgtm

runcom · 2019-05-03T20:30:04Z

/retest

cgwalters · 2019-05-03T20:36:22Z

I wanted to look at the logs of that passed run to see if my new logging added any information; turns out the logs are now compressed twice just FYI everyone.

I noticed this in the MCD logs:

I0503 19:18:27.290073    2440 start.go:214] Shutting down MachineConfigDaemon
panic: close of closed channel

goroutine 1 [running]:
main.runStartCmd(0x1d55a60, 0x1d7e4f8, 0x0, 0x0)
	/go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/start.go:214 +0xaa9
github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra.(*Command).execute(0x1d55a60, 0x1d7e4f8, 0x0, 0x0, 0x1d55a60, 0x1d7e4f8)
	/go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:766 +0x2c1
github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x1d55800, 0x0, 0xc42043df78, 0x403dfc)
	/go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:852 +0x30a
github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra.(*Command).Execute(0x1d55800, 0xc42043df60, 0x0)
	/go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:800 +0x2b
main.main()
	/go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/main.go:27 +0x31

Honestly I don't know why we're trying to do this stop channel stuff, let's just exit the process?

kikisdeliveryservice · 2019-05-03T20:46:09Z

I wanted to look at the logs of that passed run to see if my new logging added any information; turns out the logs are now compressed twice just FYI everyone.

Ohhh. so that's why I had problems yesterday afternoon. Thanks!

runcom · 2019-05-03T20:49:27Z

Honestly I don't know why we're trying to do this stop channel stuff, let's just exit the process?

fixed that in the latest commit

cgwalters · 2019-05-03T21:03:21Z

This PR feels like a bit of a mess now...it's basically ended up just being adjusting the timeouts, right? If that's the case maybe it is clearer to do separate PRs for other things. But eh.

There was a theory about this being network related with crio/kubelet; wonder if we just ended up winning a race.

Anyways I was looking at the controller logs in the passed e2e-aws-op run (after decompressing twice) and this looks odd:

I0503 19:18:04.818702       1 node_controller.go:476] Setting node ip-10-0-132-181.ec2.internal to desired config rendered-worker-fb998f7806949569640baf62cd48d0ad
I0503 19:18:04.830101       1 node_controller.go:476] Setting node ip-10-0-152-113.ec2.internal to desired config rendered-worker-fb998f7806949569640baf62cd48d0ad
I0503 19:18:04.840559       1 node_controller.go:476] Setting node ip-10-0-140-228.ec2.internal to desired config rendered-worker-fb998f7806949569640baf62cd48d0ad
I0503 19:18:04.851071       1 status.go:202] Node ip-10-0-132-181.ec2.internal unavailable: different configs (desired: rendered-worker-fb998f7806949569640baf62cd48d0ad, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready false
I0503 19:18:04.851190       1 status.go:202] Node ip-10-0-152-113.ec2.internal unavailable: different configs (desired: rendered-worker-fb998f7806949569640baf62cd48d0ad, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready false
... repeats some ...
I0503 19:18:14.841463       1 status.go:202] Node ip-10-0-140-228.ec2.internal unavailable: different configs (desired: rendered-worker-fb998f7806949569640baf62cd48d0ad, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready false
I0503 19:18:19.849176       1 node_controller.go:476] Setting node ip-10-0-132-181.ec2.internal to desired config rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:19.864522       1 node_controller.go:476] Setting node ip-10-0-152-113.ec2.internal to desired config rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:19.864672       1 node_controller.go:207] Pool worker: node ip-10-0-132-181.ec2.internal has completed update to rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:19.876107       1 node_controller.go:476] Setting node ip-10-0-140-228.ec2.internal to desired config rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:19.877922       1 node_controller.go:207] Pool worker: node ip-10-0-152-113.ec2.internal has completed update to rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:19.889322       1 node_controller.go:207] Pool worker: node ip-10-0-140-228.ec2.internal has completed update to rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:24.864998       1 status.go:178] Node ip-10-0-132-181.ec2.internal is reporting Unschedulable
I0503 19:18:24.865030       1 status.go:202] Node ip-10-0-132-181.ec2.internal unavailable: different configs (desired: rendered-worker-5eedbe776afba6cb62ce01ba6255a653, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready true
I0503 19:18:24.865083       1 status.go:178] Node ip-10-0-140-228.ec2.internal is reporting Unschedulable
I0503 19:18:24.865089       1 status.go:202] Node ip-10-0-140-228.ec2.internal unavailable: different configs (desired: rendered-worker-5eedbe776afba6cb62ce01ba6255a653, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready true
I0503 19:18:24.865095       1 status.go:178] Node ip-10-0-132-181.ec2.internal is reporting Unschedulable
...
I0503 19:21:06.775118       1 status.go:178] Node ip-10-0-140-228.ec2.internal is reporting Unschedulable
I0503 19:21:06.775124       1 status.go:202] Node ip-10-0-140-228.ec2.internal unavailable: different configs (desired: rendered-worker-5eedbe776afba6cb62ce01ba6255a653, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready true

You can see how when we revert the config to the previous we get the "completed update" message (which I hadn't considered when adding it but makes sense).

However, for 2 minutes afterwards the nodes are still marked Unschedulable. Then the logs stop...and I don't see the nodes as Unschedulable in the nodes.json so I guess it cleared up? But why/how were they Unschedulable for 2 minutes after they shouldn't have been?

runcom · 2019-05-03T21:19:15Z

I'm not sure what we're doing in those 2 minutes. Are MCD logs telling us something? It may be stuck in checkstateonboot?

cgwalters · 2019-05-03T21:21:09Z

Here's another interesting thing, check out the logs from this failed run, specifically note the MCD is in an infinite loop of:

I0503 17:46:48.659693    1586 update.go:180] Checking reconcilable for config rendered-worker-9dbe0ae65f449ba6974cd4b769c3f539 to rendered-worker-7abedcbf98d3be5c0b4b6f7f5c575e6f
I0503 17:46:48.659724    1586 update.go:239] Checking if configs are reconcilable
I0503 17:46:48.661298    1586 update.go:708] can't reconcile config rendered-worker-9dbe0ae65f449ba6974cd4b769c3f539 with rendered-worker-7abedcbf98d3be5c0b4b6f7f5c575e6f: invalid Ignition config found: error: no filesystem specified
I0503 17:46:48.664038    1586 daemon.go:411] Unable to apply update: can't reconcile config rendered-worker-9dbe0ae65f449ba6974cd4b769c3f539 with rendered-worker-7abedcbf98d3be5c0b4b6f7f5c575e6f: invalid Ignition config found: error: no filesystem specified: unreconcilable

Which...hum indeed, I think we want

--- a/test/e2e/mcd_test.go
+++ b/test/e2e/mcd_test.go
@@ -288,7 +288,7 @@ func getNodesByRole(cs *framework.ClientSet, role string) ([]v1.Node, error) {
 func TestPoolDegradedOnFailToRender(t *testing.T) {
        cs := framework.NewClientSet("")
 
-       mcadd := createMCToAddFile("add-a-file", "/etc/mytestconfs", "test", "")
+       mcadd := createMCToAddFile("add-a-file", "/etc/mytestconfs", "test", "root")
        mcadd.Spec.Config.Ignition.Version = "" // invalid, won't render
 
        // create the dummy MC now

right? It looks like this test was intending to have validation fail due to the version and not the filesystem.

So it looks like that MC source was deleted but its contents are still in the rendered version...the pool is correctly targeting the new one

                        "reason": "All nodes are updating to rendered-worker-9dbe0ae65f449ba6974cd4b769c3f539",

but the MCD hasn't noticed this. Actually the problem is the node controller hasn't changed the desired annotation:

                    "machineconfiguration.openshift.io/desiredConfig": "rendered-worker-7abedcbf98d3be5c0b4b6f7f5c575e6f",

runcom · 2019-05-03T21:38:28Z

I believe the failed run you're looking at was just related to the fact that we couldn't reconcile back with the patch I've now reverted (in the other PR now). That test is supposed to create a failing MC and that should roll back if you delete it (that's what the test does).

runcom · 2019-05-03T22:16:36Z

/retest

hexfusion · 2019-05-04T01:06:19Z

/test e2e-aws-upgrade

hexfusion · 2019-05-04T02:27:51Z

Shazam!!

cgwalters · 2019-05-04T09:40:53Z

OK, might as well get this in
/lgtm

openshift-ci-robot · 2019-05-04T09:41:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, kikisdeliveryservice, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,kikisdeliveryservice,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 2, 2019

openshift-ci-robot requested review from kikisdeliveryservice and LorbusChris May 2, 2019 15:59

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2019

cgwalters reviewed May 2, 2019

View reviewed changes

pkg/controller/node/status.go Outdated Show resolved Hide resolved

test/e2e: increase timeout on checking nodes

45ebf1c

runcom force-pushed the race-mcc branch from 0679fc3 to b266740 Compare May 2, 2019 19:56

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 2, 2019

runcom changed the title ~~pkg/controller: do not progress if the pool isn't idle~~ do not progress if the pool isn't idle, adjust sigterm handler and bump test timeouts May 2, 2019

cgwalters reviewed May 2, 2019

View reviewed changes

pkg/daemon: adjust sigterm handler

44abdd5

https://bugzilla.redhat.com/show_bug.cgi?id=1703877

runcom force-pushed the race-mcc branch from b266740 to 44abdd5 Compare May 2, 2019 20:13

cgwalters mentioned this pull request May 2, 2019

Bump timeout for e2e-aws-op #692

Closed

openshift-ci-robot assigned cgwalters May 2, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2019

store currentconfig under /etc since we have a bind from host

0688ba8

openshift-ci-robot assigned kikisdeliveryservice May 3, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 3, 2019

cgwalters reviewed May 3, 2019

View reviewed changes

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 3, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 3, 2019

pkg/daemon: use a channel to signal termination

dba25ce

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 3, 2019

cgwalters mentioned this pull request May 3, 2019

node_controller: mark unavailable if configs differ #699

Closed

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 4, 2019

openshift-merge-robot merged commit ee0f5cd into openshift:master May 4, 2019

runcom deleted the race-mcc branch May 4, 2019 09:42

RobertKrawitz mentioned this pull request May 6, 2019

Fix tolerations openshift/cluster-ingress-operator#226

Merged

adjust sigterm handler and bump test timeouts #697

adjust sigterm handler and bump test timeouts #697

Conversation

runcom commented May 2, 2019 • edited Loading

runcom commented May 2, 2019

kikisdeliveryservice commented May 2, 2019

runcom commented May 2, 2019

kikisdeliveryservice commented May 2, 2019 • edited Loading

RobertKrawitz commented May 2, 2019

cgwalters May 2, 2019

Choose a reason for hiding this comment

runcom May 2, 2019

Choose a reason for hiding this comment

cgwalters May 2, 2019

Choose a reason for hiding this comment

runcom May 2, 2019

Choose a reason for hiding this comment

kikisdeliveryservice commented May 2, 2019

cgwalters commented May 2, 2019 • edited Loading

runcom commented May 2, 2019

kikisdeliveryservice commented May 2, 2019 • edited Loading

cgwalters commented May 2, 2019

runcom commented May 2, 2019

cgwalters commented May 2, 2019

cgwalters commented May 2, 2019

runcom commented May 3, 2019 • edited Loading

kikisdeliveryservice commented May 3, 2019

cgwalters May 3, 2019

Choose a reason for hiding this comment

runcom May 3, 2019

Choose a reason for hiding this comment

kikisdeliveryservice commented May 3, 2019

kikisdeliveryservice commented May 3, 2019

runcom commented May 3, 2019

cgwalters commented May 3, 2019

kikisdeliveryservice commented May 3, 2019

runcom commented May 3, 2019

cgwalters commented May 3, 2019

runcom commented May 3, 2019

cgwalters commented May 3, 2019 • edited Loading

runcom commented May 3, 2019 • edited Loading

runcom commented May 3, 2019

hexfusion commented May 4, 2019

hexfusion commented May 4, 2019

cgwalters commented May 4, 2019

openshift-ci-robot commented May 4, 2019

runcom commented May 2, 2019 •

edited

Loading

kikisdeliveryservice commented May 2, 2019 •

edited

Loading

cgwalters commented May 2, 2019 •

edited

Loading

kikisdeliveryservice commented May 2, 2019 •

edited

Loading

runcom commented May 3, 2019 •

edited

Loading

cgwalters commented May 3, 2019 •

edited

Loading

runcom commented May 3, 2019 •

edited

Loading