Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adjust sigterm handler and bump test timeouts #697

Merged
merged 11 commits into from
May 4, 2019

Conversation

runcom
Copy link
Member

@runcom runcom commented May 2, 2019

The NodeController shouldn't rely just on what it's syncing at that
moment to deduce that a node is unavailable. It may happen that the pool
is updating to a rendered-config-A but it's not done, and the code was
still going to apply a new rendered-config-B causing more than 1 node at
the time to go unschedulable. This patch should fix that.

This is also easy to reproduce... create a machineconfig, wait for 1 node to go unschedulable, then instantly create another MC which creates another rendered for a given pool, and start progressing on a second node.

This PR also:

fixes https://bugzilla.redhat.com/show_bug.cgi?id=1703877
and bump e2e timeouts

The NodeController shouldn't rely just on what it's syncing at that
moment to deduce that a node is unavailable. It may happen that the pool
is updating to a rendered-config-A but it's not done, and the code was
still going to apply a new rendered-config-B causing more than 1 node at
the time to go unschedulable. This patch should fix that.
@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 2, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2019
@runcom
Copy link
Member Author

runcom commented May 2, 2019

/retest

@kikisdeliveryservice
Copy link
Contributor

@runcom just to clarify is this PR ensuring that we don't apply config B unless and until config A is successfully applied/finished and fixing a problem where things get trampled bc another config was applied too soon right?

@runcom
Copy link
Member Author

runcom commented May 2, 2019

@runcom just to clarify is this PR ensuring that we don't apply config B unless and until config A is successfully applied/finished and fixing a problem where things get trampled bc another config was applied too soon right?

yup, that's the aim :)

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented May 2, 2019

yup, that's the aim :)

makes sense to me!!! 👍

@RobertKrawitz
Copy link
Contributor

Sounds like a good idea on general principles

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 2, 2019
@runcom runcom changed the title pkg/controller: do not progress if the pool isn't idle do not progress if the pool isn't idle, adjust sigterm handler and bump test timeouts May 2, 2019
if dn.installedSigterm {
return
}
dn.installedSigterm = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be dn.updateActive or something now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if dn.installedSigterm {
glog.Info("Got SIGTERM, but actively updating")
} else {
close(stopCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying from the BZ your comment:

Now, in the MCD we have an handler for sigterm only during our sync, the rest of the code doesn't really care about sigterm so we don't catch it and we exit with 143 instead of 0 (if we had an handler).

I have serious difficulty believing that all other pods have SIGTERM handlers of this form. Being killed by SIGTERM seems like a completely normal event - what makes the MCD special here?

Maybe it's that most other pods are drained rather than killed by kubelet? If that's the case...OK I guess, but I would vehemently argue that whatever is treating "container got SIGTERM" as an error is the real problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other operators aren't indeed getting sigtermed like us (we're a daemonset and we don't get drained).

but I would vehemently argue that whatever is treating "container got SIGTERM" as an error is the real problem.

I would be with you :)

@kikisdeliveryservice
Copy link
Contributor

Looks good will defer to @cgwalters to give his signoff.

@cgwalters
Copy link
Member

cgwalters commented May 2, 2019

OK I don't quite understand yet how your fix is addressing the problem. Looking at this from the top, it certainly seems to me that getCandidateMachines() is broken - if getReadyMachines() returns nothing, it will happily reset the desired annotation for a node in progress.

Although...our default of maxUnavailable = 1 means that if there is an unavailable node we'd have already skipped out before going into the candidates.

I guess I can try your reproducer.

OK nevermind, I think I do understand now, getUnavailableNodes() was happily reporting nodes as not unavailable (note the double negative) if they didn't match the pool's current target. One good way to think of this probably is that I believe getReadyNodes() should be symmetric with getUnavailableNodes and with this change I think it is.

(Actually, why do we even have getUnavailableNodes() instead of just doing the set difference of nodes - getAvailable(nodes)?)

@runcom
Copy link
Member Author

runcom commented May 2, 2019

(Actually, why do we even have getUnavailableNodes() instead of just doing the set difference of nodes - getAvailable(nodes)?)

I wanted to actually do that, but I'm too afraid of changing the code.......

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented May 2, 2019

(Actually, why do we even have getUnavailableNodes() instead of just doing the set difference of nodes - getAvailable(nodes)?)

I wanted to actually do that, but I'm too afraid of changing the code.......

Yeah it makes complete sense bc the math should check out. But... we could do it later too?

@cgwalters
Copy link
Member

/lgtm

Proof will be if this merges I guess. Though it's not totally clear to me yet why this suddenly broke.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2019
@runcom
Copy link
Member Author

runcom commented May 2, 2019

Though it's not totally clear to me yet why this suddenly broke.

the e2e I'm not sure about either, looks like a general slowdown? the race in node_controller was there for long apparently

@cgwalters
Copy link
Member

Looking at more PRs I see this consistently:

            "spec": {
                "providerID": "aws:///us-east-1b/i-0cc60b89b38b0eb6d",
                "taints": [
                    {
                        "effect": "NoSchedule",
                        "key": "node.kubernetes.io/unschedulable",
                        "timeAdded": "2019-05-02T17:11:05Z"
                    },
                    {
                        "effect": "NoSchedule",
                        "key": "node.kubernetes.io/unreachable",
                        "timeAdded": "2019-05-02T17:11:41Z"
                    },
                    {
                        "effect": "NoExecute",
                        "key": "node.kubernetes.io/unreachable",
                        "timeAdded": "2019-05-02T17:11:47Z"
                    }
                ],
                "unschedulable": true
            },

which looks to me like something is actually killing that node.

@cgwalters
Copy link
Member

which looks to me like something is actually killing that node.

Or hm maybe that's normal while it's rebooting.

@runcom
Copy link
Member Author

runcom commented May 3, 2019

Alrighty, I've reverted the very first commit about not progressing the pool if it's not idle, that's breaking the roll back on bad mc. I've taken that out, I'm opening another PR so we can get this in to unblock tests (new PR here #699)

@kikisdeliveryservice
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 3, 2019
@@ -135,7 +135,7 @@ const (
pathStateJSON = "/etc/machine-config-daemon/state.json"
// currentConfigPath is where we store the current config on disk to validate
// against annotations changes
currentConfigPath = "/var/machine-config-daemon/currentconfig"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait I'm confused, at this point aren't we chrooted into the host?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaeh, I changed this for consistency with the other /etc paths and since /etc/machine-config-daemon is mounted in the daemonset. Will wait for tests to report and revert this if you don't want it (which I understand why)

@kikisdeliveryservice
Copy link
Contributor

@cgwalters added a question so:
/lgtm cancel

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 3, 2019
@kikisdeliveryservice
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 3, 2019
@runcom
Copy link
Member Author

runcom commented May 3, 2019

/retest

@cgwalters
Copy link
Member

I wanted to look at the logs of that passed run to see if my new logging added any information; turns out the logs are now compressed twice just FYI everyone.

I noticed this in the MCD logs:

I0503 19:18:27.290073    2440 start.go:214] Shutting down MachineConfigDaemon
panic: close of closed channel

goroutine 1 [running]:
main.runStartCmd(0x1d55a60, 0x1d7e4f8, 0x0, 0x0)
	/go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/start.go:214 +0xaa9
github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra.(*Command).execute(0x1d55a60, 0x1d7e4f8, 0x0, 0x0, 0x1d55a60, 0x1d7e4f8)
	/go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:766 +0x2c1
github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x1d55800, 0x0, 0xc42043df78, 0x403dfc)
	/go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:852 +0x30a
github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra.(*Command).Execute(0x1d55800, 0xc42043df60, 0x0)
	/go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:800 +0x2b
main.main()
	/go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/main.go:27 +0x31

Honestly I don't know why we're trying to do this stop channel stuff, let's just exit the process?

@kikisdeliveryservice
Copy link
Contributor

I wanted to look at the logs of that passed run to see if my new logging added any information; turns out the logs are now compressed twice just FYI everyone.

Ohhh. so that's why I had problems yesterday afternoon. Thanks!

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 3, 2019
@runcom
Copy link
Member Author

runcom commented May 3, 2019

Honestly I don't know why we're trying to do this stop channel stuff, let's just exit the process?

fixed that in the latest commit

@cgwalters
Copy link
Member

This PR feels like a bit of a mess now...it's basically ended up just being adjusting the timeouts, right? If that's the case maybe it is clearer to do separate PRs for other things. But eh.

There was a theory about this being network related with crio/kubelet; wonder if we just ended up winning a race.

Anyways I was looking at the controller logs in the passed e2e-aws-op run (after decompressing twice) and this looks odd:

I0503 19:18:04.818702       1 node_controller.go:476] Setting node ip-10-0-132-181.ec2.internal to desired config rendered-worker-fb998f7806949569640baf62cd48d0ad
I0503 19:18:04.830101       1 node_controller.go:476] Setting node ip-10-0-152-113.ec2.internal to desired config rendered-worker-fb998f7806949569640baf62cd48d0ad
I0503 19:18:04.840559       1 node_controller.go:476] Setting node ip-10-0-140-228.ec2.internal to desired config rendered-worker-fb998f7806949569640baf62cd48d0ad
I0503 19:18:04.851071       1 status.go:202] Node ip-10-0-132-181.ec2.internal unavailable: different configs (desired: rendered-worker-fb998f7806949569640baf62cd48d0ad, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready false
I0503 19:18:04.851190       1 status.go:202] Node ip-10-0-152-113.ec2.internal unavailable: different configs (desired: rendered-worker-fb998f7806949569640baf62cd48d0ad, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready false
... repeats some ...
I0503 19:18:14.841463       1 status.go:202] Node ip-10-0-140-228.ec2.internal unavailable: different configs (desired: rendered-worker-fb998f7806949569640baf62cd48d0ad, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready false
I0503 19:18:19.849176       1 node_controller.go:476] Setting node ip-10-0-132-181.ec2.internal to desired config rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:19.864522       1 node_controller.go:476] Setting node ip-10-0-152-113.ec2.internal to desired config rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:19.864672       1 node_controller.go:207] Pool worker: node ip-10-0-132-181.ec2.internal has completed update to rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:19.876107       1 node_controller.go:476] Setting node ip-10-0-140-228.ec2.internal to desired config rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:19.877922       1 node_controller.go:207] Pool worker: node ip-10-0-152-113.ec2.internal has completed update to rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:19.889322       1 node_controller.go:207] Pool worker: node ip-10-0-140-228.ec2.internal has completed update to rendered-worker-5eedbe776afba6cb62ce01ba6255a653
I0503 19:18:24.864998       1 status.go:178] Node ip-10-0-132-181.ec2.internal is reporting Unschedulable
I0503 19:18:24.865030       1 status.go:202] Node ip-10-0-132-181.ec2.internal unavailable: different configs (desired: rendered-worker-5eedbe776afba6cb62ce01ba6255a653, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready true
I0503 19:18:24.865083       1 status.go:178] Node ip-10-0-140-228.ec2.internal is reporting Unschedulable
I0503 19:18:24.865089       1 status.go:202] Node ip-10-0-140-228.ec2.internal unavailable: different configs (desired: rendered-worker-5eedbe776afba6cb62ce01ba6255a653, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready true
I0503 19:18:24.865095       1 status.go:178] Node ip-10-0-132-181.ec2.internal is reporting Unschedulable
...
I0503 19:21:06.775118       1 status.go:178] Node ip-10-0-140-228.ec2.internal is reporting Unschedulable
I0503 19:21:06.775124       1 status.go:202] Node ip-10-0-140-228.ec2.internal unavailable: different configs (desired: rendered-worker-5eedbe776afba6cb62ce01ba6255a653, current rendered-worker-5eedbe776afba6cb62ce01ba6255a653) or node not ready true

You can see how when we revert the config to the previous we get the "completed update" message (which I hadn't considered when adding it but makes sense).

However, for 2 minutes afterwards the nodes are still marked Unschedulable. Then the logs stop...and I don't see the nodes as Unschedulable in the nodes.json so I guess it cleared up? But why/how were they Unschedulable for 2 minutes after they shouldn't have been?

@runcom
Copy link
Member Author

runcom commented May 3, 2019

I'm not sure what we're doing in those 2 minutes. Are MCD logs telling us something? It may be stuck in checkstateonboot?

@cgwalters
Copy link
Member

cgwalters commented May 3, 2019

Here's another interesting thing, check out the logs from this failed run, specifically note the MCD is in an infinite loop of:

I0503 17:46:48.659693    1586 update.go:180] Checking reconcilable for config rendered-worker-9dbe0ae65f449ba6974cd4b769c3f539 to rendered-worker-7abedcbf98d3be5c0b4b6f7f5c575e6f
I0503 17:46:48.659724    1586 update.go:239] Checking if configs are reconcilable
I0503 17:46:48.661298    1586 update.go:708] can't reconcile config rendered-worker-9dbe0ae65f449ba6974cd4b769c3f539 with rendered-worker-7abedcbf98d3be5c0b4b6f7f5c575e6f: invalid Ignition config found: error: no filesystem specified
I0503 17:46:48.664038    1586 daemon.go:411] Unable to apply update: can't reconcile config rendered-worker-9dbe0ae65f449ba6974cd4b769c3f539 with rendered-worker-7abedcbf98d3be5c0b4b6f7f5c575e6f: invalid Ignition config found: error: no filesystem specified: unreconcilable

Which...hum indeed, I think we want

--- a/test/e2e/mcd_test.go
+++ b/test/e2e/mcd_test.go
@@ -288,7 +288,7 @@ func getNodesByRole(cs *framework.ClientSet, role string) ([]v1.Node, error) {
 func TestPoolDegradedOnFailToRender(t *testing.T) {
        cs := framework.NewClientSet("")
 
-       mcadd := createMCToAddFile("add-a-file", "/etc/mytestconfs", "test", "")
+       mcadd := createMCToAddFile("add-a-file", "/etc/mytestconfs", "test", "root")
        mcadd.Spec.Config.Ignition.Version = "" // invalid, won't render
 
        // create the dummy MC now

right? It looks like this test was intending to have validation fail due to the version and not the filesystem.

So it looks like that MC source was deleted but its contents are still in the rendered version...the pool is correctly targeting the new one

                        "reason": "All nodes are updating to rendered-worker-9dbe0ae65f449ba6974cd4b769c3f539",

but the MCD hasn't noticed this. Actually the problem is the node controller hasn't changed the desired annotation:

                    "machineconfiguration.openshift.io/desiredConfig": "rendered-worker-7abedcbf98d3be5c0b4b6f7f5c575e6f",

@runcom
Copy link
Member Author

runcom commented May 3, 2019

I believe the failed run you're looking at was just related to the fact that we couldn't reconcile back with the patch I've now reverted (in the other PR now). That test is supposed to create a failing MC and that should roll back if you delete it (that's what the test does).

@runcom
Copy link
Member Author

runcom commented May 3, 2019

/retest

@hexfusion
Copy link
Contributor

/test e2e-aws-upgrade

@hexfusion
Copy link
Contributor

Shazam!!

@cgwalters
Copy link
Member

OK, might as well get this in
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 4, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, kikisdeliveryservice, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,kikisdeliveryservice,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit ee0f5cd into openshift:master May 4, 2019
@runcom runcom deleted the race-mcc branch May 4, 2019 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants