Bug 1834948: increase eviction time to avoid preventable timeouts #1739

kikisdeliveryservice · 2020-05-19T20:43:06Z

Time of 20s was added when we switched to kubectl drain lib. This results in consistent error-retries for router pod, which takes ~ 90s each time. Upped timeout to 90s to avoid these preventable and predictable errors.

openshift-ci-robot · 2020-05-19T20:43:12Z

@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-05-19T20:46:08Z

@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-05-19T20:47:36Z

@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kikisdeliveryservice · 2020-05-20T04:39:29Z

/retest

runcom · 2020-05-20T06:49:39Z

/approve
/retest

kikisdeliveryservice · 2020-05-20T16:30:23Z

/skip

kikisdeliveryservice · 2020-05-20T18:24:18Z

looking at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2310/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-nwsnn_machine-config-daemon.log

I still see router timing out even at 45s less often but still happening...

kikisdeliveryservice · 2020-05-20T18:46:45Z

/skip

kikisdeliveryservice · 2020-05-20T23:11:59Z

/test e2e-gcp-upgrade

kikisdeliveryservice · 2020-05-21T00:54:40Z

/test e2e-gcp-upgrade

sinnykumari · 2020-05-21T06:36:29Z

Looks like 60 second is not enough as well for some pods to finish eviction https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2315/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-twlk5_machine-config-daemon.log .

Not sure if it is good idea to keep increasing timeouts. Do we know which all pods could impact upgrade if eviction failed? It could be possible that underlying issue in #1578 is something different which is not captured by oc describe.

kikisdeliveryservice · 2020-05-21T19:47:57Z

Looks like 60 second is not enough as well for some pods to finish eviction https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2315/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-twlk5_machine-config-daemon.log .

Not sure if it is good idea to keep increasing timeouts. Do we know which all pods could impact upgrade if eviction failed? It could be possible that underlying issue in #1578 is something different which is not captured by oc describe.

yes i noted this on the bug as i think this is in fact a router issue not an mco issue as even increasing up to 60s (which was > the time they said they needed) resulted in same failures.

kikisdeliveryservice · 2020-05-21T19:54:55Z

I don't think that this PR should be used to close the bz as still seeing the issue, though I do think it's correct to increase to 45s as that is expected.

kikisdeliveryservice · 2020-05-21T19:58:00Z

/retest

sinnykumari · 2020-05-22T11:14:43Z

I don't think that this PR should be used to close the bz as still seeing the issue, though I do think it's correct to increase to 45s as that is expected.

makes sense to me unless there is any other thoughts from the team

sinnykumari · 2020-05-22T11:14:53Z

/retest

ashcrow · 2020-05-26T17:59:59Z

I don't think that this PR should be used to close the bz as still seeing the issue, though I do think it's correct to increase to 45s as that is expected.

makes sense to me unless there is any other thoughts from the team

I also agree ... upping to 45s, which is what is currently expected, is a good change though it doesn't fix the issue at hand which, due to the great debugging here, looks like it may reside in the router, etc...

…h longer expected eviction times for example: openshift-ingress/router The 20s timeout was added when the switch to the upstream drain lib was made, which resulted in router pod erroring on 20s consistently.

kikisdeliveryservice · 2020-05-27T18:55:17Z

 level=info msg="Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.0.4:6443/.well-known/oauth-authorization-server endpoint data"
level=info msg="Cluster operator authentication Available is False with : "
level=error msg="Cluster operator console Degraded is True with RouteHealth_StatusError: RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.ci-op-42r0zyz8-1354f.origin-ci-int-gce.dev.openshift.com/health returns '503 Service Unavailable'"

/test e2e-gcp-op

kikisdeliveryservice · 2020-05-27T19:07:47Z

/skip

kikisdeliveryservice · 2020-05-27T19:43:00Z

/skip

openshift-ci-robot · 2020-05-27T19:58:04Z

@kikisdeliveryservice: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-scaleup-rhel7	`b40112d`	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/e2e-metal-ipi	`b40112d`	link	`/test e2e-metal-ipi`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

kikisdeliveryservice · 2020-05-27T20:22:37Z

Something seems weird with our ci job.. masters not coming up (obviously not related to this pr).. will investigate.

kikisdeliveryservice · 2020-05-27T20:22:46Z

/retest

openshift-ci-robot · 2020-05-27T23:21:39Z

@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-05-27T23:21:45Z

@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kikisdeliveryservice · 2020-05-27T23:22:15Z

seems good now with no affect on overall drain time and no weird router eviction errors that would cause a users/anyone to be confused...

/assign @runcom

runcom · 2020-05-28T08:55:01Z

/lgtm

openshift-ci-robot · 2020-05-28T08:55:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kikisdeliveryservice, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kikisdeliveryservice,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2020-05-28T08:59:01Z

@kikisdeliveryservice: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1739. Bugzilla bug 1834948 has been moved to the MODIFIED state.

In response to this:

Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from ashcrow and runcom May 19, 2020 20:43

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2020

kikisdeliveryservice force-pushed the up-eviction-time branch from decf706 to e01d259 Compare May 20, 2020 18:41

kikisdeliveryservice force-pushed the up-eviction-time branch from e01d259 to 070cebb Compare May 21, 2020 19:40

pkg/daemon: update drain helper timeout to 90s to accomodate pods wit…

b40112d

…h longer expected eviction times for example: openshift-ingress/router The 20s timeout was added when the switch to the upstream drain lib was made, which resulted in router pod erroring on 20s consistently.

kikisdeliveryservice force-pushed the up-eviction-time branch from 911478a to b40112d Compare May 27, 2020 17:19

kikisdeliveryservice changed the title ~~[WIP] Bug 1834948: increase eviction time to avoid preventable timeouts~~ Bug 1834948: increase eviction time to avoid preventable timeouts May 27, 2020

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2020

openshift-ci-robot assigned runcom May 27, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2020

openshift-merge-robot merged commit 6595188 into openshift:master May 28, 2020

kikisdeliveryservice mentioned this pull request Oct 29, 2020

Machine config daemon times out quickly on drain leading to node degrade #1578

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1834948: increase eviction time to avoid preventable timeouts #1739

Bug 1834948: increase eviction time to avoid preventable timeouts #1739

kikisdeliveryservice commented May 19, 2020 •

edited

openshift-ci-robot commented May 19, 2020

openshift-ci-robot commented May 19, 2020

openshift-ci-robot commented May 19, 2020

kikisdeliveryservice commented May 20, 2020

runcom commented May 20, 2020

kikisdeliveryservice commented May 20, 2020

kikisdeliveryservice commented May 20, 2020

kikisdeliveryservice commented May 20, 2020

kikisdeliveryservice commented May 20, 2020

kikisdeliveryservice commented May 21, 2020

sinnykumari commented May 21, 2020

kikisdeliveryservice commented May 21, 2020 •

edited

kikisdeliveryservice commented May 21, 2020

kikisdeliveryservice commented May 21, 2020

sinnykumari commented May 22, 2020

sinnykumari commented May 22, 2020

ashcrow commented May 26, 2020

kikisdeliveryservice commented May 27, 2020

kikisdeliveryservice commented May 27, 2020

kikisdeliveryservice commented May 27, 2020

openshift-ci-robot commented May 27, 2020 •

edited

kikisdeliveryservice commented May 27, 2020

kikisdeliveryservice commented May 27, 2020

openshift-ci-robot commented May 27, 2020

openshift-ci-robot commented May 27, 2020

kikisdeliveryservice commented May 27, 2020

runcom commented May 28, 2020

openshift-ci-robot commented May 28, 2020

openshift-ci-robot commented May 28, 2020

Bug 1834948: increase eviction time to avoid preventable timeouts #1739

Bug 1834948: increase eviction time to avoid preventable timeouts #1739

Conversation

kikisdeliveryservice commented May 19, 2020 • edited

openshift-ci-robot commented May 19, 2020

openshift-ci-robot commented May 19, 2020

openshift-ci-robot commented May 19, 2020

kikisdeliveryservice commented May 20, 2020

runcom commented May 20, 2020

kikisdeliveryservice commented May 20, 2020

kikisdeliveryservice commented May 20, 2020

kikisdeliveryservice commented May 20, 2020

kikisdeliveryservice commented May 20, 2020

kikisdeliveryservice commented May 21, 2020

sinnykumari commented May 21, 2020

kikisdeliveryservice commented May 21, 2020 • edited

kikisdeliveryservice commented May 21, 2020

kikisdeliveryservice commented May 21, 2020

sinnykumari commented May 22, 2020

sinnykumari commented May 22, 2020

ashcrow commented May 26, 2020

kikisdeliveryservice commented May 27, 2020

kikisdeliveryservice commented May 27, 2020

kikisdeliveryservice commented May 27, 2020

openshift-ci-robot commented May 27, 2020 • edited

kikisdeliveryservice commented May 27, 2020

kikisdeliveryservice commented May 27, 2020

openshift-ci-robot commented May 27, 2020

openshift-ci-robot commented May 27, 2020

kikisdeliveryservice commented May 27, 2020

runcom commented May 28, 2020

openshift-ci-robot commented May 28, 2020

openshift-ci-robot commented May 28, 2020

kikisdeliveryservice commented May 19, 2020 •

edited

kikisdeliveryservice commented May 21, 2020 •

edited

openshift-ci-robot commented May 27, 2020 •

edited