New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1834948: increase eviction time to avoid preventable timeouts #1739
Bug 1834948: increase eviction time to avoid preventable timeouts #1739
Conversation
@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
/approve |
/skip |
I still see router timing out even at 45s less often but still happening... |
decf706
to
e01d259
Compare
/skip |
/test e2e-gcp-upgrade |
1 similar comment
/test e2e-gcp-upgrade |
Looks like 60 second is not enough as well for some pods to finish eviction https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2315/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-twlk5_machine-config-daemon.log . Not sure if it is good idea to keep increasing timeouts. Do we know which all pods could impact upgrade if eviction failed? It could be possible that underlying issue in #1578 is something different which is not captured by oc describe. |
e01d259
to
070cebb
Compare
yes i noted this on the bug as i think this is in fact a router issue not an mco issue as even increasing up to 60s (which was > the time they said they needed) resulted in same failures. |
I don't think that this PR should be used to close the bz as still seeing the issue, though I do think it's correct to increase to 45s as that is expected. |
/retest |
makes sense to me unless there is any other thoughts from the team |
/retest |
I also agree ... upping to 45s, which is what is currently expected, is a good change though it doesn't fix the issue at hand which, due to the great debugging here, looks like it may reside in the router, etc... |
…h longer expected eviction times for example: openshift-ingress/router The 20s timeout was added when the switch to the upstream drain lib was made, which resulted in router pod erroring on 20s consistently.
911478a
to
b40112d
Compare
/test e2e-gcp-op |
/skip |
1 similar comment
/skip |
@kikisdeliveryservice: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Something seems weird with our ci job.. masters not coming up (obviously not related to this pr).. will investigate. |
/retest |
@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
seems good now with no affect on overall drain time and no weird router eviction errors that would cause a users/anyone to be confused... /assign @runcom |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kikisdeliveryservice, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@kikisdeliveryservice: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1739. Bugzilla bug 1834948 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Time of 20s was added when we switched to kubectl drain lib. This results in consistent error-retries for router pod, which takes ~ 90s each time. Upped timeout to 90s to avoid these preventable and predictable errors.