Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1834948: increase eviction time to avoid preventable timeouts #1739

Merged

Conversation

kikisdeliveryservice
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice commented May 19, 2020

Time of 20s was added when we switched to kubectl drain lib. This results in consistent error-retries for router pod, which takes ~ 90s each time. Upped timeout to 90s to avoid these preventable and predictable errors.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 19, 2020
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2020
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Contributor Author

/retest

@runcom
Copy link
Member

runcom commented May 20, 2020

/approve
/retest

@kikisdeliveryservice
Copy link
Contributor Author

/skip

@kikisdeliveryservice
Copy link
Contributor Author

/skip

@kikisdeliveryservice
Copy link
Contributor Author

/test e2e-gcp-upgrade

1 similar comment
@kikisdeliveryservice
Copy link
Contributor Author

/test e2e-gcp-upgrade

@sinnykumari
Copy link
Contributor

Looks like 60 second is not enough as well for some pods to finish eviction https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2315/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-twlk5_machine-config-daemon.log .

Not sure if it is good idea to keep increasing timeouts. Do we know which all pods could impact upgrade if eviction failed? It could be possible that underlying issue in #1578 is something different which is not captured by oc describe.

@kikisdeliveryservice
Copy link
Contributor Author

kikisdeliveryservice commented May 21, 2020

Looks like 60 second is not enough as well for some pods to finish eviction https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2315/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-twlk5_machine-config-daemon.log .

Not sure if it is good idea to keep increasing timeouts. Do we know which all pods could impact upgrade if eviction failed? It could be possible that underlying issue in #1578 is something different which is not captured by oc describe.

yes i noted this on the bug as i think this is in fact a router issue not an mco issue as even increasing up to 60s (which was > the time they said they needed) resulted in same failures.

@kikisdeliveryservice
Copy link
Contributor Author

I don't think that this PR should be used to close the bz as still seeing the issue, though I do think it's correct to increase to 45s as that is expected.

@kikisdeliveryservice
Copy link
Contributor Author

/retest

@sinnykumari
Copy link
Contributor

I don't think that this PR should be used to close the bz as still seeing the issue, though I do think it's correct to increase to 45s as that is expected.

makes sense to me unless there is any other thoughts from the team

@sinnykumari
Copy link
Contributor

/retest

@ashcrow
Copy link
Member

ashcrow commented May 26, 2020

I don't think that this PR should be used to close the bz as still seeing the issue, though I do think it's correct to increase to 45s as that is expected.

makes sense to me unless there is any other thoughts from the team

I also agree ... upping to 45s, which is what is currently expected, is a good change though it doesn't fix the issue at hand which, due to the great debugging here, looks like it may reside in the router, etc...

…h longer

expected eviction times for example: openshift-ingress/router

The 20s timeout was added when the switch to the upstream drain lib was made,
which resulted in router pod erroring on 20s consistently.
@kikisdeliveryservice
Copy link
Contributor Author

 level=info msg="Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.0.4:6443/.well-known/oauth-authorization-server endpoint data"
level=info msg="Cluster operator authentication Available is False with : "
level=error msg="Cluster operator console Degraded is True with RouteHealth_StatusError: RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.ci-op-42r0zyz8-1354f.origin-ci-int-gce.dev.openshift.com/health returns '503 Service Unavailable'" 

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor Author

/skip

1 similar comment
@kikisdeliveryservice
Copy link
Contributor Author

/skip

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 27, 2020

@kikisdeliveryservice: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 b40112d link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-metal-ipi b40112d link /test e2e-metal-ipi

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kikisdeliveryservice
Copy link
Contributor Author

Something seems weird with our ci job.. masters not coming up (obviously not related to this pr).. will investigate.

@kikisdeliveryservice
Copy link
Contributor Author

/retest

@kikisdeliveryservice kikisdeliveryservice changed the title [WIP] Bug 1834948: increase eviction time to avoid preventable timeouts Bug 1834948: increase eviction time to avoid preventable timeouts May 27, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2020
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1834948, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Contributor Author

seems good now with no affect on overall drain time and no weird router eviction errors that would cause a users/anyone to be confused...

/assign @runcom

@runcom
Copy link
Member

runcom commented May 28, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kikisdeliveryservice, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [kikisdeliveryservice,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 6595188 into openshift:master May 28, 2020
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1739. Bugzilla bug 1834948 has been moved to the MODIFIED state.

In response to this:

Bug 1834948: increase eviction time to avoid preventable timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants