-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lower timeout to increase retries #1405
lower timeout to increase retries #1405
Conversation
/assign @soltysh Maciej, can you give this a look when you have the time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me, but I'd like to see a proof PR before merge
/lgtm
/approve
/hold
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: neisw, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
openshift/cluster-kube-scheduler-operator#443 is the proof pr. The first two test passes failed, I commented out the change and they passed I then uncommented the change and they passed again. https://prow.ci.openshift.org/pr-history/?org=openshift&repo=cluster-kube-scheduler-operator&pr=443 Presume the early failures were environmental (bootstrapping failed) but we could retest a few more times if we want. |
The test retries succeed as well. |
/hold cancel |
@neisw: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Researching https://issues.redhat.com/browse/TRT-589 shows installer pod failures. Belief is that we are taking 1 minute to timeout trying to fetch secrets. The expectation is an individual secret should be fetched in under a second so the connection is likely getting hung up. This PR lowers the timeout to 14 seconds so we can make use of the retry attempts with the expectation if we are hung and timeout after 14 seconds we should have multiple retry attempts to connect and pull the secret.
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1795/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic/1579855936237867008
shows
Oct 11 16:54:58.794 - 2102s W alert/KubePodNotReady ns/openshift-kube-scheduler pod/installer-4-ip-10-0-195-95.ec2.internal ALERTS{alertname="KubePodNotReady", alertstate="firing", namespace="openshift-kube-scheduler", pod="installer-4-ip-10-0-195-95.ec2.internal", prometheus="openshift-monitoring/k8s", severity="warning"}}
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1795/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic/1579855936237867008/artifacts/e2e-agnostic/openshift-e2e-test/artifacts/junit/resource-pods_20221011-164650.zip
openshift-kube-scheduler->pods.json has the error
"containerID": "cri-o://f5cbf69798901894b79c66246aec409f546c9896dd26a9cfe3d8fb0704036426""