lower timeout to increase retries #1405

neisw · 2022-10-11T18:50:26Z

Researching https://issues.redhat.com/browse/TRT-589 shows installer pod failures. Belief is that we are taking 1 minute to timeout trying to fetch secrets. The expectation is an individual secret should be fetched in under a second so the connection is likely getting hung up. This PR lowers the timeout to 14 seconds so we can make use of the retry attempts with the expectation if we are hung and timeout after 14 seconds we should have multiple retry attempts to connect and pull the secret.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1795/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic/1579855936237867008

shows

Oct 11 16:54:58.794 - 2102s W alert/KubePodNotReady ns/openshift-kube-scheduler pod/installer-4-ip-10-0-195-95.ec2.internal ALERTS{alertname="KubePodNotReady", alertstate="firing", namespace="openshift-kube-scheduler", pod="installer-4-ip-10-0-195-95.ec2.internal", prometheus="openshift-monitoring/k8s", severity="warning"}}

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1795/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic/1579855936237867008/artifacts/e2e-agnostic/openshift-e2e-test/artifacts/junit/resource-pods_20221011-164650.zip

openshift-kube-scheduler->pods.json has the error

"containerID": "cri-o://f5cbf69798901894b79c66246aec409f546c9896dd26a9cfe3d8fb0704036426""

I1011 16:19:47.728337       1 cmd.go:225] Getting secrets ...
I1011 16:20:47.701859       1 copy.go:24] Failed to get secret openshift-kube-scheduler/kube-scheduler-client-cert-key: Get \"https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/secrets/kube-scheduler-client-cert-key\": context deadline exceeded

neisw · 2022-10-12T14:08:31Z

/assign @soltysh

Maciej, can you give this a look when you have the time.

soltysh

this looks good to me, but I'd like to see a proof PR before merge
/lgtm
/approve
/hold

openshift-ci · 2022-10-17T11:17:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neisw, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/operator/staticpod/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

neisw · 2022-10-18T11:44:16Z

openshift/cluster-kube-scheduler-operator#443 is the proof pr.

The first two test passes failed, I commented out the change and they passed I then uncommented the change and they passed again.

https://prow.ci.openshift.org/pr-history/?org=openshift&repo=cluster-kube-scheduler-operator&pr=443

Presume the early failures were environmental (bootstrapping failed) but we could retest a few more times if we want.

neisw · 2022-10-18T22:23:04Z

The test retries succeed as well.

soltysh · 2022-10-19T14:39:25Z

/hold cancel

openshift-ci · 2022-10-19T14:57:40Z

@neisw: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

lower timeout for retries

3717fe2

openshift-ci bot requested review from p0lyn0mial and sttts October 11, 2022 18:50

openshift-ci bot assigned soltysh Oct 12, 2022

soltysh approved these changes Oct 17, 2022

View reviewed changes

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Oct 17, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2022

neisw mentioned this pull request Oct 17, 2022

[wip] test installer cmd timeout reduction openshift/cluster-kube-scheduler-operator#443

Closed

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2022

openshift-merge-robot merged commit 7eb8005 into openshift:master Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lower timeout to increase retries #1405

lower timeout to increase retries #1405

neisw commented Oct 11, 2022

neisw commented Oct 12, 2022

soltysh left a comment

openshift-ci bot commented Oct 17, 2022

neisw commented Oct 18, 2022

neisw commented Oct 18, 2022

soltysh commented Oct 19, 2022

openshift-ci bot commented Oct 19, 2022

lower timeout to increase retries #1405

lower timeout to increase retries #1405

Conversation

neisw commented Oct 11, 2022

neisw commented Oct 12, 2022

soltysh left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Oct 17, 2022

neisw commented Oct 18, 2022

neisw commented Oct 18, 2022

soltysh commented Oct 19, 2022

openshift-ci bot commented Oct 19, 2022