Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre submit keeps failing: investigate why it takes minutes for pod to be ready some times #537

Closed
ldemailly opened this issue Aug 3, 2017 · 27 comments
Assignees
Milestone

Comments

@ldemailly
Copy link
Member

see #533 and many build failures after > 2mins of retry

@ldemailly ldemailly added this to the Istio 0.2 milestone Aug 3, 2017
@ldemailly
Copy link
Member Author

cc @douglas-reid , @sebastienvas

@sebastienvas
Copy link
Contributor

sebastienvas commented Aug 3, 2017

Is it the pod or just the ingress. Could we be running out of external ip address ? or Backend Service ? I guess I need someone to help me look at devconsole to make sure.

@ldemailly ldemailly changed the title investigate why it takes minutes for pod to be ready some times pre submit keeps failing: investigate why it takes minutes for pod to be ready some times Aug 4, 2017
@ldemailly
Copy link
Member Author

I don't know but it's really crippling - pretty much 100% failure lately on my PR:
#519

screen shot 2017-08-03 at 9 59 19 pm

@sebastienvas
Copy link
Contributor

I had to recreate the cluster as some firewall rules were missing. Could you try it again?

@Brian-Xincheng-Zhang
Copy link
Contributor

is this problem solved?

@ldemailly
Copy link
Member Author

@Brian-Xincheng-Zhang it's a bit less frequent but unfortunately no there is still a problem:

W0808 17:22:47.757] E0808 17:22:47.756137   10751 framework.go:199] Failed to complete Init. Error unable to find ingress

https://k8s-gubernator.appspot.com/build/istio-prow/pull/istio_istio/546/istio-presubmit/253/

ps: those seem more like errors or fatals than warning btw

@sebastienvas
Copy link
Contributor

OK so I am sure we are running out of external ips. I asked for more but this is going to take some time. @mandarjog proposes to use pod ip in the meantime. The e2e framework already support this we just need to add a flag. @kyessenov I am not sure if the pilot e2e supports using the pod ip instead of the ingress. Anyways should we use pod ip instead of ingress ?

@ZackButcher
Copy link
Contributor

For now I don't think there's a reason we need ingress over using pod ip, and I'm all for it if it means we get faster/more reliable test execution. Long term I think we will will want to use ingress resources since Istio plans to do work on ingress (e.g. policy via Mixer on ingress traffic), but we may still be able to get away with Pod IP of the ingress proxy pod in that case. Either way we're not there yet so I don't see a reason to use ingress/assign public IPs today.

@sebastienvas
Copy link
Contributor

@ldemailly to provide background on why we use a global ip.

@kyessenov
Copy link
Contributor

Use service cluster IP if you want to address Istio services. Istio doesn't support direct access to pods (yet, by design, ...).

@kyessenov
Copy link
Contributor

We also should be using ingress resources for externally exposed services. We don't have to expose ingress proxy externally.

@andraxylia
Copy link
Contributor

The way e2e tests are structured, we do not need to expose ingress as LoadBalancer type, because all requests to ingress are made from within the cluster, via kubectl exec ....

However, I would like to keep the istio.yaml as LoadBalancer type, and use NodePort only when running e2e.

The fix is to add a type field to the ingress template, and set it depending on the context.

@sebastienvas
Copy link
Contributor

We have to wait for @yutongz changes to be in in order to use the NodePort. @Brian-Xincheng-Zhang Can you make sure rbac rules are being updated correclty as this is blocking Yutong PR.

@ldemailly
Copy link
Member Author

ldemailly commented Aug 10, 2017

now it's

W0810 04:15:44.548] I0810 04:15:44.547373   10410 demo_test.go:131] Error talking to productpage: Get http://10.142.0.7:31748/productpage: dial tcp 10.142.0.7:31748: getsockopt: connection refused
W0810 04:15:44.548] E0810 04:15:44.547427   10410 demo_test.go:144] Couldn't get to the bookinfo product page, trying again in 5 second
W0810 04:15:49.688] I0810 04:15:49.686928   10410 demo_test.go:133] Get from page: 404
W0810 04:15:49.688] E0810 04:15:49.686969   10410 demo_test.go:144] Couldn't get to the bookinfo product page, trying again in 10 second
W0810 04:15:59.759] I0810 04:15:59.757391   10410 demo_test.go:133] Get from page: 404
W0810 04:15:59.759] E0810 04:15:59.757431   10410 demo_test.go:144] Couldn't get to the bookinfo product page, trying again in 15 second
W0810 04:16:14.829] I0810 04:16:14.827767   10410 demo_test.go:133] Get from page: 404
W0810 04:16:14.829] E0810 04:16:14.827806   10410 demo_test.go:144] Couldn't get to the bookinfo product page, trying again in 20 second
W0810 04:16:34.899] I0810 04:16:34.897930   10410 demo_test.go:133] Get from page: 404
W0810 04:16:34.899] E0810 04:16:34.897967   10410 demo_test.go:144] Couldn't get to the bookinfo product page, trying again in 25 second
W0810 04:16:59.969] I0810 04:16:59.967835   10410 demo_test.go:133] Get from page: 404
W0810 04:16:59.969] E0810 04:16:59.967862   10410 demo_test.go:144] Couldn't get to the bookinfo product page, trying again in 30 second
W0810 04:17:30.040] I0810 04:17:30.038359   10410 demo_test.go:133] Get from page: 404
W0810 04:17:30.040] E0810 04:17:30.038427   10410 demo_test.go:144] Couldn't get to the bookinfo product page, trying again in 35 second
W0810 04:18:05.109] I0810 04:18:05.107870   10410 demo_test.go:133] Get from page: 404
W0810 04:18:05.109] E0810 04:18:05.107948   10410 demo_test.go:144] Couldn't get to the bookinfo product page, trying again in 40 second
W0810 04:18:45.179] I0810 04:18:45.178009   10410 demo_test.go:133] Get from page: 404
W0810 04:18:45.179] E0810 04:18:45.178031   10410 demo_test.go:144] Couldn't get to the bookinfo product page, trying again in 45 second
W0810 04:19:30.249] I0810 04:19:30.248460   10410 demo_test.go:133] Get from page: 404
W0810 04:19:30.250] E0810 04:19:30.248487   10410 demo_test.go:144] Couldn't get to the bookinfo product page, trying again in 50 second
W0810 04:20:20.321] I0810 04:20:20.320483   10410 demo_test.go:133] Get from page: 404
W0810 04:20:20.322] E0810 04:20:20.320520   10410 framework.go:199] Failed to complete Init. Error unable to set default route

https://k8s-gubernator.appspot.com/build/istio-prow/pull/istio_istio/535/istio-presubmit/277/

@yutongz
Copy link
Contributor

yutongz commented Aug 10, 2017

Exactly, we need to solve this first #550

@sebastienvas
Copy link
Contributor

This has been resolved by Kuat and Doug. Plus increasing IP on GCP.

@ldemailly
Copy link
Member Author

it's back:

https://k8s-gubernator.appspot.com/build/istio-prow/pull/istio_istio/592/e2e-suite-rbac-auth/61/

W0823 20:06:22.939] I0823 20:06:22.938362 889 mixer_test.go:458] Error talking to productpage: Get http://10.150.0.6:30298/productpage: dial tcp 10.150.0.6:30298: getsockopt: connection refused
W0823 20:06:22.939] E0823 20:06:22.938392 889 mixer_test.go:471] Couldn't get to the bookinfo product page, trying again in 5 second
I0823 20:06:23.039] === RUN TestGlobalCheckAndReport
W0823 20:06:27.999] I0823 20:06:27.998541 889 mixer_test.go:458] Error talking to productpage: Get http://10.150.0.6:30298/productpage: dial tcp 10.150.0.6:30298: getsockopt: connection refused
W0823 20:06:28.000] E0823 20:06:27.998568 889 mixer_test.go:471] Couldn't get to the bookinfo product page, trying again in 10 second
W0823 20:06:38.060] I0823 20:06:38.058925 889 mixer_test.go:458] Error talking to productpage: Get http://10.150.0.6:30298/productpage: dial tcp 10.150.0.6:30298: getsockopt: connection refused
W0823 20:06:38.060] E0823 20:06:38.058952 889 mixer_test.go:471] Couldn't get to the bookinfo product page, trying again in 15 second
W0823 20:06:53.120] I0823 20:06:53.119375 889 mixer_test.go:458] Error talking to productpage: Get http://10.150.0.6:30298/productpage: dial tcp 10.150.0.6:30298: getsockopt: connection refused
W0823 20:06:53.121] E0823 20:06:53.119407 889 mixer_test.go:471] Couldn't get to the bookinfo product page, trying again in 20 second
W0823 20:07:13.182] I0823 20:07:13.179648 889 mixer_test.go:458] Error talking to productpage: Get http://10.150.0.6:30298/productpage: dial tcp 10.150.0.6:30298: getsockopt: connection refused
W0823 20:07:13.182] E0823 20:07:13.179678 889 mixer_test.go:471] Couldn't get to the bookinfo product page, trying again in 25 second
W0823 20:07:38.243] I0823 20:07:38.240219 889 mixer_test.go:458] Error talking to productpage: Get http://10.150.0.6:30298/productpage: dial tcp 10.150.0.6:30298: getsockopt: connection refused

@ldemailly ldemailly reopened this Aug 23, 2017
@sebastienvas
Copy link
Contributor

So I looked at the last run from prow. https://prow.istio.io/?job=e2e-suite-no_rbac-no_auth, The last 36 runs have been passing without flake. The only errors that I see were mine, and I was acutally changing the framework.

auth enabled never passes, and rbac + no_auth is flaky. It seems to mostly be the mixer test that is failing.

I did find some istioctl exec format issues, and I am thinking this could be because we might be uploading the file (running the presubmit while doing the smoke test) when we are downloading it. I ll add some condition to check if an image exist before pushing (or even better building) it.

@ldemailly
Copy link
Member Author

check pilot's ?

@sebastienvas
Copy link
Contributor

sebastienvas commented Aug 27, 2017

I counted less than 5% (3 actual flakes out of 69 runs) flakiness on the run combining last mixer-e2esmoktest pilot-e2esmoketest and e2e-suite-no_rbac-no_auth runs, which I think is pretty good for an e2e test. Obviously less would be much better :)

Coming back on the exec format, it cannot be an issue when we download a binary that we uploading, it has to be related to the way we download and save the file.

@sebastienvas
Copy link
Contributor

@ldemailly the thing that you opened is for rbac which is not presubmit, maybe it was at some point but by mistake. I would like to close this bug and open specific issues to help resolution. I think we should open different issues on why the rbac and auth test are failing. What do you think ?

@sebastienvas
Copy link
Contributor

exec format error is that the file is actually missing, I guess we are not checking the http code :). Will file a bug for this.

@andraxylia
Copy link
Contributor

It is an RBAC issue only if the logs shows some error type cannot create/list/... .
Otherwise, the simple fact it is rbac test failing does not indicate an rbac issue.

@sebastienvas
Copy link
Contributor

sebastienvas commented Aug 27, 2017 via email

@ldemailly
Copy link
Member Author

istio/old_pilot_repo#1102 for instance doesn't seem to go through

@sebastienvas
Copy link
Contributor

One more time, the issue is with pilot-presubmit not the e2e. So I am closing this. Please open issues for new problems. I will open a new issue for the RBAC one.

@ldemailly
Copy link
Member Author

not sure the value of closing and opening new issues but sure...

mandarjog pushed a commit to mandarjog/istio that referenced this issue Oct 30, 2017
* Change mock redis to use fork repo to improve code coverage.

* Remove CMakeLists.


Former-commit-id: eb83106ee5253af53ca44d05b1edf15d4c9a4565
mandarjog pushed a commit that referenced this issue Oct 31, 2017
* Change mock redis to use fork repo to improve code coverage.

* Remove CMakeLists.


Former-commit-id: 2e6870ddc1af2f404fd95912ad28b314fd39a01d
howardjohn pushed a commit to howardjohn/istio that referenced this issue Jan 12, 2020
* Updated mixer config to bring in latest changes

Updated to get to 3194864 from istio/istio
master branch.

* Updated after running make gen
howardjohn added a commit to howardjohn/istio that referenced this issue Jan 12, 2020
* Remove demo-auth profile

Fixes istio#18646

* Really kill the demo-auth
luksa pushed a commit to luksa/istio that referenced this issue Sep 20, 2022
Co-authored-by: maistra-bot <null>
antonioberben pushed a commit to antonioberben/istio that referenced this issue Jan 29, 2024
…ntainers

[jaeger] Adding optional initContainers for jaeger query and ingester deployments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants