Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev.kubeflow.org 502s #545

Closed
jlewi opened this issue Mar 30, 2018 · 8 comments
Closed

dev.kubeflow.org 502s #545

jlewi opened this issue Mar 30, 2018 · 8 comments
Assignees

Comments

@jlewi
Copy link
Contributor

jlewi commented Mar 30, 2018

I'm getting 502s on deve.kubeflow.org

I believe @ankushagarwal just redeployed the latest Kubeflow.

Following iap.md

  • Backend is reported as unhealthy
  • Health check is reported as "/" not "/healthz"
  • IAP is disabled
  • Timeout is set to 30 seconds.

My conjecture is when we redeployed all the backend settings got reset and we didn't reinitialize

/cc @danisla @ankushagarwal

@jlewi
Copy link
Contributor Author

jlewi commented Mar 30, 2018

  • Used the UI to change the health check to /healthz
  • Took some time but eventually health backends reported as working

Now I get

Required JWT token is missing

Because IAP is disabled.

@ankushagarwal ankushagarwal self-assigned this Mar 30, 2018
@danisla
Copy link
Contributor

danisla commented Mar 30, 2018

The initContainer on the envoy pods are what re-enable IAP. Deleting the envoy deployment and running ks apply again should reconfigure it.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 30, 2018

@danisla Thanks just figured that out.

It looks like when @ankushagarwal ks apply didn't cause the envoy containers to get restarted.

@danisla
Copy link
Contributor

danisla commented Mar 30, 2018

If the ingress or service spec is what changed, we might be able to add a coupling in the jsonnet spec that rolls the pods whenever the ingress or service changes. This is done in Helm with an annotation and a checksum.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 30, 2018

Would it make sense to turn the script into an agent that periodically runs a bunch of checks?

This would be more work I think.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 30, 2018

Its working now that I kicked one of the envoy pods.

@danisla
Copy link
Contributor

danisla commented Mar 30, 2018

If the liveness probe on the pod was somehow coupled to the JWT_AUDIENCE being correct, then k8s would automatically restart the pods when it changed and self-correct.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 31, 2018

Lets use #550 to figure out a better fix.

@jlewi jlewi closed this as completed Mar 31, 2018
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
Signed-off-by: Ce Gao <gaoce@caicloud.io>
elenzio9 pushed a commit to arrikto/kubeflow that referenced this issue Oct 31, 2022
Signed-off-by: Anna Jung (VMware) <antheaj@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants