Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Label to Disable Katib Webhooks #2069

Closed
andreyvelich opened this issue Dec 21, 2022 · 11 comments · Fixed by #2203
Closed

Add Label to Disable Katib Webhooks #2069

andreyvelich opened this issue Dec 21, 2022 · 11 comments · Fixed by #2203

Comments

@andreyvelich
Copy link
Member

In this PR: #2018 (comment), I proposed to introduce label for disabling Katib Webhooks (validator, defaulter, mutator). For example: katib.kubeflow.org/webhooks: disabled.
Let's discuss if that would be useful for the users with large-scale environment.

Currently, if user's namespace has katib.kubeflow.org/metrics-collector-injection: enabled label, Katib Mutation Webhook runs for every Pod in that namespace. That might increase latency in the Kubernetes API server. Some users might want to use Katib Experiments and run other pods in their namespaces without Webhook execution.

What do you think @gaocegege @johnugeorge @tenzen-y @anencore94 @terrytangyuan ?

/kind discussion


Love this feature? Give it a 👍 We prioritize the features with the most 👍

@tenzen-y
Copy link
Member

tenzen-y commented Dec 21, 2022

@andreyvelich Does that mean if experiments with that label deployed, all webhooks (validator, defaultor, mutator) are ignored?
Or ignored only the mutator?

@johnugeorge
Copy link
Member

Effectively, this will create problems with Katib. I would not suggest this unless a non Katib user complains about the performance?

@andreyvelich
Copy link
Member Author

andreyvelich commented Dec 22, 2022

Does that mean if experiments with that label deployed, all webhooks (validator, defaultor, mutator) are ignored?

@tenzen-y Yes, we can give users an option to disable all 3 webhooks by setting label in Katib Experiment (disable validator and defaulter) and in user's pods (mutator).

I would not suggest this unless a non Katib user complains about the performance?

Let's hold this issue for some time and ask users to provide feedback.

@andreyvelich
Copy link
Member Author

We can close this issue this @tenzen-y integrated cert-generator to Katib Controller startup 🎉
If we need to deactivate webhook for namespace Pod Mutation, we will create separate issues.

@tenzen-y
Copy link
Member

tenzen-y commented Aug 13, 2023

/reopen

I found this issue still remains, and the replicaset-controller occasionally faces this issue:

Warning  FailedCreate  39s (x15 over 2m2s)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "mutator.pod.katib.kubeflow.org": failed to call webhook: Post "[https://katib-controller.kubeflow.svc:443/mutate-pod?timeout=10s](https://katib-controller.kubeflow.svc/mutate-pod?timeout=10s)": dial tcp 10.100.157.27:443: connect: connection refused

The above error will happen if the katib starts up in the following steps:

  1. webhooks are registered to kube-apiserver.
  2. the deployment-controller creates the ReplicaSet.
  3. the replicaset-controller tries to create pods.

I think we can select one of the following option:

  1. Revert the following PRs. This means the katib will start up, as well as v0.15.0.
  1. Introduce label katib.kubeflow.org/metrics-collector-injection: disabled as @andreyvelich proposed in Change failurePolicy to Fail for Katib Webhooks #2018 (comment).
    Draft PR: Skip to inject the metrics-collector pods to the katib controller #2203
  2. Replace failurePolicy with Ignore in the MutatingWebhookConfiguration mutator.pod.katib.kubeflow.org.

@andreyvelich @johnugeorge WDYT?

@google-oss-prow
Copy link

@tenzen-y: Reopened this issue.

In response to this:

/reopen

I found this issue still remains, and the replicaset-controller occasionally faces this issue:

Warning  FailedCreate  39s (x15 over 2m2s)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "mutator.pod.katib.kubeflow.org": failed to call webhook: Post "[https://katib-controller.kubeflow.svc:443/mutate-pod?timeout=10s](https://katib-controller.kubeflow.svc/mutate-pod?timeout=10s)": dial tcp 10.100.157.27:443: connect: connection refused

The above error will happen if the katib starts up in the following steps:

  1. webhooks are registered to kube-apiserver.
  2. the deployment-controller creates the ReplicaSet.
  3. the replicaset-controller tries to create pods.

I think we can select one of the following option:

  1. Revert the following PRs. This means the katib will start up, as well as v0.15.0.
  1. Introduce label katib.kubeflow.org/metrics-collector-injection: disabled as @andreyvelich proposed in Change failurePolicy to Fail for Katib Webhooks #2018 (comment).
    Draft PR: Skip to inject the metrics-collector pods to the katib controller #2203
  2. Replace failurePolicy with Ignore in the MutatingWebhookConfiguration mutator.pod.katib.kubeflow.org.

@andreyvelich @johnugeorge WDYT?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot reopened this Aug 13, 2023
@tenzen-y
Copy link
Member

tenzen-y commented Aug 13, 2023

I think 2 would be better because consolidating the cert-generator gives users the following advantages:

  1. Allow users to deploy easily.
  2. Allow users easily debug, and then it is possible to avoid the following confusion:

@andreyvelich
Copy link
Member Author

@tenzen-y On which Katib Component did you see this Warning ?
I thought that webhooks are registered to kube-apiserver. only after certificates are ready, isn't it ?

@tenzen-y
Copy link
Member

tenzen-y commented Aug 14, 2023

@tenzen-y On which Katib Component did you see this Warning ?

@andreyvelich I see this issue on all the katib components such as controller, db-manager...

I thought that webhooks are registered to kube-apiserver. only after certificates are ready, isn't it ?

I thought so too. However, the webhook is occasionally registered to the kube-apiserver before certs are ready. It's distributed system :(

@andreyvelich
Copy link
Member Author

@tenzen-y Did you see the errors even without Leader Election mode enabled ?
What if you just create WebhookConfiguration without caBundle and start Katib DB Manager.
Will the Katib DB Manager deployment fail ?

@tenzen-y
Copy link
Member

@tenzen-y Did you see the errors even without Leader Election mode enabled ? What if you just create WebhookConfiguration without caBundle and start Katib DB Manager. Will the Katib DB Manager deployment fail ?

@andreyvelich Yes, I see it on the installation without leader election mode. You can check the error in the following:

With caBundle: https://github.com/kubeflow/katib/actions/runs/5848826299/job/15856504332
Without caBundle: https://github.com/kubeflow/katib/actions/runs/5848990494/job/15856837725

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants