Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed calling webhook when running random example experiment #1160

Closed
felihong opened this issue Apr 20, 2020 · 9 comments
Closed

Failed calling webhook when running random example experiment #1160

felihong opened this issue Apr 20, 2020 · 9 comments
Labels

Comments

@felihong
Copy link

felihong commented Apr 20, 2020

Hi there,

my Katib failed to submit the the random algorithm example https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/random-example.yaml:

Internal error occurred: 
failed calling webhook "mutating.experiment.katib.kubeflow.org": 
Post https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s: 
ssh: rejected: connect failed (Connection refused)

The same experiment worked just fine before several days and I didn't change anything. All deployments kati-controller, katib-db-manager etc. are in healthy states. Any ideas what is going wrong here?

Thanks.

Environment:

  • Kubeflow version: v1.0
  • Kubernetes version: (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.27", 
GitCommit:"145f9e21a4515947d6fb10819e5a336aff1b6959", GitTreeState:"clean", BuildDate:"2020-02-21T18:01:40Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
bug 0.61

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the bug label Apr 20, 2020
@jlewi jlewi added kind/bug and removed bug labels Apr 20, 2020
@andreyvelich
Copy link
Member

Can you check your webhook status?
kubectl get MutatingWebhookConfiguration

@felihong
Copy link
Author

Hi @andreyvelich , thanks for your reply.

The katib-mutating-webhook-config is listed in Webhook configurations:

kubectl get MutatingWebhookConfiguration

NAME                                                 CREATED AT
admission-webhook-mutating-webhook-configuration     2020-03-08T16:34:33Z
cert-manager-webhook                                 2020-03-08T16:32:44Z
inferenceservice.serving.kubeflow.org                2020-03-08T16:36:05Z
istio-sidecar-injector                               2020-03-08T16:32:21Z
katib-mutating-webhook-config                        2020-03-08T16:37:42Z
pod-ready.config.common-webhooks.networking.gke.io   2020-03-08T16:30:29Z
seldon-mutating-webhook-configuration-kubeflow       2020-03-08T16:35:22Z
webhook.serving.knative.dev                          2020-03-08T16:34:45Z

Below the yaml file:

apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
  generation: 1
  name: katib-mutating-webhook-config
  resourceVersion: "5990"
  selfLink: /apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations/katib-mutating-webhook-config
  uid: [ID]
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: [Certificat]
    service:
      name: katib-controller
      namespace: kubeflow
      path: /mutate-experiments
  failurePolicy: Fail
  name: mutating.experiment.katib.kubeflow.org
  namespaceSelector:
    matchExpressions:
    - key: control-plane
      operator: DoesNotExist
  rules:
  - apiGroups:
    - kubeflow.org
    apiVersions:
    - v1alpha3
    operations:
    - CREATE
    - UPDATE
    resources:
    - experiments
    scope: '*'
  sideEffects: Unknown
  timeoutSeconds: 30
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: [Certificat]
    service:
      name: katib-controller
      namespace: kubeflow
      path: /mutate-pods
  failurePolicy: Ignore
  name: mutating.pod.katib.kubeflow.org
  namespaceSelector:
    matchLabels:
      katib-metricscollector-injection: enabled
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    scope: '*'
  sideEffects: Unknown
  timeoutSeconds: 30

Thanks for your help!

@andreyvelich
Copy link
Member

Can you describe your Kubeflow namespace, check it you have label katib-metricscollector-injection=enabled there?
Also, where did you see this error, that you showed above?

If it doesn't help, try to reinstall Katib deployment.
You can use this script: https://github.com/kubeflow/katib/blob/master/scripts/v1alpha3/deploy.sh to install Katib.
This will delete validating and mutating webhook.
Katib controller will create new, when you deploy it.

@felihong
Copy link
Author

Hi @andreyvelich , I re-deployed Katib and now it works fine. Thanks a lot for your help!

The error was occurred as a pop-out on top of the page, when I tried to submit the experiment yaml file. I also checked the kubeflow namespace and katib-metricscollector was injected:

kubectl describe namespace kubeflow
Name:         kubeflow
Labels:       control-plane=kubeflow
              katib-metricscollector-injection=enabled

Any ideas what may cause this happen?

@andreyvelich
Copy link
Member

andreyvelich commented Apr 22, 2020

@felihong Maybe on your cluster you somehow modified this webhook.
How did you install Kubeflow?

@felihong
Copy link
Author

@andreyvelich actually not that I remember :(

My kubeflow cluster is hosted on GCP and I deployed v1.0 using the official config URL https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.0.yaml

@andreyvelich
Copy link
Member

@felihong Ok, if you have this problem again, let me know.

@charlescurt
Copy link

charlescurt commented Mar 17, 2022

https://github.com/kubeflow/katib/blob/master/scripts/v1alpha3/deploy.sh
Could you repost this link? I have the same issue.

My error:
Error from server (InternalError): error when creating "random-search-example.yaml": Internal error occurred: failed calling webhook "mutating.experiment.katib.kubeflow.org": Post "https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants