Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stable/cert-manager] v0.6.0 - Internal error occurred: failed calling admission webhook "certificates.admission.certmanager.k8s.io": the server is currently unable to handle the request #10869

Closed
rmuehlbauer opened this Issue Jan 24, 2019 · 20 comments

Comments

Projects
None yet
7 participants
@rmuehlbauer
Copy link

rmuehlbauer commented Jan 24, 2019

Is this a request for help?:


Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT (maybe)

Version of Helm and Kubernetes:
→ helm version
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}

→ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.6-gke.3", GitCommit:"04ad69a117f331df6272a343b5d8f9e2aee5ab0c", GitTreeState:"clean", BuildDate:"2019-01-10T00:39:15Z", GoVersion:"go1.10.3b4", Compiler:"gc", Platform:"linux/amd64"}

Which chart:
cert-manager Version 0.6

What happened:
after upgrading the cert-manager pod's log is full of messages:
controller.go:147] certificates controller: Re-queuing item "some-certificate" due to error processing: Internal error occurred: failed calling admission webhook "certificates.admission.certmanager.k8s.io": the server is currently unable to handle the request

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
I've also done "helm delete --purge" and reinstalled the chart again - same behaviour.
Therefore, I can reproduce the issue by just installing the cert-manager chart

Anything else we need to know:
I've followed the install/upgrade instructions on https://cert-manager.readthedocs.io/en/latest/admin/upgrading/index.html and the upgrade went smooth without any problems. Also after the upgrade, all the pods are in "running" state.

@haf

This comment has been minimized.

Copy link

haf commented Jan 25, 2019

@rmuehlbauer

This comment has been minimized.

Copy link
Author

rmuehlbauer commented Jan 28, 2019

I think the issue #10856 might be indeed somehow related.
Before starting the "helm upgrade" command, I've labeled the existing namespace like described here: cert-manager documentation and afterwards also verified the label is correctly set:
→ kubectl describe namespace cert-manager Name: ingress Labels: certmanager.k8s.io/disable-validation=true

@haf

This comment has been minimized.

Copy link

haf commented Jan 28, 2019

Good to know! I have a helm-let's-encrypt to upgrade going forward as well, so I'm not going to do that until this is resolved. /cc @kragniz

@rmuehlbauer

This comment has been minimized.

Copy link
Author

rmuehlbauer commented Jan 30, 2019

today I had to do a complete fallback to 0.5.2 as cert-manager 0.6.0 caused very strange side effect on my k8s gke environment:
Today I've upgraded to a new kubernetes version and afterwards I had huge problems with the cluster, as two pods - "calico-typha-vertical-autoscaler" and "calico-node-vertical-autoscaler" didn't start anymore and also other cluster ressources had a strange behaviour.
Pods restarted after error messages like
autoscaler.go:49] failed to discover apigroup for kind "DaemonSet": unable to retrieve the complete list of server APIs: admission.certmanager.k8s.io/v1beta1: an error on the server ("service unavailable") has prevented the request from succeeding
My first idea was that this might have been caused by the kubernetes update - but this was not the case, as a second cluster - that was not updated - had the same strange issues.
So I've completely uninstalled cert-manager (also removed all CRD's) and installed version 0.5.2 again.
After that procedure I also had to recreate the clusterissuer, as it was also gone.
Now everything is up and running again... for the moment I'm gonna stay with the old cert-manager version - at least this version is working for me and not causing strange issues.

@rmuehlbauer

This comment has been minimized.

Copy link
Author

rmuehlbauer commented Feb 18, 2019

did anyone already try with cert-manager 0.6.5?

@davi5e

This comment has been minimized.

Copy link

davi5e commented Feb 19, 2019

@rmuehlbauer Trying with v0.6.5, same error... Any workarounds?

@rmuehlbauer

This comment has been minimized.

Copy link
Author

rmuehlbauer commented Feb 19, 2019

at least not to my knowledge...

@rmuehlbauer

This comment has been minimized.

Copy link
Author

rmuehlbauer commented Feb 25, 2019

today I had some time to dig somewhat deeper and finally could resolve the issue with new cert-manager versions - maybe you guys can use this to sort things out on your side.

TL;DR:
There was a firewall rule missing. Allow Kubernetes master (network) to access the cert-manager-webhook pod on port 6443.

After working my way thorough cert-manager's "getting started guide" and "troubleshooting guide", I found a Note (on the very bottom on the troubleshooting guide) saying: "If the job continues to fail, please read the Webhook docs for additional information."
Now, on this Webhook Doc (which you can find here: https://cert-manager.readthedocs.io/en/latest/getting-started/webhook.html) I found a interesting piece of information, regarding running cert-manager on private GKE clusters.
On GKE environments the K8s masters only have very limited access to its nodes. Now, to be able to use cert-managers webhook you have to allow those connections also.
This was somehow the missing piece of information - Now it was easy to work my way through the GKE docs (found here: https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#add_firewall_rules), gather all the little pieces together and create a new firewall rule which solved the issue for me.
Basically I allowed the K8s master network to access the webhook pod on port 6443 (if you have a deeper look on the webhook pod, you will see that it acually listens on that port and only the webhook service translated that port from 6443 to 443)

I hope this piece of information helps a bit to sort out situations on your side.

@rmuehlbauer

This comment has been minimized.

Copy link
Author

rmuehlbauer commented Feb 26, 2019

issue was solved by allowing my k8s master to access its nodes on port 6443 - which is used on the cert manager webhook pod

alkar added a commit to ministryofjustice/cloud-platform-infrastructure that referenced this issue Mar 11, 2019

Open up cert-manager-webhook to apiserver.
Since the apiserver containers run with host networking and Pod networking is not routable outside the cluster, we can
simply open up traffic for the relevant port.

Reference to the issue: helm/charts#10869 (comment)

alkar added a commit to ministryofjustice/cloud-platform-infrastructure that referenced this issue Mar 11, 2019

Open up cert-manager-webhook to apiserver.
Since the apiserver containers run with host networking and Pod networking is not routable outside the cluster, we can
simply open up traffic for the relevant port.

Reference to the issue: helm/charts#10869 (comment)
@gajus

This comment has been minimized.

Copy link

gajus commented Mar 15, 2019

issue was solved by allowing my k8s master to access its nodes on port 6443 - which is used on the cert manager webhook pod

What is the gcloud command line to create this rule?

@Izopi4a

This comment has been minimized.

Copy link

Izopi4a commented Mar 16, 2019

i would like to see that gke command as well please

@gajus

This comment has been minimized.

Copy link

gajus commented Mar 16, 2019

For the record, I did not have this problem when setting up cert-manager.

From my notes, here is literally everything that was needed to set up the cert-manager on a new cluster.

# Install the CustomResourceDefinition resources separately
kubectl apply -f https://raw.githubusercontent.com/jetstack/cert-manager/release-0.7/deploy/manifests/00-crds.yaml

# Create the namespace for cert-manager
kubectl create namespace cert-manager

# Label the cert-manager namespace to disable resource validation
kubectl label namespace cert-manager certmanager.k8s.io/disable-validation=true

# Add the Jetstack Helm repository
helm repo add jetstack https://charts.jetstack.io

# Update your local Helm chart repository cache
helm repo update

# Install the cert-manager Helm chart
helm install \
  --name cert-manager \
  --namespace cert-manager \
  --version v0.7.0 \
  jetstack/cert-manager

kubectl get pods --namespace cert-manager
# Setup cluster issuer (using letsencrypt)

cat <<'EOF' | kubectl create -f -
apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: 'gajus@gajus.com'
    privateKeySecretRef:
      name: letsencrypt-production
    http01: {}
EOF

# Set up certficate (replace with your details)
cat <<'EOF' | kubectl replace -f -
apiVersion: certmanager.k8s.io/v1alpha1
kind: Certificate
metadata:
  name: queryalert-com
  namespace: default
spec:
  secretName: queryalert-com-tls
  issuerRef:
    name: letsencrypt-production
    kind: ClusterIssuer
  commonName: queryalert.com
  dnsNames:
  - queryalert.com
  acme:
    config:
    - http01:
        ingressClass: nginx
      domains:
      - queryalert.com
EOF

Then just update Ingress:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: {{ .Release.Name | quote }}
  labels:
    {{- include "release_labels" . | indent 4 }}
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
+    certmanager.k8s.io/cluster-issuer: 'letsencrypt-production'
+    certmanager.k8s.io/acme-challenge-type: http01
+spec:
+  tls:
+    - hosts:
+      - queryalert.com
+      secretName: queryalert-com-tls
  rules:
    - host: queryalert.com
      http:
        paths:
          - path: /api
            backend:
              serviceName: {{ .Release.Name | quote }}
              servicePort: 8080
@Izopi4a

This comment has been minimized.

Copy link

Izopi4a commented Mar 18, 2019

with 0.7.0 it works indeed thx

@rmuehlbauer

This comment has been minimized.

Copy link
Author

rmuehlbauer commented Mar 18, 2019

issue was solved by allowing my k8s master to access its nodes on port 6443 - which is used on the cert manager webhook pod

What is the gcloud command line to create this rule?

please have a look at #10869 (comment)

@mlushpenko

This comment has been minimized.

Copy link

mlushpenko commented Apr 3, 2019

So, we are running a private cluster on GKE:

Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.5-gke.5", GitCommit:"2c44750044d8aeeb6b51386ddb9c274ff0beb50b", GitTreeState:"clean", BuildDate:"2019-02-01T23:53:25Z", GoVersion:"go1.10.8b4", Compiler:"gc", Platform:"linux/amd64"}

I created firewall

CLUSTER=staging
REGION=europe-west4
SOURCE=$(gcloud container clusters describe $CLUSTER --region $REGION | grep masterIpv4CidrBlock| cut -d ':' -f 2 | tr -d ' ')
NETWORK=$(gcloud container clusters describe $CLUSTER --region $REGION | egrep '^network:' | cut -d ':' -f 2 | tr -d ' ')
TAGS=$(gcloud compute firewall-rules list --filter "name~^gke-$CLUSTER" --format 'value(targetTags.list():label=TARGET_TAGS)' | head -n 1)

gcloud compute firewall-rules create cert-manager-admission-webhook --action ALLOW --direction INGRESS --source-ranges $SOURCE --rules tcp:6443 --target-tags $TAGS --network $NETWORK

Then, I tried repeating steps from here #10869 (comment) and getting the same error when trying to create ClusterIssuer:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling admission webhook "clusterissuers.admission.certmanager.k8s.io": the server is currently unable to handle the request

Could you point me what else am I missing?

@rmuehlbauer

This comment has been minimized.

Copy link
Author

rmuehlbauer commented Apr 4, 2019

So, we are running a private cluster on GKE:

Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.5-gke.5", GitCommit:"2c44750044d8aeeb6b51386ddb9c274ff0beb50b", GitTreeState:"clean", BuildDate:"2019-02-01T23:53:25Z", GoVersion:"go1.10.8b4", Compiler:"gc", Platform:"linux/amd64"}

I created firewall

CLUSTER=staging
REGION=europe-west4
SOURCE=$(gcloud container clusters describe $CLUSTER --region $REGION | grep masterIpv4CidrBlock| cut -d ':' -f 2 | tr -d ' ')
TAGS=$(gcloud compute firewall-rules list --filter "name~^gke-$CLUSTER" --format 'value(targetTags.list():label=TARGET_TAGS)' | head -n 1)

gcloud compute firewall-rules create cert-manager-admission-webhook --action ALLOW --direction INGRESS --source-ranges $SOURCE --rules tcp:6443 --target-tags $TAGS

Then, I tried repeating steps from here #10869 (comment) and getting the same error when trying to create ClusterIssuer:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling admission webhook "clusterissuers.admission.certmanager.k8s.io": the server is currently unable to handle the request

Could you point me what else am I missing?

hmm...your SOURCE and TAGS variables seem to be populated with the correct values - at least when I tried your commands in my environment.
also your "firewall-rules create" line is looking good.
That is exactly that got it finally working in my case...
What about the resulting firewall rule - did you check it is effective for the gke hosts in your cluster? (I dont know how to check this using cli but you can easily see it in the webgui in the firewall rules details on the very bottom of the page...)

@mlushpenko

This comment has been minimized.

Copy link

mlushpenko commented Apr 4, 2019

@rmuehlbauer thanks, good observation, I haven't really checked if rules were applied and they weren't, I am not sure why, maybe it has something to do with custon node pools or preemptible nodes.

UPDATE: It was network, damn it, our cluster doesn't run on default network. I've updated my commands

One more update, I got a step further with connections, but now getting this:

I0404 15:31:48.644874       1 request.go:942] Request Body: {"kind":"SubjectAccessReview","apiVersion":"authorization.k8s.io/v1beta1","metadata":{"creationTimestamp":null},"spec":{"nonResourceAttributes":{"path":"/","verb":"get"},"user":"system:anonymous","group":["system:unauthenticated"]},"status":{"allowed":false}}
I0404 15:31:48.645004       1 round_trippers.go:419] curl -k -v -XPOST  -H "Content-Type: application/json" -H "User-Agent: image.app_linux-amd64.binary/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Accept: application/json, */*" -H "Authorization: Bearer blalblalal" 'https://10.125.192.1:443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews'
I0404 15:31:48.654331       1 round_trippers.go:438] POST https://10.125.192.1:443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews 201 Created in 9 milliseconds
I0404 15:31:48.654376       1 round_trippers.go:444] Response Headers:
I0404 15:31:48.654392       1 round_trippers.go:447]     Audit-Id: c5990609-2b9d-47ce-9bda-d15180940f1c
I0404 15:31:48.654397       1 round_trippers.go:447]     Content-Type: application/json
I0404 15:31:48.654400       1 round_trippers.go:447]     Content-Length: 294
I0404 15:31:48.654403       1 round_trippers.go:447]     Date: Thu, 04 Apr 2019 15:31:48 GMT
I0404 15:31:48.654441       1 request.go:942] Response Body: {"kind":"SubjectAccessReview","apiVersion":"authorization.k8s.io/v1beta1","metadata":{"creationTimestamp":null},"spec":{"nonResourceAttributes":{"path":"/","verb":"get"},"user":"system:anonymous","group":["system:unauthenticated"]},"status":{"allowed":false,"reason":"no RBAC policy matched"}}
I0404 15:31:48.654606       1 authorization.go:73] Forbidden: "/", Reason: "no RBAC policy matched"
I0404 15:31:48.654766       1 wrap.go:47] GET /: (10.220427ms) 403 [Go-http-client/2.0 172.16.0.10:54560]
I0404 15:31:49.774434       1 log.go:172] http: TLS handshake error from 172.16.0.11:56368: remote error: tls: bad certificate
I0404 15:31:50.357270       1 log.go:172] http: TLS handshake error from 172.16.0.10:58938: remote error: tls: bad certificate
I0404 15:31:52.656929       1 log.go:172] http: TLS handshake error from 172.16.0.10:58960: remote error: tls: bad certificate
I0404 15:31:55.179124       1 log.go:172] http: TLS handshake error from 172.16.0.10:58966: remote error: tls: bad certificate
I0404 15:31:56.677010       1 log.go:172] http: TLS handshake error from 172.16.0.10:58972: remote error: tls: bad certificate

I was updating from 0.5.2, so maybe something got messed up along those lines, I may try clean install again a bit later

@rmuehlbauer

This comment has been minimized.

Copy link
Author

rmuehlbauer commented Apr 4, 2019

@mlushpenko I think you are hitting some RBAC issues - hava a look at https://docs.cert-manager.io/en/latest/getting-started/install.html - especially about the note regarding RBAC and GKE...hopefully that fixes your problem

@mlushpenko

This comment has been minimized.

Copy link

mlushpenko commented Apr 5, 2019

@rmuehlbauer thanks for suggestion, although looks fine:

kubectl describe clusterrolebinding cluster-admin-binding                                           
Name:         cluster-admin-binding
Labels:       <none>
Annotations:  <none>
Role:
  Kind:  ClusterRole
  Name:  cluster-admin
Subjects:
  Kind  Name                     Namespace
  ----  ----                     ---------
  User  mlushpenko@blockport.io

I was testing by running helm with my permissions, but it does look related to RBAC as it states in the error log. It feels to me like whoever is calling the API (probably webhook pod or cert-manager) is not running with specific SA because it tries to use anonymous user:

"user":"system:anonymous","group":["system:unauthenticated"]}

Do you have idea about validation process? I read How it works section, but didn't find relevant info.
From the other side, I probably won't be spending much more time with it now, but maybe this will help some other people if they encounter similar issues.

@yujunz

This comment has been minimized.

Copy link
Contributor

yujunz commented Apr 15, 2019

TL;DR:
There was a firewall rule missing. Allow Kubernetes master (network) to access the cert-manager-webhook pod on port 6443.

This hint helps me solve the problem.

Some notes here: for cluster created by kops, cross subnet mode may need to be enabled explicitly on AWS when using calico.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  name: k8s.local
spec:
  networking:
    # Ref: https://github.com/kubernetes/kops/blob/master/docs/networking.md#enable-cross-subnet-mode-in-calico-aws-only
    calico:
      crossSubnet: true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.