Webhook issues: InternalError (failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io") #1597

fedepaol · 2022-09-07T13:14:03Z

This is an umbrella issue to try to provide troubleshooting information and to collect all the webhook related issues:

A very good guide that can be applied also to metallb is https://hackmd.io/@maelvls/debug-cert-manager-webhook
Please note that the service name / webhook name might be slightly different when consuming the helm charts or the manifest.

Given a webhook failure, one must check

if the metallb controller is running and the endpoints of the service are healthy

kubectl get endpoints -n metallb-system
NAME              ENDPOINTS         AGE
webhook-service   10.244.2.2:9443   4h32m

If the caBundle is generated and the configuration is patched properly

To get the caBundle used by the webhooks:

kubectl get validatingwebhookconfiguration metallb-webhook-configuration -ojsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d

To get the caBundle from the secret:

kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' | base64 -d

The caBundle in the webhook configuration and in the secret must match, and the raw version you get from kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' must be different from the default dummy one that can be found here:

metallb/config/crd/crd-conversion-patch.yaml

Line 15 in 93755e1

    
           caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tDQpNSUlGWlRDQ0EwMmdBd0lCQWdJVU5GRW1XcTM3MVpKdGkrMmlSQzk1WmpBV1MxZ3dEUVlKS29aSWh2Y05BUUVMDQpCUUF3UWpFTE1Ba0dBMVVFQmhNQ1dGZ3hGVEFUQmdOVkJBY01ERVJsWm1GMWJIUWdRMmwwZVRFY01Cb0dBMVVFDQpDZ3dUUkdWbVlYVnNkQ0JEYjIxd1lXNTVJRXgwWkRBZUZ3MHlNakEzTVRrd09UTXlNek5hRncweU1qQTRNVGd3DQpPVE15TXpOYU1FSXhDekFKQmdOVkJBWVRBbGhZTVJVd0V3WURWUVFIREF4RVpXWmhkV3gwSUVOcGRIa3hIREFhDQpCZ05WQkFvTUUwUmxabUYxYkhRZ1EyOXRjR0Z1ZVNCTWRHUXdnZ0lpTUEwR0NTcUdTSWIzRFFFQkFRVUFBNElDDQpEd0F3Z2dJS0FvSUNBUUNxVFpxMWZRcC9vYkdlenhES0o3OVB3Ny94azJwellualNzMlkzb1ZYSm5sRmM4YjVlDQpma2ZZQnY2bndscW1keW5PL2phWFBaQmRQSS82aFdOUDBkdVhadEtWU0NCUUpyZzEyOGNXb3F0MGNTN3pLb1VpDQpvcU1tQ0QvRXVBeFFNZjhRZDF2c1gvVllkZ0poVTZBRXJLZEpIaXpFOUJtUkNkTDBGMW1OVW55Rk82UnRtWFZUDQpidkxsTDVYeTc2R0FaQVBLOFB4aVlDa0NtbDdxN0VnTWNiOXlLWldCYmlxQ3VkTXE5TGJLNmdKNzF6YkZnSXV4DQo1L1pXK2JraTB2RlplWk9ZODUxb1psckFUNzJvMDI4NHNTWW9uN0pHZVZkY3NoUnh5R1VpSFpSTzdkaXZVTDVTDQpmM2JmSDFYbWY1ZDQzT0NWTWRuUUV2NWVaOG8zeWVLa3ZrbkZQUGVJMU9BbjdGbDlFRVNNR2dhOGFaSG1URSttDQpsLzlMSmdDYjBnQmtPT0M0WnV4bWh2aERKV1EzWnJCS3pMQlNUZXN0NWlLNVlwcXRWVVk2THRyRW9FelVTK1lsDQpwWndXY2VQWHlHeHM5ZURsR3lNVmQraW15Y3NTU1UvVno2Mmx6MnZCS21NTXBkYldDQWhud0RsRTVqU2dyMjRRDQp0eGNXLys2N3d5KzhuQlI3UXdqVTFITndVRjBzeERWdEwrZ1NHVERnSEVZSlhZelYvT05zMy94TkpoVFNPSkxNDQpoeXNVdyttaGdackdhbUdXcHVIVU1DUitvTWJzMTc1UkcrQjJnUFFHVytPTjJnUTRyOXN2b0ZBNHBBQm8xd1dLDQpRYjRhY3pmeVVscElBOVFoSmFsZEY3S3dPSHVlV3gwRUNrNXg0T2tvVDBvWVp0dzFiR0JjRGtaSmF3SURBUUFCDQpvMU13VVRBZEJnTlZIUTRFRmdRVW90UlNIUm9IWTEyRFZ4R0NCdEhpb1g2ZmVFQXdId1lEVlIwakJCZ3dGb0FVDQpvdFJTSFJvSFkxMkRWeEdDQnRIaW9YNmZlRUF3RHdZRFZSMFRBUUgvQkFVd0F3RUIvekFOQmdrcWhraUc5dzBCDQpBUXNGQUFPQ0FnRUFSbkpsWWRjMTFHd0VxWnh6RDF2R3BDR2pDN2VWTlQ3aVY1d3IybXlybHdPYi9aUWFEa0xYDQpvVStaOVVXT1VlSXJTdzUydDdmQUpvVVAwSm5iYkMveVIrU1lqUGhvUXNiVHduOTc2ZldBWTduM3FMOXhCd1Y0DQphek41OXNjeUp0dlhMeUtOL2N5ak1ReDRLajBIMFg0bWJ6bzVZNUtzWWtYVU0vOEFPdWZMcEd0S1NGVGgrSEFDDQpab1Q5YnZHS25adnNHd0tYZFF0Wnh0akhaUjVqK3U3ZGtQOTJBT051RFNabS8rWVV4b2tBK09JbzdSR3BwSHNXDQo1ZTdNY0FTVXRtb1FORXd6dVFoVkJaRWQ1OGtKYjUrV0VWbGNzanlXNnRTbzErZ25tTWNqR1BsMWgxR2hVbjV4DQpFY0lWRnBIWXM5YWo1NmpBSjk1MVQvZjhMaWxmTlVnanBLQ0c1bnl0SUt3emxhOHNtdGlPdm1UNEpYbXBwSkI2DQo4bmdHRVluVjUrUTYwWFJ2OEhSSGp1VG9CRHVhaERrVDA2R1JGODU1d09FR2V4bkZpMXZYWUxLVllWb1V2MXRKDQo4dVdUR1pwNllDSVJldlBqbzg5ZytWTlJSaVFYUThJd0dybXE5c0RoVTlqTjA0SjdVL1RvRDFpNHE3VnlsRUc5DQorV1VGNkNLaEdBeTJIaEhwVncyTGFoOS9lUzdZMUZ1YURrWmhPZG1laG1BOCtqdHNZamJadnR5Mm1SWlF0UUZzDQpUU1VUUjREbUR2bVVPRVRmeStpRHdzK2RkWXVNTnJGeVVYV2dkMnpBQU4ydVl1UHFGY2pRcFNPODFzVTJTU3R3DQoxVzAyeUtYOGJEYmZFdjBzbUh3UzliQnFlSGo5NEM1Mjg0YXpsdTBmaUdpTm1OUEM4ckJLRmhBPQ0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQ==

Test if the service is reacheable from the apiserver node

Find the webhook service cluster ip:

kubectl get service -n metallb-system  webhook-service
NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
webhook-service   ClusterIP   10.96.50.216   <none>        443/TCP   4h15m

Fetch the caBundle:

kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' | base64 -d > caBundle.pem

Move the caBundle.pem file to a node, and from the node try to curl the service, providing the resolution from the service fqdn to the service's clusterIP (in this case, 10.96.50.216):

curl --cacert ./caBundle.pem --resolve webhook-service.metallb-system.svc:443:10.96.50.216 https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool

The expected result is the webhook complaining about missing content:

{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}

But that will guarantee the certificate is valid.

In case the connection times out:

The instructions at https://hackmd.io/@maelvls/debug-cert-manager-webhook#Error-2-io-timeout can be followed.

Use tcpdump on port 443 to see if the traffic from the apiserver is directed to the endpoint (the controller pod's ip in this case).

How to disable the webhook

A very quick workaround is to disable the webhook, which requires changing its failurePolicy to failurePolicy=Ignore

The text was updated successfully, but these errors were encountered:

michaelvl · 2022-09-07T17:46:45Z

Hi,
My observation is that its a bit random if this occurs or not. Another observation is that after observing the issue, if I retry after some minutes the webhook often succeeds. This suggest that there is some race involved.

Pure speculation, but could the API server initially cache an invalid cert before the cabundle injection and the waiting cause this cache to expire and the new injected certificate being loaded?

koimad · 2022-09-13T15:35:58Z

Hi,
I've got the same issue, After installing a clean bare metal kubernetes cluster following this guide https://graspingtech.com/install-kubernetes-rhel-8/, and doing the basic metallb install as per the guide.

If I turn off the firewall on the worker nodes then all works well.

The firewall dropped logs report

The configuration of the firewall is

The webhook service is described below.

I'm a beginner with Kubernetes and Metallb so any advice would be gratefully accepted.

Brian..

elraro · 2022-09-15T19:07:51Z

My problem was fixed. I had problems with helm deployment, metallb version 0.12 and outdated crds. Redeploying everything works perfectly.

Best regards.

LarsBingBong · 2022-09-16T14:21:38Z

I've experienced when installing MetalLB via Helm and as a post Chart installation step create advertisement and pool configs - via CRD. v0.13.5.

But, it doesn't always happen.

fireflycons · 2022-09-25T19:11:30Z

Hi @fedepaol

I've done two installations of MetalLB today, both using the same version of the chart (latest available as at time of writing).

kubeadm cluster running 1.25 deployed on Hyper-V VMs running Ubuntu 22.04 - This works fine.
Manually installed cluster running 1.24 on VirtualBox running Ubuntu 22.04 as per https://github.com/mmumshad/kubernetes-the-hard-way (which I currently maintain). This one borks with the ipaddresspools issue.

I've run the tests outlined above and got the expected response from curl. I'm not intending to run the second method, but more to demonstrate to KodeKloud students how one might configure nginx ingress in LoadBalancer mode on this hand-built cluster.

While we're here, can MetalLB be configured to get certificates from cert-manager? Happy to submit a PR on the chart if it should work, and is not implemented.

fedepaol · 2022-09-26T08:18:33Z

Hi @fireflycons , we had a discussion about the method to provide certificates to the webhooks, and we came up to the conclusion that adding an additional dependency such as cert-manager was not ideal, so we ended up embedding https://github.com/open-policy-agent/cert-controller in metallb to perform such task.
So, right now cert-manager is not supported, and adding it would compete with the current method.

fireflycons · 2022-09-26T12:53:08Z

Ok @fedepaol that's fine re cert-manager. At least I know how you're doing it now :-)

Any ideas on why the IPAddressPool isn't working in my virtualbox setup?

fedepaol · 2022-09-26T12:55:13Z

What you mean by "the ipaddresspool isn't working"?

fireflycons · 2022-09-26T13:22:30Z

I mean what this entire discussion is about. When I try to create the IPAddresPool and L2Advertisement resources, they time out.
As mentioned in my initial post, I performed all the steps outlined above and got the expected curl response, which if I haven't read it incorrectly means that the API server should be able to reach the webhook endpoint.

{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}

fedepaol · 2022-09-26T13:26:15Z

Ah apologies. Can you check the logs of the apiserver?
Also, is the error always occurring or intermittent? And what is the error? Is the environment single node?

fireflycons · 2022-09-26T17:31:44Z

Hi @fedepaol

API server error is

{
    "kind": "Status",
    "apiVersion": "v1",
    "metadata": {},
    "status": "Failure",
    "message": "Internal error occurred: failed calling webhook \"ipaddresspoolvalidationwebhook.metallb.io\": failed to call webhook: Post \"https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s\": context deadline exceeded",
    "reason": "InternalError",
    "details": {
        "causes": [
            {
                "message": "failed calling webhook \"ipaddresspoolvalidationwebhook.metallb.io\": failed to call webhook: Post \"https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s\": context deadline exceeded"
            }
        ]
    },
    "code": 500
}

Now the cluster is running the control plane as OS services (i.e. not kubeadm). Doesn't this usually mean that in-cluster webhooks need to have hostNetwork: true? It occurs to me that I did the curl test from inside a test pod, which of course was going to work!

Was unable to redeploy the controller on the host network as it has a port clash, I'm guessing with the speaker pod that's on the same worker.

This is the cluster configuration. Note that this is purely a learning exercise for K8s students, however some are asking to extend the exercise to deploy ingress and I thought it would be good to try it with MetalLB.

Note that I have successfully installed the same chart version on a kubeadm cluster.

mschauf · 2022-09-29T12:53:32Z

Hello! I've deployed Metallb Operator on OKD 4.11 and I'm trying now to configure the address pool via:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: ip-addresspool-sample1
  namespace: metallb-system
spec:
  addresses:
    - xxx.xxx.xxx.xxx-xxx.xxx.xxx.xxx
  autoAssign: true

respectively

apiVersion: metallb.io/v1beta1
kind: AddressPool
metadata:
  name: ippool
  namespace: metallb-system
spec:
  addresses:
  - xxx.xxx.xxx.xxx-xxx.xxx.xxx.xxx
  autoAssign: false
  protocol: layer2

and I get following error message:

Error "failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": x509: certificate is valid for metallb-operator-webhook-server-service.metallb-system, metallb-operator-webhook-server-service.metallb-system.svc, not webhook-service.metallb-system.svc" for field "undefined".

"NO_PROXY" also covers .svc

- name: NO_PROXY
              value: >-
                .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,api-int.domain.xxx,localhost

Already tried disabling the webhook by

How to disable the webhook
A very quick workaround is to disable the webhook, which requires changing its failurePolicy to failurePolicy=Ignore

which didn't work for me.

pseymournutanix · 2022-10-07T12:17:59Z

My secret and webhook don't have any values for caBundle and I get this error. It's a clean deployment to a new cluster ?

UPDATE: I deleted the controller pod and both the secret and the webhook certs were recreated.

UPDATE 2: A few hours later both the secret, and the webhook had no value for the caBundle again.....

I suspect this is due to using ArgoCD via helm and kustomize what is the best way to exclude the syncing of these resources from ArgoCD if rendering via those tools any help would be appreciated.

hidalgopl · 2022-10-10T11:22:59Z

We started hitting this issue in our CI after migrating to v0.13.5. We're using kind cluster, v1.21.
Code that worked before:

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb.yaml
kubectl -n metallb-system wait deploy/controller --timeout=90s --for=condition=Available
kubectl apply -f ./metal_lb_cm.yaml

I noticed that in kind, speaker has CreateContainerConfigError initially:

metallb-system       controller-6846c94466-bn2qx                0/1     Running                      0          15s
metallb-system       speaker-6t6x8                              0/1     CreateContainerConfigError   0          15s
metallb-system       speaker-6t6x8                              0/1     Running                      0          17s
metallb-system       controller-6846c94466-bn2qx                1/1     Running                      0          20s
metallb-system       speaker-6t6x8                              1/1     Running                      0          30s

workaround that seems to be working so far for us is waiting until all pods (speaker & controller) are ready:

kubectl -n metallb-system wait pod --all --timeout=90s --for=condition=Ready
kubectl -n metallb-system wait deploy controller --timeout=90s --for=condition=Available
kubectl -n metallb-system wait apiservice v1beta1.metallb.io --timeout=90s --for=condition=Available
kubectl apply -f ./metal_lb_addrpool.yaml

our metal_lb_addrpool.yaml:

--- Metal LB config:
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: example
  namespace: metallb-system
spec:
  addresses:
  - 172.18.255.200-172.18.255.255
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: empty
  namespace: metallb-system

posting as fyi, in case anyone is looking for a workaround in kind.

jsemohub · 2022-10-19T11:45:05Z

Experiencing same issue with webhook failure.
MicroK8s v1.25.2 revision 4055
Is there a quick workaround to turn off webhooks?
Thank you.

fedepaol · 2022-10-19T12:00:03Z

Experiencing same issue with webhook failure. MicroK8s v1.25.2 revision 4055 Is there a quick workaround to turn off webhooks? Thank you.

It's in the first post. Also, if you are using helm to deploy, you can use this value https://github.com/metallb/metallb/blob/main/charts/metallb/values.yaml#L332 from the last release.

Zveroloff · 2022-10-19T15:22:15Z

Hello!

Same Issue,
Error from server (InternalError): error when creating "IPAddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": Unknown Host

Env:
Bare-metal Kubernetes 1.25.2
Cilium CNI 1.13.0-rc1
3 nodes

jmcgrath207 · 2022-10-23T03:25:51Z

Originally went I did a fresh install, I opted out of installing cilium-proxy and went with the default kube-proxy

I tried everything under the sun for a month, but ultimately it came down to metallb not working with my kube-proxy but it worked with cilium-proxy.

I never found the error for this in the end and I am not saying kube-proxy doesn't work but did come down to something in my proxy configuration or node configuration.

Hope this helps.

coopbri · 2022-10-23T07:55:09Z

Originally went I did a fresh install, I opted out of installing cilium-proxy and went with the default kube-proxy

I tried everything under the sun for a month, but ultimately it came down to metallb not working with my kube-proxy but it worked with cilium-proxy.

I never found the error for this in the end and I am not saying kube-proxy doesn't work but did come down to something in my proxy configuration or node configuration.

Hope this helps.

Similar situation here using kind. I tested the following configurations:

kube-proxy without Cilium: worked ✅
kube-proxy with Cilium: did not work (received webhook error) ❌
Cilium without kube-proxy: worked ✅

So it does seem to be connected to the proxy networking in my case as well as @jmcgrath207's.

1guzzy · 2023-03-22T18:55:51Z

Hello, I followed the guide in the first comment and everything worked but I'm still getting the

Error from server (InternalError): error when creating "address-pool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "cert")

error. I want to try disabling the webhook but I'm using microk8s and I'm not sure how to disable the webhook. I tried adding metallb with helm instead but its not working.

I'm pretty new to kurbenettes, any help would be greatly appreciated.

lgehrke6 · 2023-04-13T18:16:04Z

I have a 3 Node cluster running and got the following messages when i tried to apply IPAddressPools and L2Advertisment.

Every pod is also running:

The errors changed once i disabled the firewall.

With active firewall:
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": context deadline exceeded
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": context deadline exceeded
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "l2advertisementvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s": context deadline exceeded```


With disabled firewall:
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "cert")
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "cert")
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "l2advertisementvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "cert")```

aleksandrov · 2023-05-06T22:57:58Z

Hi All, I performed the checks from initial post. I checked every node in the cluster - all good with certificate:

$ curl --cacert /tmp/caBundle.pem --resolve webhook-service.metallb-system.svc:443:10.152.183.193 https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool
{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}

but API Server is still unable to make a call:

Failed calling webhook, failing closed ipaddresspoolvalidationwebhook.metallb.io: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "cert")

Could someone please suggest what else to check? I'm happy to provide logs to get certs issue finally sorted out.

chrismedinapy · 2023-05-06T23:13:35Z

Have you try disabling the firewalls of your nodes?.

cristian-corbu · 2023-05-09T13:50:01Z

I fixed this issue by scheduling the controller pod to the master node.
Follow these steps to force pods deployment to master node:
https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

seyfettinover · 2023-06-01T07:01:53Z

To resolve this issue, you can try the following steps:
Verify that the webhook service is deployed and running in the metallb-system namespace. You can use the following command to check the status of the webhook service:

kubectl get pods -n metallb-system

Look for a pod with a name similar to metallb-controller-xxxxx to confirm if it's running.
If the webhook service is not running, you may need to redeploy it. You can do this by deleting the existing webhook resources and letting them be recreated. Use the following command:

kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io metallb-webhook-config

If you get any error, you may need to change end of the command above with this:

kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io metallb-webhook-configuration

After all of this, you can apply metallb-adrpool.yaml and metallb-12.yaml

this is the content of my YAML files;
metallb-adrpool.yaml

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: first-pool
  namespace: metallb-system
spec:
  addresses:
  - 10.1.81.151-10.1.81.155

Kubectl apply -f metallb-adrpool.yaml

metallb-12.yaml

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: example
  namespace: metallb-system

Kubectl apply -f metallb-12.yaml

licecil · 2023-06-08T03:20:08Z

For v0.13.10 Ignore some webhook configs works for me:
changes for metallb-native.yaml are:

2009 path: /validate-metallb-io-v1beta1-ipaddresspool
2010 failurePolicy: Ignore

2029 path: /validate-metallb-io-v1beta1-l2advertisement
2030 failurePolicy: Ignore

github-actions · 2023-07-09T00:03:29Z

This issue has been automatically marked as stale because it has been open 30 days
with no activity. This issue will be closed in 10 days unless you do one of the following:

respond to this issue
have one of these labels applied: bug,good first issue,help wanted,hold,enhancement,documentation,question

laundry-96 · 2023-07-15T06:47:01Z

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

This is what fixed it for me!

kubectl edit deploy metallb-controller < -n <namespace_if_you_have_one>>

Add

nodeName: <kubernetes master>

to the pod spec
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

yurikilian · 2023-07-27T16:34:53Z

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

This is what fixed it for me!
kubectl edit deploy metallb-controller < -n <namespace_if_you_have_one>>
Add
nodeName: <kubernetes master>
to the pod spec https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

For me it also worked. But why?

dtufood-kihen · 2023-08-08T16:27:38Z

I've encountered an issue that seems related to the problem discussed here, specifically while configuring MetalLB on an RKE2 cluster running in VMs provisioned by Vagrant on my local machine (VirtualBox).

"message":"failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": context deadline exceeded"

The VMs have two network interfaces: the first one is a NAT interface that Vagrant always configures (including SSH port forwarding rules for provisioning), and I've added a second interface, which is bridged to my host machine's physical interface to make the VMs part of my local subnet.

When I shut down the VMs and remove the NAT interface, leaving only one interface, everything works as expected.

I'm eager to understand why this might be happening and how to overcome this limitation. Any insights or potential fixes would be greatly appreciated.

willzhang · 2023-08-26T16:34:21Z

我可以设法修复我的错误“服务不可用”，
Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": Service Unavailable
通过修改/etc/kubernetes/manifests/kube-apiserver.yaml,

添加.svc到环境no_proxy

same problems

# kubectl apply -f IPAddressPool.yaml
Error from server (InternalError): error when creating "IPAddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": EOF
Error from server (InternalError): error when creating "IPAddressPool.yaml": Internal error occurred: failed calling webhook "l2advertisementvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s": EOF

i find i have proxy config

# cat /etc/kubernetes/manifests/kube-apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
........
    env:
    - name: no_proxy
      value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,172.17.0.1,.svc.cluster.local,apiserver.cluster.local,100.64.0.0/10
    - name: ftp_proxy
      value: http://192.168.72.1:7890
    - name: https_proxy
      value: http://192.168.72.1:7890
    - name: NO_PROXY
      value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,172.17.0.1,.svc.cluster.local,apiserver.cluster.local,100.64.0.0/10
    - name: FTP_PROXY
      value: http://192.168.72.1:7890
    - name: HTTPS_PROXY
      value: http://192.168.72.1:7890
    - name: HTTP_PROXY
      value: http://192.168.72.1:7890
    - name: http_proxy
      value: http://192.168.72.1:7890

when i delete them everything ok

# kubectl apply -f IPAddressPool.yaml
ipaddresspool.metallb.io/first-pool created
l2advertisement.metallb.io/l2 created

bhudgens · 2023-09-03T12:07:13Z

I fixed this issue by scheduling the controller pod to the master node.
Follow these steps to force pods deployment to master node:
https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

My setup is kubeadm cluster on raspberry pi's. Default helm installation of metallb. I was getting the following error trying to apply configs:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": dial tcp 10.98.186.109:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "l2advertisementvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s": dial tcp 10.98.186.109:443: connect: connection refused

Your solution fixed this for me. Specifically:

#1547 (comment)

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

This is what fixed it for me!
kubectl edit deploy metallb-controller < -n <namespace_if_you_have_one>>
Add
nodeName: <kubernetes master>
to the pod spec https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

I really wish we had a bit of understanding of why the pod running on the master node fixes this issue? I do not have firewalls installed on my worker nodes. In either case, thank you @bejay88!

wgnathanael · 2023-09-06T21:37:09Z

Ok, so we tried a few different things to get this to work and I thought I'd report in how we've worked around the issue. First some context.

We deploy metallb via helm via ansible.

Our first theory was that there was a timing issue.
1. So the first thing we tried was just having the ansible task that installs the configuration for the IPAddress and L2Advertisement retry a number of times. That seemed to work. However later on it didn't we had set 3 retries with 10s. Later on in debugging the issue (after the other solution we tried that I'll detail after this) I had set 120 retries with 30 seconds and it basically never succeeded (the test above to call the webhook from a node wouldn't respond at all).
2. added wait: true to the helm deployment
We then brought the adding of these configuration steps into the helm step by putting the config as a template and adding annotations to make helm add it as a post install step. This worked sometimes but then not others. We thought again it was a timing issue. However upon further searching found this. Which showed that helm can't actually wait for the service to be ready.
In this thread we saw mentions about making the controller run on the control-plane node as a solution. In trying to validate this, I noticed that when our ansible task failed even with hundreds of retries, the controller/webhook were on the control-plane node and when it failed it was running on an agent/worker node. If I cordoned off the agent and deleted the controller it came back up on the control-plane and then immediately the next retry in the ansible task succeeded in real time.

So I don't know why it needs to be on the control plane but the solution for us was to change the values.yaml file in the helm chart to add the following to the controller: nodeSelector section:

  nodeSelector: { node-role.kubernetes.io/control-plane: "true" }

It was around line 230 in the version of the values.yaml file we have that isn't heavily modified.

I'd be really curious as to the actual reason it works on the control-plane node but not any others. I don't know if that is a legitimate bug or if something about our cluster is at fault for that.

wgnathanael · 2023-09-08T21:49:45Z

Ok, so upon further investigation this has turned out to be a firewall issue. Our ansible scripts were doing the following:

- name: Open ports for Flannel/Calico
  ansible.posix.firewalld:
    port: "8472/udp"
    permanent: true
    state: enabled
  notify: Restart Firewalld

- name: Add cluster network to trusted zone
  ansible.posix.firewalld:
    zone: trusted
    permanent: true
    source: "{{ cluster_cidr }}"
    state: enabled
  notify: Restart Firewalld

- name: Add services network to trusted zone
  ansible.posix.firewalld:
    zone: trusted
    permanent: true
    source: "{{ service_cidr }}"
    state: enabled
  notify: Restart Firewalld

Once we ensured the firewall had been restarted prior to applying the L2Advertisement/IPAddressPool config then applying the config it worked without fail. Our setup is using rke2 1.25 and the firewall was setup per the guidelines.

github-actions · 2023-10-09T00:03:24Z

This issue has been automatically marked as stale because it has been open 30 days
with no activity. This issue will be closed in 10 days unless you do one of the following:

respond to this issue
have one of these labels applied: bug,good first issue,help wanted,hold,enhancement,documentation,question

github-actions · 2023-10-20T00:03:11Z

This issue was automatically closed because of lack of activity after being marked stale in the
past 10 days. If you feel this was done in error, please feel free to re-open this issue.

Reedler01 · 2024-02-26T13:06:36Z

Opening port 8472/udp on all nodes fixed it for me.

lindhe · 2024-03-15T09:54:10Z

I too had this issue, and it turned out to be a firewall issue where the controller pod could not receive the web hook calls.

I use Rancher with project network isolation activated for my cluster. That creates a rule in each namespace that is supposed to allow all communication to/from the nodes (since that is required for many system functions, like webhooks). Unfortunately, I use Cilium which has a bug that prevents such policies to work: cilium/cilium#12277

I debugged this by using Hubble to inspect the traffic flow:

hubble observe -f --namespace metallb-system --verdict DROPPED

$ hubble observe -f --namespace metallb-system --verdict DROPPED
Mar 15 09:03:23.154: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:23.154: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:03:24.195: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:24.195: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:03:26.243: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:26.243: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:03:30.276: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:30.276: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:03:38.403: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:38.403: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:03:54.788: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:54.788: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:04:27.043: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:04:27.043: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)

There I could identify that the packets dropped were Ingress traffic to port 9443, so I created a NetworkPolicy rule as a work-around, and then everything worked as expected:

NetworkPolicy.yaml

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-webhook-ingress
  namespace: metallb-system
spec:
  ingress:
    - ports:
        - port: 9443
          protocol: TCP
  podSelector:
    matchLabels:
      component: controller
  policyTypes:
    - Ingress

After applying that NetworkPolicy, I did no longer get the failing webhook and so I could update MetalLB again.

I see that this issue has been closed by low activity. Maybe it should be reopened again?

Including a NetworkPolicy to fix this would be nice, but it's hard to know if that should be the responsibility of MetalLB or the platform team that implemented the CNI in the cluster where MetalLB gets deployed.

sammagnet7 · 2024-04-13T17:17:46Z

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

This is what fixed it for me!
kubectl edit deploy metallb-controller < -n <namespace_if_you_have_one>>
Add
nodeName: <kubernetes master>
to the pod spec https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

This approach worked for me also. Thanks. It's like a Saviour.

fedepaol changed the title ~~Webhook issues~~ Webhook issues: InternalError (failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io") Sep 7, 2022

fedepaol mentioned this issue Sep 7, 2022

Provide a deployment option for disabling the webhooks #1598

Closed

fedepaol mentioned this issue Sep 14, 2022

Issue with creating ipaddresspools: failed calling webhook #1612

Closed

fedepaol pinned this issue Sep 14, 2022

This comment was marked as outdated.

Sign in to view

jwcesign mentioned this issue Sep 22, 2022

Disable webhook-mode for metallb karmada-io/karmada#2560

Merged

jwcesign mentioned this issue Oct 6, 2022

Add e2e test on k8s 1.25.2 knative/serving#13358

Merged

TheRealVira mentioned this issue Oct 10, 2022

'microk8s enable metallb' results in webhook timeout #1634

Closed

mowoe mentioned this issue Jul 6, 2023

Add MetalLB in FRR mode via flags, add webhookvalidation disable option canonical/microk8s-core-addons#206

Open

github-actions bot added the lifecycle-stale label Jul 9, 2023

github-actions bot removed the lifecycle-stale label Jul 16, 2023

yike21 mentioned this issue Jul 28, 2023

Flaking test: timeout in resourceinterpreter test karmada-io/karmada#3825

Closed

github-actions bot added the lifecycle-stale label Oct 9, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 20, 2023

fedepaol mentioned this issue Nov 27, 2023

Replace validation webhook with CEL/OpenAPI native validation #2182

Closed

2 tasks

FatmanUK mentioned this issue Feb 24, 2024

tls: failed to verify certificate: x509: certificate is valid for webhook-service.metallb-system.svc, not metallb-webhook-service.metallb-system.svc FatmanUK/k3s_playground#71

Closed

fedepaol mentioned this issue Mar 8, 2024

ServiceL2Status follow-up: create resources in main namespace with speaker pod as OwnerRef #2311

Open

2 tasks

Webhook issues: InternalError (failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io") #1597

Webhook issues: InternalError (failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io") #1597

Comments

fedepaol commented Sep 7, 2022 • edited

Given a webhook failure, one must check

if the metallb controller is running and the endpoints of the service are healthy

If the caBundle is generated and the configuration is patched properly

Test if the service is reacheable from the apiserver node

In case the connection times out:

How to disable the webhook

michaelvl commented Sep 7, 2022

koimad commented Sep 13, 2022

This comment was marked as outdated.

elraro commented Sep 15, 2022

LarsBingBong commented Sep 16, 2022 • edited

fireflycons commented Sep 25, 2022

fedepaol commented Sep 26, 2022

fireflycons commented Sep 26, 2022

fedepaol commented Sep 26, 2022

fireflycons commented Sep 26, 2022

fedepaol commented Sep 26, 2022

fireflycons commented Sep 26, 2022

mschauf commented Sep 29, 2022 • edited

pseymournutanix commented Oct 7, 2022 • edited

hidalgopl commented Oct 10, 2022

jsemohub commented Oct 19, 2022

fedepaol commented Oct 19, 2022

Zveroloff commented Oct 19, 2022

jmcgrath207 commented Oct 23, 2022

coopbri commented Oct 23, 2022

1guzzy commented Mar 22, 2023

lgehrke6 commented Apr 13, 2023

aleksandrov commented May 6, 2023 • edited

chrismedinapy commented May 6, 2023 • edited

cristian-corbu commented May 9, 2023 • edited

seyfettinover commented Jun 1, 2023 • edited

licecil commented Jun 8, 2023

github-actions bot commented Jul 9, 2023

laundry-96 commented Jul 15, 2023 • edited

yurikilian commented Jul 27, 2023

dtufood-kihen commented Aug 8, 2023

willzhang commented Aug 26, 2023

bhudgens commented Sep 3, 2023 • edited

wgnathanael commented Sep 6, 2023 • edited

wgnathanael commented Sep 8, 2023

github-actions bot commented Oct 9, 2023

github-actions bot commented Oct 20, 2023

Reedler01 commented Feb 26, 2024

lindhe commented Mar 15, 2024

sammagnet7 commented Apr 13, 2024

fedepaol commented Sep 7, 2022 •

edited

LarsBingBong commented Sep 16, 2022 •

edited

mschauf commented Sep 29, 2022 •

edited

pseymournutanix commented Oct 7, 2022 •

edited

aleksandrov commented May 6, 2023 •

edited

chrismedinapy commented May 6, 2023 •

edited

cristian-corbu commented May 9, 2023 •

edited

seyfettinover commented Jun 1, 2023 •

edited

laundry-96 commented Jul 15, 2023 •

edited

bhudgens commented Sep 3, 2023 •

edited

wgnathanael commented Sep 6, 2023 •

edited