Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook issues: InternalError (failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io") #1597

Closed
fedepaol opened this issue Sep 7, 2022 · 57 comments

Comments

@fedepaol
Copy link
Member

fedepaol commented Sep 7, 2022

This is an umbrella issue to try to provide troubleshooting information and to collect all the webhook related issues:

#1563
#1547
#1540

A very good guide that can be applied also to metallb is https://hackmd.io/@maelvls/debug-cert-manager-webhook
Please note that the service name / webhook name might be slightly different when consuming the helm charts or the manifest.

Given a webhook failure, one must check

if the metallb controller is running and the endpoints of the service are healthy

kubectl get endpoints -n metallb-system
NAME              ENDPOINTS         AGE
webhook-service   10.244.2.2:9443   4h32m

If the caBundle is generated and the configuration is patched properly

To get the caBundle used by the webhooks:

kubectl get validatingwebhookconfiguration metallb-webhook-configuration -ojsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d

To get the caBundle from the secret:

kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' | base64 -d

The caBundle in the webhook configuration and in the secret must match, and the raw version you get from kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' must be different from the default dummy one that can be found here:

caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tDQpNSUlGWlRDQ0EwMmdBd0lCQWdJVU5GRW1XcTM3MVpKdGkrMmlSQzk1WmpBV1MxZ3dEUVlKS29aSWh2Y05BUUVMDQpCUUF3UWpFTE1Ba0dBMVVFQmhNQ1dGZ3hGVEFUQmdOVkJBY01ERVJsWm1GMWJIUWdRMmwwZVRFY01Cb0dBMVVFDQpDZ3dUUkdWbVlYVnNkQ0JEYjIxd1lXNTVJRXgwWkRBZUZ3MHlNakEzTVRrd09UTXlNek5hRncweU1qQTRNVGd3DQpPVE15TXpOYU1FSXhDekFKQmdOVkJBWVRBbGhZTVJVd0V3WURWUVFIREF4RVpXWmhkV3gwSUVOcGRIa3hIREFhDQpCZ05WQkFvTUUwUmxabUYxYkhRZ1EyOXRjR0Z1ZVNCTWRHUXdnZ0lpTUEwR0NTcUdTSWIzRFFFQkFRVUFBNElDDQpEd0F3Z2dJS0FvSUNBUUNxVFpxMWZRcC9vYkdlenhES0o3OVB3Ny94azJwellualNzMlkzb1ZYSm5sRmM4YjVlDQpma2ZZQnY2bndscW1keW5PL2phWFBaQmRQSS82aFdOUDBkdVhadEtWU0NCUUpyZzEyOGNXb3F0MGNTN3pLb1VpDQpvcU1tQ0QvRXVBeFFNZjhRZDF2c1gvVllkZ0poVTZBRXJLZEpIaXpFOUJtUkNkTDBGMW1OVW55Rk82UnRtWFZUDQpidkxsTDVYeTc2R0FaQVBLOFB4aVlDa0NtbDdxN0VnTWNiOXlLWldCYmlxQ3VkTXE5TGJLNmdKNzF6YkZnSXV4DQo1L1pXK2JraTB2RlplWk9ZODUxb1psckFUNzJvMDI4NHNTWW9uN0pHZVZkY3NoUnh5R1VpSFpSTzdkaXZVTDVTDQpmM2JmSDFYbWY1ZDQzT0NWTWRuUUV2NWVaOG8zeWVLa3ZrbkZQUGVJMU9BbjdGbDlFRVNNR2dhOGFaSG1URSttDQpsLzlMSmdDYjBnQmtPT0M0WnV4bWh2aERKV1EzWnJCS3pMQlNUZXN0NWlLNVlwcXRWVVk2THRyRW9FelVTK1lsDQpwWndXY2VQWHlHeHM5ZURsR3lNVmQraW15Y3NTU1UvVno2Mmx6MnZCS21NTXBkYldDQWhud0RsRTVqU2dyMjRRDQp0eGNXLys2N3d5KzhuQlI3UXdqVTFITndVRjBzeERWdEwrZ1NHVERnSEVZSlhZelYvT05zMy94TkpoVFNPSkxNDQpoeXNVdyttaGdackdhbUdXcHVIVU1DUitvTWJzMTc1UkcrQjJnUFFHVytPTjJnUTRyOXN2b0ZBNHBBQm8xd1dLDQpRYjRhY3pmeVVscElBOVFoSmFsZEY3S3dPSHVlV3gwRUNrNXg0T2tvVDBvWVp0dzFiR0JjRGtaSmF3SURBUUFCDQpvMU13VVRBZEJnTlZIUTRFRmdRVW90UlNIUm9IWTEyRFZ4R0NCdEhpb1g2ZmVFQXdId1lEVlIwakJCZ3dGb0FVDQpvdFJTSFJvSFkxMkRWeEdDQnRIaW9YNmZlRUF3RHdZRFZSMFRBUUgvQkFVd0F3RUIvekFOQmdrcWhraUc5dzBCDQpBUXNGQUFPQ0FnRUFSbkpsWWRjMTFHd0VxWnh6RDF2R3BDR2pDN2VWTlQ3aVY1d3IybXlybHdPYi9aUWFEa0xYDQpvVStaOVVXT1VlSXJTdzUydDdmQUpvVVAwSm5iYkMveVIrU1lqUGhvUXNiVHduOTc2ZldBWTduM3FMOXhCd1Y0DQphek41OXNjeUp0dlhMeUtOL2N5ak1ReDRLajBIMFg0bWJ6bzVZNUtzWWtYVU0vOEFPdWZMcEd0S1NGVGgrSEFDDQpab1Q5YnZHS25adnNHd0tYZFF0Wnh0akhaUjVqK3U3ZGtQOTJBT051RFNabS8rWVV4b2tBK09JbzdSR3BwSHNXDQo1ZTdNY0FTVXRtb1FORXd6dVFoVkJaRWQ1OGtKYjUrV0VWbGNzanlXNnRTbzErZ25tTWNqR1BsMWgxR2hVbjV4DQpFY0lWRnBIWXM5YWo1NmpBSjk1MVQvZjhMaWxmTlVnanBLQ0c1bnl0SUt3emxhOHNtdGlPdm1UNEpYbXBwSkI2DQo4bmdHRVluVjUrUTYwWFJ2OEhSSGp1VG9CRHVhaERrVDA2R1JGODU1d09FR2V4bkZpMXZYWUxLVllWb1V2MXRKDQo4dVdUR1pwNllDSVJldlBqbzg5ZytWTlJSaVFYUThJd0dybXE5c0RoVTlqTjA0SjdVL1RvRDFpNHE3VnlsRUc5DQorV1VGNkNLaEdBeTJIaEhwVncyTGFoOS9lUzdZMUZ1YURrWmhPZG1laG1BOCtqdHNZamJadnR5Mm1SWlF0UUZzDQpUU1VUUjREbUR2bVVPRVRmeStpRHdzK2RkWXVNTnJGeVVYV2dkMnpBQU4ydVl1UHFGY2pRcFNPODFzVTJTU3R3DQoxVzAyeUtYOGJEYmZFdjBzbUh3UzliQnFlSGo5NEM1Mjg0YXpsdTBmaUdpTm1OUEM4ckJLRmhBPQ0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQ==

Test if the service is reacheable from the apiserver node

Find the webhook service cluster ip:

kubectl get service -n metallb-system  webhook-service
NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
webhook-service   ClusterIP   10.96.50.216   <none>        443/TCP   4h15m

Fetch the caBundle:

kubectl -n metallb-system get secret webhook-server-cert -ojsonpath='{.data.ca\.crt}' | base64 -d > caBundle.pem

Move the caBundle.pem file to a node, and from the node try to curl the service, providing the resolution from the service fqdn to the service's clusterIP (in this case, 10.96.50.216):

curl --cacert ./caBundle.pem --resolve webhook-service.metallb-system.svc:443:10.96.50.216 https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool

The expected result is the webhook complaining about missing content:

{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}

But that will guarantee the certificate is valid.

In case the connection times out:

The instructions at https://hackmd.io/@maelvls/debug-cert-manager-webhook#Error-2-io-timeout can be followed.

Use tcpdump on port 443 to see if the traffic from the apiserver is directed to the endpoint (the controller pod's ip in this case).

How to disable the webhook

A very quick workaround is to disable the webhook, which requires changing its failurePolicy to failurePolicy=Ignore

@michaelvl
Copy link

Hi,
My observation is that its a bit random if this occurs or not. Another observation is that after observing the issue, if I retry after some minutes the webhook often succeeds. This suggest that there is some race involved.

Pure speculation, but could the API server initially cache an invalid cert before the cabundle injection and the waiting cause this cache to expire and the new injected certificate being loaded?

@koimad
Copy link

koimad commented Sep 13, 2022

Hi,
I've got the same issue, After installing a clean bare metal kubernetes cluster following this guide https://graspingtech.com/install-kubernetes-rhel-8/, and doing the basic metallb install as per the guide.

If I turn off the firewall on the worker nodes then all works well.

The firewall dropped logs report

image

The configuration of the firewall is

image

The webhook service is described below.

image

I'm a beginner with Kubernetes and Metallb so any advice would be gratefully accepted.

Brian..

@elraro

This comment was marked as outdated.

@elraro
Copy link

elraro commented Sep 15, 2022

My problem was fixed. I had problems with helm deployment, metallb version 0.12 and outdated crds. Redeploying everything works perfectly.

Best regards.

@LarsBingBong
Copy link

LarsBingBong commented Sep 16, 2022

I've experienced when installing MetalLB via Helm and as a post Chart installation step create advertisement and pool configs - via CRD. v0.13.5.

But, it doesn't always happen.

@fireflycons
Copy link

Hi @fedepaol

I've done two installations of MetalLB today, both using the same version of the chart (latest available as at time of writing).

  1. kubeadm cluster running 1.25 deployed on Hyper-V VMs running Ubuntu 22.04 - This works fine.
  2. Manually installed cluster running 1.24 on VirtualBox running Ubuntu 22.04 as per https://github.com/mmumshad/kubernetes-the-hard-way (which I currently maintain). This one borks with the ipaddresspools issue.

I've run the tests outlined above and got the expected response from curl. I'm not intending to run the second method, but more to demonstrate to KodeKloud students how one might configure nginx ingress in LoadBalancer mode on this hand-built cluster.

While we're here, can MetalLB be configured to get certificates from cert-manager? Happy to submit a PR on the chart if it should work, and is not implemented.

@fedepaol
Copy link
Member Author

Hi @fireflycons , we had a discussion about the method to provide certificates to the webhooks, and we came up to the conclusion that adding an additional dependency such as cert-manager was not ideal, so we ended up embedding https://github.com/open-policy-agent/cert-controller in metallb to perform such task.
So, right now cert-manager is not supported, and adding it would compete with the current method.

@fireflycons
Copy link

Ok @fedepaol that's fine re cert-manager. At least I know how you're doing it now :-)

Any ideas on why the IPAddressPool isn't working in my virtualbox setup?

@fedepaol
Copy link
Member Author

What you mean by "the ipaddresspool isn't working"?

@fireflycons
Copy link

I mean what this entire discussion is about. When I try to create the IPAddresPool and L2Advertisement resources, they time out.
As mentioned in my initial post, I performed all the steps outlined above and got the expected curl response, which if I haven't read it incorrectly means that the API server should be able to reach the webhook endpoint.

{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}

@fedepaol
Copy link
Member Author

Ah apologies. Can you check the logs of the apiserver?
Also, is the error always occurring or intermittent? And what is the error? Is the environment single node?

@fireflycons
Copy link

Hi @fedepaol

API server error is

{
    "kind": "Status",
    "apiVersion": "v1",
    "metadata": {},
    "status": "Failure",
    "message": "Internal error occurred: failed calling webhook \"ipaddresspoolvalidationwebhook.metallb.io\": failed to call webhook: Post \"https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s\": context deadline exceeded",
    "reason": "InternalError",
    "details": {
        "causes": [
            {
                "message": "failed calling webhook \"ipaddresspoolvalidationwebhook.metallb.io\": failed to call webhook: Post \"https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s\": context deadline exceeded"
            }
        ]
    },
    "code": 500
}

Now the cluster is running the control plane as OS services (i.e. not kubeadm). Doesn't this usually mean that in-cluster webhooks need to have hostNetwork: true? It occurs to me that I did the curl test from inside a test pod, which of course was going to work!

Was unable to redeploy the controller on the host network as it has a port clash, I'm guessing with the speaker pod that's on the same worker.

This is the cluster configuration. Note that this is purely a learning exercise for K8s students, however some are asking to extend the exercise to deploy ingress and I thought it would be good to try it with MetalLB.

Note that I have successfully installed the same chart version on a kubeadm cluster.

@mschauf
Copy link

mschauf commented Sep 29, 2022

Hello! I've deployed Metallb Operator on OKD 4.11 and I'm trying now to configure the address pool via:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: ip-addresspool-sample1
  namespace: metallb-system
spec:
  addresses:
    - xxx.xxx.xxx.xxx-xxx.xxx.xxx.xxx
  autoAssign: true

respectively

apiVersion: metallb.io/v1beta1
kind: AddressPool
metadata:
  name: ippool
  namespace: metallb-system
spec:
  addresses:
  - xxx.xxx.xxx.xxx-xxx.xxx.xxx.xxx
  autoAssign: false
  protocol: layer2

and I get following error message:

Error "failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": x509: certificate is valid for metallb-operator-webhook-server-service.metallb-system, metallb-operator-webhook-server-service.metallb-system.svc, not webhook-service.metallb-system.svc" for field "undefined".

"NO_PROXY" also covers .svc

- name: NO_PROXY
              value: >-
                .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,172.30.0.0/16,api-int.domain.xxx,localhost

Already tried disabling the webhook by

How to disable the webhook
A very quick workaround is to disable the webhook, which requires changing its failurePolicy to failurePolicy=Ignore

which didn't work for me.

@pseymournutanix
Copy link

pseymournutanix commented Oct 7, 2022

My secret and webhook don't have any values for caBundle and I get this error. It's a clean deployment to a new cluster ?

UPDATE: I deleted the controller pod and both the secret and the webhook certs were recreated.

UPDATE 2: A few hours later both the secret, and the webhook had no value for the caBundle again.....

I suspect this is due to using ArgoCD via helm and kustomize what is the best way to exclude the syncing of these resources from ArgoCD if rendering via those tools any help would be appreciated.

@hidalgopl
Copy link

We started hitting this issue in our CI after migrating to v0.13.5. We're using kind cluster, v1.21.
Code that worked before:

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb.yaml
kubectl -n metallb-system wait deploy/controller --timeout=90s --for=condition=Available
kubectl apply -f ./metal_lb_cm.yaml

I noticed that in kind, speaker has CreateContainerConfigError initially:

metallb-system       controller-6846c94466-bn2qx                0/1     Running                      0          15s
metallb-system       speaker-6t6x8                              0/1     CreateContainerConfigError   0          15s
metallb-system       speaker-6t6x8                              0/1     Running                      0          17s
metallb-system       controller-6846c94466-bn2qx                1/1     Running                      0          20s
metallb-system       speaker-6t6x8                              1/1     Running                      0          30s

workaround that seems to be working so far for us is waiting until all pods (speaker & controller) are ready:

kubectl -n metallb-system wait pod --all --timeout=90s --for=condition=Ready
kubectl -n metallb-system wait deploy controller --timeout=90s --for=condition=Available
kubectl -n metallb-system wait apiservice v1beta1.metallb.io --timeout=90s --for=condition=Available
kubectl apply -f ./metal_lb_addrpool.yaml

our metal_lb_addrpool.yaml:

--- Metal LB config:
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: example
  namespace: metallb-system
spec:
  addresses:
  - 172.18.255.200-172.18.255.255
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: empty
  namespace: metallb-system

posting as fyi, in case anyone is looking for a workaround in kind.

@jsemohub
Copy link

Experiencing same issue with webhook failure.
MicroK8s v1.25.2 revision 4055
Is there a quick workaround to turn off webhooks?
Thank you.

@fedepaol
Copy link
Member Author

Experiencing same issue with webhook failure. MicroK8s v1.25.2 revision 4055 Is there a quick workaround to turn off webhooks? Thank you.

It's in the first post. Also, if you are using helm to deploy, you can use this value https://github.com/metallb/metallb/blob/main/charts/metallb/values.yaml#L332 from the last release.

@Zveroloff
Copy link

Hello!

Same Issue,
Error from server (InternalError): error when creating "IPAddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": Unknown Host

Env:
Bare-metal Kubernetes 1.25.2
Cilium CNI 1.13.0-rc1
3 nodes

@jmcgrath207
Copy link

Originally went I did a fresh install, I opted out of installing cilium-proxy and went with the default kube-proxy

I tried everything under the sun for a month, but ultimately it came down to metallb not working with my kube-proxy but it worked with cilium-proxy.

I never found the error for this in the end and I am not saying kube-proxy doesn't work but did come down to something in my proxy configuration or node configuration.

Hope this helps.

@coopbri
Copy link

coopbri commented Oct 23, 2022

Originally went I did a fresh install, I opted out of installing cilium-proxy and went with the default kube-proxy

I tried everything under the sun for a month, but ultimately it came down to metallb not working with my kube-proxy but it worked with cilium-proxy.

I never found the error for this in the end and I am not saying kube-proxy doesn't work but did come down to something in my proxy configuration or node configuration.

Hope this helps.

Similar situation here using kind. I tested the following configurations:

  • kube-proxy without Cilium: worked
  • kube-proxy with Cilium: did not work (received webhook error)
  • Cilium without kube-proxy: worked

So it does seem to be connected to the proxy networking in my case as well as @jmcgrath207's.

@1guzzy
Copy link

1guzzy commented Mar 22, 2023

Hello, I followed the guide in the first comment and everything worked but I'm still getting the

Error from server (InternalError): error when creating "address-pool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "cert")

error. I want to try disabling the webhook but I'm using microk8s and I'm not sure how to disable the webhook. I tried adding metallb with helm instead but its not working.

I'm pretty new to kurbenettes, any help would be greatly appreciated.

@lgehrke6
Copy link

I have a 3 Node cluster running and got the following messages when i tried to apply IPAddressPools and L2Advertisment.

Every pod is also running:
image

The errors changed once i disabled the firewall.

With active firewall:
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": context deadline exceeded
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": context deadline exceeded
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "l2advertisementvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s": context deadline exceeded```


With disabled firewall:
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "cert")
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "cert")
Error from server (InternalError): error when creating "IpaddressPool.yaml": Internal error occurred: failed calling webhook "l2advertisementvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "cert")```

@aleksandrov
Copy link

aleksandrov commented May 6, 2023

Hi All, I performed the checks from initial post. I checked every node in the cluster - all good with certificate:

$ curl --cacert /tmp/caBundle.pem --resolve webhook-service.metallb-system.svc:443:10.152.183.193 https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool
{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}

but API Server is still unable to make a call:

Failed calling webhook, failing closed ipaddresspoolvalidationwebhook.metallb.io: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "cert")

Could someone please suggest what else to check? I'm happy to provide logs to get certs issue finally sorted out.

@chrismedinapy
Copy link

chrismedinapy commented May 6, 2023

Have you try disabling the firewalls of your nodes?.

@cristian-corbu
Copy link

cristian-corbu commented May 9, 2023

I fixed this issue by scheduling the controller pod to the master node.
Follow these steps to force pods deployment to master node:
https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

@seyfettinover
Copy link

seyfettinover commented Jun 1, 2023

To resolve this issue, you can try the following steps:
Verify that the webhook service is deployed and running in the metallb-system namespace. You can use the following command to check the status of the webhook service:

kubectl get pods -n metallb-system

Look for a pod with a name similar to metallb-controller-xxxxx to confirm if it's running.
If the webhook service is not running, you may need to redeploy it. You can do this by deleting the existing webhook resources and letting them be recreated. Use the following command:

kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io metallb-webhook-config

If you get any error, you may need to change end of the command above with this:

kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io metallb-webhook-configuration

After all of this, you can apply metallb-adrpool.yaml and metallb-12.yaml

this is the content of my YAML files;
metallb-adrpool.yaml

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: first-pool
  namespace: metallb-system
spec:
  addresses:
  - 10.1.81.151-10.1.81.155

Kubectl apply -f metallb-adrpool.yaml

metallb-12.yaml

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: example
  namespace: metallb-system
Kubectl apply -f metallb-12.yaml

@licecil
Copy link

licecil commented Jun 8, 2023

For v0.13.10 Ignore some webhook configs works for me:
changes for metallb-native.yaml are:

2009 path: /validate-metallb-io-v1beta1-ipaddresspool
2010 failurePolicy: Ignore

2029 path: /validate-metallb-io-v1beta1-l2advertisement
2030 failurePolicy: Ignore

@github-actions
Copy link

github-actions bot commented Jul 9, 2023

This issue has been automatically marked as stale because it has been open 30 days
with no activity. This issue will be closed in 10 days unless you do one of the following:

  • respond to this issue
  • have one of these labels applied: bug,good first issue,help wanted,hold,enhancement,documentation,question

@laundry-96
Copy link

laundry-96 commented Jul 15, 2023

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

This is what fixed it for me!

kubectl edit deploy metallb-controller < -n <namespace_if_you_have_one>>

Add

nodeName: <kubernetes master>

to the pod spec
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

@yurikilian
Copy link

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

This is what fixed it for me!

kubectl edit deploy metallb-controller < -n <namespace_if_you_have_one>>

Add

nodeName: <kubernetes master>

to the pod spec https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

For me it also worked. But why?

@dtufood-kihen
Copy link

I've encountered an issue that seems related to the problem discussed here, specifically while configuring MetalLB on an RKE2 cluster running in VMs provisioned by Vagrant on my local machine (VirtualBox).

"message":"failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": context deadline exceeded"

The VMs have two network interfaces: the first one is a NAT interface that Vagrant always configures (including SSH port forwarding rules for provisioning), and I've added a second interface, which is bridged to my host machine's physical interface to make the VMs part of my local subnet.

When I shut down the VMs and remove the NAT interface, leaving only one interface, everything works as expected.

I'm eager to understand why this might be happening and how to overcome this limitation. Any insights or potential fixes would be greatly appreciated.

@willzhang
Copy link

我可以设法修复我的错误“服务不可用”,

Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": Service Unavailable

通过修改/etc/kubernetes/manifests/kube-apiserver.yaml,

  • 添加.svc到环境no_proxy

same problems

# kubectl apply -f IPAddressPool.yaml
Error from server (InternalError): error when creating "IPAddressPool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": EOF
Error from server (InternalError): error when creating "IPAddressPool.yaml": Internal error occurred: failed calling webhook "l2advertisementvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s": EOF

i find i have proxy config

# cat /etc/kubernetes/manifests/kube-apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
........
    env:
    - name: no_proxy
      value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,172.17.0.1,.svc.cluster.local,apiserver.cluster.local,100.64.0.0/10
    - name: ftp_proxy
      value: http://192.168.72.1:7890
    - name: https_proxy
      value: http://192.168.72.1:7890
    - name: NO_PROXY
      value: localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,172.17.0.1,.svc.cluster.local,apiserver.cluster.local,100.64.0.0/10
    - name: FTP_PROXY
      value: http://192.168.72.1:7890
    - name: HTTPS_PROXY
      value: http://192.168.72.1:7890
    - name: HTTP_PROXY
      value: http://192.168.72.1:7890
    - name: http_proxy
      value: http://192.168.72.1:7890

when i delete them everything ok

# kubectl apply -f IPAddressPool.yaml
ipaddresspool.metallb.io/first-pool created
l2advertisement.metallb.io/l2 created

@bhudgens
Copy link

bhudgens commented Sep 3, 2023

I fixed this issue by scheduling the controller pod to the master node.
Follow these steps to force pods deployment to master node:
https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

My setup is kubeadm cluster on raspberry pi's. Default helm installation of metallb. I was getting the following error trying to apply configs:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": dial tcp 10.98.186.109:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "l2advertisementvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s": dial tcp 10.98.186.109:443: connect: connection refused

Your solution fixed this for me. Specifically:

#1547 (comment)

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

This is what fixed it for me!

kubectl edit deploy metallb-controller < -n <namespace_if_you_have_one>>

Add

nodeName: <kubernetes master>

to the pod spec https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

I really wish we had a bit of understanding of why the pod running on the master node fixes this issue? I do not have firewalls installed on my worker nodes. In either case, thank you @bejay88!

@wgnathanael
Copy link

wgnathanael commented Sep 6, 2023

Ok, so we tried a few different things to get this to work and I thought I'd report in how we've worked around the issue. First some context.

We deploy metallb via helm via ansible.

  1. Our first theory was that there was a timing issue.
    1. So the first thing we tried was just having the ansible task that installs the configuration for the IPAddress and L2Advertisement retry a number of times. That seemed to work. However later on it didn't we had set 3 retries with 10s. Later on in debugging the issue (after the other solution we tried that I'll detail after this) I had set 120 retries with 30 seconds and it basically never succeeded (the test above to call the webhook from a node wouldn't respond at all).
    2. added wait: true to the helm deployment
  2. We then brought the adding of these configuration steps into the helm step by putting the config as a template and adding annotations to make helm add it as a post install step. This worked sometimes but then not others. We thought again it was a timing issue. However upon further searching found this. Which showed that helm can't actually wait for the service to be ready.
  3. In this thread we saw mentions about making the controller run on the control-plane node as a solution. In trying to validate this, I noticed that when our ansible task failed even with hundreds of retries, the controller/webhook were on the control-plane node and when it failed it was running on an agent/worker node. If I cordoned off the agent and deleted the controller it came back up on the control-plane and then immediately the next retry in the ansible task succeeded in real time.

So I don't know why it needs to be on the control plane but the solution for us was to change the values.yaml file in the helm chart to add the following to the controller: nodeSelector section:

  nodeSelector: { node-role.kubernetes.io/control-plane: "true" }

It was around line 230 in the version of the values.yaml file we have that isn't heavily modified.

I'd be really curious as to the actual reason it works on the control-plane node but not any others. I don't know if that is a legitimate bug or if something about our cluster is at fault for that.

@wgnathanael
Copy link

Ok, so upon further investigation this has turned out to be a firewall issue. Our ansible scripts were doing the following:

- name: Open ports for Flannel/Calico
  ansible.posix.firewalld:
    port: "8472/udp"
    permanent: true
    state: enabled
  notify: Restart Firewalld

- name: Add cluster network to trusted zone
  ansible.posix.firewalld:
    zone: trusted
    permanent: true
    source: "{{ cluster_cidr }}"
    state: enabled
  notify: Restart Firewalld

- name: Add services network to trusted zone
  ansible.posix.firewalld:
    zone: trusted
    permanent: true
    source: "{{ service_cidr }}"
    state: enabled
  notify: Restart Firewalld

Once we ensured the firewall had been restarted prior to applying the L2Advertisement/IPAddressPool config then applying the config it worked without fail. Our setup is using rke2 1.25 and the firewall was setup per the guidelines.

@github-actions
Copy link

github-actions bot commented Oct 9, 2023

This issue has been automatically marked as stale because it has been open 30 days
with no activity. This issue will be closed in 10 days unless you do one of the following:

  • respond to this issue
  • have one of these labels applied: bug,good first issue,help wanted,hold,enhancement,documentation,question

@github-actions
Copy link

This issue was automatically closed because of lack of activity after being marked stale in the
past 10 days. If you feel this was done in error, please feel free to re-open this issue.

@Reedler01
Copy link

Opening port 8472/udp on all nodes fixed it for me.

@lindhe
Copy link
Contributor

lindhe commented Mar 15, 2024

I too had this issue, and it turned out to be a firewall issue where the controller pod could not receive the web hook calls.

I use Rancher with project network isolation activated for my cluster. That creates a rule in each namespace that is supposed to allow all communication to/from the nodes (since that is required for many system functions, like webhooks). Unfortunately, I use Cilium which has a bug that prevents such policies to work: cilium/cilium#12277

I debugged this by using Hubble to inspect the traffic flow:

hubble observe -f --namespace metallb-system --verdict DROPPED

$ hubble observe -f --namespace metallb-system --verdict DROPPED
Mar 15 09:03:23.154: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:23.154: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:03:24.195: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:24.195: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:03:26.243: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:26.243: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:03:30.276: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:30.276: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:03:38.403: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:38.403: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:03:54.788: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:03:54.788: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)
Mar 15 09:04:27.043: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) policy-verdict:none INGRESS DENIED (TCP Flags: SYN)
Mar 15 09:04:27.043: 10.42.2.245:42494 (remote-node) <> metallb-system/controller-cdcfbc84c-kfk8b:9443 (ID:22370) Policy denied DROPPED (TCP Flags: SYN)

There I could identify that the packets dropped were Ingress traffic to port 9443, so I created a NetworkPolicy rule as a work-around, and then everything worked as expected:

NetworkPolicy.yaml

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-webhook-ingress
  namespace: metallb-system
spec:
  ingress:
    - ports:
        - port: 9443
          protocol: TCP
  podSelector:
    matchLabels:
      component: controller
  policyTypes:
    - Ingress

After applying that NetworkPolicy, I did no longer get the failing webhook and so I could update MetalLB again.

I see that this issue has been closed by low activity. Maybe it should be reopened again?

Including a NetworkPolicy to fix this would be nice, but it's hard to know if that should be the responsibility of MetalLB or the platform team that implemented the CNI in the cluster where MetalLB gets deployed.

@sammagnet7
Copy link

I fixed this issue by scheduling the controller pod to the master node. Follow these steps to force pods deployment to master node: https://stackoverflow.com/questions/41999756/how-to-force-pods-deployments-to-master-nodes

This is what fixed it for me!

kubectl edit deploy metallb-controller < -n <namespace_if_you_have_one>>

Add

nodeName: <kubernetes master>

to the pod spec https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

This approach worked for me also. Thanks. It's like a Saviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests