Skip to content

Webhook does not process transient errors properly #2173

@oribon

Description

@oribon

MetalLB Version

latest

Deployment method

Not relevant

Main CNI

N/A

Kubernetes Version

No response

Cluster Distribution

No response

Describe the bug

Using the webhooks, we do not want to block resources depending on the order they are received - e.g we don't want to block a peer specifying a bfdProfile that doesn't exist yet because it could eventually converge. today we achieve that using the TransientError type, returning it only when an error that relates to dependencies between resources occur and we ignore this type of err in the webhook's validator.
The problem with that approach is that in some cases this causes the webhook to miss an actual error, because in the webhook we pass multiple objects of the same type to the validation and return early when hitting any error, which for example if a TransientError happens for the first resource the rest of the resources are not being validated at all.
As an example, consider the following resources:

apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
  name: peer1
  namespace: metallb-system
spec:
  peerAddress: 172.16.30.2
  peerASN: 65433
  myASN: 65439
  bfdProfile: "non-existing"
---
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
  name: peer2
  namespace: metallb-system
spec:
  peerAddress: 172.16.30.3
  peerASN: 0
  myASN: 65439

peer2 is invalid since it specifies peerASN: 0 so we would expect the webhook to block it anyways, but the outcome of the webhook changes as a result of the order: if peer2 is created without peer1 it is blocked, if it is created after peer1 it passes.
A possible solution is instead of ignoring transient errors in the webhook's validator, it would pass the resources to For without fields that can cause a transient error (resetting them), essentially not allowing the webhook to receive a transient error at all.
TransientErrors could stay in the codebase (serving as a "doc"), we would just not act on them in the webhooks.

To Reproduce

described above

Expected Behavior

described above

Additional Context

described above

I've read and agree with the following

  • I've checked all open and closed issues and my request is not there.
  • I've checked all open and closed pull requests and my request is not there.

I've read and agree with the following

  • I've checked all open and closed issues and my issue is not there.
  • This bug is reproducible when deploying MetalLB from the main branch
  • I have read the troubleshooting guide and I am still not able to make it work
  • I checked the logs and MetalLB is not discarding the configuration as not valid
  • I enabled the debug logs, collected the information required from the cluster using the collect script and will attach them to the issue
  • I will provide the definition of my service and the related endpoint slices and attach them to this issue

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions