Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of ServerSpec collision errors #10447

Closed
ripta opened this issue Mar 4, 2023 · 2 comments
Closed

Handling of ServerSpec collision errors #10447

ripta opened this issue Mar 4, 2023 · 2 comments

Comments

@ripta
Copy link

ripta commented Mar 4, 2023

What is the issue?

This is in relation to a feature added in #8076 which also started validating spec.podSelector to ensure that the field on new Server doesn't cause selector overlap with pod selectors on existing Server resources.

Unfortunately, this check trips up when users rename their Server resource. Because renames look like a create and delete pair of operations to Kubernetes, this causes the resource with the old name (that still exists in the cluster) and the resource with the new name (being submitted for creation) to have overlapping selectors.

Assuming I haven't missed any special flags, the problem is also that some tooling don't prune until after creation succeeds, while creation fails because the old resource hasn't been pruned yet:

  • helm doesn't delete resources (deleteResource call on line 459) until after creates (createResource call on line 412) and updates (updateResource call on on line 427) succeed.
  • kubectl apply --prune doesn't prune (masked as the call to PostProcessorFn on line 477) until after all objects are applied (applyOneObject).

How can it be reproduced?

Create a new Server resource, e.g.:

apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  name: http
  labels:
    app.kubernetes.io/name: web
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: web
  port: http
  proxyProtocol: HTTP/1

Apply the resource. Change the resource name (metadata.name field) to something new, e.g., web, and you'll get an error.

Logs, error output, etc

The error is:

could not create object: admission webhook "linkerd-policy-validator.linkerd.io" denied the request: identical server spec already exists

output of linkerd check -o short

N/A

Environment

linkerd 2.12.3

Possible solution

We currently have to manually delete the resource with the old name before doing any resource creation / update / apply. It was also not always obvious to users which resources are colliding, but #10187 seems to have alleviated that.

With GitOps-based controls, where users might not have write access to production systems, this sometimes means it requires intervention from cluster operators. With dozens of clusters and hundreds of microservices, this doesn't scale well. We could attempt to build something to handle it automatically, but deleting before creating does mean there could be downtime, because there are no Server resources that target the set of pods.

Another possible solution is to add a flag to allow the cluster administrator to disable pod selector overlap validation. A big question I haven't taken into account here (and don't know the answer to) is what the repercussions of overlapping selectors are, e.g., whether selector overlaps cause undefined behavior on the proxy.

Additional context

No response

Would you like to work on fixing this bug?

None

@ripta ripta added the bug label Mar 4, 2023
@jeremychase
Copy link
Contributor

@ripta Thank you for the detailed report!

The admission webhook blocking the above resource renaming will be removed or relaxed as we continue to improve Status fields on Linkerd policy resources. We have work in flight to address updating Statuses and we are keeping this issue open to track resolving the bug you have identified.

@jeremychase jeremychase added this to the stable-2.14.0 milestone Mar 9, 2023
@jeremychase jeremychase added the priority/P1 Planned for Release label Mar 9, 2023
@stale
Copy link

stale bot commented Jun 8, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 8, 2023
@stale stale bot closed this as completed Jun 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants