Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L2: use externalips for speaker allocation and remove the need for a local backend in case of traffic policy cluster #976

Merged
merged 11 commits into from Oct 27, 2021

Conversation

fedepaol
Copy link
Member

@fedepaol fedepaol commented Sep 30, 2021

Fixes:

Obsoletes the following PRs:

This started with adding a test to reproduce the issue above, and after investigating it the fix ended up to be removing the need for locality for endpoints in case of traffic policy of type cluster, plus changing the way we determine if a speaker should advertise a service, consuming the external ip(s) instead of the service name + namespace to sort.

Note: I haven't aligned the unit tests (yet), will do if this changes make sense to the reviewers.

@fedepaol fedepaol changed the title Testl2issue L2: use externalips for speaker allocation and remove the need for a local backend in case of traffic policy cluster Sep 30, 2021
@fedepaol
Copy link
Member Author

uhm, tests failed 👀

@russellb
Copy link
Collaborator

russellb commented Oct 8, 2021

You’ve hit an area that has needed to be fixed for a long time. We have many open issues and PRs that are related to this. I had to spend some time going back through all the related issues to help decide what to do with this, so I’m going to share some of those notes here.

I’d love to sort all of this out in this PR and get a bunch of stuff closed out.

You’ve made two key changes here:

  1. You changed the way we calculate the set of candidate nodes to be all nodes when the traffic policy is cluster instead of using only the nodes where a speaker is running along side an Endpoint backing that Service.
  2. You changed the sort key of the candidate node list to be the node name + load balancer IP addresses instead of node name + service name.

On 1 -- related PRs include #781, #665, and #613. Related open issues include #646 and #302.

One thing I liked about other changes compared to yours is to prefer the old algorithm if it produces at least one candidate node. That way we’ll still put the IP on a node with a local endpoint if possible. Otherwise, we can fall back to what you’ve proposed here if we’re allowed to (with the cluster traffic policy). If you agree, updating to incorporate this behavior is my first request.

On 2, you called out issue #968. There’s another class of issues related to change #2, which is to deal with shared IPs for services that use the Local traffic policy. See issues #558, #801 and PRs #562, #887, and #922.

I do see that your change would resolve #968, but I don’t think we would fully resolve the issues related to sharing an IP between multiple IP addresses using the Local traffic policy. Sharing an IP across multiple Local policy services is not going to always work, and we need to get better at detecting that. Perhaps we can defer that to a future PR because I don’t think what you’re proposing here makes it any worse.

Making a change here will potentially cause many/most/all IPs to get reassigned to a new node. It would be nice if we could sort out these local traffic policy issues at the same time so we impose the disruption in node selection just once. We should also document this in the release notes.

If you can, take a look at these other PRs and issues. I’d like to try to minimize the number of times we release a change to node selection. If you’d like, we could split out change #1 (with my suggested updates) into a separate PR to merge first.

@fedepaol
Copy link
Member Author

You’ve hit an area that has needed to be fixed for a long time. We have many open issues and PRs that are related to this. I had to spend some time going back through all the related issues to help decide what to do with this, so I’m going to share some of those notes here.

I’d love to sort all of this out in this PR and get a bunch of stuff closed out.

You’ve made two key changes here:

1. You changed the way we calculate the set of candidate nodes to be all nodes when the traffic policy is `cluster` instead of using only the nodes where a speaker is running along side an Endpoint backing that Service.

2. You changed the sort key of the candidate node list to be the node name + load balancer IP addresses instead of node name + service name.

On 1 -- related PRs include #781, #665, and #613. Related open issues include #646 and #302.

One thing I liked about other changes compared to yours is to prefer the old algorithm if it produces at least one candidate node. That way we’ll still put the IP on a node with a local endpoint if possible. Otherwise, we can fall back to what you’ve proposed here if we’re allowed to (with the cluster traffic policy). If you agree, updating to incorporate this behavior is my first request.

Oh, sounds cool. But please remind that if there are more than one backend pods, there's no guarantee at all that the traffic will be directed to the local pod (unless traffic policy is local).

On 2, you called out issue #968. There’s another class of issues related to change #2, which is to deal with shared IPs for services that use the Local traffic policy. See issues #558, #801 and PRs #562, #887, and #922.

I do see that your change would resolve #968, but I don’t think we would fully resolve the issues related to sharing an IP between multiple IP addresses using the Local traffic policy. Sharing an IP across multiple Local policy services is not going to always work, and we need to get better at detecting that. Perhaps we can defer that to a future PR because I don’t think what you’re proposing here makes it any worse.

Well, if the policy is local AND we have two pods on two different nodes assigned to two different services that must share the same ip, that's mechanically impossible, so yes the only thing that we can do is to throw an error in that case.

Making a change here will potentially cause many/most/all IPs to get reassigned to a new node. It would be nice if we could sort out these local traffic policy issues at the same time so we impose the disruption in node selection just once. We should also document this in the release notes.

Do you mean after an upgrade?

If you can, take a look at these other PRs and issues. I’d like to try to minimize the number of times we release a change to node selection. If you’d like, we could split out change #1 (with my suggested updates) into a separate PR to merge first.

I will!

@russellb
Copy link
Collaborator

One thing I liked about other changes compared to yours is to prefer the old algorithm if it produces at least one candidate node. That way we’ll still put the IP on a node with a local endpoint if possible. Otherwise, we can fall back to what you’ve proposed here if we’re allowed to (with the cluster traffic policy). If you agree, updating to incorporate this behavior is my first request.

Oh, sounds cool. But please remind that if there are more than one backend pods, there's no guarantee at all that the traffic will be directed to the local pod (unless traffic policy is local).

Right. I was thinking that at least some percentage of the traffic would stay local, which is a little better than none of it.

Making a change here will potentially cause many/most/all IPs to get reassigned to a new node. It would be nice if we could sort out these local traffic policy issues at the same time so we impose the disruption in node selection just once. We should also document this in the release notes.

Do you mean after an upgrade?

Yes

@fedepaol
Copy link
Member Author

On 2, you called out issue #968. There’s another class of issues related to change #2, which is to deal with shared IPs for services that use the Local traffic policy. See issues #558, #801 and PRs #562, #887, and #922.

I gave a look at the others:

Said that (and having looked at the code again):

  • having shared ip between services with local policy means that they have the same endpoints (
    func BackendKey(svc *v1.Service) string {
    /
    if existingSK := a.sharingKeyForIP[ip.String()]; existingSK != nil {
  • this makes it impossible to have two services, with one of them being local, with a different set of endpoints (and the risk of announcing one from one node, the other from the other)
  • This means that, assuming a stable ordering function of the nodes based on active speakers / endpoints in case of traffic policy locals, all the speakers will agree on which node should advertise a given service (provided that we use the ip as "service identifier".
  • If such conflict exists, the second service won't get its ip and will remain in pending

Long story short, if we want to share the same ip to two services with a different backend selector, the second service won't get its ip. If the set of backends is the same, the winner node in each speaker selection will be the same assuming we don't use the service's name but the service's ip(s) / the sharing key as ordering criteria.

#801 (and it's related PR) handles a corner case.
MetalLB runs with the simplification that, if two services have local policy and want to share the same IP address, they MUST have the same set of pods as backend.
My understanding is that #887 tries to check if the set of backend pods is the same, regardless of the fact that they are choosen using the same selector. This would not change the considerations above, would just make the logic a bit smarter.

I think the conclusion is, the current PR would address all the mentioned issues / prs you mentioned except
#801 which would not affect the node selection behaviour.

@fedepaol
Copy link
Member Author

fedepaol commented Oct 12, 2021

On 1 -- related PRs include #781, #665, and #613. Related open issues include #646 and #302.

One thing I liked about other changes compared to yours is to prefer the old algorithm if it produces at least one candidate node. That way we’ll still put the IP on a node with a local endpoint if possible. Otherwise, we can fall back to what you’ve proposed here if we’re allowed to (with the cluster traffic policy). If you agree, updating to incorporate this behavior is my first request.

Regarding this, I re-checked the issues plus the old prs.
All the old PRs behave like mine, they take either the list of nodes OR the list of nodes with backends depending on the traffic policy.
See https://github.com/metallb/metallb/pull/562/files#diff-2c4fb432cbb842626bb58d73a6d40bc04718aeaca625b44d1711e7e174548ec3R52 and https://github.com/metallb/metallb/pull/922/files#diff-2c4fb432cbb842626bb58d73a6d40bc04718aeaca625b44d1711e7e174548ec3R47

And it can't be any different.
The scenario we want to address here is: two different services with traffic policy cluster and different set of backends.
So, the worst case scenario is the one with the test I added:

  • same ip

  • one service with one backend on one node, the other service with the backend on the other node

  • When a speaker produces the list of possible nodes, it only has the list of the backends of the current service available, which will be different for the two services (without the current change).

  • We need to change the sorting key, so that the two services sharing the same ip will end up sorted in the same way

We could make the logic more complex, collecting all the endpoints belonging to all the services sharing the same ip with the one currently being handled, but I am not sure it's worth the effort. If users care about the traffic being local, they will use the TrafficPolicy local anyhow.

TL;DR: I think it makes sense to keep this as it is, and both changes are needed.

@russellb
Copy link
Collaborator

Thanks a lot for putting the time into looking over all of those issues and PRs. I'm OK with proceeding with this PR once you update the related tests. Can you also update the PR / commit message to reflect all of the things it will close?

@fedepaol
Copy link
Member Author

Thanks a lot for putting the time into looking over all of those issues and PRs. I'm OK with proceeding with this PR once you update the related tests. Can you also update the PR / commit message to reflect all of the things it will close?

How about mentioning the things we close in the PR only? I always thought as commit message explaining the rationale, but I can also mention issues if you think it's worth.

@russellb
Copy link
Collaborator

Thanks a lot for putting the time into looking over all of those issues and PRs. I'm OK with proceeding with this PR once you update the related tests. Can you also update the PR / commit message to reflect all of the things it will close?

How about mentioning the things we close in the PR only? I always thought as commit message explaining the rationale, but I can also mention issues if you think it's worth.

PR is fine

@fedepaol fedepaol force-pushed the testl2issue branch 2 times, most recently from 2e6bb50 to 6ada084 Compare October 13, 2021 17:16
for _, i := range svc.Status.LoadBalancer.Ingress {
ss = append(ss, i.IP)
}
return nodeName + "#" + strings.Join(ss, "#")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are adding nodeName twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks! Hope it doesn't shuffle the tests again 😅

@champtar
Copy link
Contributor

champtar commented Oct 13, 2021

You might want to share an IPv4 but have 2 different IPv6 between 2 services. The sort key should be the metallb.universe.tf/allow-shared-ip annotation if not empty else the svc name (like right now).

@briantopping
Copy link
Contributor

briantopping commented Oct 14, 2021

Hi, I just saw this and thought I would chime in with the reason #801 was written. Gardener is a hyperscaler provider for Kubernetes. It uses what are called "Cloud Providers", a software abstraction to provide worker nodes. The provider I am using creates Kubevirt VMs for the nodes. Kubevirt can schedule child ("shoot") cluster worker nodes to be created anywhere in the root ("seed") cluster, which likely spans several hardware nodes. As these child clusters are created, the root cluster scheduler will typically distribute these worker nodes all across the root cluster to distribute load, regardless of the child cluster grouping.

Closer to the problem, let's say a child cluster wants to pull an address from a MetalLB pool. MetalLB is only running on the root cluster, not the child clusters. To expose services in child clusters, they must be exposed through the root cluster. This happens by creating a service proxy in the root that selects the services on the child by selector.

In order to get a public address, the typical MLB annotations are propagated up to the root cluster proxy in kubevirt/cloud-provider-kubevirt#16. MLB sees the annotation on a service and does its normal thing. The actual problem #801 addresses is in kubevirt/cloud-provider-kubevirt#18:

This behavior in MetalLB in Local mode makes sense: A single IP address cannot route to multiple service nodes. At the same time, it doesn't seem fixed service selectors referencing dynamic pod labels should be a constrained, so this seems to be a misfeature in MetalLB.

Does this patch solve the problem? If not, I'd like to help upgrade the patch to do so. I can't see how it does though.

@fedepaol
Copy link
Member Author

You might want to share an IPv4 but have 2 different IPv6 between 2 services. The sort key should be the metallb.universe.tf/allow-shared-ip annotation if not empty else the svc name (like right now).

This is an interesting point (and currently covered behind the fact that we don't handle dual stack yet). I assumed that two services with the same ipv4 would also have had the same ipv6, but you are right, that's not always the case.

I'd rather try to change the interface and the logic to focus on the IP (because at the end of the day the shouldAnnounce logic is applied per load balancer ip

if deleteReason := handler.ShouldAnnounce(l, name, svc, eps); deleteReason != "" {
) so we avoid having crowded announcers in those cases where the sharing key is the same BUT the ips are different.
Does it make sense?

@fedepaol
Copy link
Member Author

Does this patch solve the problem? If not, I'd like to help upgrade the patch to do so. I can't see how it does though.

I saw your PR when wrapping my head around the open work, and I think it's orthogonal to what's being addressed here, which is the other way around: remove the constraint of having a local pod in case of traffic policy cluster.

I did a recap in #976 (comment) , but I'd keep the two things separate instead of collapsing that other logic here, which requires a separate discussion.

@fedepaol fedepaol force-pushed the testl2issue branch 3 times, most recently from efcc28b to a0775d9 Compare October 14, 2021 13:40
@briantopping
Copy link
Contributor

briantopping commented Oct 14, 2021

Thanks @fedepaol, that's fine for me if your proposed changes don't block a solution to #801. It sounds like you are saying that's not the case. Merging them at the same time would help ensure that doesn't happen. I've been waiting six months for some actionable feedback on that issue and I'm not sure why it never happened.

cgoncalves and others added 6 commits October 27, 2021 17:58
This is half-baked commit adding support to Podman-based kind
environments. It continues assuming the container runtime is docker but
checks behind the scenes if docker is a wrapper for podman, i.e. a
symlink to podman for cases where commands and/or outputs differ.

Users on Red Hat and Debian family distributions can either install
package 'podman-docker' or create a symbolic link docker -> podman.

For more information of how user:group ownership is mapped, please refer
to https://www.redhat.com/sysadmin/rootless-podman-makes-sense

Signed-off-by: Carlos Goncalves <cgoncalves@redhat.com>
Merge-with-parent seems not to be working when the main branch is not
well aligned with the branch the PR is coming from. Here we roll it
back.

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Updating the release notes to reflect the new L2 announcement logic
based on loadbalancer ip, and change of behaviour in case of traffic
policy cluster.

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Having a way to skip tests is required when we define ipv4 / ipv6 only
tests.

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Now that the tests are able to configure dinamically, having two
separate lanes for two protocols is not needed anymore.

On top of that, we add skips for ipv6 tests on the ipv4 lane and
viceversa.

Finally, BGP is skipped in the ipv6 lane.

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Instead of failing after the first attempt, we validate the behaviour
inside an eventually loop in order to avoid flakes due to slowness in
CI.

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Copy link
Collaborator

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ... just waiting for CI to finish

@russellb russellb merged commit ccc9122 into metallb:main Oct 27, 2021
@russellb russellb added this to the v0.11.0 milestone Oct 27, 2021
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/openshift-docs that referenced this pull request Dec 17, 2021
Addresses the doc impact for SDN-2480.

The metallb/metallb/pull/976 PR seems related
in the sense that an limiting the `speaker` pods
and ETP of `cluster` seem to jive well together.

Add link to "Placing pods on specific nodes using
node selectors"--address Mohamed's feedback.

Address comments from Federico.

Address comments from Arti.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants