[IMPROVEMENT] Remove static sessionAffinity: ClientIP set in most services if not required #7399

innobead · 2023-12-20T14:46:26Z

What's the task? Please describe

Longhorn API doesn't require sessions as each API call should be independent. Although backing image download and upload are present, the internal forwarding is handled by Longhorn manager and not the load balancer layer. However, most services have sessionAffinity hard-coded with ClientIP.

/charts/longhorn/templates/services.yaml#L1-L16

apiVersion: v1
kind: Service
metadata:
  labels: {{- include "longhorn.labels" . | nindent 4 }}
    app: longhorn-conversion-webhook
  name: longhorn-conversion-webhook
  namespace: {{ include "release_namespace" . }}
spec:
  type: ClusterIP
  sessionAffinity: ClientIP
  selector:
    app: longhorn-conversion-webhook
  ports:
  - name: conversion-webhook
    port: 9443
    targetPort: conversion-wh

Describe the sub-tasks

Additional context

ejweber · 2024-01-17T19:24:06Z

The scope of this issue is to investigate whether or not we can remove sessionAffinity: ClientIP from our services, and that will be my primary area of investigation. However, the broader context from one user can be summarized as follows:

Can Longhorn run in an "multitenant" environment in which some nodes related to one tenant can only talk to other nodes related to one tenant? sessionAffinity: ClientIP makes this difficult, because after the Kubernetes network control plane decides randomly which destination IP address to resolve a service to, it will not make another decision for 10800 (configured by Longhorn) seconds. If, instead, the Kubernetes network control plane might make a different decision every time a connection attempt was made, a pod in one tenant would have a reasonable chance of getting connected to a pod backing a service in the same tenant.

Before getting started with the main investigation, I decided to try to understand exactly how/why Longhorn cannot work in the "multitenant" environment today.

Setup

In a two node cluster, temporarily ensure it is impossible for one node (kgxdq) to route traffic to another node (nn4rg) and vice-versa.

root@eweber-v126-worker-9c1451b4-kgxdq:~# ip route
default via 143.198.224.1 dev eth0 proto static 
10.42.0.0/24 via 10.42.0.0 dev flannel.1 onlink 
10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink
10.42.2.0/24 dev cni0 proto kernel scope link src 10.42.2.1 
10.42.3.0/24 via 10.42.3.0 dev flannel.1 onlink 
10.48.0.0/16 dev eth0 proto kernel scope link src 10.48.0.37 
10.124.0.0/20 dev eth1 proto kernel scope link src 10.124.0.31 
143.198.224.0/20 dev eth0 proto kernel scope link src 143.198.232.225

root@eweber-v126-worker-9c1451b4-kgxdq:~# ip route del 10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink

# Repeat on the other node.

kgxdq runs pods on 10.42.2.x and can't communicate with 10.42.1.x.
nn4rg runs pods on 10.42.1.x and can't communicate with 10.42.2.x.

Test

Attempt to access the longhorn-backend service from the longhorn-csi-plugin on each node.

# Executed inside the pod on nn4rg.

longhorn-csi-plugin-h2z28:/ # curl -k longhorn-backend:9500
curl: (28) Failed to connect to longhorn-backend port 9500 after 129625 ms: Couldn't connect to server

longhorn-csi-plugin-h2z28:/ # host longhorn-backend
longhorn-backend.longhorn-system.svc.cluster.local has address 10.43.2.254

longhorn-csi-plugin-h2z28:/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if76: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether 6a:57:bf:7c:47:d3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.42.1.172/24 brd 10.42.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::6857:bfff:fe7c:47d3/64 scope link 
       valid_lft forever preferred_lft forever

Deep dive

Examining the iptables rules on nn4rg, we see the following:

root@eweber-v126-worker-9c1451b4-nn4rg:~# iptables -t nat -L

Chain KUBE-SERVICES (2 references)
KUBE-SVC-NQL64TY2T3FWFXFQ  tcp  --  anywhere             10.43.2.254          /* longhorn-system/longhorn-backend:manager cluster IP */ tcp dpt:9500

Chain KUBE-SVC-NQL64TY2T3FWFXFQ (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  tcp  -- !10.42.0.0/16         10.43.2.254          /* longhorn-system/longhorn-backend:manager cluster IP */ tcp dpt:9500
KUBE-SEP-W3Y4QUAIHX5U3K5C  all  --  anywhere             anywhere             /* longhorn-system/longhorn-backend:manager -> 10.42.1.171:9500 */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-W3Y4QUAIHX5U3K5C side: source mask: 255.255.255.255
KUBE-SEP-N5TYH7GRWXGGDHYU  all  --  anywhere             anywhere             /* longhorn-system/longhorn-backend:manager -> 10.42.2.150:9500 */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-N5TYH7GRWXGGDHYU side: source mask: 255.255.255.255
KUBE-SEP-W3Y4QUAIHX5U3K5C  all  --  anywhere             anywhere             /* longhorn-system/longhorn-backend:manager -> 10.42.1.171:9500 */ statistic mode random probability 0.50000000000
KUBE-SEP-N5TYH7GRWXGGDHYU  all  --  anywhere             anywhere             /* longhorn-system/longhorn-backend:manager -> 10.42.2.150:9500 */

Chain KUBE-SEP-N5TYH7GRWXGGDHYU (2 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.42.2.150          anywhere             /* longhorn-system/longhorn-backend:manager */
DNAT       tcp  --  anywhere             anywhere             /* longhorn-system/longhorn-backend:manager */ recent: SET name: KUBE-SEP-N5TYH7GRWXGGDHYU side: source mask: 255.255.255.255 tcp to:10.42.2.150:9500

Chain KUBE-SEP-W3Y4QUAIHX5U3K5C (2 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.42.1.171          anywhere             /* longhorn-system/longhorn-backend:manager */
DNAT       tcp  --  anywhere             anywhere             /* longhorn-system/longhorn-backend:manager */ recent: SET name: KUBE-SEP-W3Y4QUAIHX5U3K5C side: source mask: 255.255.255.255 tcp to:10.42.1.171:9500

During PREROUTING, if a packet is destined for the longhorn-backend ClusterIP (10.43.2.254), we jump to the chain that handles load balancing (KUBE-SVC-NQL64TY2T3FWFXFQ).

The chain that handles load balancing (KUBE-SVC-NQL64TY2T3FWFXFQ) has two kinds of rules:

The first set of rules are for packets from source IPs that have already been balanced in the last configured number of seconds (10800). If, for example, the source IP is in the KUBE-SEP-W3Y4QUAIHX5U3K5C list, we jump to the KUBE-SEP-W3Y4QUAIHX5U3K5C chain.
The second set of rules are for packets from source IPs that have not already been balanced. Probabilities are used to ensure we have an equal chance to jump to any related chain (e.g. KUBE-SEP-W3Y4QUAIHX5U3K5C).

When we jump to a related chain, we update the associated list with the source IP and then change the destination of the packet to the one configured (e.g. 10.42.2.150).

The lists are stored in /proc. In this example, packets from 10.42.1.185 are routed to 10.42.1.171 because the source happened to end up in the KUBE-SEP-W3Y4QUAIHX5U3K5C list.

root@eweber-v126-worker-9c1451b4-nn4rg:~# cat /proc/net/xt_recent/KUBE-SEP-W3Y4QUAIHX5U3K5C
src=10.42.1.185 ttl: 64 last_seen: 4463900565 oldest_pkt: 3 4463900279, 4463900564, 4463900565

The attempt at routing from the longhorn-csi-plugin pod on nn4rg above failed because that pod's source IP address ended up in the wrong list. Its packets were destined for 10.42.2.150, which we we configured its host not to be able to talk to.

root@eweber-v126-worker-9c1451b4-nn4rg:~# grep "" /proc/net/xt_recent/*
/proc/net/xt_recent/KUBE-SEP-N5TYH7GRWXGGDHYU:src=10.42.1.172 ttl: 64 last_seen: 4465135466 oldest_pkt: 5 4465095240, 4465119790, 4465120849, 4465122431, 4465135466
/proc/net/xt_recent/KUBE-SEP-W3Y4QUAIHX5U3K5C:src=10.42.1.185 ttl: 64 last_seen: 4463900565 oldest_pkt: 3 4463900279, 4463900564, 4463900565

We can fix the "problem" by removing its entry in the KUBE-SEP-N5TYH7GRWXGGDHYU list and trying again. This time, we happen to load balance to a longhorn-backend pod we can actually talk to (though it's equally likely to end up in the same "bad" list).

root@eweber-v126-worker-9c1451b4-nn4rg:~# echo / >/proc/net/xt_recent/KUBE-SEP-N5TYH7GRWXGGDHYU

longhorn-csi-plugin-h2z28:/ # curl -k longhorn-backend:9500
{"data":[{"actions":{},"id":"v1","links":{"self":"http://longhorn-backend:9500/v1"},"type":"apiVersion"}],"links":{"latest":"http://longhorn-backend:9500/v1","self":"http://longhorn-backend:9500/"},"resourceType":"apiVersion","type":"collection"}

root@eweber-v126-worker-9c1451b4-nn4rg:~# grep "" /proc/net/xt_recent/*
/proc/net/xt_recent/KUBE-SEP-W3Y4QUAIHX5U3K5C:src=10.42.1.185 ttl: 64 last_seen: 4463900565 oldest_pkt: 3 4463900279, 4463900564, 4463900565
/proc/net/xt_recent/KUBE-SEP-W3Y4QUAIHX5U3K5C:src=10.42.1.172 ttl: 64 last_seen: 4465265740 oldest_pkt: 1 4465265740

Conclusion

It's fairly easy to understand why tenancy (when set up this way) can cause major problems for Longhorn. It's not 100% clear that removing sessionAffinity: ClientIP will allow Longhorn to work normally in this environment. It depends how gracefully we handle connection failures.

What happens if a longhorn-csi-plugin pod fails to contact the longhorn-backend service on its first attempt?
How many times will it retry?
What happens in the off chance that each time it retries, the Kubernetes network control plane chooses the same backend pod?

ejweber · 2024-01-17T21:27:59Z

We have had sessionAffinity: ClientIP on certain services since the first time we ran Longhorn in Kubernetes: #22. There is no discussion in that PR of any particular reason why it would be desirable.

Today we have it in four services:

eweber@laptop:~/longhorn> kl get service -oyaml | grep -e "sessionAffinity: ClientIP" -B 6
    - name: admission-webhook
      port: 9502
      protocol: TCP
      targetPort: admission-wh
    selector:
      app: longhorn-manager
    sessionAffinity: ClientIP
--
    - name: manager
      port: 9500
      protocol: TCP
      targetPort: manager
    selector:
      app: longhorn-manager
    sessionAffinity: ClientIP
--
    - name: conversion-webhook
      port: 9501
      protocol: TCP
      targetPort: conversion-wh
    selector:
      app: longhorn-manager
    sessionAffinity: ClientIP
--
    - name: recovery-backend
      port: 9503
      protocol: TCP
      targetPort: recov-backend
    selector:
      app: longhorn-manager
    sessionAffinity: ClientIP

It seems likely that we simply copied the original services when creating new ones.

ejweber · 2024-01-17T21:47:23Z

So far, I cannot see any reason to keep sessionAffinity: ClientIP. There do not seem to be any mechanisms that rely on the fact that a service client will always talk to the same backend pod. I will do the following without it enabled:

Run the end-to-end tests.
Click around in the UI and verify it remains functional.

ejweber · 2024-01-17T21:53:18Z

Tests for QA (working list) (checked when completed by me before QA review):

PASS - Deploy a previous version of Longhorn. Then, deploy a version of Longhorn with sessionAffinity: ClientIP removed. Verify with kl get service -oyaml | grep -e "sessionAffinity: ClientIP" that there are no longer any services with sessionAffinity: ClientIP.
PASS - Run the end-to-end tests.
PASS - Click around in the UI and verify it remains functional.

Tests for developer (working list):

PASS - After executing the first test from the QA list, check iptables NAT rules on nodes with iptables -t nat -L. Verify that the chain related to the longhorn-backend service no longer includes rules containing strings like recent: CHECK seconds: 10800 reap name: KUBE-SEP-W3Y4QUAIHX5U3K5C.
FAIL - Click around in the UI and verify it remains functional in a "multitenant" environment (as in [IMPROVEMENT] Remove static sessionAffinity: ClientIP set in most services if not required #7399 (comment)) before the change.
SKIPPED - Run the end-to-end tests in a "multitenant" environment (as in [IMPROVEMENT] Remove static sessionAffinity: ClientIP set in most services if not required #7399 (comment)) before the change. Skipped because the integration tests aren't really set up to run in this kind of environment. For example, we need to ensure that there are enough nodes in each "zone" for the expected number of replicas and each volume must be created using a StorageClass that restricts it (and its replicas) to one node or another. We can do further work on this if we decide we want to validate this kind of environment in the future.
SKIPPED - Run the end-to-end tests in a "multitenant" environment (as in [IMPROVEMENT] Remove static sessionAffinity: ClientIP set in most services if not required #7399 (comment)) after the change. Compare results. Skipped for the above reason.

ejweber · 2024-01-18T23:09:09Z

End-to-end tests WITHOUT sessionAffinity: ClientIP passed in https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6011/ with one exception that passed in https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6020/.

ejweber · 2024-01-22T16:17:31Z

I recommend we stop hard coding sessionAffinity: ClientIP and do not provide an additional option to configure it. This means all clusters will no longer have it enabled. Please provide feedback at #7748.

longhorn-io-github-bot · 2024-01-22T16:24:04Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at: [IMPROVEMENT] Remove static sessionAffinity: ClientIP set in most services if not required #7399 (comment).
Is there a workaround for the issue? If so, where is it documented?
It a user needs to unset sessionAffinity: ClientIP before the PR, they can make the same manual changes the PR does to deploy/longhorn.yaml or chart/templates. If the user is using a managed chart (e.g. with Rancher), they can make they changes to the deployed services themselves.
Does the PR include the explanation for the fix or the feature?
Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at: Remove sessionAffinity ClientIP from services #7748.
The PR for the chart change is at: Remove sessionAffinity ClientIP from services #7748.
Which areas/issues this PR might have potential impacts on?
It is unclear why sessionAffinity: ClientIP was ever enabled and it does not appear to have any negative effects to disable it. However, there may be an unintended consequence (especially for some corner case deployment) that we do not expect.

innobead · 2024-01-23T13:25:38Z

Moved to 1.6.0.

yangchiu · 2024-01-25T02:08:58Z

Verified passed on master-head (longhorn-manager 9aec52) and v1.6.x-head (longhorn-manager 59df82) following the test plan.

Upgrading Longhorn from v1.5.3 to master-head and from v1.5.3 to v1.6.x-head, originally there are sessionAffinity: ClientIP set for services:

# kubectl get svc -n longhorn-system -oyaml | grep -e "sessionAffinity: ClientIP"
    sessionAffinity: ClientIP
    sessionAffinity: ClientIP
    sessionAffinity: ClientIP
    sessionAffinity: ClientIP

But after the upgrade, they no more exist:

# kubectl get svc -n longhorn-system -oyaml | grep -e "sessionAffinity: ClientIP"
# empty

And don't have issue in the following regression tests.

The UI is checked too. It still works as well.

innobead added the kind/task General task request to fulfill another primary request label Dec 20, 2023

innobead assigned ejweber Dec 20, 2023

innobead added area/install-uninstall-upgrade Install, Uninstall or Upgrade related area/chart Helm chart related priority/0 Must be fixed in this release (managed by PO) labels Dec 20, 2023

innobead added this to the v1.7.0 milestone Dec 20, 2023

ejweber mentioned this issue Jan 22, 2024

Remove sessionAffinity ClientIP from services #7748

Merged

innobead modified the milestones: v1.7.0, v1.6.0 Jan 23, 2024

innobead added backport/1.4.5 backport/1.5.4 labels Jan 23, 2024

This was referenced Jan 23, 2024

[BACKPORT][v1.4.5][TASK] Review why sessionAffinity: ClientIP is used in most services #7759

Closed

[BACKPORT][v1.5.4][TASK] Review why sessionAffinity: ClientIP is used in most services #7760

Closed

innobead changed the title ~~[TASK] Review why sessionAffinity: ClientIP is used in most services~~ [IMPROVEMENT] Remove static sessionAffinity: ClientIP used in most services if not required Jan 23, 2024

innobead added kind/improvement Request for improvement of existing function and removed kind/task General task request to fulfill another primary request labels Jan 23, 2024

innobead changed the title ~~[IMPROVEMENT] Remove static sessionAffinity: ClientIP used in most services if not required~~ [IMPROVEMENT] Remove static sessionAffinity: ClientIP set in most services if not required Jan 23, 2024

innobead assigned yangchiu Jan 24, 2024

yangchiu closed this as completed Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] Remove static sessionAffinity: ClientIP set in most services if not required #7399

[IMPROVEMENT] Remove static sessionAffinity: ClientIP set in most services if not required #7399

innobead commented Dec 20, 2023

ejweber commented Jan 17, 2024 •

edited

ejweber commented Jan 17, 2024 •

edited

ejweber commented Jan 17, 2024

ejweber commented Jan 17, 2024 •

edited

ejweber commented Jan 18, 2024

ejweber commented Jan 22, 2024

longhorn-io-github-bot commented Jan 22, 2024 •

edited by ejweber

innobead commented Jan 23, 2024

yangchiu commented Jan 25, 2024

[IMPROVEMENT] Remove static sessionAffinity: ClientIP set in most services if not required #7399

[IMPROVEMENT] Remove static sessionAffinity: ClientIP set in most services if not required #7399

Comments

innobead commented Dec 20, 2023

What's the task? Please describe

Describe the sub-tasks

Additional context

ejweber commented Jan 17, 2024 • edited

Setup

Test

Deep dive

Conclusion

ejweber commented Jan 17, 2024 • edited

ejweber commented Jan 17, 2024

ejweber commented Jan 17, 2024 • edited

ejweber commented Jan 18, 2024

ejweber commented Jan 22, 2024

longhorn-io-github-bot commented Jan 22, 2024 • edited by ejweber

Pre Ready-For-Testing Checklist

innobead commented Jan 23, 2024

yangchiu commented Jan 25, 2024

ejweber commented Jan 17, 2024 •

edited

ejweber commented Jan 17, 2024 •

edited

ejweber commented Jan 17, 2024 •

edited

longhorn-io-github-bot commented Jan 22, 2024 •

edited by ejweber