Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Nftables proxy #124272

Closed
wants to merge 5 commits into from
Closed

[WIP] Nftables proxy #124272

wants to merge 5 commits into from

Conversation

aojea
Copy link
Member

@aojea aojea commented Apr 11, 2024

/kind feature

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 11, 2024
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.30 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.30.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Thu Apr 11 13:45:22 UTC 2024.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 11, 2024
@aojea
Copy link
Member Author

aojea commented Apr 11, 2024

/hold
/cc @danwinship

let's start to get some data

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. area/kube-proxy area/provider/gcp Issues or PRs related to gcp provider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 11, 2024
@aojea
Copy link
Member Author

aojea commented Apr 11, 2024

/test

@k8s-ci-robot
Copy link
Contributor

@aojea: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test pull-cadvisor-e2e-kubernetes
  • /test pull-cos-containerd-e2e-ubuntu-gce
  • /test pull-kubernetes-conformance-kind-ga-only-parallel
  • /test pull-kubernetes-coverage-unit
  • /test pull-kubernetes-dependencies
  • /test pull-kubernetes-dependencies-go-canary
  • /test pull-kubernetes-e2e-gce
  • /test pull-kubernetes-e2e-gce-100-performance
  • /test pull-kubernetes-e2e-gce-big-performance
  • /test pull-kubernetes-e2e-gce-cos
  • /test pull-kubernetes-e2e-gce-cos-canary
  • /test pull-kubernetes-e2e-gce-cos-no-stage
  • /test pull-kubernetes-e2e-gce-network-proxy-http-connect
  • /test pull-kubernetes-e2e-gce-scale-performance-manual
  • /test pull-kubernetes-e2e-kind
  • /test pull-kubernetes-e2e-kind-ipv6
  • /test pull-kubernetes-integration
  • /test pull-kubernetes-integration-go-canary
  • /test pull-kubernetes-kubemark-e2e-gce-scale
  • /test pull-kubernetes-node-e2e-containerd
  • /test pull-kubernetes-typecheck
  • /test pull-kubernetes-unit
  • /test pull-kubernetes-unit-go-canary
  • /test pull-kubernetes-update
  • /test pull-kubernetes-verify
  • /test pull-kubernetes-verify-go-canary
  • /test pull-kubernetes-verify-lint

The following commands are available to trigger optional jobs:

  • /test check-dependency-stats
  • /test pull-ci-kubernetes-unit-windows
  • /test pull-crio-cgroupv1-node-e2e-eviction
  • /test pull-crio-cgroupv1-node-e2e-features
  • /test pull-crio-cgroupv1-node-e2e-hugepages
  • /test pull-crio-cgroupv1-node-e2e-resource-managers
  • /test pull-crio-cgroupv2-imagefs-separatedisktest
  • /test pull-crio-cgroupv2-node-e2e-eviction
  • /test pull-crio-cgroupv2-splitfs-splitdisk
  • /test pull-e2e-gce-cloud-provider-disabled
  • /test pull-kubernetes-conformance-image-test
  • /test pull-kubernetes-conformance-kind-ga-only
  • /test pull-kubernetes-conformance-kind-ipv6-parallel
  • /test pull-kubernetes-cos-cgroupv1-containerd-node-e2e
  • /test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
  • /test pull-kubernetes-crio-node-memoryqos-cgrpv2
  • /test pull-kubernetes-cross
  • /test pull-kubernetes-e2e-autoscaling-hpa-cm
  • /test pull-kubernetes-e2e-autoscaling-hpa-cpu
  • /test pull-kubernetes-e2e-autoscaling-vpa-full
  • /test pull-kubernetes-e2e-capz-azure-disk
  • /test pull-kubernetes-e2e-capz-azure-disk-vmss
  • /test pull-kubernetes-e2e-capz-azure-file
  • /test pull-kubernetes-e2e-capz-azure-file-vmss
  • /test pull-kubernetes-e2e-capz-conformance
  • /test pull-kubernetes-e2e-capz-windows-alpha-feature-vpa
  • /test pull-kubernetes-e2e-capz-windows-alpha-features
  • /test pull-kubernetes-e2e-capz-windows-master
  • /test pull-kubernetes-e2e-capz-windows-serial-slow
  • /test pull-kubernetes-e2e-capz-windows-serial-slow-hpa
  • /test pull-kubernetes-e2e-containerd-gce
  • /test pull-kubernetes-e2e-ec2
  • /test pull-kubernetes-e2e-ec2-arm64
  • /test pull-kubernetes-e2e-ec2-conformance
  • /test pull-kubernetes-e2e-ec2-conformance-arm64
  • /test pull-kubernetes-e2e-ec2-device-plugin-gpu
  • /test pull-kubernetes-e2e-gce-canary
  • /test pull-kubernetes-e2e-gce-correctness
  • /test pull-kubernetes-e2e-gce-cos-alpha-features
  • /test pull-kubernetes-e2e-gce-csi-serial
  • /test pull-kubernetes-e2e-gce-device-plugin-gpu
  • /test pull-kubernetes-e2e-gce-disruptive-canary
  • /test pull-kubernetes-e2e-gce-kubelet-credential-provider
  • /test pull-kubernetes-e2e-gce-network-policies
  • /test pull-kubernetes-e2e-gce-network-proxy-grpc
  • /test pull-kubernetes-e2e-gce-providerless
  • /test pull-kubernetes-e2e-gce-serial
  • /test pull-kubernetes-e2e-gce-serial-canary
  • /test pull-kubernetes-e2e-gce-storage-disruptive
  • /test pull-kubernetes-e2e-gce-storage-slow
  • /test pull-kubernetes-e2e-gce-storage-snapshot
  • /test pull-kubernetes-e2e-gci-gce-autoscaling
  • /test pull-kubernetes-e2e-gci-gce-ingress
  • /test pull-kubernetes-e2e-gci-gce-ipvs
  • /test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2
  • /test pull-kubernetes-e2e-kind-alpha-beta-features
  • /test pull-kubernetes-e2e-kind-alpha-features
  • /test pull-kubernetes-e2e-kind-beta-features
  • /test pull-kubernetes-e2e-kind-canary
  • /test pull-kubernetes-e2e-kind-cloud-provider-loadbalancer
  • /test pull-kubernetes-e2e-kind-dual-canary
  • /test pull-kubernetes-e2e-kind-evented-pleg
  • /test pull-kubernetes-e2e-kind-ipv6-canary
  • /test pull-kubernetes-e2e-kind-ipvs-dual-canary
  • /test pull-kubernetes-e2e-kind-kms
  • /test pull-kubernetes-e2e-kind-multizone
  • /test pull-kubernetes-e2e-kind-nftables
  • /test pull-kubernetes-e2e-storage-kind-alpha-features
  • /test pull-kubernetes-e2e-storage-kind-disruptive
  • /test pull-kubernetes-kind-dra
  • /test pull-kubernetes-kind-json-logging
  • /test pull-kubernetes-kind-text-logging
  • /test pull-kubernetes-kubemark-e2e-gce-big
  • /test pull-kubernetes-linter-hints
  • /test pull-kubernetes-local-e2e
  • /test pull-kubernetes-node-arm64-e2e-containerd-ec2
  • /test pull-kubernetes-node-arm64-e2e-containerd-serial-ec2
  • /test pull-kubernetes-node-arm64-ubuntu-serial-gce
  • /test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e-kubetest2
  • /test pull-kubernetes-node-crio-cgrpv2-imagefs-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-splitfs-e2e
  • /test pull-kubernetes-node-crio-e2e
  • /test pull-kubernetes-node-crio-e2e-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-1-7-dra
  • /test pull-kubernetes-node-e2e-containerd-alpha-features
  • /test pull-kubernetes-node-e2e-containerd-ec2
  • /test pull-kubernetes-node-e2e-containerd-features
  • /test pull-kubernetes-node-e2e-containerd-features-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-serial-ec2
  • /test pull-kubernetes-node-e2e-containerd-standalone-mode
  • /test pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha
  • /test pull-kubernetes-node-e2e-crio-dra
  • /test pull-kubernetes-node-kubelet-containerd-flaky
  • /test pull-kubernetes-node-kubelet-credential-provider
  • /test pull-kubernetes-node-kubelet-serial-containerd
  • /test pull-kubernetes-node-kubelet-serial-containerd-alpha-features
  • /test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
  • /test pull-kubernetes-node-kubelet-serial-hugepages
  • /test pull-kubernetes-node-kubelet-serial-memory-manager
  • /test pull-kubernetes-node-kubelet-serial-pod-disruption-conditions
  • /test pull-kubernetes-node-kubelet-serial-topology-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
  • /test pull-kubernetes-node-swap-conformance-fedora-serial
  • /test pull-kubernetes-node-swap-conformance-ubuntu-serial
  • /test pull-kubernetes-node-swap-fedora
  • /test pull-kubernetes-node-swap-fedora-serial
  • /test pull-kubernetes-node-swap-ubuntu-serial
  • /test pull-kubernetes-unit-experimental
  • /test pull-publishing-bot-validate

Use /test all to run the following jobs that were automatically triggered:

  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-dependencies
  • pull-kubernetes-e2e-ec2
  • pull-kubernetes-e2e-ec2-conformance
  • pull-kubernetes-e2e-gce
  • pull-kubernetes-e2e-gce-canary
  • pull-kubernetes-e2e-gce-cos-alpha-features
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6
  • pull-kubernetes-integration
  • pull-kubernetes-linter-hints
  • pull-kubernetes-node-e2e-containerd
  • pull-kubernetes-typecheck
  • pull-kubernetes-unit
  • pull-kubernetes-verify
  • pull-kubernetes-verify-lint

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aojea
Copy link
Member Author

aojea commented Apr 11, 2024

/test pull-kubernetes-e2e-gce-100-performance

@aojea
Copy link
Member Author

aojea commented Apr 11, 2024

/test pull-kubernetes-e2e-gce-100-performance

@aojea
Copy link
Member Author

aojea commented Apr 11, 2024

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/124272/pull-kubernetes-e2e-gce-100-performance/1778448759718416384

    {
      "data": {
        "Perc50": 1519.802112,
        "Perc90": 2282.213099,
        "Perc99": 6815.068537
      },
      "unit": "ms",
      "labels": {
        "Metric": "NetworkProgrammingLatency"
      }

This is much better, but iptables runs with 10s of sync period, last time we analyzed this the latency in iptables improved but CPU increased #110268 (comment)
#114229 (comment)
#115720

image

      "data": {
        "Perc50": 12.5,
        "Perc90": 22.5,
        "Perc99": 24.75
      },
      "unit": "ms",
      "labels": {
        "Metric": "InClusterNetworkLatency"
      }

is the same https://perf-dash.k8s.io/#/?jobname=gce-100Nodes-master&metriccategoryname=Network&metricname=Load_NetworkLatency&Metric=InClusterNetworkLatency

The sync rules in nftables

image

The sync rules in iptables from https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/1778462542213943296 :) that are way better

image

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 11, 2024

@aojea: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit fa74ceb link true /test pull-kubernetes-unit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@danwinship
Copy link
Contributor

danwinship commented Apr 11, 2024

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/124272/pull-kubernetes-e2e-gce-100-performance/1778448759718416384

    {
      "data": {
        "Perc50": 1519.802112,
        "Perc90": 2282.213099,
        "Perc99": 6815.068537
      },
      "unit": "ms",
      "labels": {
        "Metric": "NetworkProgrammingLatency"
      }

(So that's the NetworkProgrammingLatency numbers from the e2e-gce-100-performance run you kicked off in this PR, aka "length of time from when the EndpointSliceController makes a change to when kube-proxy programs the change in nftables".)

This is much better [than equivalent current iptables]

For comparision from a recent run:

    {
      "data": {
        "Perc50": 9034.440056,
        "Perc90": 10284.454043,
        "Perc99": 10984.283114
      },
      "unit": "ms",
      "labels": {
        "Metric": "NetworkProgrammingLatency"
      }
    }

but iptables runs with 10s of sync period

(which is to say, the job runs with --iptables-min-sync-period=10s so that it does at most 1 iptables sync every 10 seconds, thus forcing p50 NetworkProgrammingLatency to hover around (10 seconds minus the length of time it takes to do a single sync).)

last time we analyzed [switching to --iptables-min-sync-period=1s] the latency in iptables improved but CPU increased

Honestly, running kube-proxy with min sync period 10s is just crazy. Like, if a pod crashes, you want to keep sending new connections to its former IP address for another 10 seconds?

If we think kube-proxy uses too much CPU with a shorter sync period then we should fix that. (I think it's not clear how much of the CPU was kube-proxy itself vs kube-proxy plus iptables-restore?)

      "data": {
        "Perc50": 12.5,
        "Perc90": 22.5,
        "Perc99": 24.75
      },
      "unit": "ms",
      "labels": {
        "Metric": "InClusterNetworkLatency"
      }

is the same

That's the one measuring packet latency for pod-to-service-IP-to-pod traffic?

How many services does this job create? We'd probably need at least 10s of thousands for the difference between iptables kube-proxy's rules and nftables kube-proxy's rules to become noticeable.

(Note that for purposes of this metric, you don't even need a large cluster to test that; you can create 1 pod and 10,000 services that all point to that same pod.)

The sync rules in nftables
...
The sync rules in iptables from https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/1778462542213943296 :) that are way better

So that's p95 of sync_proxy_rules_duration_seconds? (aka SyncProxyRulesLatency, aka the time it takes a single syncProxyRules() call to run)

The nftables proxy currently doesn't have the partial-syncing optimizations that iptables has (because we were waiting for the perf jobs before optimizing anything), so this is expected; every sync rewrites every rule, so nftables's sync_proxy_rules_duration_seconds ought to be more comparable to iptables's sync_full_proxy_rules_duration_seconds (note "full")

@@ -1140,7 +1140,7 @@ var defaultKubernetesFeatureGates = map[featuregate.Feature]featuregate.FeatureS

NewVolumeManagerReconstruction: {Default: true, PreRelease: featuregate.GA, LockToDefault: true}, // remove in 1.32

NFTablesProxyMode: {Default: false, PreRelease: featuregate.Alpha},
NFTablesProxyMode: {Default: true, PreRelease: featuregate.Beta},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update NFTablesProxyMode comment above with version for beta

if [[ "${KUBE_PROXY_MODE:-}" == "nftables" ]];then
params+=" --proxy-mode=nftables"
else
params+=" --iptables-sync-period=1m --iptables-min-sync-period=10s --ipvs-sync-period=1m --ipvs-min-sync-period=10s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, you could set the iptables- and ipvs-specific options regardless of KUBE_PROXY_MODE.

But probably some of this should be merged with the existing [[ "${KUBE_PROXY_MODE:-}" == "ipvs" ]] check? And/or it should do --proxy-mode=${KUBE_PROXY_MODE}.

(Also, typo in the commit message: "configura")

# Optional: Change the kube-proxy implementation. Choices are [iptables, ipvs].
KUBE_PROXY_MODE=${KUBE_PROXY_MODE:-iptables}
# Optional: Change the kube-proxy implementation. Choices are [iptables, ipvs, nftables].
KUBE_PROXY_MODE=${KUBE_PROXY_MODE:-nftables}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(should mark this commit [DO NOT MERGE] or something)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has to be removed , yes

o.logger.Info("Using iptables proxy")
config.Mode = proxyconfigapi.ProxyModeIPTables
o.logger.Info("Using nftables proxy")
config.Mode = proxyconfigapi.ProxyModeNFTables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(likewise... or just merge this with the other DNM commit)

params+=" --proxy-mode=nftables"
# to avoid the problems caused of adding iptables rules to drop packets with invalid conntrack state
# kube-proxy has an option to set the tcpBeLiberal sysctl that solves the problem and is less disruptive
# https://github.com/kubernetes/kubernetes/issues/94861ss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the link (the trailing "ss"), but https://issues.k8s.io/122663#issuecomment-1885024015 is probably a better link anyway.

In particular, we are intentionally requiring that nftables-kube-proxy-based CI set --conntrack-tcp-be-liberal because we want to test that option.

@danwinship
Copy link
Contributor

/close
replaced by #124383

@k8s-ci-robot
Copy link
Contributor

@danwinship: Closed this PR.

In response to this:

/close
replaced by #124383

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy area/provider/gcp Issues or PRs related to gcp provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants