kube-proxy metrics cleanup (and stuff) #124557

danwinship · 2024-04-26T11:27:16Z

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

Organizes kube-proxy metrics registration better
Changes it so we don't register iptables-specific metrics in ipvs/nftables/winkernel mode
Fixes nftables to have its own "sync failure" metric rather than reusing the one with "iptables" in the name.
Adds an nftables "cleanup failure" metric (for when the delayed stale chain cleanup fails)
Fixes a bug that made the cleanup failure happen a lot, as seen in recent nftables CI testing. (Previously any time the sync failed, it would be followed by a cleanup failure on the next sync. Now it doesn't do that.)
Fixes another old FIXME that was in the area. (I had commented out a debug log that used a knftables API that went away, and then never uncommented it when the API came back slightly differently.)

Um, yeah, it's the not the most focused PR ever...

Does this PR introduce a user-facing change?

The nftables kube-proxy mode now has its own metrics rather than reporting
metrics with "iptables" in their names.

/sig network
/area kube-proxy
/assign @aojea @aroradaman

k8s-ci-robot · 2024-04-26T11:27:23Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2024-04-26T11:28:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kube-proxy/OWNERS~~ [danwinship]
~~pkg/proxy/OWNERS~~ [danwinship]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

danwinship · 2024-04-26T11:30:18Z

pkg/proxy/metrics/metrics.go

+		switch mode {
+		case kubeproxyconfig.ProxyModeIPTables:
+			legacyregistry.MustRegister(SyncFullProxyRulesLatency)
+			legacyregistry.MustRegister(SyncPartialProxyRulesLatency)
+			legacyregistry.MustRegister(IptablesRestoreFailuresTotal)
+			legacyregistry.MustRegister(IptablesPartialRestoreFailuresTotal)
+			legacyregistry.MustRegister(IptablesRulesTotal)
+			legacyregistry.MustRegister(IptablesRulesLastSync)
+
+		case kubeproxyconfig.ProxyModeIPVS:
+			legacyregistry.MustRegister(IptablesRestoreFailuresTotal)
+
+		case kubeproxyconfig.ProxyModeNFTables:
+			// FIXME: should not use the iptables-specific metric
+			legacyregistry.MustRegister(IptablesRestoreFailuresTotal)
+
+		case kubeproxyconfig.ProxyModeKernelspace:
+			// currently no winkernel-specific metrics
+		}


@dgrisonnet since you helped with kube-proxy metrics on another PR...

does this make sense? It seemed wrong to me to be registering metrics in modes where they don't get used (eg, registering the "iptables-restore failure" metric in windows or nftables mode), but maybe there's a rule that says that we should always register all the same metrics, even if some of them will always be 0?

There is no such rule, registering conditionally like you did sounds like a cleaner approach to me

dgrisonnet

LGTM from sig-instrumentation

Windows proxy metric registration was in a separate file, which had led to some metrics (eg the new ProxyHealthzTotal and ProxyLivezTotal) not being registered for Windows even though they were implemented by platform-generic code. (A few other metrics were neither registered on, nor implemented on Windows, and that's probably a bug.) Also, beyond linux-vs-windows, make it clearer which metrics are specific to individual backends.

aojea · 2024-04-26T13:28:08Z

pkg/proxy/nftables/proxier.go

+
+		// staleChains is now incorrect since we didn't actually flush the
+		// chains in it. We can recompute it next time.
+		proxier.staleChains = make(map[string]time.Time)


we used to do this thing to avoid relocating memory, no?

proxier.staleChains = proxier.staleChains[:0]

hm... you can't do that with a map through...
ah, no, apparently as of golang 1.21 you can do clear(proxier.staleChains)

to be honest, I didn''t realize it was a map when I commented, just saw the make and remembered the allocation problems ... glad it helped anyway

aojea · 2024-04-26T13:35:04Z

lgtm once comment #124557 (comment) is resolved

gofmt errors are legit

aroradaman · 2024-04-26T15:17:48Z

pkg/proxy/metrics/metrics.go

+	NFTablesSyncFailuresTotal = metrics.NewCounter(
+		&metrics.CounterOpts{
+			Subsystem:      kubeProxySubsystem,
+			Name:           "sync_proxy_rules_nftables_sync_failures_total",


The kubeProxySubsystem will add kubeproxy prefix and the final metric would be kubeproxy_ sync_proxy_rules_nftables_sync_failures_total.
How about kubeproxy_ sync_nftables_rules_failures_total or kubeproxy_ nftables_sync_rules_failures_total?

The corresponding iptables metric is sync_proxy_rules_iptables_restore_failures_total... I was trying to keep it parallel with that.

Admittedly, I did drop the second sync originally, making it sync_proxy_rules_nftables_failures_total, but then when I added the cleanup failures metric, it seemed unbalanced/ambiguous, so I put it back...

I don't have really strong opinions about the names though...

@aojea any opinion?

(FWIW, it makes more sense if you realize that the sync_proxy_rules prefix refers to the syncProxyRules function that the metric is coming from...)

I'm bad at naming ... but consistency sounds better for people that already has scripts or dashboards so they just need to copy paste and s/iptables/nftables/ ... I don't think anybody reference this by memory

aroradaman · 2024-04-26T15:18:44Z

pkg/proxy/metrics/metrics.go

+	NFTablesCleanupFailuresTotal = metrics.NewCounter(
+		&metrics.CounterOpts{
+			Subsystem:      kubeProxySubsystem,
+			Name:           "sync_proxy_rules_nftables_cleanup_failures_total",


something similar for this maybe?

If the sync fails, don't try to cleanup, since it's guaranteed to fail too.

aroradaman · 2024-04-26T17:05:05Z

/lgtm (all threads resolved)

/hold for pull-kubernetes-e2e-capz-windows-master

aroradaman · 2024-04-26T17:05:11Z

/test pull-kubernetes-e2e-capz-windows-master

aroradaman · 2024-04-26T20:23:23Z

/lgtm
/hold cancel

k8s-ci-robot · 2024-04-26T20:23:28Z

LGTM label has been added.

Git tree hash: 6b1b4fa81ab759e15f83a101f8651169855ce23d

k8s-triage-robot · 2024-04-27T00:29:46Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-ci-robot assigned aojea Apr 26, 2024

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 26, 2024

k8s-ci-robot assigned aroradaman Apr 26, 2024

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 26, 2024

k8s-ci-robot requested review from aroradaman and bowei April 26, 2024 11:27

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs sig/windows Categorizes an issue or PR as relevant to SIG Windows. labels Apr 26, 2024

danwinship commented Apr 26, 2024

View reviewed changes

dgrisonnet reviewed Apr 26, 2024

View reviewed changes

danwinship added 3 commits April 26, 2024 09:27

fix "Iptables" -> "IPTables" in metrics variable names

1823de0

Rename nftables sync failure metric

fc05a29

aojea reviewed Apr 26, 2024

View reviewed changes

danwinship force-pushed the metrics-and-stuff branch from 2dc77f2 to 7f9a3a9 Compare April 26, 2024 13:48

aroradaman reviewed Apr 26, 2024

View reviewed changes

danwinship added 2 commits April 26, 2024 11:41

Add nftables cleanup failure metric, fix cleanup bug

d4e6e62

If the sync fails, don't try to cleanup, since it's guaranteed to fail too.

Re-enable V(9) transaction logging in nftables proxy

c4dd2c5

danwinship force-pushed the metrics-and-stuff branch from 7f9a3a9 to c4dd2c5 Compare April 26, 2024 15:41

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 26, 2024

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 26, 2024

k8s-ci-robot merged commit ae8474a into kubernetes:master Apr 27, 2024
18 checks passed

k8s-ci-robot added this to the v1.31 milestone Apr 27, 2024

danwinship deleted the metrics-and-stuff branch April 27, 2024 12:17

danwinship mentioned this pull request May 7, 2024

nftables kube-proxy TODO #122572

Open

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-proxy metrics cleanup (and stuff) #124557

kube-proxy metrics cleanup (and stuff) #124557

danwinship commented Apr 26, 2024

k8s-ci-robot commented Apr 26, 2024

k8s-ci-robot commented Apr 26, 2024

danwinship Apr 26, 2024

dgrisonnet Apr 26, 2024

dgrisonnet left a comment

aojea Apr 26, 2024

danwinship Apr 26, 2024

aojea Apr 26, 2024

aojea commented Apr 26, 2024

aroradaman Apr 26, 2024 •

edited

danwinship Apr 26, 2024

danwinship Apr 26, 2024

danwinship Apr 26, 2024

aojea Apr 26, 2024

aroradaman Apr 26, 2024

aroradaman commented Apr 26, 2024

aroradaman commented Apr 26, 2024

aroradaman commented Apr 26, 2024

k8s-ci-robot commented Apr 26, 2024

k8s-triage-robot commented Apr 27, 2024

kube-proxy metrics cleanup (and stuff) #124557

kube-proxy metrics cleanup (and stuff) #124557

Conversation

danwinship commented Apr 26, 2024

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Apr 26, 2024

k8s-ci-robot commented Apr 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgrisonnet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented Apr 26, 2024

aroradaman Apr 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aroradaman commented Apr 26, 2024

aroradaman commented Apr 26, 2024

aroradaman commented Apr 26, 2024

k8s-ci-robot commented Apr 26, 2024

k8s-triage-robot commented Apr 27, 2024

aroradaman Apr 26, 2024 •

edited