Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy metrics cleanup (and stuff) #124557

Merged
merged 5 commits into from
Apr 27, 2024

Conversation

danwinship
Copy link
Contributor

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

  1. Organizes kube-proxy metrics registration better
  2. Changes it so we don't register iptables-specific metrics in ipvs/nftables/winkernel mode
  3. Fixes nftables to have its own "sync failure" metric rather than reusing the one with "iptables" in the name.
  4. Adds an nftables "cleanup failure" metric (for when the delayed stale chain cleanup fails)
  5. Fixes a bug that made the cleanup failure happen a lot, as seen in recent nftables CI testing. (Previously any time the sync failed, it would be followed by a cleanup failure on the next sync. Now it doesn't do that.)
  6. Fixes another old FIXME that was in the area. (I had commented out a debug log that used a knftables API that went away, and then never uncommented it when the API came back slightly differently.)

Um, yeah, it's the not the most focused PR ever...

Does this PR introduce a user-facing change?

The nftables kube-proxy mode now has its own metrics rather than reporting
metrics with "iptables" in their names.

/sig network
/area kube-proxy
/assign @aojea @aroradaman

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 26, 2024
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/network Categorizes an issue or PR as relevant to SIG Network. area/kube-proxy cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 26, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 26, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs sig/windows Categorizes an issue or PR as relevant to SIG Windows. labels Apr 26, 2024
Comment on lines 258 to 298
switch mode {
case kubeproxyconfig.ProxyModeIPTables:
legacyregistry.MustRegister(SyncFullProxyRulesLatency)
legacyregistry.MustRegister(SyncPartialProxyRulesLatency)
legacyregistry.MustRegister(IptablesRestoreFailuresTotal)
legacyregistry.MustRegister(IptablesPartialRestoreFailuresTotal)
legacyregistry.MustRegister(IptablesRulesTotal)
legacyregistry.MustRegister(IptablesRulesLastSync)

case kubeproxyconfig.ProxyModeIPVS:
legacyregistry.MustRegister(IptablesRestoreFailuresTotal)

case kubeproxyconfig.ProxyModeNFTables:
// FIXME: should not use the iptables-specific metric
legacyregistry.MustRegister(IptablesRestoreFailuresTotal)

case kubeproxyconfig.ProxyModeKernelspace:
// currently no winkernel-specific metrics
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgrisonnet since you helped with kube-proxy metrics on another PR...

does this make sense? It seemed wrong to me to be registering metrics in modes where they don't get used (eg, registering the "iptables-restore failure" metric in windows or nftables mode), but maybe there's a rule that says that we should always register all the same metrics, even if some of them will always be 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such rule, registering conditionally like you did sounds like a cleaner approach to me

Copy link
Member

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from sig-instrumentation

Windows proxy metric registration was in a separate file, which had
led to some metrics (eg the new ProxyHealthzTotal and ProxyLivezTotal)
not being registered for Windows even though they were implemented by
platform-generic code.

(A few other metrics were neither registered on, nor implemented on
Windows, and that's probably a bug.)

Also, beyond linux-vs-windows, make it clearer which metrics are
specific to individual backends.

// staleChains is now incorrect since we didn't actually flush the
// chains in it. We can recompute it next time.
proxier.staleChains = make(map[string]time.Time)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we used to do this thing to avoid relocating memory, no?

proxier.staleChains = proxier.staleChains[:0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... you can't do that with a map through...
ah, no, apparently as of golang 1.21 you can do clear(proxier.staleChains)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest, I didn''t realize it was a map when I commented, just saw the make and remembered the allocation problems ... glad it helped anyway

@aojea
Copy link
Member

aojea commented Apr 26, 2024

lgtm once comment #124557 (comment) is resolved

gofmt errors are legit

NFTablesSyncFailuresTotal = metrics.NewCounter(
&metrics.CounterOpts{
Subsystem: kubeProxySubsystem,
Name: "sync_proxy_rules_nftables_sync_failures_total",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubeProxySubsystem will add kubeproxy prefix and the final metric would be kubeproxy_ sync_proxy_rules_nftables_sync_failures_total.
How about kubeproxy_ sync_nftables_rules_failures_total or kubeproxy_ nftables_sync_rules_failures_total?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corresponding iptables metric is sync_proxy_rules_iptables_restore_failures_total... I was trying to keep it parallel with that.

Admittedly, I did drop the second sync originally, making it sync_proxy_rules_nftables_failures_total, but then when I added the cleanup failures metric, it seemed unbalanced/ambiguous, so I put it back...

I don't have really strong opinions about the names though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea any opinion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(FWIW, it makes more sense if you realize that the sync_proxy_rules prefix refers to the syncProxyRules function that the metric is coming from...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm bad at naming ... but consistency sounds better for people that already has scripts or dashboards so they just need to copy paste and s/iptables/nftables/ ... I don't think anybody reference this by memory

NFTablesCleanupFailuresTotal = metrics.NewCounter(
&metrics.CounterOpts{
Subsystem: kubeProxySubsystem,
Name: "sync_proxy_rules_nftables_cleanup_failures_total",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something similar for this maybe?

@aroradaman
Copy link
Member

/lgtm (all threads resolved)

/hold for pull-kubernetes-e2e-capz-windows-master

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 26, 2024
@aroradaman
Copy link
Member

/test pull-kubernetes-e2e-capz-windows-master

@aroradaman
Copy link
Member

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 26, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 6b1b4fa81ab759e15f83a101f8651169855ce23d

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-ci-robot k8s-ci-robot merged commit ae8474a into kubernetes:master Apr 27, 2024
18 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Apr 27, 2024
@danwinship danwinship deleted the metrics-and-stuff branch April 27, 2024 12:17
@danwinship danwinship mentioned this pull request May 7, 2024
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs area/kube-proxy cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

6 participants