Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dual-stack proxies share healthz and metrics, corrupting the results #116486

Open
danwinship opened this issue Mar 10, 2023 · 10 comments
Open

dual-stack proxies share healthz and metrics, corrupting the results #116486

danwinship opened this issue Mar 10, 2023 · 10 comments
Assignees
Labels
area/kube-proxy priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@danwinship
Copy link
Contributor

In a dual-stack proxy there is only a single healthz server, and likewise none of the proxy metrics are differentiated by IP family. Thus, the IPv4 and IPv6 sub-proxies are both updating the same data.

For the healthz server, this means that the "becoming unhealthy" timer starts counting whenever either proxy queues an update, and gets reset whenever either proxy successfully syncs. (So, eg, if every IPv6 sync fails, it could still be reporting healthy as long as IPv4 syncs are succeeding frequently enough.)

For metrics:

  • Some are just broken:
    • The SyncProxyRulesLastQueuedTimestamp and SyncProxyRulesLastTimestamp metrics are broken in the same way as the health check; at any given time it's possible that one of them was set by the IPv4 proxy and the other was set by the IPv6 proxy.
    • The IptablesRulesTotal and SyncProxyRulesNoLocalEndpointsTotal metrics report the most-recently-observed value for either IPv4 or IPv6, whichever synced last.
    • The ServiceChangesPending and EndpointChangesPending metrics report the most-recently-observed value for either IPv4 or IPv6, whichever synced or processed an event last.
  • Some may or may not be useful depending on cluster configuration:
    • SyncProxyRulesLatency will be based on the duration of both IPv4 syncProxyRules calls and IPv6 syncProxyRules calls, which will skew the results if you have either way more IPv4 Services than IPv6 ones or vice versa.
    • Likewise, NetworkProgrammingLatency will include both IPv4 programming latency and IPv6 programming latency, which could potentially be very different.
  • Some other metrics are fine:
    • IptablesRestoreFailuresTotal and IptablesPartialRestoreFailuresTotal correctly report the total number of IPv4 and IPv6 failures combined.
    • ServiceChangesTotal and EndpointChangesTotal correctly report the total number of IPv4 and IPv6 changes combined.

/sig network
/area kube-proxy
cc @khenidak

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. area/kube-proxy needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 10, 2023
@aojea
Copy link
Member

aojea commented Mar 13, 2023

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 13, 2023
@aroradaman
Copy link
Member

/assign

@aroradaman
Copy link
Member

@aojea @danwinship
I can prepare a draft for tracking new metrics exclusively for IPv6

SyncProxy6RulesLastQueuedTimestamp => sync_proxy6_rules_last_queued_timestamp_seconds
SyncProxyRulesLastTimestamp => sync_proxy6_rules_last_timestamp_seconds
Ip6TablesRulesTotal => sync_proxy6_rules_iptables_total
NetworkProgrammingLatency => network6_programming_duration_seconds

Following this convention suffix of 6 at end of proxy or network keyword for other relevant metrics.

For healthz server maybe we can maintain mappings with key as IPFamily for last updated and oldest pending queued values and return OK based on the values of the map and network configuration.

diff --git a/pkg/proxy/healthcheck/proxier_health.go b/pkg/proxy/healthcheck/proxier_health.go
index 7dc5e4e4b4d..b8c4f204482 100644
--- a/pkg/proxy/healthcheck/proxier_health.go
+++ b/pkg/proxy/healthcheck/proxier_health.go
@@ -60,8 +60,8 @@ type proxierHealthServer struct {
        recorder      events.EventRecorder
        nodeRef       *v1.ObjectReference
 
-       lastUpdated         atomic.Value
-       oldestPendingQueued atomic.Value
+       lastUpdated         map[v1.IPFamily]atomic.Value
+       oldestPendingQueued map[v1.IPFamily]atomic.Value
 }

@danwinship
Copy link
Contributor Author

I don't think we should have metrics with different names. (Then you'd need different alert rules for IPv4 and IPv6 clusters...)

Maybe we want one metric with separate values (like how the SyncProxyRulesNoLocalEndpointsTotal metric has separate Local and Cluster values). Maybe we want to just combine the IPv4 and IPv6 values together. Maybe we want one of those for some metrics and the other for other metrics...

@MikeZappa87
Copy link

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 16, 2023
@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jun 14, 2023
@aroradaman
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 17, 2023
@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024
@danwinship
Copy link
Contributor Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 19, 2024
@danwinship
Copy link
Contributor Author

Maybe we want one metric with separate values (like how the SyncProxyRulesNoLocalEndpointsTotal metric has separate Local and Cluster values). Maybe we want to just combine the IPv4 and IPv6 values together. Maybe we want one of those for some metrics and the other for other metrics...

While looking at some of @aojea's nftables metrics, I realized that merging the v4 and v6 values together for things like NetworkProgrammingLatency means that our perfscale metrics are consistently off; they get a (roughly) 0 value merged in for IPv6 once every sync period. So we need to do something that will let us avoid that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants