dual-stack proxies share healthz and metrics, corrupting the results #116486

danwinship · 2023-03-10T22:25:10Z

In a dual-stack proxy there is only a single healthz server, and likewise none of the proxy metrics are differentiated by IP family. Thus, the IPv4 and IPv6 sub-proxies are both updating the same data.

For the healthz server, this means that the "becoming unhealthy" timer starts counting whenever either proxy queues an update, and gets reset whenever either proxy successfully syncs. (So, eg, if every IPv6 sync fails, it could still be reporting healthy as long as IPv4 syncs are succeeding frequently enough.)

For metrics:

Some are just broken:
- The SyncProxyRulesLastQueuedTimestamp and SyncProxyRulesLastTimestamp metrics are broken in the same way as the health check; at any given time it's possible that one of them was set by the IPv4 proxy and the other was set by the IPv6 proxy.
- The IptablesRulesTotal and SyncProxyRulesNoLocalEndpointsTotal metrics report the most-recently-observed value for either IPv4 or IPv6, whichever synced last.
- The ServiceChangesPending and EndpointChangesPending metrics report the most-recently-observed value for either IPv4 or IPv6, whichever synced or processed an event last.
Some may or may not be useful depending on cluster configuration:
- SyncProxyRulesLatency will be based on the duration of both IPv4 syncProxyRules calls and IPv6 syncProxyRules calls, which will skew the results if you have either way more IPv4 Services than IPv6 ones or vice versa.
- Likewise, NetworkProgrammingLatency will include both IPv4 programming latency and IPv6 programming latency, which could potentially be very different.
Some other metrics are fine:
- IptablesRestoreFailuresTotal and IptablesPartialRestoreFailuresTotal correctly report the total number of IPv4 and IPv6 failures combined.
- ServiceChangesTotal and EndpointChangesTotal correctly report the total number of IPv4 and IPv6 changes combined.

/sig network
/area kube-proxy
cc @khenidak

The text was updated successfully, but these errors were encountered:

aojea · 2023-03-13T14:02:46Z

/priority important-soon

aroradaman · 2023-03-13T19:32:59Z

/assign

aroradaman · 2023-03-13T19:56:32Z

@aojea @danwinship
I can prepare a draft for tracking new metrics exclusively for IPv6

SyncProxy6RulesLastQueuedTimestamp => sync_proxy6_rules_last_queued_timestamp_seconds
SyncProxyRulesLastTimestamp => sync_proxy6_rules_last_timestamp_seconds
Ip6TablesRulesTotal => sync_proxy6_rules_iptables_total
NetworkProgrammingLatency => network6_programming_duration_seconds

Following this convention suffix of 6 at end of proxy or network keyword for other relevant metrics.

For healthz server maybe we can maintain mappings with key as IPFamily for last updated and oldest pending queued values and return OK based on the values of the map and network configuration.

diff --git a/pkg/proxy/healthcheck/proxier_health.go b/pkg/proxy/healthcheck/proxier_health.go
index 7dc5e4e4b4d..b8c4f204482 100644
--- a/pkg/proxy/healthcheck/proxier_health.go
+++ b/pkg/proxy/healthcheck/proxier_health.go
@@ -60,8 +60,8 @@ type proxierHealthServer struct {
        recorder      events.EventRecorder
        nodeRef       *v1.ObjectReference
 
-       lastUpdated         atomic.Value
-       oldestPendingQueued atomic.Value
+       lastUpdated         map[v1.IPFamily]atomic.Value
+       oldestPendingQueued map[v1.IPFamily]atomic.Value
 }

danwinship · 2023-03-14T14:22:02Z

I don't think we should have metrics with different names. (Then you'd need different alert rules for IPv4 and IPv6 clusters...)

Maybe we want one metric with separate values (like how the SyncProxyRulesNoLocalEndpointsTotal metric has separate Local and Cluster values). Maybe we want to just combine the IPv4 and IPv6 values together. Maybe we want one of those for some metrics and the other for other metrics...

MikeZappa87 · 2023-03-16T16:22:12Z

/triage accepted

k8s-triage-robot · 2023-06-14T16:51:49Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

aroradaman · 2023-06-17T12:40:21Z

/triage accepted

k8s-triage-robot · 2024-01-19T01:00:47Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

danwinship · 2024-01-19T19:28:31Z

/triage accepted

danwinship · 2024-04-13T21:20:19Z

Maybe we want one metric with separate values (like how the SyncProxyRulesNoLocalEndpointsTotal metric has separate Local and Cluster values). Maybe we want to just combine the IPv4 and IPv6 values together. Maybe we want one of those for some metrics and the other for other metrics...

While looking at some of @aojea's nftables metrics, I realized that merging the v4 and v6 values together for things like NetworkProgrammingLatency means that our perfscale metrics are consistently off; they get a (roughly) 0 value merged in for IPv6 once every sync period. So we need to do something that will let us avoid that.

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. area/kube-proxy needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 10, 2023

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 13, 2023

k8s-ci-robot assigned aroradaman Mar 13, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 16, 2023

danwinship mentioned this issue May 17, 2023

clean up kube-proxy nodeport address handling (part 3) #116016

Merged

aroradaman mentioned this issue May 20, 2023

proxy healthz server for dualstack clusters #118146

Merged

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jun 14, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 17, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 19, 2024

aroradaman mentioned this issue Apr 14, 2024

Some kube-proxy metrics must be labeled per IP family #124305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dual-stack proxies share healthz and metrics, corrupting the results #116486

dual-stack proxies share healthz and metrics, corrupting the results #116486

danwinship commented Mar 10, 2023

aojea commented Mar 13, 2023

aroradaman commented Mar 13, 2023

aroradaman commented Mar 13, 2023

danwinship commented Mar 14, 2023

MikeZappa87 commented Mar 16, 2023

k8s-triage-robot commented Jun 14, 2023

aroradaman commented Jun 17, 2023

k8s-triage-robot commented Jan 19, 2024

danwinship commented Jan 19, 2024

danwinship commented Apr 13, 2024

dual-stack proxies share healthz and metrics, corrupting the results #116486

dual-stack proxies share healthz and metrics, corrupting the results #116486

Comments

danwinship commented Mar 10, 2023

aojea commented Mar 13, 2023

aroradaman commented Mar 13, 2023

aroradaman commented Mar 13, 2023

danwinship commented Mar 14, 2023

MikeZappa87 commented Mar 16, 2023

k8s-triage-robot commented Jun 14, 2023

aroradaman commented Jun 17, 2023

k8s-triage-robot commented Jan 19, 2024

danwinship commented Jan 19, 2024

danwinship commented Apr 13, 2024