-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dual-stack proxies share healthz and metrics, corrupting the results #116486
Comments
/priority important-soon |
/assign |
@aojea @danwinship
Following this convention suffix of 6 at end of proxy or network keyword for other relevant metrics. For healthz server maybe we can maintain mappings with key as IPFamily for last updated and oldest pending queued values and return OK based on the values of the map and network configuration.
|
I don't think we should have metrics with different names. (Then you'd need different alert rules for IPv4 and IPv6 clusters...) Maybe we want one metric with separate values (like how the |
/triage accepted |
This issue is labeled with You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/triage accepted |
This issue is labeled with You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/triage accepted |
While looking at some of @aojea's nftables metrics, I realized that merging the v4 and v6 values together for things like NetworkProgrammingLatency means that our perfscale metrics are consistently off; they get a (roughly) 0 value merged in for IPv6 once every sync period. So we need to do something that will let us avoid that. |
In a dual-stack proxy there is only a single healthz server, and likewise none of the proxy metrics are differentiated by IP family. Thus, the IPv4 and IPv6 sub-proxies are both updating the same data.
For the healthz server, this means that the "becoming unhealthy" timer starts counting whenever either proxy queues an update, and gets reset whenever either proxy successfully syncs. (So, eg, if every IPv6 sync fails, it could still be reporting healthy as long as IPv4 syncs are succeeding frequently enough.)
For metrics:
SyncProxyRulesLastQueuedTimestamp
andSyncProxyRulesLastTimestamp
metrics are broken in the same way as the health check; at any given time it's possible that one of them was set by the IPv4 proxy and the other was set by the IPv6 proxy.IptablesRulesTotal
andSyncProxyRulesNoLocalEndpointsTotal
metrics report the most-recently-observed value for either IPv4 or IPv6, whichever synced last.ServiceChangesPending
andEndpointChangesPending
metrics report the most-recently-observed value for either IPv4 or IPv6, whichever synced or processed an event last.SyncProxyRulesLatency
will be based on the duration of both IPv4syncProxyRules
calls and IPv6syncProxyRules
calls, which will skew the results if you have either way more IPv4 Services than IPv6 ones or vice versa.NetworkProgrammingLatency
will include both IPv4 programming latency and IPv6 programming latency, which could potentially be very different.IptablesRestoreFailuresTotal
andIptablesPartialRestoreFailuresTotal
correctly report the total number of IPv4 and IPv6 failures combined.ServiceChangesTotal
andEndpointChangesTotal
correctly report the total number of IPv4 and IPv6 changes combined./sig network
/area kube-proxy
cc @khenidak
The text was updated successfully, but these errors were encountered: