kube-proxy: change buckets used by NetworkProgrammingLatency #80218

oxddr · 2019-07-16T14:44:26Z

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
We have too fine buckets granularity for lower latencies, at cost of the higher latencies (7+ minutes). This is causing spikes in SLI calculated based on that metrics.
refs kubernetes/perf-tests#640

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

/sig scalability

refs kubernetes/perf-tests#640 We have too fine buckets granularity for lower latencies, at cost of the higher latecies (7+ minutes). This is causing spikes in SLI calculated based on that metrics. I don't have strong opinion about actual values - those seemed to be better matching our need. But let's have discussion about them. Values: 0.015 s 0.030 s 0.060 s 0.120 s 0.240 s 0.480 s 0.960 s 1.920 s 3.840 s 7.680 s 15.360 s 30.720 s 61.440 s 122.880 s 245.760 s 491.520 s 983.040 s 1966.080 s 3932.160 s 7864.320 s

oxddr · 2019-07-16T14:44:35Z

/hold

oxddr · 2019-07-17T11:30:03Z

/area kube-proxy
/kind cleanup
/assign @wojtek-t
/cc @mm4tt

oxddr · 2019-07-17T11:30:15Z

/hold cancel

oxddr · 2019-07-17T11:32:01Z

If we want even less granularity for lower latencies, using 0.02 as a start would give us those buckets:

0.020 s
0.040 s
0.080 s
0.160 s
0.320 s
0.640 s
1.280 s
2.560 s
5.120 s
10.240 s
20.480 s
40.960 s
81.920 s
163.840 s
327.680 s
655.360 s
1310.720 s
2621.440 s
5242.880 s
10485.760 s

mm4tt · 2019-07-17T11:44:13Z

I'm not sure if I follow how this is supposed to help.
Starting with 0.015 instead of 0.001 basically just eliminates first few buckets and instead adds few more at the end (but those would be for recording values of 20min+, which probably never would be used / useful).

What's the gain in that?

wojtek-t · 2019-07-17T12:14:05Z

pkg/proxy/metrics/metrics.go

-			// TODO(mm4tt): Reevaluate buckets before 1.14 release.
-			// The last bucket will be [0.001s*2^20 ~= 17min, +inf)
-			Buckets: prometheus.ExponentialBuckets(0.001, 2, 20),
+			Buckets:   prometheus.ExponentialBuckets(0.015, 2, 20),


Couple thoughts from me:

for the purpose of this SLI, I don't think anything smaller than 100ms is what we really care about it (I don't think we will see non-neglighble amount of values lower than 100ms). So I would just start with 100ms.
I'm wondering if even 100ms is something we should really care about. Maybe 250ms is also enough?

O(7800s) is also waaay to high that something we care about - if programming networking takes more than 2 hours it's by far non-acceptable. I don't think we care about anything bigger than small minutes - then it's obviously wrong anyway and I don't think it matters whether it was 20m or 2h.

Factor of 2 is quite significant - I would suggest using something smaller than 2.

So my first suggestion would be:
(0.1, 1.5, 20) [that translates to (100ms->221s) interval]

Though ideally, I would like to have even smaller step than 1.5 (in many cases there is a difference between say 6s and 8.5 which will be treated equally in this case).

I also think that we will be targeting with SLO to a value smaller than 1m (And this will probably be somewhat round number like 30s).
So my second proposal is more custom buckets, like we did for api-call latencies:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L88
Here probably ranging from [0.1 or maybe even 0.25, to O(1-2m)]
So maybe sth like:
[0.25, 0.5, 1, 2, 3, 4, 5, 7.5, 10, 12.5, 15, 20, 25, 30, 35, 40, 45, 50, 60, 90]

@freehan @bowei - for thoughts

Thanks for your insights, Wojtek, especially around numbers.

I gave a second thought to it and I agree with you and @mm4tt, that exponential buckets with factor of 2 doesn't provide much value. For this particular case exponential buckets might not make sense at all. As you mentioned, we don't really care about value below XXXms and I agree it doesn't matter whether the value is 20m and 2h. However I think we should have some buckets for values in order of minutes. In few samples collected so far there were values above 4 minutes. From debugging perspective it'd be good to distinct between broken programming (~minutes) and broken metric/measurement (~hours).

Going with (0.1, 1.5, 20) or getting the same buckets as api-call latency doesn't actually solve the problem, that sparked this PR. Both don't provide enough resolution for minutes latencies. What about adding few more thresholds at the end: like 120, 180, 240, 360?

You mean after what I suggested:
[0.25, 0.5, 1, 2, 3, 4, 5, 7.5, 10, 12.5, 15, 20, 25, 30, 35, 40, 45, 50, 60, 90]
?

That sounds reasonable to me. The question is whether changing the number of buckets isn't a breaking change. Do you know that?

Yes, I meant adding 120, 180, 240, 360 to what you propose. So in the end buckets would look like:
[0.25, 0.5, 1, 2, 3, 4, 5, 7.5, 10, 12.5, 15, 20, 25, 30, 35, 40, 45, 50, 60, 90, 120, 180, 240, 360]
Sorry for not being precise.

This is breaking change in a similar way to changing specification of the buckets. If someone is queries specific buckets or labels or have some assumptions about their number backed in someone changing this will break their code. However for usual queries over Prometheus histograms this shouldn't have any effect (e.g. current implementation of network programming latency SLI).

Reading through conversation above, is the suggestion to change from exp to linear buckets? Also, as I understand it, there is really very little cost to having lots of buckets for a global metrics like this...

What I'm suggesting isn't even linear - it's somewhat custom.

Also, as I understand it, there is really very little cost to having lots of buckets for a global metrics like this...

So you're suggesting making even more buckets?
I remember a discussion with prometheus folks and they said that having more than 30 buckets would cause issues, but now when I recall it was for the metric that has multiple lables with many potential values.

Yeah, for this counter, as I understand, it's would just be an array of N doubles and the update operation just (handful) of increments, so we might as well have 100 values to get the granularity we need.

Yeah - that sounds reasonable to me (given that we don't have any metrics on that label).
When I'm thinking more about it, I actually think the values in large clusters may exceed 30s, because of long calls to iptables - we've seen those taking 20-30s in the past.

I'm still for something a bit custom though, so let's say:

Buckets: []float64{prometheus.LinearBuckets(0.25, 0.25, 20)..., prometheus.LinearBuckets(6, 1, 55)..., prometheus.LinearBuckets(65, 5, 12)...}

(roughly, requires checking what exactly it will generate).
But would give us really good granularity.

I've slightly update what you propose, Wojtek. I removed some lower values and added more of higher ones. While I don't expect we'd ever used them in the regular work of k8s, they are useful to debug problems around the metric itself.

wojtek-t · 2019-07-26T12:53:52Z

pkg/proxy/metrics/metrics.go

+				prometheus.LinearBuckets(1, 1, 59),      // 1s, 2s, 3s, ... 59s
+				prometheus.LinearBuckets(60, 5, 12),     // 60s, 65s, 70s, ... 115s
+				prometheus.LinearBuckets(120, 30, 7),    // 2min, 2.5min, 3min, ..., 5min
+			),


I'm fine with this.

@bowei - WDYT?

oxddr · 2019-07-31T09:47:15Z

/priority important-soon

bowei · 2019-08-02T17:00:35Z

/lgtm

At some point we would hopeful need more granularity in the lower end, but one can dream :-)

wojtek-t · 2019-08-02T18:54:49Z

At some point we would hopeful need more granularity in the lower end, but one can dream :-)

I would love to :) but I'm afraid we have a bunch of work to do before then...

/approve

k8s-ci-robot · 2019-08-02T18:55:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: oxddr, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/proxy/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2019-08-02T23:48:57Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.