Add option for metrics sampling #16

raxod502-plaid · 2022-03-08T19:24:27Z

We've noticed that this plugin sends metrics to Prometheus that are sampled every 50ms. This results in an absolutely monumental amount of data volume being ingested, which overloads our system. All of our other services have metrics that are sampled at around 30s intervals. Even 1s would be fine, but 50ms is complete overkill since that level of detail will never show up in query results or a dashboard.

We've worked around the problem in our fork of xk6-output-prometheus-remote by applying the following diff:

--- a/kube/charts/k6/xk6-output-prometheus-remote/pkg/remotewrite/remotewrite.go
+++ b/kube/charts/k6/xk6-output-prometheus-remote/pkg/remotewrite/remotewrite.go
@@ -131,6 +131,8 @@ func (o *Output) flush() {
 func (o *Output) convertToTimeSeries(samplesContainers []stats.SampleContainer) []prompb.TimeSeries {
        promTimeSeries := make([]prompb.TimeSeries, 0)

+       seen := map[string]bool{}
+
        for _, samplesContainer := range samplesContainers {
                samples := samplesContainer.GetSamples()

@@ -148,9 +150,17 @@ func (o *Output) convertToTimeSeries(samplesContainers []stats.SampleContainer)

                        if newts, err := o.metrics.transform(o.mapping, sample, labels); err != nil {
                                o.logger.Error(err)
-                       } else {
+                       } else if !seen[sample.Metric.Name] {
                                promTimeSeries = append(promTimeSeries, newts...)
                        }
+
+                       // We only need 1 sample per metric per remote
+                       // write, not one every 50ms(!!). Can't do
+                       // this earlier in the function because
+                       // counters have to be computed using all
+                       // incoming samples. We just refrain from
+                       // sending the final value multiple times.
+                       seen[sample.Metric.Name] = true
                }

                // Do not blow up if remote endpoint is overloaded and responds too slowly.

This is clearly a hack, as it would be better to just set a specific time interval and have the sampling done intelligently, rather than hardcoding to once per remote write. Also, this implementation sends the oldest datapoint per interval, rather than the most recent which is probably what is desired. But, Prometheus simply can't manage with hundreds of thousands of time series generated by a 50ms sample interval, so we needed to do something to get the plugin working.

Would a change to make the sampling interval customizable be accepted upstream?

yorugac · 2022-03-10T11:49:33Z

Hi @raxod502-plaid thanks for opening this issue.

Yes, you're right; 50ms is hard-coded in k6 itself and this extension just re-sends everything. This issue is known and somewhat implied in #2. The main problem is that meaningful work in this extension is currently blocked by plans for metrics refactoring in k6. But given not fully clear timeline for that refactoring, perhaps, adding some workaround here would be good.

Personally I'd prefer if we had a slightly smarter aggregation, even for a workaround... Would you be willing to add a simple configuration option together with this change? E.g. K6_PROMETHEUS_AGGREGATION_PERIOD option: if this period is > 0 then use the logic of sending one sample per batch; otherwise, send everything. So this behavior can be switched on and off in configuration. WDYT?

raxod502-plaid · 2022-03-10T17:05:50Z

I can work on that - one concern I have though is, after implementing the workaround above, we see the following memory utilization pattern for k6:

With this plugin disabled, the memory usage is stable at only a few GB.

Do you know what's up with that? Presumably, metrics are somehow not getting freed, or the amount of data churn is so large the GC cannot keep up.

yorugac · 2022-03-17T09:26:14Z

This is interesting. Going by the snippet you posted before, the logic could definitely be improved by moving the check for seen to the beginning of the loop. My current assumption is that Golang has an optimization for slices that are appended, so the function ends up using more memory when it's not appending newts as in your approach. IOW, I'd try moving the check to the top of the loop and see how it performs then.
But there could be something else at play as well. So I suggest making a PR to be able to discuss it properly and in case there's some additional caveat.

raxod502-plaid · 2022-03-17T16:12:07Z

Okay - I've been working on this integration on and off for many weeks now, so I'm going to take a break, but will make a note of this internally for when the work is picked back up.

daniel-maganto · 2022-06-29T13:31:35Z

Any news on this issue?
Many thanks

raxod502-plaid · 2022-06-29T14:45:01Z

Nothing on my side. The project has been deprioritized for now, unfortunately. I still think it is a needed feature, though.

raxod502-plaid · 2022-06-29T14:45:27Z

Not sure why GitHub auto-closed that. Reopened.

raxod502-plaid · 2022-06-29T14:45:36Z

Wtf.

codebien · 2022-12-16T16:41:23Z

The major issues with this have been resolved. The output aggregates all the samples of the same series seen during the same configured push interval. I keep it open mostly for tracking eventual feedback on the case to configure a different aggregation interval from the push interval.

raxod502-plaid closed this as completed Jun 29, 2022

raxod502-plaid reopened this Jun 29, 2022

raxod502-plaid closed this as completed Jun 29, 2022

raxod502-plaid reopened this Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for metrics sampling #16

Add option for metrics sampling #16

raxod502-plaid commented Mar 8, 2022

yorugac commented Mar 10, 2022 •

edited

raxod502-plaid commented Mar 10, 2022

yorugac commented Mar 17, 2022

raxod502-plaid commented Mar 17, 2022

daniel-maganto commented Jun 29, 2022

raxod502-plaid commented Jun 29, 2022

raxod502-plaid commented Jun 29, 2022

raxod502-plaid commented Jun 29, 2022

codebien commented Dec 16, 2022

Add option for metrics sampling #16

Add option for metrics sampling #16

Comments

raxod502-plaid commented Mar 8, 2022

yorugac commented Mar 10, 2022 • edited

raxod502-plaid commented Mar 10, 2022

yorugac commented Mar 17, 2022

raxod502-plaid commented Mar 17, 2022

daniel-maganto commented Jun 29, 2022

raxod502-plaid commented Jun 29, 2022

raxod502-plaid commented Jun 29, 2022

raxod502-plaid commented Jun 29, 2022

codebien commented Dec 16, 2022

yorugac commented Mar 10, 2022 •

edited