Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for metrics sampling #16

Open
raxod502-plaid opened this issue Mar 8, 2022 · 9 comments
Open

Add option for metrics sampling #16

raxod502-plaid opened this issue Mar 8, 2022 · 9 comments

Comments

@raxod502-plaid
Copy link

We've noticed that this plugin sends metrics to Prometheus that are sampled every 50ms. This results in an absolutely monumental amount of data volume being ingested, which overloads our system. All of our other services have metrics that are sampled at around 30s intervals. Even 1s would be fine, but 50ms is complete overkill since that level of detail will never show up in query results or a dashboard.

We've worked around the problem in our fork of xk6-output-prometheus-remote by applying the following diff:

--- a/kube/charts/k6/xk6-output-prometheus-remote/pkg/remotewrite/remotewrite.go
+++ b/kube/charts/k6/xk6-output-prometheus-remote/pkg/remotewrite/remotewrite.go
@@ -131,6 +131,8 @@ func (o *Output) flush() {
 func (o *Output) convertToTimeSeries(samplesContainers []stats.SampleContainer) []prompb.TimeSeries {
        promTimeSeries := make([]prompb.TimeSeries, 0)

+       seen := map[string]bool{}
+
        for _, samplesContainer := range samplesContainers {
                samples := samplesContainer.GetSamples()

@@ -148,9 +150,17 @@ func (o *Output) convertToTimeSeries(samplesContainers []stats.SampleContainer)

                        if newts, err := o.metrics.transform(o.mapping, sample, labels); err != nil {
                                o.logger.Error(err)
-                       } else {
+                       } else if !seen[sample.Metric.Name] {
                                promTimeSeries = append(promTimeSeries, newts...)
                        }
+
+                       // We only need 1 sample per metric per remote
+                       // write, not one every 50ms(!!). Can't do
+                       // this earlier in the function because
+                       // counters have to be computed using all
+                       // incoming samples. We just refrain from
+                       // sending the final value multiple times.
+                       seen[sample.Metric.Name] = true
                }

                // Do not blow up if remote endpoint is overloaded and responds too slowly.

This is clearly a hack, as it would be better to just set a specific time interval and have the sampling done intelligently, rather than hardcoding to once per remote write. Also, this implementation sends the oldest datapoint per interval, rather than the most recent which is probably what is desired. But, Prometheus simply can't manage with hundreds of thousands of time series generated by a 50ms sample interval, so we needed to do something to get the plugin working.

Would a change to make the sampling interval customizable be accepted upstream?

@yorugac
Copy link
Collaborator

yorugac commented Mar 10, 2022

Hi @raxod502-plaid thanks for opening this issue.

Yes, you're right; 50ms is hard-coded in k6 itself and this extension just re-sends everything. This issue is known and somewhat implied in #2. The main problem is that meaningful work in this extension is currently blocked by plans for metrics refactoring in k6. But given not fully clear timeline for that refactoring, perhaps, adding some workaround here would be good.

Personally I'd prefer if we had a slightly smarter aggregation, even for a workaround... Would you be willing to add a simple configuration option together with this change? E.g. K6_PROMETHEUS_AGGREGATION_PERIOD option: if this period is > 0 then use the logic of sending one sample per batch; otherwise, send everything. So this behavior can be switched on and off in configuration. WDYT?

@raxod502-plaid
Copy link
Author

I can work on that - one concern I have though is, after implementing the workaround above, we see the following memory utilization pattern for k6:

image

With this plugin disabled, the memory usage is stable at only a few GB.

Do you know what's up with that? Presumably, metrics are somehow not getting freed, or the amount of data churn is so large the GC cannot keep up.

@yorugac
Copy link
Collaborator

yorugac commented Mar 17, 2022

This is interesting. Going by the snippet you posted before, the logic could definitely be improved by moving the check for seen to the beginning of the loop. My current assumption is that Golang has an optimization for slices that are appended, so the function ends up using more memory when it's not appending newts as in your approach. IOW, I'd try moving the check to the top of the loop and see how it performs then.
But there could be something else at play as well. So I suggest making a PR to be able to discuss it properly and in case there's some additional caveat.

@raxod502-plaid
Copy link
Author

Okay - I've been working on this integration on and off for many weeks now, so I'm going to take a break, but will make a note of this internally for when the work is picked back up.

@daniel-maganto
Copy link

Any news on this issue?
Many thanks

@raxod502-plaid
Copy link
Author

Nothing on my side. The project has been deprioritized for now, unfortunately. I still think it is a needed feature, though.

@raxod502-plaid
Copy link
Author

Not sure why GitHub auto-closed that. Reopened.

@raxod502-plaid
Copy link
Author

Wtf.

@codebien
Copy link
Contributor

The major issues with this have been resolved. The output aggregates all the samples of the same series seen during the same configured push interval. I keep it open mostly for tracking eventual feedback on the case to configure a different aggregation interval from the push interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants