Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: PromQL primitives for sampling #4172

Closed
xginn8 opened this Issue May 17, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@xginn8
Copy link

xginn8 commented May 17, 2018

Proposal

When collecting data from large populations, it is often useful to sample from the population rather than aggregating into a single metric. This operation is more resource-friendly, as well as often more descriptive of the actual population distribution (as the data can be full-fidelity without much performance impact).

Popular TSDBs provide a similar function:
InfluxDB: https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#sample
CrateDB: https://crate.io/docs/crate/reference/en/latest/general/builtins/scalar.html#random

If the maintainers are amenable to this query function, I'm happy to take a stab at implementing it.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 17, 2018

I don't see how this would work sanely semantically, nor how it would be a real performance improvement as we'd still need to pull in all the data.

@xginn8

This comment has been minimized.

Copy link
Author

xginn8 commented May 18, 2018

I think there are two things to consider here:

  1. is the sampling utility useful and novel?
    To me this issue is unequivocally yes, since sampling full fidelity data is in many cases more useful than looking at either aggregated metrics, or topk()/bottomk() in understanding the population distribution. If there's already something in place, even better! The point here is that even without a performance improvement, this function would be useful unto itself.

  2. is there a performance improvement to querying data in this manner, as compared to the aggregation?
    Let's say for example I'm querying something like sample(N, up[1h]). I defer to you in terms of the implementation as I haven't spent a lot of time looking through the source, but I would think algorithmically you could do something like:

  1. generate N random times within the range
  2. for each time generated, perform an instant query at that instant for a single metric
  3. look up and return matching timeseries over the window at full-fidelity

InfluxDB's implementation uses reservoir sampling to achieve O(n) performance for their sample function. Maybe I'm misunderstanding something fundamental about how the query engine performs a lookup, such that sampling random metrics is prohibitively computationally expensive?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 18, 2018

Even if you could figure out how to do it sanely from a stats standpoint, there's no point in random sampling performance wise as you could do any aggregation you want for the same (or more likely less) resources.

Is there a particular use case you have in mind?

generate N random times within the range

If you want to do that you can do it on top of our existing APIs.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

If you can explain your exact use case we can see if it makes sense, but I'm going to close for now as I can't think of one.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.