Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a way to specify the number of datapoints to return over a given time interval #188

Closed
m-mcgowan opened this issue Jan 18, 2014 · 7 comments

Comments

@m-mcgowan
Copy link

commented Jan 18, 2014

When plotting a graph (on a remote client), ideally the data would have as many data points defined as the display resolution.

While at present data can be fetched by time range (start/end) and aggregated over a given time interval (s, m etc.), this can give less than ideal rendering in a graph due to the number of data points not exactly matching the display resolution.

In some cases it's possible to compute the required aggregation interval so that the range requested ends up producing exactly the number of points needed. E.g. if we have a time range of 1 hour and want to get 240 points for display, then the aggregation interval would be 0.25s. But my concern is that in some cases rounding errors may mean we get one data point too few or too many. (Of course, this can be coded around by artificially increasing the number of points 1 or 2 beyond what is required and discarding any extra. so there is always sufficient points for the graph.)

I'm airing this to see what you think. Maybe I'm approaching this from the wrong end?

In psudo-syntax, something like:

select avg(value) from events group by count(60) where time>start && time<end

@FGRibreau

This comment has been minimized.

Copy link

commented Jan 18, 2014

👍 as I have the same kind of issue. Howerver the following request should returns the |ast 60 points with a precision of 10m ending at end.

select avg(value) from events group by time(10m) WHERE time<=end LIMIT 60
@pauldix

This comment has been minimized.

Copy link
Member

commented Jan 18, 2014

@m-mcgowan: your method for returning a fixed number of points would produce odd results I think. If you're not using a group by time interval to give you the approximate number of points you want, the question becomes which points do we drop to get you the desired number. I'm not sure what a desired behavior would be here. If you're talking about drawing a line graph, you don't necessarily need a data point per viewable pixel.

However, I do think that @FGRibreau's method should work. That way you would get a specific number of points and you could specify either a start or an end time to count from. You could specify both, but it's possible that your limit would be higher than the number of points that would be returned anyway.

Would that be an acceptable solution @m-mcgowan?

@m-mcgowan

This comment has been minimized.

Copy link
Author

commented Jan 18, 2014

In my case I have more points that I need - thousands, so we don't need all of them. It would be possible to compute an appropriate time interval to reduce the number of points, but in this case the number of points is more important that the actual time interval, since the time interval may produce rounding errors.

If I wanted 60 points, I could set the time interval so that it will give a minimum of 60 points over the requested start/end range (taking into account rounding errors), and then discard any extra with 'limit 60'. That could be good enough, and would work in my case.

But my hope was that the db engine could do a better job, by allowing the query to specify exactly what is wanted - rather than grouping over a time interval, the query specifies to group by a count. The db would then compute the appropriate time interval internally and use that to group by. Because of rounding errors the last time interval in the group may be slightly larger or smaller, but the db engine just includes it.

Alternatively, rather than using a fixed time interval, which is then applied in succession, which gives eventual errors due to the precision (e.g. a time interval of 1/3 repeated 30 times from 0 to 30s probably doesn't end up with the last interval falling exaclty on 10s, but probably 9.9999 due to accumulated precision errors). To avoid this the db could compute the time slots for each step based on the total, so the error is constant, rather than accumulated with each interval.

E.g. request 30 points over 10s. The first time interval has the range

[ 0*(end-start)/30, 1*(end-start)/30 ),

the 2nd

[1*(end-start)/30, 2*(end-start)/30 ) etc...

In this way, the error for the time interval is constant (rather than accumulated) the query is guaranteed to produce exactly the required number of points.

You asked about which points to return - this functions as an aggregate just like the regular time-interval group by, all points within the range are processed by the aggregate function. Internally it's still a time interval, just the interval is computed based on the query time interval range and group count rather than a specified duration.

@pauldix

This comment has been minimized.

Copy link
Member

commented Jan 28, 2014

hmm, that makes sense, but I'm not sure that it's something we'll get to for a little while. Mainly because this can be solved through just a few lines of code in the client with the caveat that you may have fewer points than you have pixels, which I think is ok.

However, I'll keep this issue open and assign it to 0.7.0 for now.

@pspeybro

This comment has been minimized.

Copy link

commented Apr 22, 2014

Keep in mind that splitting the time range in specific intervals and reduce the amount of data by interpolating on those intervals may not be the most accurate representation of your data.

Imagine a timeseries that has a more or less constant value with periodic spikes. If you just interpolate at specific intervals to reduce the number of datapoints, you will most likely see none of these spikes or with a bit of luck some of the spikes or perhaps even only the spikes causing an unreliable representation of the actual data.

A very basic but more accurate way to do this is to split the range in a number of intervals and per interval return 2, 3 or 4 values being: the value at the beginning of the interval, the value at the end of the interval, the highest value, the lowest value, with the last 2 being optional because they may be identical to the first 2.

So imagine you would like to show data on a trend that is 600 pixels wide, you would request so called "plot values" with 200 or 300 intervals which should give you about the right amount of data for your trend with a focus on accuracy instead of focus on amount of data.

@jvshahid jvshahid modified the milestones: 0.8.0, 0.7.0 May 2, 2014

@jvshahid jvshahid modified the milestones: 0.8.0, Next release Aug 25, 2014

@jvshahid

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2014

Closing this issue in favor of #230

@jvshahid jvshahid closed this Aug 29, 2014

@jvshahid jvshahid removed this from the Next release milestone Aug 29, 2014

@mitar

This comment has been minimized.

Copy link

commented Apr 21, 2017

This was closed, but then @230 was closed. Can we reopen then this back?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.