Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
a way to specify the number of datapoints to return over a given time interval #188
When plotting a graph (on a remote client), ideally the data would have as many data points defined as the display resolution.
While at present data can be fetched by time range (start/end) and aggregated over a given time interval (s, m etc.), this can give less than ideal rendering in a graph due to the number of data points not exactly matching the display resolution.
In some cases it's possible to compute the required aggregation interval so that the range requested ends up producing exactly the number of points needed. E.g. if we have a time range of 1 hour and want to get 240 points for display, then the aggregation interval would be 0.25s. But my concern is that in some cases rounding errors may mean we get one data point too few or too many. (Of course, this can be coded around by artificially increasing the number of points 1 or 2 beyond what is required and discarding any extra. so there is always sufficient points for the graph.)
I'm airing this to see what you think. Maybe I'm approaching this from the wrong end?
In psudo-syntax, something like:
@m-mcgowan: your method for returning a fixed number of points would produce odd results I think. If you're not using a group by time interval to give you the approximate number of points you want, the question becomes which points do we drop to get you the desired number. I'm not sure what a desired behavior would be here. If you're talking about drawing a line graph, you don't necessarily need a data point per viewable pixel.
However, I do think that @FGRibreau's method should work. That way you would get a specific number of points and you could specify either a start or an end time to count from. You could specify both, but it's possible that your limit would be higher than the number of points that would be returned anyway.
Would that be an acceptable solution @m-mcgowan?
In my case I have more points that I need - thousands, so we don't need all of them. It would be possible to compute an appropriate time interval to reduce the number of points, but in this case the number of points is more important that the actual time interval, since the time interval may produce rounding errors.
If I wanted 60 points, I could set the time interval so that it will give a minimum of 60 points over the requested start/end range (taking into account rounding errors), and then discard any extra with 'limit 60'. That could be good enough, and would work in my case.
But my hope was that the db engine could do a better job, by allowing the query to specify exactly what is wanted - rather than grouping over a time interval, the query specifies to group by a count. The db would then compute the appropriate time interval internally and use that to group by. Because of rounding errors the last time interval in the group may be slightly larger or smaller, but the db engine just includes it.
Alternatively, rather than using a fixed time interval, which is then applied in succession, which gives eventual errors due to the precision (e.g. a time interval of 1/3 repeated 30 times from 0 to 30s probably doesn't end up with the last interval falling exaclty on 10s, but probably 9.9999 due to accumulated precision errors). To avoid this the db could compute the time slots for each step based on the total, so the error is constant, rather than accumulated with each interval.
E.g. request 30 points over 10s. The first time interval has the range
In this way, the error for the time interval is constant (rather than accumulated) the query is guaranteed to produce exactly the required number of points.
You asked about which points to return - this functions as an aggregate just like the regular time-interval group by, all points within the range are processed by the aggregate function. Internally it's still a time interval, just the interval is computed based on the query time interval range and group count rather than a specified duration.
hmm, that makes sense, but I'm not sure that it's something we'll get to for a little while. Mainly because this can be solved through just a few lines of code in the client with the caveat that you may have fewer points than you have pixels, which I think is ok.
However, I'll keep this issue open and assign it to 0.7.0 for now.
Keep in mind that splitting the time range in specific intervals and reduce the amount of data by interpolating on those intervals may not be the most accurate representation of your data.
Imagine a timeseries that has a more or less constant value with periodic spikes. If you just interpolate at specific intervals to reduce the number of datapoints, you will most likely see none of these spikes or with a bit of luck some of the spikes or perhaps even only the spikes causing an unreliable representation of the actual data.
A very basic but more accurate way to do this is to split the range in a number of intervals and per interval return 2, 3 or 4 values being: the value at the beginning of the interval, the value at the end of the interval, the highest value, the lowest value, with the last 2 being optional because they may be identical to the first 2.
So imagine you would like to show data on a trend that is 600 pixels wide, you would request so called "plot values" with 200 or 300 intervals which should give you about the right amount of data for your trend with a focus on accuracy instead of focus on amount of data.