Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: add aggregate funciton top and bottom #409 #551

Closed

Conversation

chobie
Copy link
Contributor

@chobie chobie commented May 17, 2014

Updated: 20 May 2014

This patch provides aggregate function top and bottom. these function are useful when executing analytics or summarize queries like:

select top(elapsed_time, 10), url from access group by url;

Added functions:

  top(column, limit)
  bottom(column, limit)

select top(value, 10) from series will output 10 value values with descending order.
select bottom(value, 10) from series will output 10 value values with ascending order.
select top(value, 10), name from series group by name will output 10 value and name values with descending order (order key is 1st column. this doesn't support multiple sort).

supported column types are int64, float64 and string.

And, I'm not good at English and Golang. so if you find some wrong points, please feel free to comment it.

this patch provides aggregate function top and low. these function are
useful when executing analytics or summarize queries.

Added functions:
  low(columns..., limit)
  top(columns..., limit)

For instance:
  `select low(value, 10) from series` will output 10 `value` values with ascending order.
  `select top(value, 10) from series` will output 10 `value` values with descending order.

How to build:
  go get github.com/ryszard/goskiplist/skiplist
  go build src/daemon.go

Currently, these function support float64 and int64 types.

Limitations:
  * can't handle same value correctly (goskiplist limitation)

Misc:
  * each column should support alias.
    e.g) low(score as low_score, 10)

And, I'm not good at English and Golang. so if you find some wrong
points, plz feel free to fix it.
@pauldix
Copy link
Member

pauldix commented May 17, 2014

Thanks for the PR @chobie! Here are a few thoughts. I just added comments to #409 about a new way to structure the query. I think using having is cleaner. So there are a few things we'd need here. First would be some tests around it. I'd like to see the three cases I bulleted in #409 to have tests.

The nice thing about the having way of doing things is that it works cleanly for when you're doing counts on grouped by columns.

Then also have tests for what happens if a top query is run against a non-numerical column.

A comment on the implementation. I think it's overkill to use a skiplist. Most people are going to be asking for the top 10 or to 50 at most. The simplest method is to have an array of points, implement sorting methods to sort by the field. Should be easy by following this: https://github.com/influxdb/influxdb/blob/master/src/protocol/protocol_extensions.go#L125-L164

Pick the sort based on if you're doing top or bottom. Have the slice sized the same as the N in top. Then as each new point comes in, check vs the last element in the slice. If it's greater than, replace, resort.

Sorts on small slices are probably very efficient.

If it can be updated to work like those last two comments I left in #409 with tests added, we'd jump at taking this PR. Thanks again, hope that's not too much of a pain. Or let me know if there's something that you think would be cleaner or better.

@chobie chobie changed the title POC: add aggregate funciton top and low #409 POC: add aggregate funciton top and bottom #409 May 18, 2014
@chobie
Copy link
Contributor Author

chobie commented May 18, 2014

Added test cases and choose sort based implementation. Let me know If I missed something.

select mean(value), host from cpu_idle group by host having top(10, mean)  where time > now() - 10m

That's great. I prefer adding having clause (#183) to this patch.

The other potential problem is that when you're doing a top query over the series, you'd probably want the timestamps of the points that are actually at the top

This would be helpful for finding problems with top. For example: to show slow url with select top(100, elapsed, url, user_id) from access query. we could find problem easily if we have timestamp. but we can find the cause with querying multiple times. so it's trivial thing.

For now, my blocker was solved by this. I'm not in any rush though.

@chobie
Copy link
Contributor Author

chobie commented May 18, 2014

btw, What is the better way to do data_test test cases? go test integration takes several minutes.
I've commented out unwanted test cases.

@pauldix
Copy link
Member

pauldix commented May 18, 2014

reviewing now. On the data_test thing, the best place is in integration/data_test.go. Then you can run by doing make integration_test only=DataTest which should run quickly.

@pauldix
Copy link
Member

pauldix commented May 18, 2014

So this doesn't implement the having portion of the query I take it. What happens when you do these queries?

-- say you only have 3 data points total and you run this
select top(10, value) from some_series

--- what happens here?
select top(10, value), host from cpu_idle group by time(1h), host

It's not clear to me what the additional column arguments in the top function do. Do those just select columns to return? If so then it should be more like this:

select top(10, value), host from cpu_idle

That means you'll filter by the top value and get the host.

I also just realized that the percentile function has the number and column parameters reversed, so top should work like that too to keep things consistent. So:

select top(value, 10), host from cpu_idle

…or it.

top(column, limit)
bottom(column, limit)
@chobie
Copy link
Contributor Author

chobie commented May 19, 2014

I've changed top and bottom function arguments. Is my understanding correct?

I'll improve this when I back home.

We need to detect correct column offset when group by clause contains other fields.
otherwise we correct wrong field.
previous commit was just my wrong approach. basically don't need to keep whole columns.
@chobie
Copy link
Contributor Author

chobie commented May 19, 2014

updated. probably this will work fine. could you review again when you have a chance?

@jvshahid jvshahid closed this in d382ea0 May 19, 2014
@jvshahid jvshahid added this to the 0.6.5 milestone May 19, 2014
@tisba
Copy link

tisba commented May 20, 2014

That is an awesome feature! Thanks!

@tisba
Copy link

tisba commented May 20, 2014

Will the documentation be updated to reflect those new functions?

@pauldix
Copy link
Member

pauldix commented May 20, 2014

It will, I'm on it.

On Tue, May 20, 2014 at 10:16 AM, Sebastian Cohnen <notifications@github.com

wrote:

Will the documentation be updated to reflect those new functions?


Reply to this email directly or view it on GitHubhttps://github.com//pull/551#issuecomment-43631075
.

pauldix pushed a commit that referenced this pull request May 27, 2014
this patch provides aggregate function top and low. these function are
useful when executing analytics or summarize queries.

Added functions:
  low(columns..., limit)
  top(columns..., limit)

For instance:
  `select low(value, 10) from series` will output 10 `value` values with ascending order.
  `select top(value, 10) from series` will output 10 `value` values with descending order.

How to build:
  go get github.com/ryszard/goskiplist/skiplist
  go build src/daemon.go

Currently, these function support float64 and int64 types.

Limitations:
  * can't handle same value correctly (goskiplist limitation)

Misc:
  * each column should support alias.
    e.g) low(score as low_score, 10)

And, I'm not good at English and Golang. so if you find some wrong
points, plz feel free to fix it.
@samuraraujo
Copy link

Dear Paul, is there any plan to support my first suggestion?

" select top(10, value) from cpu.* where time > now() - 10m"
"I would expect It to retrieve the top 10 series."

I have thousands of series and I would like to obtain the top 10 on a specific interval. The mean or sum can be used to rank the series. e.g. select top(10, mean(value)) from cpu.* where time > now() - 10m

Currently, I have to run: select mean(value) from cpu.* where time > now() - 10m
Then, it retrieves hundred thousands of series that are sorted inside my application. This is not efficient.

Please let me know if you have this feature on your roadmap.

Best

@pauldix
Copy link
Member

pauldix commented May 28, 2014

It's definitely something we want to get to. However, we'll probably be doing column indexes first: #582. And when we do those the better way to structure things would be to have a series called cpu and have the host as an indexed column. Then the query using #183 and top would work.

It'll probably be more than a month until we get to this since there are some other features that are much higher priority at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants