-
Notifications
You must be signed in to change notification settings - Fork 180
Improve time complexity of Postings merge #50
Comments
Awesome! Thanks for digging so deep. The proper construction of the iterator tree is of course an easy and welcome improvement and my initial oversight. As stated in the blog post, there seems to be tons of material on this. I roughly went with what was easiest for now, in particular because k is typically very small for our use case. For merge (i.e, regexp matchers in Prometheus), it might indeed get larger than intersections but still It would indeed be interesting to benchmark the impact of different solutions for the typical inputs we encounter. |
Sure, this seems like a minor optimization iff k << n. Yet, it's a pretty easy one and helps in some degenerate cases (many labels).
(I was thinking about binary heaps, backed by an array (slice), not tree based heaps)
Indeed, they are! :) Please take a look at Python implementation of
There definitely is! If you don't mind I would prepare some benchmarks and improvements to this in my spare time (some evening or SAT). |
That would be fantastic! My only background is an information retrieval lecture from years ago and some basic research I've done for this project. So any input is greatly appreciated. The serialization format has version flags on several levels, so it should be rather easy to iterate on and add better solutions without breaking things. |
So I couldn't find any literature how to do an efficient union of sorted postings (search engines call it When it comes to Your postings ( |
This was indeed in the original design doc. Then the index size turned out to be so small I decided not to improve the naive implementation yet. Currently revamping various pieces of the serializaiton format though. So it's a good time to add relevant metadata for cost estimation functions and such. Is there anything beyond total number of elements in a postings list that could be interesting? |
I haven't dived deep into index format yet, but I noticed that it's quite simple and there's a lot of room for improvements (postings encoding it one thing, lookup dicts are the other -- I see that right now these are simply Go's maps serialized to disk).
Lucene uses "packed blocks" and "vint blocks" + skip tables. Anyway, I don't want to steal this thread for index format discussion :)
No, number of elements should be enough. |
Closed via #98 |
-- https://fabxc.org/blog/2017-04-10-writing-a-tsdb/
That's not the most efficient way of performing k-way merge. Your implementation has indeed
O(nk)
time complexity, but this short drop-in replacement:has
O(n log k)
(hope I got it right). There are many other efficient implementations of k-way merge, one of the most popular (I think) is based on binary heap (container/heap
in Go land).I could prepare a couple of implementations and benchmark them. Alternatively, we can take a look how other search engines do that (https://github.com/blevesearch/bleve ?) and go with that.
@fabxc
The text was updated successfully, but these errors were encountered: