Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Improve time complexity of Postings merge #50

Closed
telendt opened this issue Apr 17, 2017 · 7 comments
Closed

Improve time complexity of Postings merge #50

telendt opened this issue Apr 17, 2017 · 7 comments

Comments

@telendt
Copy link
Contributor

telendt commented Apr 17, 2017

The procedure for more than two lists of different set operations works similarly. So the number of k set operations merely modifies the factor (O(k*n)) instead of the exponent (O(n^k)) of our worst-case lookup runtime. A great improvement.

-- https://fabxc.org/blog/2017-04-10-writing-a-tsdb/

That's not the most efficient way of performing k-way merge. Your implementation has indeed O(nk) time complexity, but this short drop-in replacement:

func Merge(its ...Postings) Postings {
	l := len(its)
	switch l {
	case 0:
		return nil
	case 1:
		return its[0]
	default:
		m := l / 2
		return newMergePostings(Merge(its[:m]...), Merge(its[m:]...))
	}
}

has O(n log k) (hope I got it right). There are many other efficient implementations of k-way merge, one of the most popular (I think) is based on binary heap (container/heap in Go land).

I could prepare a couple of implementations and benchmark them. Alternatively, we can take a look how other search engines do that (https://github.com/blevesearch/bleve ?) and go with that.

@fabxc

@fabxc
Copy link
Contributor

fabxc commented Apr 18, 2017

Awesome! Thanks for digging so deep.

The proper construction of the iterator tree is of course an easy and welcome improvement and my initial oversight.

As stated in the blog post, there seems to be tons of material on this. I roughly went with what was easiest for now, in particular because k is typically very small for our use case. For merge (i.e, regexp matchers in Prometheus), it might indeed get larger than intersections but still k << n.

It would indeed be interesting to benchmark the impact of different solutions for the typical inputs we encounter.
Heaps are certainly interesting, the caveat being that its cache locality might be a lot worse than for a simple list and it would require a re-implementation that allows us to hold the heap in a continuous byte slice (for reading the data structure directly from our mmaped index file). From my understanding sorted lists are pretty established in that regard – which may partially be because they compress really well, which is not as important for us, but also because adding elements is very cheap if happening in-order.

@telendt
Copy link
Contributor Author

telendt commented Apr 18, 2017

For merge (i.e, regexp matchers in Prometheus), it might indeed get larger than intersections but still k << n.

Sure, this seems like a minor optimization iff k << n. Yet, it's a pretty easy one and helps in some degenerate cases (many labels).

Heaps are certainly interesting, the caveat being that its cache locality might be a lot worse than for a simple list[..]

(I was thinking about binary heaps, backed by an array (slice), not tree based heaps)

From my understanding sorted lists are pretty established in that regard [...]

Indeed, they are! :)
Maybe I didn't express myself clearly -- I'm not advocating for changing organization of posting list to binary heap, but using binary (min)heap for getting posting list with the smallest At() in constant (O(1)) time.

Please take a look at Python implementation of heapq.merge (or, I you're not familiar with Python, please take a look at this example) to understand the idea.

As stated in the blog post, there seems to be tons of material on this

There definitely is!
It looks that you're building some inverted index here, and since I spent last few years working with search I thought I could help here.

If you don't mind I would prepare some benchmarks and improvements to this in my spare time (some evening or SAT).

@fabxc
Copy link
Contributor

fabxc commented Apr 18, 2017

If you don't mind I would prepare some benchmarks and improvements to this in my spare time (some evening or SAT).

That would be fantastic! My only background is an information retrieval lecture from years ago and some basic research I've done for this project. So any input is greatly appreciated.

The serialization format has version flags on several levels, so it should be rather easy to iterate on and add better solutions without breaking things.

@telendt
Copy link
Contributor Author

telendt commented Apr 21, 2017

So I couldn't find any literature how to do an efficient union of sorted postings (search engines call it Disjunction). But from what I see disjunction of postings is only trigger on regexp label match (label = val1 OR val2 OR ...) and there's no intersection there, so it is a classic k-way merge problem.

When it comes to Intersect (search engines often call this Conjunction), there is more information about it. In short - it's very similar to yours, except it merges Postings in the ascending length order. That's also how bleve and lucene do it. So either Merge should do that, or the caller should make sure that the postings are in the right order (but then it should be documented).

Your postings (bigEndianPostings) are backed by byte slice and have fixed size items, so it's easy to compare their length.
(In the real word posting lists are usually delta encoded with some variable int encoding.)

@fabxc
Copy link
Contributor

fabxc commented Apr 25, 2017

Your postings (bigEndianPostings) are backed by byte slice and have fixed size items, so it's easy to compare their length.
(In the real word posting lists are usually delta encoded with some variable int encoding.)

This was indeed in the original design doc. Then the index size turned out to be so small I decided not to improve the naive implementation yet.
Groupvarint is certainly a good candidate here as it has even greater throughput than regular varint. One complexity concern was that it requires adding of skip pointers to effectively advance through compressed postings.

Currently revamping various pieces of the serializaiton format though. So it's a good time to add relevant metadata for cost estimation functions and such. Is there anything beyond total number of elements in a postings list that could be interesting?

@telendt
Copy link
Contributor Author

telendt commented Apr 25, 2017

Then the index size turned out to be so small I decided not to improve the naive implementation yet.

I haven't dived deep into index format yet, but I noticed that it's quite simple and there's a lot of room for improvements (postings encoding it one thing, lookup dicts are the other -- I see that right now these are simply Go's maps serialized to disk).

Groupvarint is certainly a good candidate[...]

Lucene uses "packed blocks" and "vint blocks" + skip tables.
https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/codecs/lucene50/Lucene50PostingsFormat.html
But maybe Prometheus does not need such sophisticated structures (I guess that its postings list are on average much shorter).

Anyway, I don't want to steal this thread for index format discussion :)

Is there anything beyond total number of elements in a postings list that could be interesting?

No, number of elements should be enough.

@fabxc
Copy link
Contributor

fabxc commented Jun 16, 2017

Closed via #98

@fabxc fabxc closed this as completed Jun 16, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants