Optimise std::trie's iterator #11497

huonw · 2014-01-12T12:35:07Z

This stores the stack of iterators inline (we have a maximum depth with
uint keys), and then uses direct pointer offsetting to manipulate it,
in a blazing fast way:

Before:

bench_iter_large          ... bench:     43187 ns/iter (+/- 3082)
bench_iter_small          ... bench:       618 ns/iter (+/- 288)

After:

bench_iter_large          ... bench:     13497 ns/iter (+/- 1575)
bench_iter_small          ... bench:       220 ns/iter (+/- 91)

Also, removes .each_{key,value}_reverse as an offering to
placate the gods of external iterators for my heinous sin of
attempting to add new internal ones (in a previous version of this
PR).

bill-myers · 2014-01-12T15:19:21Z

Doesn't the trie have a bounded tree height since the keys are integers? (specifically it seems something like at most 16 levels on 64-bit)

If so, why not just replace the growable heap-allocated array in the iterators with an immediate fixed-size array?

That should result in the same or faster performance than an internal iterator.

alexcrichton · 2014-01-12T18:29:48Z

I agree with @bill-myers, I would rather attempt to optimize the external iterators rather than adding internal ones. Another option would be to have the iterator maintain a "reasonably sized" fixed-size vector which then gets spilled onto the heap if you iterate farther down perhaps?

Regardless, I'd like to pursue optimization before adding internal iterators. If it does indeed turn out that internal iterators are fundamentally faster, then I think it's ok to add them, but right now I wouldn't necessarily be convinced that there's a fundamental reason external iterators can't be faster.

huonw · 2014-01-13T14:40:48Z

No need for internal iterators: just a gentle application of unsafe to get the external iterators even faster than the internal ones were.

thestinger · 2014-01-13T22:59:15Z

I like that this is faster, but it makes me worried. The trie can rewritten with raw pointers and have over 5x faster insert/removal too... but isn't this an admission that Rust's safety model doesn't really work for a systems language? It's going to occur for all owned tree-like data structures. I think a general solution without unsafe code is necessary.

huonw · 2014-01-13T23:44:20Z

I agree in principal, but I'm not sure it is possible to have an safe solution that is this low level/performant, at least, not without some sort of dependent typing. I certainly don't have a better proposal than carefully writing parts of the stdlib unsafely for performance (at least generics mean that we only have to write it once for each data structure, but as you say, there are many tree-like data structures).

(I seriously had to go all the way down to stepping through by .offseting pointers directly to get this performance, the following were all at least 2x slower self.stack[self.length], self.stack.unsafe_mut_ref(self.length), start_ptr.offset(self.length).)

thestinger · 2014-01-13T23:48:37Z

There are plenty of user-defined tree-like data structures too, and Rust needs a solution to avoid them being painfully slow compared to C++. It's not good that after lots of tweaking, TrieMap and TreeMap are several times slower at inserts and removals compared to easier to write unsafe code. I really don't want the solution to be that you have to write your tree with unsafe code for fast insertion, removal and iteration.

bill-myers · 2014-01-13T23:54:20Z

It's a bit off-topic, but...

The issue with self.stack.unsafe_mut_ref(self.length) and start_ptr.ofsset(self.length) is that the element type is VecMutIterator, which is large, and x86 can only address arrays or compute addresses with element sizes of 1, 2, 4 or 8 for free.

If the element size is different, you need an explicit multiplication instruction to compute the final address, which of course reduces performance since it has lower throughput and latency.

This multiplication and safety issue can be solved by using iterators for the fixed array instead of raw pointers or integer offsets.

However, there is still a bounds check on each iterator increment, which is unavoidable for safety, because the compiler has no easy way of divining that the increment will not happen more times than the fixed stack size (that would require automatic theorem proving based, in this case, on the fact that if you shift any integer by K, it will become zero in max_int/K steps and this means that nodes at that level won't be split, etc.)

Ultimately, the outcome is that library data structures probably must be written with some unsafe code, and using only safe code is possible only for applications.

That's still far better than the C++ situation.

thestinger · 2014-01-14T00:02:04Z

Owned tree data structures are common in user code too and they'll certainly be everywhere in Servo. I don't think it's acceptable to require that these all use unsafe code to avoid a 2-3x performance hit on everything but non-iterative searches.

huonw · 2014-01-14T00:25:22Z

VecMutIterator, which is large

It's actually "only" 3 words (which is still too large on 64-bit platforms; even if/when we remove the unnecessary 3rd word).

This multiplication issue can be solved by using iterators for the fixed array instead of raw pointers or integer offsets.

We'd need a reversible iterator of some sort to be able to push and pop (and, as you say, there's the bounds checking).

Owned tree data structures are common in user code too and they'll certainly be everywhere in Servo. I don't think it's acceptable to require that these all use unsafe code to avoid a 2-3x performance hit on everything but non-iterative searches.

I imagine we could get somewhere with some limited bounded integer/pointer types, but I don't see how we can offer a general solution other than dependent typing, or getting rustc/LLVM to recognise & optimise the "swap out, put back in" idiom (for inserts/removes) where possible (which may not be many places: probably interacts poorly with destructors/failure).

bill-myers · 2014-01-14T00:31:09Z

@thestinger

Well, they could use a common library implementation of a tree.

Specifically, the bounds checks for the iteration stack can only be avoided for trees with a constant bound on their height, which generally means they are either tries, balanced search trees or heaps, which are or can all be provided by libstd.

If you are iterating over an unbounded tree (say, an HTML document), then you need a growable heap-allocated stack (or something that is either on the stack or on the heap depending on size), and you only need to add/peek/remove the last element, so you don't save any bounds checks with unsafe code.

Also, with code that actually does something with the elements its iterating on, I doubt that that bounds check matters.

@huonw

We'd need a reversible iterator of some sort to be able to push and pop (and, as you say, there's the bounds checking).

I think you in fact just need a fixed-capacity stack type with fast access to the last element (i.e. something that holds a [T, ..n] and the current length in bytes).

Then, optimal performance only requires an "unsafe_push" function that doesn't check for going out of bounds by assuming that you know it can't happen.

Without integer generic parameters it seems that such a type cannot be implemented as a generic struct, but it should be implementable as a macro.

alexcrichton · 2014-01-17T19:18:42Z

Looks like the internal iterators were removed from this commit, and it's just the external iterators with some optimizations are left (nice job!)

r=me with a rebase

This are *trivial* to reimplement in terms of each_reverse if that extra little bit of performance is needed.

This stores the stack of iterators inline (we have a maximum depth with `uint` keys), and then uses direct pointer offsetting to manipulate it, in a blazing fast way: Before: bench_iter_large ... bench: 43187 ns/iter (+/- 3082) bench_iter_small ... bench: 618 ns/iter (+/- 288) After: bench_iter_large ... bench: 13497 ns/iter (+/- 1575) bench_iter_small ... bench: 220 ns/iter (+/- 91)

This stores the stack of iterators inline (we have a maximum depth with `uint` keys), and then uses direct pointer offsetting to manipulate it, in a blazing fast way: Before: bench_iter_large ... bench: 43187 ns/iter (+/- 3082) bench_iter_small ... bench: 618 ns/iter (+/- 288) After: bench_iter_large ... bench: 13497 ns/iter (+/- 1575) bench_iter_small ... bench: 220 ns/iter (+/- 91) Also, removes `.each_{key,value}_reverse` as an offering to placate the gods of external iterators for my heinous sin of attempting to add new internal ones (in a previous version of this PR).

Support multi-character punct tokens in MBE Fixes rust-lang#11497 In the context of MBE, consecutive puncts are parsed as multi-character punct tokens whenever possible. For example, `:::` is parsed as ``[Punct(`::`), Punct(`:`)]`` and shouldn't get matched to patterns like `: : :` or `: ::`. We have implemented this behavior only for when we match puncts against `tt` fragments, but not when we match puncts literally. This PR extracts the multi-character punct handling procedure into a separate method and extends its support for literal matching. For good measure, this PR adds support for `<-` token, which is still [considered as one token in rustc](https://github.com/rust-lang/rust/blob/e3961864075eaa9e855e5eec6b4f148029684539/compiler/rustc_ast/src/token.rs#L249) despite the placement syntax having been removed.

Remove `derive_new` test dependency It is the last thing depending on syn 1.0 in clippy changelog: none

huonw added 2 commits January 18, 2014 10:45

std::trie: remove each_{key,value}_reverse internal iterators.

f0c554d

This are *trivial* to reimplement in terms of each_reverse if that extra little bit of performance is needed.

bors closed this Jan 18, 2014

bors merged commit 0148055 into rust-lang:master Jan 18, 2014

huonw deleted the trie-internal-iter branch June 27, 2014 06:48

flip1995 pushed a commit to flip1995/rust that referenced this pull request Sep 25, 2023

Auto merge of rust-lang#11497 - Alexendoo:remove-derive-new, r=dswij

0273ed3

Remove `derive_new` test dependency It is the last thing depending on syn 1.0 in clippy changelog: none

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise std::trie's iterator #11497

Optimise std::trie's iterator #11497

huonw commented Jan 12, 2014

bill-myers commented Jan 12, 2014

alexcrichton commented Jan 12, 2014

huonw commented Jan 13, 2014

thestinger commented Jan 13, 2014

huonw commented Jan 13, 2014

thestinger commented Jan 13, 2014

bill-myers commented Jan 13, 2014

thestinger commented Jan 14, 2014

huonw commented Jan 14, 2014

bill-myers commented Jan 14, 2014

alexcrichton commented Jan 17, 2014

Optimise std::trie's iterator #11497

Optimise std::trie's iterator #11497

Conversation

huonw commented Jan 12, 2014

bill-myers commented Jan 12, 2014

alexcrichton commented Jan 12, 2014

huonw commented Jan 13, 2014

thestinger commented Jan 13, 2014

huonw commented Jan 13, 2014

thestinger commented Jan 13, 2014

bill-myers commented Jan 13, 2014

thestinger commented Jan 14, 2014

huonw commented Jan 14, 2014

bill-myers commented Jan 14, 2014

alexcrichton commented Jan 17, 2014