Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] Make find_duplicate_stored_vtable_revloc more efficient #5580

Closed
wants to merge 1 commit into from

Conversation

TethysSvensson
Copy link
Contributor

When profiling code, we noticed that a lot of time in our application was spent inside the find_duplicate_stored_vtable_revloc function.

Currently this function works by doing a linear scan of all the previous vtable positions to look for a matching one.

This PR improves upon the status quo by splitting the written_vtable_revpos field into multiple Vecs instead of just a single one. We place these Vecs in a BTreeMap, and we index into this map using the key (vtable_byte_len, table_object_size) to find out which Vec to use. After having found the correct vector to check, we do a linear scan of this vector using the same logic as the current implementation.

This PR will increase heap usage slightly, as it introduces a BTreeMap to manage the Vecs. Even so, I believe this to be an acceptable sacrifice and it should still be a net improvement for most applications.

We use a BTreeMap instead of a HashMap, because that was what gave the best result for our application. I have not done any other benchmarks, so I am not sure if our application is representative.

I am sure this could probably be improved even further, but I do not understand the code well enough to write a more invasive change.

@aardappel
Copy link
Collaborator

I have so far resisted making a similar change in the C++ implementation, because it is very important to me that basic FlatBuffer usage can work with very few to no allocations. It is kind of what the library is about.

I'm not sure how many allocations this BTreeMap causes, but using it outright for all users of FlatBuffers doesn't seem that great to me. I would suggest, if this is easy to do in Rust, to default to the existing Vec, but allow users with special requirements to swap in BTreeMap statically through a template parameter (or whatever the Rust equivalent).

If that is not possible, I'd do it dynamically, by keeping this BTreeMap unallocated/unused until the number of vtables exceeds N (where N is.. 10? or so). I think you'll find the vast majority of FlatBuffers use cases does not exceed this.

@aardappel
Copy link
Collaborator

@rw

@TethysSvensson
Copy link
Contributor Author

First of all, I completely understand that our use-case might not be very representative. If that is the case, we will be happy to continue maintaining our fork.

To give you some context, the purpose of the application I was talking about before is to convert ~300 MB files in a proprietary format into ~30 MB files in a flatbuffer format. The output format contains ~40 tables. We first do a streaming conversion to an in-memory format that closely represents the input file and then do a second pass which converts it to a flatbuffer.

It might be feasible to do the entire process streaming. However we do a few semantic transformations while converting, and we have had a lot of trouble achieving this while still writing maintainable and correct code.

Most of our code is very optimized, to the point that I believe that we are currently spending most of the time doing the flatbuffer serialization -- though it's been a while since I have done a proper benchmark of the code. I am planning on doing another round of profiling within the next few weeks.

For our use-case, sacrificing a few kilobytes for a multi-digit percentage boost in performance is well worth it. That would be the case, even if it was a few megabytes.

And to answer some of your questions:

Regarding memory usage: Both the current implementation and the BTreeMap implemention can allocate memory if it tries to serialize a vtable it has not seen before. The BTreeMap implementation will of course consume more memory, but in my opinion it is insignificant. It is on the order of ~512 bytes per tree node and each tree node contains up to 11 (K, V) pairs. If you are very worried about memory usage, I can switch to a HashMap, which is slightly slower but is more compact in memory (and uses fewer total number of allocations).

Regarding implementation options: I think both of your options are feasible, though they will of course complicate the code a bit. Another option would be to do it with feature flags.

@rw
Copy link
Collaborator

rw commented Oct 28, 2019

For a long time I've thought it would be nice to make the vtable deduplicator be user-configurable. We would just need to define the API for it. Do either of you think this is a good idea?

@aardappel
Copy link
Collaborator

@TethysSvensson in your use case it may be insignifant, but for others it may not be. For comparison, the C++ implementation defaults to working with a single allocation for all data, including vtable tracking, and can work with 0 allocation. Rust should want to be as close as possible to that, by default.

@rw yes, that's essentially what I am suggesting, but more along the lines of making the linear vtable storage the default, and allowing people with special needs to plug in a tree based one statically (or automatically at runtime, if that works better).

@TethysSvensson
Copy link
Contributor Author

@aardappel As I said before: We know that we are not exactly the common use case. We understand that you might not want to support our needs as a default option, or at all for that matter. Our worst case here is that we have to continue maintaining our fork, which is not a lot of work at all.

I made the PR to discuss whether this was something you wanted in one form or another, because there might be other users with similar needs.

The part you mentioned about the the C++ version has me a bit confused. I am not sure I understand which of the following you are optimizing for here and on which parameters you are highlighting the C++ as being superior:

  • Do you want to minimize the total number of allocations made?
  • The total number of bytes allocated?
  • The total memory cost of the implementation (i.e. including code size)?
  • Are you trying to avoid the performance penalty of allocating while serializing?
  • Are you trying to reach a point you can provide an API that supports e.g. embedded platforms without a heap?
  • All of the above?

I ask because each of the ideal implementation of this might not look the same depending on which of these you want to prioritize the most.

@TethysSvensson
Copy link
Contributor Author

Perhaps we should close this PR and have this discussion in an issue instead?

@aardappel
Copy link
Collaborator

@TethysSvensson I didn't say we shouldn't support this use case. It seems totally useful to me. I just think it shouldn't be default, but that comes from the assumption that Rust users have the same performance sensibilities as C++ users, and I am not (yet) a Rust user, so I don't know.

The answer to your list is mostly "all of the above". My comment about C++ is just to illustrate the lengths we have gone thru so far, and how that contrasts with a dynamically allocated tree (as a default).

And I am not trying to decide on this, just give context. @rw and other actual Rust users should.

This PR is best for discussion. It already contains the discussion so far, and possible code for an implementation.

@TethysSvensson
Copy link
Contributor Author

@aardappel Thank you for giving a bit more context. It seems interpreted your reply as being more critical than it actually was.

I think the easiest solution to implement would be to have an additional generic parameter on the FlatBufferBuild which determines the type of written_vtable_revpos.

As a alternative/complementary approach, perhaps it would be possible to crate a reusable version of MyTableArgs which would save the vtable offset internally. That way you could do something like this:

let mut weapon_template = WeaponTemplate::new();

// Looks for a vtable in written_vtable_revpos. Creates a new one as a fallback
// This also stores vtable offset inside the weapon_template struct
weapon_template.name = Some(weapon_one_name),
weapon_template.damage =  damage: 3;
let weapon1 = weapon_template.create(&mut builder);

// Reuses the vtable offset found inside the weapon_template
weapon_template.name = Some(weapon_two_name);
weapon_template.damage = 4;
let weapon2 = weapon_template.create(&mut builder)

// Tries to reuse the vtable offside found inside the weapon_template,
// however the desired vtable does not match, so we fallback to searching
// through written_vtable_revpos, and potentially writes a new vtable.
weapon_template.name = None;
weapon_template.damage = 4;
let weapon3 = weapon_template.create(&mut builder)

@rw
Copy link
Collaborator

rw commented Oct 30, 2019

@TethysSvensson Would you want to explore defining a trait for pluggable vtable deduplicators? As test cases, we'd want to support a no-op deduplicator, the current linear deduplicator, and a heap-heavy deduplicator like yours from this PR.

@TethysSvensson
Copy link
Contributor Author

Yes, I would be interested in exploring that, but I have unfortunately been a bit busy for the last few months.

I promised I would get back to you with some more concrete numbers, and I have finally managed to get around to re-running our profiling. For this test, I am running our application on my old laptop and using it to convert a ~350 MB file into a single ~200 MB flatbuffer (which we then compress and write to disk).

Our application currently takes around 12 seconds to do this conversion. Of these 12 seconds, we spent ~3s on serialization. If I disable my vtable optimization, this part of the process is increased to ~10s.

According to valgrind --tool=callgrind, about half of the time spent serializing is used inside the function write_vtable, even with my optimization. Specifically it says that 12.54% of the entire application time is spent inside this function. If I disable this optimization, this function's contribution is increased to 58.61 %.

@rw
Copy link
Collaborator

rw commented Dec 19, 2019

@TethysSvensson Sounds like more evidence in favor of pluggable vtable deduplicators using a trait :-]

@TethysSvensson
Copy link
Contributor Author

TethysSvensson commented Dec 21, 2019

I decided to also experiment with pluggable buffers. This would allow the flatbuffer library to work on embedded platforms without a heap. My current design uses these traits:

pub trait BufferStorage<'a> {
    type Allocator: Allocator<'a>;
    type Iterator: Iterator<Item = &'a [u8]>;
    fn allocator(&'a mut self) -> Self::Allocator;
    fn iter_bytes(&'a mut self) -> Self::Iterator;
}

pub trait Allocator<'a> {
    fn allocate(&mut self, data: &[u8]) -> Option<&'a [u8]>;
}

pub trait CacheStorage<'allocator, 'cache> {
    type Cache: Cache<'allocator>;
    fn initialize(&'cache mut self) -> Self::Cache;
}

pub trait Cache<'allocator> {
    fn lookup<A>(&mut self, data: &[u8], allocate: A) -> RawBuildOffset
    where
        A: FnOnce() -> (&'allocator [u8], RawBuildOffset);
}

This is not the simplest design possible, but it was the simplest I could come up with after having experimented a bit. The design requirements I tried to work from was the following:

  • I want to be able to re-use the backing storages
  • I want the design to be able to work in no_std environments without a heap
  • I want to be able to write bumpalo buffer storage, which does not give back the allocated bytes in one continuous slice (hence the iterator).
  • I want to be able to write storage implementation with as little unsafe as possible
  • I want to be able to work with u8 slices in the vtable cache, to make the design both as simple as possible and as fast as possible.

To prove the design, I have currently written a buffer storage based on bumpalo and one based on a &mut [u8]. Additionally I have written a cache storage based on HashMap, one based on lru_cache::LruCache and one that simply does not caching. The only places I needed unsafe was:

  • Some lifetime juggling to allow storage re-use in the vtable caches
  • Bumpalo's byte iterator is requires use of unsafe, to prevent undefined behavior in certain cases. My implementation fulfills the requirements to avoid this undefined behavior.

I am imagining a flatbuffer library based around this API would be used something like this:

let mut flatbuffer = Flatbuffer::with_storages(MyBufferStorage::new(), MyCacheStorage::new());
// or alternatively:
let mut flatbuffer = Flatbuffer::with_default_storages();

for _ in 0..3 {
    let mut builder = flatbuffer.new_message();

    // [uses the builder to write a message]

    // This consumes the builder
    builder.set_root(object);

    for slice in flatbuffer.finished_message() {
        // [writes the slice to somewhere]
    }
}

@rw What do you think? Would this be an okay API to work with?

@TethysSvensson
Copy link
Contributor Author

This turned out to be quite a large rewrite I started working on.

I have support for the following buffer backends:

  • A raw slice buffer for use on embedded platforms without a heap, or with very limited memory available.
  • A buffer based on bumpalo. This is also a no_std crate, but needs access to a heap implementation.

I have following vtable cache backends:

  • A noop cache, which does not save anything. This works on any platform.
  • An implementation with a small LRU cache based on uluru. This works on any platform and does not require a heap.
  • A backend based on a HashMap, which not only uses a heap, but also a std implementation.
    • It has a theoretically unbounded memory usage. It uses ~32 bytes of memory per unique vtable in the generated buffer. This means you can create artificial examples, where the memory_usage(cache) ~= k * memory_usage(output_message). I do not think you can create any examples with k larger than about 2 though.
    • Any table serialize operation can now cause a re-allocation of the hashmap. Amortized this is still O(1), but any individual operation might in principle take an arbitrarily long time.
    • It is likely to be the fastest implementation overall.
    • It has the best performance in terms of output size.
  • A backend based on lru-cache. This uses a HashMap internally, but is bounded in size. The only advantage it has over the HashMap is that there is an upper bound on the worst-case runtime and memory use.

To support this I have unfortunately been forced to rewrite some of the APIs. The good news is that I think they are both more ergonomic and are likely to result in faster code than the current ones.

The bad news is that it is a lot of work and that I am worried about not being able to support some features (either current or planned ones).

I am quite far with my rewrite though. The library itself works as expected and I have a working example of some generated code. My TODOs are:

  • I am still missing proper docstrings on about half of the new API.
  • I have not updated any of the documentation outside the flatbuffer library itself yet.
  • I have not updated any tests outside the flatbuffer library itself yet.
  • My "generated code" is a hand-generated example based on monster.fbs. I have just begun working on updating the code generator.

@rw I'm hoping you will still be interested in merging this once I'm finished. I think it will be an overall improvement to the status quo, but I am very worried than I am missing some of the planned features. I am also worried that I am stepping on somebody's toes by doing such an extensive rewrite without talking to you about it first.

Would you be have time to do a review of my design and whether it is something you would be interested in?

@rw
Copy link
Collaborator

rw commented Jan 2, 2020

@TethysSvensson That's very exciting! You call it a "rewrite"... is that accurate? It seems to me that the existing code would be mostly unchanged, except that the Builder type will be parameterized by a buffer allocator type and a vtable deduplicator type.

@TethysSvensson
Copy link
Contributor Author

Unfortunately it is a rewrite.

I wanted to support basing the vtable cache on a HashMap<&'fbb [u8], Offset>.

You cannot do that by having a single struct:

pub struct FlatBufferBuilder {
    owned_buf: Vec<u8>,
    cache: HashMap<&'??? [u8], Offset>, // This hashmap contains references to the owned_buf, which cannot be expressed in rust
}

I then thought I could get by with splitting up the struct like this:

pub struct Flatbuffer {
    owned_buf: Vec<u8>,
}

pub struct FlatbufferBuilder<'fbb> {
    builder: &'fbb mut Flatbuffer
    cache: HashMap<&'fbb [u8], Offset>,
}

impl Flatbuffer {
    fn new_builder<'fbb>(&'fbb mut self) -> FlatbufferBuilder<'fbb> { ... }
}

This almost worked, except for the fact that owned_buf will re-allocate when full, which invalidates the pointer in the hashmap. To solve this I changed the owned buffer to be parameterized over a storage which can always allocate without moving existing data. This is why I only support bumpalo and slices.

Then I noticed the final problem: The current FlatbufferBuilder also uses Vec<...>s to store field_locs, and this will not work on platforms without a heap.

At this point I decided to start working on a rewrite mostly as a proof-of-concept to figure out what was actually achievable. I ended up having a full re-implementation -- and I am current finding it very hard to reintegrate that with the existing code.

With this being a re-implementation, it naturally has made a couple of design choices. For instance in the generated code, my design currently needs for two different types serialize::MyStruct and deserialize::MyStruct<'buffer>. One of them is a "normal" rust struct, while the other is a wrapper around a &'buffer [u8; N].

@rw
Copy link
Collaborator

rw commented Jan 2, 2020

@TethysSvensson Hmm, I see what you mean. The current logic uses offsets, not references, due to this kind of issue. Did you pursue an implementation with an offset-based approach?

@maxburke
Copy link
Contributor

maxburke commented Jan 2, 2020

( @rw did you mean @TethysSvensson instead of me? :) )

@rw
Copy link
Collaborator

rw commented Jan 2, 2020

( @rw did you mean @TethysSvensson instead of me? :) )

Woops, fixed! :-D

@TethysSvensson
Copy link
Contributor Author

@rw I am what you would use for your keys in the HashMap. The keys need to implement Hash+Eq. If you use an offset for a key in your HashMap or HashSet, how would you using hashing to check if an existing offset matches a newly generated vtable?

@rw
Copy link
Collaborator

rw commented Jan 3, 2020

@TethysSvensson It would be a two-step process, similar to how it is now: look up an offset based on a hash, then check the underlying buffer for byte-for-byte equality.

If I understand your approach correctly, you may be getting stuck while trying to use a standard hash table to store actual byte slice data. Instead of storing slices, I recommend just storing (hash, offset) key/value pairs. This has the smell of using a hash table to implement a hash table, but that should get you started.

(It's possible that the best implementation of a hash table for this situation would be a custom one, but I wouldn't try that immediately.)

(An alternative approach would be to implement a Hash function for offsets that somehow ties the data in the buffer to the offset, but I think that would run into lifetime issues.)

@TethysSvensson
Copy link
Contributor Author

@rw I think one of us is misunderstanding the other.

I already have a fully working re-implementation. I am using a HashMap<&'a [u8], Offset>. I am not having any troubles with getting it to work.. From a performance viewpoint, I think this is more or less the best possible solution. This is doubly true, because an offset-based approach like you are suggesting would not be compatible with bumpalo, which is a much faster buffer allocation strategy than what you are currently using.

What I am stuck at is how to do this as an incremental improvement to the current codebase:

  • The user-facing APIs do not match up very well
  • The current code uses a lot of Vecs, which does not work on platforms without a heap. They are also slow.
  • I have not found a solution for how to enable an approach centered around the MonsterBuilder-design without using Vec. My current design have renamed user_code::MonsterArgs to user_code::serialize::Monster. You will then call flatbuffer_builder.write_table(monster), where monster has the type user_code::serialize::Monster.

@rw
Copy link
Collaborator

rw commented Jan 3, 2020

@TethysSvensson Totally understandable how we could be misunderstanding each other. Thanks for trying to clarify!

I'm essentially saying that the algorithm that builds the table (that moves data after every resize operation) is core to how the builder works. This has natural implications for how auxiliary data, like offsets, are handled during construction ops.

It sounds like you changed that core algorithm, is that correct?

@TethysSvensson
Copy link
Contributor Author

It sounds like you changed that core algorithm, is that correct?

Correct, but not by choice. I did it because I believe that the current algorithm is incompatible with the goals I am trying to achieve, i.e. faster performance using hashmap+bumpalo, while also supporting heapless contexts and contexts with very bounded memory.

@TethysSvensson
Copy link
Contributor Author

@rw Status update: I have documentation for almost the entire support library and I am more than halfway through rewriting the C++ codegen as well. I will try to get it into a form where I can push it to this branch, so you can take a look.

In the meantime I have run into some open questions.

My main problem is that I do not know how to handle invalid/missing/malicious data when deserializing. As I see it, there are three options:

  • Treat invalid data as though it is simple not there
  • Wrap almost every return type in a Result<T, InvalidData>
  • Panic inside the library

I dislike the second option, because it makes the API harder to use. I dislike the third one, because I think it is a potential DoS vector, and it should be up to the users of the library to decide how to handle invalid data.

So far I have tried to go with the first option. I have also verified that current code will never panic, at least not in my example code.

However I am beginning to wonder if this is the right choice. If e.g. an enum contains an invalid field, I am not sure if it is meaningful to try and turn that into the default value for that enum.

What do you think I should do here?

@rw
Copy link
Collaborator

rw commented Jan 19, 2020

@TethysSvensson I can tell you're putting a lot of effort into this!

That said, the PR you're describing sounds enormous. As I said above, I think it's likely that only the Builder initializer would need to change, to accept user-provided deduplicators. I don't see the connection to the code generator.

Could you explain more of your ideas here in words?

@TethysSvensson
Copy link
Contributor Author

My core goals are:

  • I want to be able to run a version using bumpalo as the backing storage, since it is significantly faster that the allocation strategy currently used.
  • I want to be able to run a version using HashMap<&[u8], Offset> as the cache for vtables.
  • I want to be able to run a version without any heap usage.

The current code is not very compatible with those goals for multiple reasons:

  • The currently generated code uses parts of the API that I cannot see how we can support without dynamic allocations. I cannot see how to implement e.g. the fn push() function.
  • The buffer storage and cache needs to be stored in a different struct than the builder itself, because you cannot have self-referential structs in rust without a lot of unsafe code that is very hard to get right.

@TethysSvensson
Copy link
Contributor Author

I can tell you more about how my code works, but to be honest I would rather show you, I think that makes more sense.

@rw
Copy link
Collaborator

rw commented Jan 20, 2020

@TethysSvensson Replying inline to your three stated goals:

  • Would parameterizing the Builder with a backing buffer type work? Then you can use, say, bumbalo::collections::Vec<u8>.
  • The Builder algorithm relocates in-progress data, so those HashMap references will be invalid (as you know). I think we've talked about this before, but my two recommendations are to instead 1) use a HashMap<std::ops::Range<usize>, Offset> (or equivalent) to store ranges that you then look up at the time of deduplication, and/or 2) store the hashes manually, then perform deduplication by first hashing the slice in question.
  • Running a version without heap usage can probably be done using the heapless crate to create the backing buffer for the Builder.

WDYT?

@TethysSvensson
Copy link
Contributor Author

TethysSvensson commented Jan 20, 2020

  • I am using bumpalo::Bump as the backing storage. This means that there will not be a single continuous buffer, this avoiding the need to ever relocate the data.
  • Yes, the current builder algorithm does this. The implementation I am working on does not move the data when the buffer needs to be expanded. I am not sure what you are suggesting with HashMap<std::ops::Range<usize>, Offset>. If I have a vtable, say in a &[u8], how do I use that HashMap to find out if I already have that vtable somewhere?
  • It is not enough to create a backing buffer for the builder. It uses multiples Vecs internally.

@TethysSvensson
Copy link
Contributor Author

I have a working prototype now, including the C++ codegen. I will fiddle a bit more with it and push it sometime tomorrow afternoon (EU timezone).

@TethysSvensson TethysSvensson changed the title Make find_duplicate_stored_vtable_revloc more efficient [Rust] Make find_duplicate_stored_vtable_revloc more efficient Jan 20, 2020
@TethysSvensson
Copy link
Contributor Author

@rw I have submitted my PR at #5729. Should we continue the conversation there?

@github-actions
Copy link

github-actions bot commented Apr 8, 2021

This pull request is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Apr 8, 2021
@TethysSvensson
Copy link
Contributor Author

Closing. See my comment on #5729 for context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants