-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Rust] Make find_duplicate_stored_vtable_revloc more efficient #5580
Conversation
6e74a24
to
0374120
Compare
I have so far resisted making a similar change in the C++ implementation, because it is very important to me that basic FlatBuffer usage can work with very few to no allocations. It is kind of what the library is about. I'm not sure how many allocations this If that is not possible, I'd do it dynamically, by keeping this |
First of all, I completely understand that our use-case might not be very representative. If that is the case, we will be happy to continue maintaining our fork. To give you some context, the purpose of the application I was talking about before is to convert ~300 MB files in a proprietary format into ~30 MB files in a flatbuffer format. The output format contains ~40 tables. We first do a streaming conversion to an in-memory format that closely represents the input file and then do a second pass which converts it to a flatbuffer. It might be feasible to do the entire process streaming. However we do a few semantic transformations while converting, and we have had a lot of trouble achieving this while still writing maintainable and correct code. Most of our code is very optimized, to the point that I believe that we are currently spending most of the time doing the flatbuffer serialization -- though it's been a while since I have done a proper benchmark of the code. I am planning on doing another round of profiling within the next few weeks. For our use-case, sacrificing a few kilobytes for a multi-digit percentage boost in performance is well worth it. That would be the case, even if it was a few megabytes. And to answer some of your questions: Regarding memory usage: Both the current implementation and the Regarding implementation options: I think both of your options are feasible, though they will of course complicate the code a bit. Another option would be to do it with feature flags. |
For a long time I've thought it would be nice to make the vtable deduplicator be user-configurable. We would just need to define the API for it. Do either of you think this is a good idea? |
@TethysSvensson in your use case it may be insignifant, but for others it may not be. For comparison, the C++ implementation defaults to working with a single allocation for all data, including vtable tracking, and can work with 0 allocation. Rust should want to be as close as possible to that, by default. @rw yes, that's essentially what I am suggesting, but more along the lines of making the linear vtable storage the default, and allowing people with special needs to plug in a tree based one statically (or automatically at runtime, if that works better). |
@aardappel As I said before: We know that we are not exactly the common use case. We understand that you might not want to support our needs as a default option, or at all for that matter. Our worst case here is that we have to continue maintaining our fork, which is not a lot of work at all. I made the PR to discuss whether this was something you wanted in one form or another, because there might be other users with similar needs. The part you mentioned about the the C++ version has me a bit confused. I am not sure I understand which of the following you are optimizing for here and on which parameters you are highlighting the C++ as being superior:
I ask because each of the ideal implementation of this might not look the same depending on which of these you want to prioritize the most. |
Perhaps we should close this PR and have this discussion in an issue instead? |
@TethysSvensson I didn't say we shouldn't support this use case. It seems totally useful to me. I just think it shouldn't be default, but that comes from the assumption that Rust users have the same performance sensibilities as C++ users, and I am not (yet) a Rust user, so I don't know. The answer to your list is mostly "all of the above". My comment about C++ is just to illustrate the lengths we have gone thru so far, and how that contrasts with a dynamically allocated tree (as a default). And I am not trying to decide on this, just give context. @rw and other actual Rust users should. This PR is best for discussion. It already contains the discussion so far, and possible code for an implementation. |
@aardappel Thank you for giving a bit more context. It seems interpreted your reply as being more critical than it actually was. I think the easiest solution to implement would be to have an additional generic parameter on the As a alternative/complementary approach, perhaps it would be possible to crate a reusable version of let mut weapon_template = WeaponTemplate::new();
// Looks for a vtable in written_vtable_revpos. Creates a new one as a fallback
// This also stores vtable offset inside the weapon_template struct
weapon_template.name = Some(weapon_one_name),
weapon_template.damage = damage: 3;
let weapon1 = weapon_template.create(&mut builder);
// Reuses the vtable offset found inside the weapon_template
weapon_template.name = Some(weapon_two_name);
weapon_template.damage = 4;
let weapon2 = weapon_template.create(&mut builder)
// Tries to reuse the vtable offside found inside the weapon_template,
// however the desired vtable does not match, so we fallback to searching
// through written_vtable_revpos, and potentially writes a new vtable.
weapon_template.name = None;
weapon_template.damage = 4;
let weapon3 = weapon_template.create(&mut builder) |
@TethysSvensson Would you want to explore defining a trait for pluggable vtable deduplicators? As test cases, we'd want to support a no-op deduplicator, the current linear deduplicator, and a heap-heavy deduplicator like yours from this PR. |
Yes, I would be interested in exploring that, but I have unfortunately been a bit busy for the last few months. I promised I would get back to you with some more concrete numbers, and I have finally managed to get around to re-running our profiling. For this test, I am running our application on my old laptop and using it to convert a ~350 MB file into a single ~200 MB flatbuffer (which we then compress and write to disk). Our application currently takes around 12 seconds to do this conversion. Of these 12 seconds, we spent ~3s on serialization. If I disable my vtable optimization, this part of the process is increased to ~10s. According to |
@TethysSvensson Sounds like more evidence in favor of pluggable vtable deduplicators using a trait :-] |
I decided to also experiment with pluggable buffers. This would allow the flatbuffer library to work on embedded platforms without a heap. My current design uses these traits: pub trait BufferStorage<'a> {
type Allocator: Allocator<'a>;
type Iterator: Iterator<Item = &'a [u8]>;
fn allocator(&'a mut self) -> Self::Allocator;
fn iter_bytes(&'a mut self) -> Self::Iterator;
}
pub trait Allocator<'a> {
fn allocate(&mut self, data: &[u8]) -> Option<&'a [u8]>;
}
pub trait CacheStorage<'allocator, 'cache> {
type Cache: Cache<'allocator>;
fn initialize(&'cache mut self) -> Self::Cache;
}
pub trait Cache<'allocator> {
fn lookup<A>(&mut self, data: &[u8], allocate: A) -> RawBuildOffset
where
A: FnOnce() -> (&'allocator [u8], RawBuildOffset);
} This is not the simplest design possible, but it was the simplest I could come up with after having experimented a bit. The design requirements I tried to work from was the following:
To prove the design, I have currently written a buffer storage based on
I am imagining a flatbuffer library based around this API would be used something like this: let mut flatbuffer = Flatbuffer::with_storages(MyBufferStorage::new(), MyCacheStorage::new());
// or alternatively:
let mut flatbuffer = Flatbuffer::with_default_storages();
for _ in 0..3 {
let mut builder = flatbuffer.new_message();
// [uses the builder to write a message]
// This consumes the builder
builder.set_root(object);
for slice in flatbuffer.finished_message() {
// [writes the slice to somewhere]
}
} @rw What do you think? Would this be an okay API to work with? |
f381fd7
to
31f8799
Compare
This turned out to be quite a large rewrite I started working on. I have support for the following buffer backends:
I have following vtable cache backends:
To support this I have unfortunately been forced to rewrite some of the APIs. The good news is that I think they are both more ergonomic and are likely to result in faster code than the current ones. The bad news is that it is a lot of work and that I am worried about not being able to support some features (either current or planned ones). I am quite far with my rewrite though. The library itself works as expected and I have a working example of some generated code. My TODOs are:
@rw I'm hoping you will still be interested in merging this once I'm finished. I think it will be an overall improvement to the status quo, but I am very worried than I am missing some of the planned features. I am also worried that I am stepping on somebody's toes by doing such an extensive rewrite without talking to you about it first. Would you be have time to do a review of my design and whether it is something you would be interested in? |
@TethysSvensson That's very exciting! You call it a "rewrite"... is that accurate? It seems to me that the existing code would be mostly unchanged, except that the |
Unfortunately it is a rewrite. I wanted to support basing the vtable cache on a You cannot do that by having a single struct: pub struct FlatBufferBuilder {
owned_buf: Vec<u8>,
cache: HashMap<&'??? [u8], Offset>, // This hashmap contains references to the owned_buf, which cannot be expressed in rust
} I then thought I could get by with splitting up the struct like this: pub struct Flatbuffer {
owned_buf: Vec<u8>,
}
pub struct FlatbufferBuilder<'fbb> {
builder: &'fbb mut Flatbuffer
cache: HashMap<&'fbb [u8], Offset>,
}
impl Flatbuffer {
fn new_builder<'fbb>(&'fbb mut self) -> FlatbufferBuilder<'fbb> { ... }
} This almost worked, except for the fact that Then I noticed the final problem: The current At this point I decided to start working on a rewrite mostly as a proof-of-concept to figure out what was actually achievable. I ended up having a full re-implementation -- and I am current finding it very hard to reintegrate that with the existing code. With this being a re-implementation, it naturally has made a couple of design choices. For instance in the generated code, my design currently needs for two different types |
@TethysSvensson Hmm, I see what you mean. The current logic uses offsets, not references, due to this kind of issue. Did you pursue an implementation with an offset-based approach? |
( @rw did you mean @TethysSvensson instead of me? :) ) |
Woops, fixed! :-D |
@rw I am what you would use for your keys in the |
@TethysSvensson It would be a two-step process, similar to how it is now: look up an offset based on a hash, then check the underlying buffer for byte-for-byte equality. If I understand your approach correctly, you may be getting stuck while trying to use a standard hash table to store actual byte slice data. Instead of storing slices, I recommend just storing (hash, offset) key/value pairs. This has the smell of using a hash table to implement a hash table, but that should get you started. (It's possible that the best implementation of a hash table for this situation would be a custom one, but I wouldn't try that immediately.) (An alternative approach would be to implement a Hash function for offsets that somehow ties the data in the buffer to the offset, but I think that would run into lifetime issues.) |
@rw I think one of us is misunderstanding the other. I already have a fully working re-implementation. I am using a What I am stuck at is how to do this as an incremental improvement to the current codebase:
|
@TethysSvensson Totally understandable how we could be misunderstanding each other. Thanks for trying to clarify! I'm essentially saying that the algorithm that builds the table (that moves data after every resize operation) is core to how the builder works. This has natural implications for how auxiliary data, like offsets, are handled during construction ops. It sounds like you changed that core algorithm, is that correct? |
Correct, but not by choice. I did it because I believe that the current algorithm is incompatible with the goals I am trying to achieve, i.e. faster performance using hashmap+bumpalo, while also supporting heapless contexts and contexts with very bounded memory. |
@rw Status update: I have documentation for almost the entire support library and I am more than halfway through rewriting the C++ codegen as well. I will try to get it into a form where I can push it to this branch, so you can take a look. In the meantime I have run into some open questions. My main problem is that I do not know how to handle invalid/missing/malicious data when deserializing. As I see it, there are three options:
I dislike the second option, because it makes the API harder to use. I dislike the third one, because I think it is a potential DoS vector, and it should be up to the users of the library to decide how to handle invalid data. So far I have tried to go with the first option. I have also verified that current code will never panic, at least not in my example code. However I am beginning to wonder if this is the right choice. If e.g. an enum contains an invalid field, I am not sure if it is meaningful to try and turn that into the default value for that enum. What do you think I should do here? |
@TethysSvensson I can tell you're putting a lot of effort into this! That said, the PR you're describing sounds enormous. As I said above, I think it's likely that only the Builder initializer would need to change, to accept user-provided deduplicators. I don't see the connection to the code generator. Could you explain more of your ideas here in words? |
My core goals are:
The current code is not very compatible with those goals for multiple reasons:
|
I can tell you more about how my code works, but to be honest I would rather show you, I think that makes more sense. |
@TethysSvensson Replying inline to your three stated goals:
WDYT? |
|
I have a working prototype now, including the C++ codegen. I will fiddle a bit more with it and push it sometime tomorrow afternoon (EU timezone). |
f116b80
to
f12cca8
Compare
This pull request is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days. |
Closing. See my comment on #5729 for context. |
When profiling code, we noticed that a lot of time in our application was spent inside the
find_duplicate_stored_vtable_revloc
function.Currently this function works by doing a linear scan of all the previous vtable positions to look for a matching one.
This PR improves upon the status quo by splitting the
written_vtable_revpos
field into multipleVec
s instead of just a single one. We place theseVec
s in aBTreeMap
, and we index into this map using the key(vtable_byte_len, table_object_size)
to find out whichVec
to use. After having found the correct vector to check, we do a linear scan of this vector using the same logic as the current implementation.This PR will increase heap usage slightly, as it introduces a
BTreeMap
to manage theVec
s. Even so, I believe this to be an acceptable sacrifice and it should still be a net improvement for most applications.We use a
BTreeMap
instead of aHashMap
, because that was what gave the best result for our application. I have not done any other benchmarks, so I am not sure if our application is representative.I am sure this could probably be improved even further, but I do not understand the code well enough to write a more invasive change.