Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "vector of union" feature to FlatBuffers C++ #4143

Closed
wants to merge 1 commit into from

Conversation

b-nonymous
Copy link
Contributor

@b-nonymous b-nonymous commented Jan 19, 2017

Currently to have a vector of union in FlatBuffers, unions have to be encapsulated by tables. However, tables have certain overhead, e.g., it has to keep vtable per table instance, and store offsets instead of actual values for many value types. vtable itself has a few bytes overhead for store the vtable metadata, and the actual indices.

In order to minimize the overhead, improve the coding throughput and reduce the wire format size, it would be nice to support the "vector of union" feature, without paying the one table per union cost.

wvo@ pointed out that being non-invasive to the existing coding format is one of the key issues for implementing this feature.

We agreed on making the vector of union into 2 parallel fields: a vector of types and a vector of values. This will not break any existing encoding and has efficient storage. The downside is, iterating the vector has more work, as there are two underlying vectors to read from instead of just one.

This CL only adds the feature to CPP. There will be a followup CL that adds JSON support. Supports of other programming languages will be added later.

@b-nonymous b-nonymous changed the title Add "vector of union" feature to C++ Add "vector of union" feature to FlatBuffers C++ Jan 19, 2017
Copy link
Collaborator

@aardappel aardappel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome PR.. some minor changes needed.

@@ -356,10 +356,25 @@ struct IDLOptions {
bool allow_non_utf8;

// Possible options for the more general generator below.
enum Language { kJava, kCSharp, kGo, kMAX };
enum Language {
kJava = 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 << 0 :)

@@ -510,6 +511,14 @@ inline bool VerifyEquipment(flatbuffers::Verifier &verifier, const void *obj, Eq
}
}

inline bool VerifyEquipmentVector(flatbuffers::Verifier &verifier, const flatbuffers::Vector<flatbuffers::Offset<void>> *values, const flatbuffers::Vector<uint8_t> *types) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot be adding vectors of unions to the samples, since this sample is used by all languages. It has to be a separate schema and test until support is ported to all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the sample schema is not changed. the code here generated is due to the change of the GenUnionPost, which is invoked on the existing union.

code_ += "inline " + UnionVectorVerifySignature(enum_def) + " {";
code_ += " assert(values->size() == types->size());";
code_ += " for (flatbuffers::uoffset_t i = 0; i < values->size(); ++i) {";
code_ += " if (!Verify" + enum_def.name + "(verifier, values->Get(i), types->GetEnum<" + enum_def.name + ">(i))) return false;";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure lines fit in 80 columns (google c++ style guide)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -510,6 +511,14 @@ inline bool VerifyEquipment(flatbuffers::Verifier &verifier, const void *obj, Eq
}
}

inline bool VerifyEquipmentVector(flatbuffers::Verifier &verifier, const flatbuffers::Vector<flatbuffers::Offset<void>> *values, const flatbuffers::Vector<uint8_t> *types) {
assert(values->size() == types->size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the sizes of these arrays are created by whoever builds these arrays, returning false is better than an assert. It be nice to add an assert in the builder code, but I am not sure where.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd love to return a error code / message so it's easier to debug. for now i'd just return false as you suggested. any plan/demand to add such code/msg?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task of the verifier is to find out if a buffer is safe or corrupt, and do so quickly, and by, default, without asserts. If a buffer is corrupt, that could have all sorts of reasons, including bad transmissions etc, so errors are not necessarily that useful.

But yes, rather than just return, you should wrap the condition in Verifier::Check, which allows anyone to debug a verifier issue.

return &language_parameters[idx];
}
}
return nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert.. this function should never be called with a language that is not in the table.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be: if (lang == java) return .. else { assert(lang == C#); return .. }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -613,6 +605,17 @@ CheckedError Parser::ParseField(StructDef &struct_def) {
type.enum_def->underlying_type, &typefield));
}

if (type.base_type == BASE_TYPE_VECTOR && type.element == BASE_TYPE_UNION) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -613,6 +605,17 @@ CheckedError Parser::ParseField(StructDef &struct_def) {
type.enum_def->underlying_type, &typefield));
}

if (type.base_type == BASE_TYPE_VECTOR && type.element == BASE_TYPE_UNION) {
// Only cpp supports the union vector feature so far.
assert(opts.lang_to_generate == IDLOptions::kCpp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has to be an error instead of assert, since it comes from flags supplied by user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. make sense.

TEST_EQ(flatbuffers::LoadFile(
"tests/union_vector/union_vector.fbs", false, &schemafile), true);

// parse schema.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that you're parsing this schema, but it is not used in the code below. for this to be useful, you should also load a json file with such a vector in it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Added a TODO.

const flatbuffers::Offset<Movie> movie_offset =
CreateMovie(fbb, fbb.CreateVector(types), fbb.CreateVector(characters));
FinishMovieBuffer(fbb, movie_offset);
uint8_t *buf = fbb.GetBufferPointer();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use auto whereever possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for flatbuffers::Offset types. I personally find buf concrete type to be pretty useful.

FinishMovieBuffer(fbb, movie_offset);
uint8_t *buf = fbb.GetBufferPointer();

// verify.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave out comments that don't add to what the code already says.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@aardappel
Copy link
Collaborator

@mikkelfj you may be interested in this PR, since we want to add support for other languages eventually.

@aardappel
Copy link
Collaborator

@bei, oh, and besides the comments above, some minimal additions to the docs would be good. This can be very minimal (in Schemas.md) since it is only supported by C++, and should be noted as such.

Also, JSON parsing/generating would be awesome in this PR or a followup.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 23, 2017

Haven't read through it all, but to the original point at top

We agreed on making the vector of union into 2 parallel fields: a vector of types and a vector of values. This will not break any existing encoding and has efficient storage. The downside is, iterating the vector has more work, as there are two underlying vectors to read from instead of just one.

Using two vectors is likely very efficient. First, the type field is a byte so it packs better. In the cpu it will fit comfortably in data cache and you avoid 3 wasted bytes of padding both in cpu cache and in buffer format. Second, you are possibly only interested in specific types in the union. If so, you need to access much less memory and can find the offset to the actual object with a simple array lookup. Furthermore this lookup is more likely to be in lookahead cache because it is more densely packed. The initial lookup could be more costly to feed the cache with both type and vector, but if it is a small vector, both are likely cached. If it is a large vector, then having the separate is all the more important.

@aardappel
Copy link
Collaborator

@mikkelfj : glad you agree! but yes, besides efficiency, a big motivation was that a pair of type/union would require an entirely new kind of type in FlatBuffers, whereas this doesn't.

@mikkelfj
Copy link
Contributor

It is probably too late now. But one thing worth considering it requiring the layout to be a type vector padded and immediately followed by the table vector. This will enable an implementation to reference the union vector with just one pointer which is highly practical while still allowing the union vector to appear as two normal vectors to other code. A verifier would enforce this. Otherwise the C implementation needs to define a struct to carry a dereferenced union vector.

@mikkelfj
Copy link
Contributor

Sorting, searching, key attribute if all tables have a key field of same type (and possibly same name)?

type.element == BASE_TYPE_UNION) {
// Only cpp supports the union vector feature so far.
if (opts.lang_to_generate != IDLOptions::kCpp) {
Error("The union vector feature is C++ only.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Vectors of unions are not yet supported in all the specified programming languages"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@aardappel
Copy link
Collaborator

@mikkelfj : I see your point, but such a special encoding requires significant new support code across languages. The beauty of the current implementation is that it is both efficient and a very minor extension of the code that is already there.

@b-nonymous
Copy link
Contributor Author

closed the PR by accident. Trying to squash the commits by using "
git reset $(git commit-tree HEAD^{tree} -m "commit message")", then "git push -f origin vector_unions". The PR is then closed. Now trying to recover.

@b-nonymous b-nonymous mentioned this pull request Jan 24, 2017
@b-nonymous
Copy link
Contributor Author

Restored the PR at #4148

@jsanmiya
Copy link
Contributor

Ugh. It's too bad we can't reopen this PR. We'll submit the new PR though (linked above), which is the same change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants