-
Notifications
You must be signed in to change notification settings - Fork 51
Cached validation #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cached validation #126
Conversation
|
Great job on the performance analysis! I'll try to get this reviewed and merged soon. |
|
Thanks @wravery, we're using this in an embedded system and it's very sensitive to memory pressure. I have other patches in my https://github.com/profusion/cppgraphqlgen/tree/perf, but I'll wait this one to be reviewed and merged so I can rebase and propose the other PR. |
wravery
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of the benefits of this change could be achieved just by keeping a std::unique_ptr<ValidateExecutableVisitor> alive in a member on the Request. Would you try that change as a smaller comparison? I think it would be a lot easier to reason about it that way than by passing the response to the IntrospectionQuery into the Request. If that one change is enough to get similar results I'd prefer to do that.
BTW, depending on your scenario and how much query caching you can do, you might be able to erase the impact of validation entirely after an initial parse. The sample test case would have 0 overhead for validation after the first iteration if the peg::ast variable was declared object outside of the loop. After validation it sets ast.validated to true and it validation after that.
include/Validation.h
Outdated
| std::shared_ptr<ValidateType> type; | ||
|
|
||
| ValidateArgument() = default; | ||
| ValidateArgument(std::shared_ptr<ValidateType>& type_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::shared_ptr<ValidateType>&?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if you move the declaration of type to the top, the default constructor/initializer order should just do the right thing, with default values for the other 2 members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that applies to ValidateType as well. You could omit the constructor overrides and use initializer syntax and default constructors for pretty much every struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will change that one.
include/Validation.h
Outdated
| std::shared_ptr<ValidateType> returnType; | ||
| ValidateTypeFieldArguments arguments; | ||
|
|
||
| ValidateTypeField() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of these constructors match the default compiler generated constructors. You could just omit them.
include/Validation.h
Outdated
|
|
||
| bool isInputType() const; | ||
| ValidateType getType(); | ||
| std::shared_ptr<ValidateType>&& getType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should return by value. From what I've heard return-value-optimization (RVO) works better that way.
To get move semantics you can move to a local variable (e.g. auto result = std::move(value);) and then return the local variable by value, and all the RVO goodness should apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh really? I changed some of those to get it shorter 👀
seriously, the more I look into C++, the more I dislike this language 😆
src/Validation.cpp
Outdated
| // This is taking advantage of the fact that during validation we can choose to execute | ||
| // unvalidated queries against the Introspection schema. This way we can use fragment | ||
| // cycles to expand an arbitrary number of wrapper types. | ||
| ast.validated = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the canonical IntrospectionQuery does have this limitation, where it can only follow links to a fixed depth. It's compatible with more tools, but it does put a limit on the complexity of the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get this, but while there may be a combination of lists, non-null and the actual type, it's pretty unusual in real life, even more that it would not work in any other GraphQL tool.
Do you want me to move and use the non-standard recursive here?
I went ahead and merged that change as part of #131, so if you want to repeat your measurements with that version we can see how effective it is compared to the full change in your PR. I can try to take my own before and after measurements, but I won't be able to compare them directly to your environment or results. |
|
@wravery I'll spare some time to do the measurements and resolve the conflicts. I also have the However just keeping the visitor as you did will not solve all the issues, namely it will not allow me to remove the introspection from production server, also there is no optimized lookup for types (as you keep the not-so-cheap maps, lookups, conversion to kind and generation of type-strings to make comparison simpler).
Also, not sure you've noticed, but in my code you can still get the introspection if you do not provide the values, it's backward compatible. However in near future I plan to generate the introspection results in code, then we can avoid the introspection query, only building the results using
Problem is that we're still too slow, at least compared to Apollo, we're about 3x slower, which in turn is like 2-3x slower than Go implementations. So we need some more work, my initial PR is not enough. |
|
Massif using fb4a589 So it's running around 374kb, while my branch runs at 322Kb. dhat: This improved a bit compared to mine (224, 575) likely because of the lazy I'll rebase my work on top and see how both play along together (but still avoiding individual |
Doing the introspection query all the time is hurting performance, this does not change, so a single query can be done with all the fields, build a validation tree and then query it on all validations. This is the first part, moving the top-level query to be read-only. The next commits will eliminate other `executeQuery()`, then a shared context will be created and hosted by the `Response` class, which can be discovered using introspection or fed using JSON (schema.json).
This is an incremental commit, just make use of the read-only data instead of `release` primitives, allows sharing the query results.
This is an incremental commit, just make use of the read-only data instead of `release` primitives, allows sharing the query results.
Split the fields getter and cache/insertion into the map so they can be used later in a different way. There should be no code differences, just moving around the internal branch to its own function. In the next commits, this will be removed from the getter, as it will be query-only as the types will be all cached beforehand.
Wouldn't you need to edit the generated files to do this? Alternatively, maybe there should be a switch for If we modify the
Interesting, I never tried a direct comparison with either of those. Partly that's because I've been thinking of it as filling a different niche, specifically interop with existing C++ code in a hybrid web or React Native client. Upon handing off to the JS UI code, most of the native perf concerns become less relevant, they're generally orders of magnitude faster/cheaper just being native. I do mostly desktop development, so even Electron is generally fast enough. So in your scenario, are you running just a GraphQL service on the device and handling the results elsewhere? Can you share a sample for either of those alternatives so I can see what we're up against? |
This handles OBJECT, INTERFACE, UNION and INPUT_OBJECT types. It should have no behavior change, just moving code around. Minor adjustments were made to cope with the iterator return
It should have no behavior change, just moving code around.
This uses the information being queried in the introspection and allows the fields and input fields to be processed in one go.
This is another step to split the visitor from the lookup data structures, in the future the lookup will be shared.
ValidateExecutableVisitor was split into a lookup data structure (ValidationContext) and the actual visitor. The lookup data structure is shared across requests, saving queries and processing.
We do not need a map, there are only 3 well defined names
I was thinking a
Yeah, this is my ultimate goal. I'm close to that in my PR (still cleaning up), first I'm working on the data structures (which I should push tomorrow or so), then I'll generate this validationContext directly in code.
Usually, yes. But in my case there is no render being done, just data being normalized to another device (web, android, ...) where the GraphQL data is displayed.
Yes, this is an embedded device and it will normalize various sources as GraphQL queries, I cannot disclose much at this point (working for a customer under NDA), but we generate the resolvers to access some sources in C++. As the number of sources and properties are large and the hardware is underpowered, we did run into performance issues, that's why I'm trying to fix them. |
👍 No problem, I was just hoping you already had benchmarks you could share for Apollo or Go. |
|
@wravery pushed what I have done so far, changed what you pointed in the first review (commits were edited, so take a look at them again, there are no fixup commits) I've reworked the Adding fields to input/objects were split into a second iteration, this way we know for sure the named types exist and will use those references. Things like Also did some work to reduce the memory allocations, moving some of the This PR now contains the |
|
weird, on MacOS/clang it's not giving that error. I'll test on Linux |
Instead of using a map with properties `name`, `kind` (string) and `ofType` (another map), use a set of custom classes with kind as enumeration and ofType as shared pointer. This allows much simpler handling and comparison, no need to serialize to string to make it simpler to compare. We can also store reference to types, know which kind (ie: isInputType?) and save memory by using references, in particular to common types such as Int, Float, String... The matchingTypes and fields are stored as part of each ValidateType specialization.
Instead of 2 maps + set (both ordered), use one single unordered_map (string_view) + unordered_set (pointer to definition). The string_view is okay since the ast_node tree is valid during the processing, so the references are valid. The pointer to definition is also okay for _referencedVariables, since the defintitions are all created upfront and the map (thus references) won't change while visiting the fields. The 2->1 map was possible since we're now storing the definition location instead of using a second map just to store the ast_node to query the position on errors.
pre-allocate a vector and populate it, then iterate directly instead of using a queue
Instead of always creating a recursive resolver, which in turn may call `std::async()`, only do that if the result is not readily available.
The wrap is not for free and is, more often than not, useless.
Just minor tweaks to make it compile, moving the template functions to the header and also marking virtuals as final. In the next commits it will be moved to more public usage, including the generator.
Soon there will be a generated ValidationContext, thus we don't need to carry any of the introspection bits.
This is basically moving code around, change the parameters to allow the request to receive the ValidationContext. Bring back the original Request() constructor so it will not break dynamically linked binaries. The introspection results are kept around, in the future the validation context will only use pointers to strings (string_view) and everything must be alive in order to work.
|
@wravery finally got everything to work 😅 It became HUGE, the commits are not in the best order possible as I fixed some issues while I was reading/measuring the code paths. The code generator now outputs You can skim through the commit messages to see all that was changed, but a summary is:
DHAT is impressive 125,212,274 -> 9,472,310 (max), with half of memory used by This massif chart should get you a clear picture of the final results, there is no peak anymore and memory is stable at ~300Kb. |
|
Damn, the test failed on CI, related to |
This avoids the introspection query and simplifies the build of lookup maps
The generated file contains "#ifdef SCHEMAGEN_DISABLE_INTROSPECTION", if that is set then the introspection blocks will be disabled: - no __schema and __type resolvers - no AddTypesToSchema - no _schema field
|
@wravery now it includes all the introspection stuff (I did a complete generation, even if the current schema doesn't contain any enums or input types, if we add those in the future, the generator will just work. It also includes |
|
Running the sample without introspection, the results are: DHAT 125,212,274 -> 6,522,087, 49% of the allocated memory is in graphqlpeg, followed by Massif reports less than half memory used: |
Provide `push()` and `pop()` convenience methods so it's the same as `queue`. The `list.size()` is not as fast, however these lists are often small enough to not matter (walk the list counting the elements)
Change SelectionSetParams to keep an optional reference to the parent, this way we don't need to build the path over and over again just to add one element, instead we create this on demand. Since errorPath was accessed directly, this breaks the existing code, it became a method that dynamically computes the error path (recursive). This is important since we just pay the list copy price when there is an error and not in all field resolution.
Use this specific constructor in list converter, creating one itemParams with the new ownErrorPath, instead of changing the wrapper request param.
We shouldn't modify the parameters using a string is causing it to copy the field name, which is particularly bad when processing huge lists (it would copy the name for each item).
We don't change it anymore, we don't push to the array, then we can keep it inline in the parent structure, avoiding the extra allocation.
|
@wravery with this last commit, peg is 68% of the allocation, everything else runs much smoother. I'm running out of time to work on more optimizations, but if you know how to get peg to play nicer with memory, let me know |
Sounds good. I'm going to make a cleanup pass to makes sure it's consistent with the rest of the project, and then I should be able to get it merged sometime this week. Thanks for this contribution! |
I forgot to reply to this one. I can't share much details due NDA, but raw numbers (C++ runs with
"Pre Optimizations" is 3add6d3, BEFORE your cached validation visitor. By far that was the biggest source of slowness. Just that helped a lot, however other changes like changing the converters to be more efficient, remove some useless This test is an artificially generated schemas, one is deeply nested, the other is a huge flat schema. We're querying 50 leaf fields, 500 queries over the network (HTTP/GET), using websocketpp in the C++ version. We don't have it written in Go to say for sure, but given this https://github.com/appleboy/golang-graphql-benchmark and https://github.com/the-benchmarker/graphql-benchmarks/blob/develop/rates.md we can estimate how slow JS is compared to Go. |
|
While reviewing this, I thought of another approach that I'd like to take. Rather than building a separate TL;DR; I'm not driving the validation through a separate hierarchy of cheaper validation objects, I'm driving both validation and introspection through a separate hierarchy of cheaper introspection objects. Also, instead of using pre-processor directives, I added a I have a little more cleanup on this approach to go, and I want to try rebase or merge some of your other fixes on top of that, but the memory savings seem very promising. I also noticed my unit tests run in about 2-3 times faster 🎉! Here's what I got from debug builds running on WSL 2 (using a massif against master massif against this PR I re-ran this after confirming my copy of this branch was up-to-date, and I'm still getting very different results from your last update, more in line with the previous update: massif against my version massif against my version with --no-introspection This is generating the compact schema representation, but it blocks loading the The branch where I'm working on this is https://github.com/wravery/cppgraphqlgen/tree/merge-cached-validation. If you want to run any of your own tests on that you can. Remaining work:
|
|
@wravery that's okay, just be sure to compare to my latest version, since you report As for the Meanwhile was triggered by the ast stuff, as I never worked with PEGTL it took me a while, but https://github.com/profusion/cppgraphqlgen/compare/cached-validation...profusion:parser-tweaks is evolving. What is left is a way to cache the Do you know of some simple way to get this memory arena/allocator pool done? Currently the PEGTL generates A LOT of useless nodes, every node is first built to later be discarded, which causes too much pressure on the memory allocator. So far, without the arena/pool it removes around 10kb from peak: |
Got it. I made a separate And this is Almost all of the savings apply with or without
Interesting, so it looks like this is meant to ignore the schema definition part of the grammar, correct? That makes sense for most purposes, and the same optimization could be used in reverse for the schema generator. However, the validation section of the spec specifically mentions rejecting documents for execution if they have any non-executable elements (and vice versa for schema definitions IIRC). This will convert an error about that specifically into a parser error. If you split the document rules into separate executable and schema documents, no single document should satisfy both, but it might be nice to add a handler for the parse error which checks to see if it matches the unified document grammar and converts that to the same error message about how they shouldn't be mixed together. It should even be possible to parse the grammar and see if it matches without executing any of the actions, so parsing against the unified I would also swap the meaning of the two
I think you can still inherit from For simplicity, high throughput, and data locality, I would suggest using a std::deque for the node mempool:
The tradeoff is allocating a few extra nodes (based on the chunk size) to round up, but if they're empty that's not a lot, and it will greatly decrease the number of calls to the allocator in addition to making most adjacent nodes fit together in a cache line. Since the actual parsing is single threaded, but the parsing might happen concurrently on multiple threads, you might try adding the |
|
More random thoughts about the mempool:
|
There are a lot of ways to misuse Generally, replacing Couple of other points about std::string_view:
|
Never mind, this was not as efficient as I hoped. The overall memory usage is higher and it's a little slower. I was able to plug in a |
I tested that hack as well, but it was quickly, without checking the number of entries. What I found earlier during some prints is that the quick alloc-dealloc were hitting the malloc cache, so it was not going to the kernel. However that reuse improved only once I reduced the |
Introduction
cppgraphqlgen validates each against a schema before it's executed.
Most tools (ie: Apollo) uses a given schema to work on, most will load it using a JSON file resulting from an introspection query (usually called
schema.json).However
cppgraphqlgendoes this by doing an introspection query prior to execute each query. This is bad for two reasons:In particular the last point is hurting us, since for large schemas (or small user queries) the cost of each introspection is way larger than the user query itself, so the validation takes more time than the execution.
This PR moves the schema information used to do validation to a class
ValidationContextthat is hosted bygraphql::service::Requestand shared to eachValidateExecutableVisitorused.The
ValidationContextcan be created using an introspection query on the service or aresponse::Valuewith the results of such query, useful for people usingparseJSON().In addition to that,
ValidateTypeis a specific structure, instead of using the more expensiveresponse::Value, this allows less memory usage and faster/simpler usage due direct access tokind(as enum),nameandofType(as shared pointer).Test Environment
The following tests were executed on Ubuntu 20.04.1 LTS (Focal Fossa) running on a docker on MacOS X.
Compiled with 9.3.0 and running kernel 5.4.39-linuxkit.
The test was done using a loop of 100 elements inside
samples/today/sample.cpp:The input query:
valgrind --tool=massif
Valgrind provides a heap profiler called Massif.
As seen below in
ms_printresults, the introspection query costs much more than the actual query (since it's small). If we keep redoing the introspection on each query, we keep memory pressure. This also may lead to fragmentation that causes more memory to be used, as seen in the final snapshot of each:Notice there is a higher peak using Cached since the introspection is handled as read-only (values are not released as the
ValidationContextis built). This was done to enable theresponse::Valueto be used elsewhere, as well as being loaded fromparseJSON().However, after the peak (695,248) it goes down (324,736) and remains mostly stable. Around 50KB smaller than the Pristine solution.
The results were edited to present only the most relevant information.
Pristine results (commit: 3add6d3)
Cached Validation Results
valgrind --tool=dhat
Valgrind provides a dynamic heap analysis tool called DHAT.
As seen in the results from
dh_view.html, while we have a peak (t-gmax: 629Kb) as explained invalgrind --tool=massif, we have much less bytes used, with much less reads and writes (total), and increased memory utilization (reads/writes are greater than total):That's 22 times less memory used and around 15 times less memory being read and write.
Comparing most important allocation origins:
27 times better.
28 times better.
28 times better.
28 times better.
17 times better.
The results were edited to present only the most relevant information.
Pristine results (commit: 3add6d3)
Cached Validation Results