Skip to content

Conversation

@wravery
Copy link
Contributor

@wravery wravery commented Dec 14, 2020

Fixes #126

This branch incorporates most of the changes from PR #126. Rather than building a separate schema structure specifically for validation, it generates an optimized representation of the schema which is shared between Validation and Introspection. There's still an option to disable Introspection in the schemagen tool with --no-introspection, but that's mostly for cases where you want to disable Introspection and not just to get the performance benefit. The overhead of re-enabling Introspection on top of the shared schema is minimal.

There's more to do to optimize allocations in the peg::graphql grammar and parsing, but I think we'll target a separate PR for that.

@wravery wravery merged commit cde90d7 into microsoft:master Dec 14, 2020
@wravery wravery deleted the merge-cached-validation branch December 14, 2020 06:09
MapData copy {};

copy.map.reserve(other.size());
for (const auto& entry : other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you test if this is triggering and how frequently?

In my PR I did some tests and found out that it wasn't needed if we Copy-on-Write, that's why I had a shared_ptr<> for the internal data, then copy became just a reference. During modification I did a unique() and if false I'd copy there, on demand (search for willModify).

After I did that, it turns that no copies were being done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that switching to a result type eliminated all of the use cases for copying. Maybe it can be a move-only type now, I'll try deleting the copy constructor/assignment operator and see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no. It does get copied. It's used by variables and directives, which both get passed around as a Map, but neither of those is used in the sample for the benchmark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think these needs to be copied to the resolvers? Is there any use case where these would be changed?

Anyway, as you can see in my PR, the COW is pretty simple to do. Just use a shared_ptr and check if unique() before doing any modification (ex: emplace_back()). If so, then you create a copy before modifying.

ListType copy {};

copy.reserve(other.size());
for (size_t i = 0; i < other.size(); ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, I did share_ptr this one, COW and at the end, never copied.

}

case Type::String:
_data = { StringData { other.get<StringType>(), other.maybe_enum() } };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here did the COW


wrappedParams.errorPath.push(size_t { 0 });
children.reserve(wrappedResult.size());
wrappedParams.errorPath.push_back(size_t { 0 });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part you should avoid, you modify the list, then force the copy of the modified (ResolverParams(wrappedParams)), which is confusing and code become uglier.

Take a look at my PR, there I just keep a reference to the previous SelectionSetParams, then errorPath is not even a list/queue/vector anymore and it becomes copy-free, each SelectionSetParams is a node in an implicit linked list. You just pay the price on errors, which we expect to be not the common case.

Also the code becomes simpler, no need to push, set, reset... the index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been experimenting with a version of this along with the mempool in https://github.com/wravery/cppgraphqlgen/tree/parser-mempool. The thing I don't particularly like about putting the link in the SelectionSetParams is that there are places in client code (e.g. the EdgeContraints template in TodayMock) which use move semantics on just the errorPath, and then it outlives the SelectionSetParams. So if it's relying on a reference to the parent SelectionSetParams struct, or even a reference to another field_path, the memory might have been freed by the time it tries to dereference the moved copy. It's not trivially relocatable from a single node, it needs to do a deep copy or otherwise keep a separate allocation alive for the parent.

To get around this I tried wrapping the segments in a std::shared_ptr and building a linked list from each segment to its parent. That was a little slower, and it used more memory. Most recently I tried allocating them on the stack and only copying the parent into a shared_ptr if a child is allocated on the stack. That seems slightly better, but it's still hitting std::shared_ptr quite a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're paying a huge price to cope with this EdgeConstraints usage...

Actually, why do we need this to be handled by that class? Shouldn't the class just throw? I recall some parts of the code detected there was no path/location information and would properly enhance the errors before copying them.

shared_ptr doesn't seem like the way to go, in particular if we keep changing the errorPath as currently done by the list converter (something I'd also avoid).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been worried about breaking backwards compatibility, figuring that existing external code might have implemented this like the sample. But I went back and checked the history for the sample, and I already modified it to add this reference in a minor version update (https://github.com/microsoft/cppgraphqlgen/releases/tag/v3.1.0). So maybe that's enough of a precedent to change this and drop the new code from EdgeConstraints. As you said, it should be able to just throw the exception without decoration and let the shared resolver code fill it in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should noticeably break the build though if it's still used this way, since otherwise users will get a runtime error and crash.

for (auto& error : value.errors)
{
data.emplace_back(std::move(entry.second));
document.errors.push_back(std::move(error));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you're changing the types, worth looking into making errors a list, so you can splice(), which is simpler (1 line) and cheaper (although errors are not the hot path).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large enough nodes, a std::list may be worth it. But otherwise you do pay a premium in memory usage per node to store the extra next/prev link pointers. It's also bad from a data locality/cache perspective since every node is allocated separately, so I generally avoid using them. Some more background here: https://www.stroustrup.com/bs_faq.html#list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I get the difference between list/vector and even variations such as inlist (inline lists, where each element is a linked list node per se).

Really, the wasted amount or locality does not matter for handling errors, even more compared to the current situation where we're creating a new array and moving everything -- or before, where those were actually response::Value maps, with the map + array, strings getting copied. We already improved a lot, the move to list/splice would basically save a minimum while making the code simper (no more reserve/loop/...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, propagating errors happens a lot more than the final packaging into a response::Value, and std::list::splice can do that in constant time.

};

using ValidateTypeKinds = std::map<std::string_view, introspection::TypeKind>;
using ValidateTypes = std::map<std::string_view, ValidateType>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to keep using map instead of unordered_map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm acutally thinking of replacing them all with sorted vectors: https://youtu.be/fHNmRkzxHWs?t=2818

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key with vectors is to avoid reallocating on append by reserving or resizing to the exact size you need, and not inserting/removing other than at the end. But for most of these objects that should be doable, they're static for the lifetime of the service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're moving to sorted and looking into static... why not target array? There is a generator tool to help that, then we can try to get more stuff into .ro-data section of the binaries.

There is no need to get the vector helpers.

The binary search is not as fast as an unordered map, but I think it doesn't matter much for these use cases. If so we can look into integrating https://www.gnu.org/software/gperf/manual/gperf.html or an equivalent later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're moving to sorted and looking into static... why not target array? There is a generator tool to help that, then we can try to get more stuff into .ro-data section of the binaries.

Most of them still have a size which varies, so it couldn't be built into the graphqlservice library directly. The generated code would need to pass a span/range into graphqlservice to get dynamic sizing backed by std::array with a fixed size in the generated code.

C++20 adds span/range, so that would make sense whenever cppgraphqlgen moves to that language version (I'm tentatively planning on doing that for v4.0). There's also the gsl (C++ Core Guideline Support Library) which adds lookalike definitions of std::span<>, so that's another option that makes C++20 adoption easier down the road. But I'm reluctant to add another build dependency unless/until it's absolutely worth it.

The binary search is not as fast as an unordered map, but I think it doesn't matter much for these use cases. If so we can look into integrating https://www.gnu.org/software/gperf/manual/gperf.html or an equivalent later.

I'm not so worried about the quality of the std::hash implementation for std::string_view, and integers/pointers can just hash to their own value.

The multiple levels of indirection and poor data locality (cache misses) in std::unordered_map overwhelm the savings on lookup compared to a binary search, until you get to a very large number of items. Anecdotal benchmarks I've heard say you need at least 100 entries to make std::unordered_map break even, but really we should do some of our own benchmarking/profiling.

You mentioned you are dealing with large numbers of fields per object, so maybe you're more likely to hit that break even point than I expected. If you want to try comparing them for your scenario, take a look at fa772a1 where I already did this for the ResolverMap generated for Object. If you locally change where it declares the ResolverMap type as a std::vector<std::pair<>> back to std::unordered_map and change the lookup in SelectionVisitor::visitField back to a call to find, you can see if/what difference it makes.

@barbieri
Copy link
Contributor

@wravery thanks for merging this, it did improve a lot, however my original PR was producing something reasonably better.

Massif comparing master (cde90d7) and my PR (a142c6c) shows:

Code Largest Smallest
cde90d7 156,328 145,808
a142c6c 122,600 106,664
Difference +27.5% +36.7%

DHAT also points out to big differences:

Code t-gmax t-end
cde90d7 7,377,431 699,095,217
a142c6c 6,307,022 312,019,366
Difference +16.9% +124%

Okay, the "t-end" is not that meaningful, but the total allocated at peak is.

DHAT points out that lots of your memory allocation is at field_path allocation, in particular GraphQLService.h:139 leads to errorPath. Take a look at my commits https://github.com/profusion/cppgraphqlgen/commit/a142c6c9035ee46123127e9ad0764614952c9055, https://github.com/profusion/cppgraphqlgen/commit/7eb32e49c67afab87ab71b1c8f0d240da3a68b4e and https://github.com/profusion/cppgraphqlgen/commit/25afd869f8b15caf2ecc406861a5787cce745678

I'd also point out to the conversion optimizations, it's clean and faster: https://github.com/profusion/cppgraphqlgen/commit/cb52a63bfcb7c73bdc000eed4e063d803e3a83f7 and https://github.com/profusion/cppgraphqlgen/commit/7d746b594f2696e9b215990c7c486e2475272f84 (this last one is more important to people using std::launch::async as we're still doing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants