Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance code review #15

Closed
FooBarWidget opened this issue Oct 15, 2018 · 6 comments
Closed

Performance code review #15

FooBarWidget opened this issue Oct 15, 2018 · 6 comments

Comments

@FooBarWidget
Copy link

FooBarWidget commented Oct 15, 2018

Hi there, great work writing oat++. I am the author of Passenger, an application server that powers 650.000 companies and sites world-wide such as Apple, Pixar, Juniper, etc. Passenger is mostly written in C++ and like you, performance is a priority for us.

I've done a fast code review, and I thought I'd share some of my findings with you. These are based on my experience with writing high-performance servers. You can read about some of my experience in these blog posts.

Hopefully you can use my feedback to make oat++ even faster and better.

  • EDIT: elaborated on the documentation feedback
  • EDIT: added feedback about zero-copy architecture

Performance feedback

Pools and contention

You are using memory pools and object pools everywhere. This is good, because avoiding dynamic memory allocation is key to performance. It is also cool that you've attempted to write a standardized framework to allow every object to easily make use of pools, including extras like counters.

One issue with your approach is the fact that your pools are static, i.e. all objects of a certain type are allocated from that single pool. In a multithreaded program this creates contention. You can avoid this by using one pool per thread.

I see you already tried to mitigate this somewhat with ThreadDistributedMemoryPool, but that is still dependent on a single contention point, namely the statically allocated atomic counter. Nothing is faster than no contention.

Shared_ptr and atomics everywhere

Shared_ptr is great for safety, but it uses atomic counters under the hood. Atomic operations are expensive; they even impose memory barriers.

What I do in Passenger is using smart pointers that do not use atomic counters. This makes changing the refcount from multiple threads to be unsafe, but performance is much better.

Using shared_ptr has another implication, namely the use of pointers. Pointers take up space too. I am not familiar enough with your codebase, but maybe you can reduce the amount pointer-chasing? Your cache locality is probably not that bad thanks to the use of pools, but I suspect it could be even better if you reduce the number of pointers and make more use of fixed storage. Not sure how to do this, but maybe you can think about it.

List vs vector

You use std::list in some places, such as MemoryPool::m_chunks. std::vector has better memory allocation behavior and is better for cache locality, making it often faster than even the cases where std::list's theoretical complexity is (e.g. removing from the middle).

In any case, it appears you are not relying on the algorithmic complexity properties of std::list, so consider replacing with std::vector.

Or even better: along the lines of improving cache locality and reducing better, consider using small_vector. See boost::container::small_vector, or the LLVM codebase's ADT::SmallVector.

Zero-copy architecture

You are making limited use of a zero-copy architecture. You already have heap-allocated I/O buffers which you can move around without copying the data, and this is good. However, your stream interfaces read data by copying to a void* buffer.

In Passenger we make use of "mbufs" — slices/substrings of an I/O buffer. I/O buffers are non-atomically reference counted (making them only usable within 1 thread). Mbufs act as smart pointers, manipulating the refcount of the underlying I/O buffer. This way we do not need to copy around any data.

You may ask, what about non-contiguous data that are spread over multiple I/O buffers? In Passenger we embrace the fact that data may not be contiguous, using a data structure called LString. We then build algorithms around such a data structure instead of around contiguous strings. Only in cases where we can't build such an algorithm (and thus when we have no choice but to use a contiguous string, e.g. when we need to call an external library or a system call like stat()) will we turn the LString in a contiguous string.

Other feedback

SpinLock everywhere

You're using spinlocks everywhere instead of OS mutexes. Spinlocks are often faster than mutexes microbenchmarks, as well as smaller, but they can be dangerous depending on the amount of contention you have. With little to no contention, they are great. With large amounts of contention you risk spending a significant amount of time spinning, which is slower than having the OS pause the process and do the scheduling for you.

I see that you use spinlocks in MemoryPool's obtain() method. That method performs a non-trivial amount of work. For example you call into the native memory allocator using new, which can perform an arbitrary amount of work.

New does not return nullptr?

There is some code which allocates something with new and then checks for a nullptr result, such as ChunkedBuffer::obtainNewEntry(). But new does not return nullptr, it throws an exception.

More documentation

It can be a bit hard to figure out what a class is supposed to do or how it is supposed to be used. More code documentation — both on the class level and on the architecture/concept level — would be nice.

On the class level, I do not understand what ChunkedBuffer is supposed to be and how that's different from FifoBuffer. I also did not understand what ParsingCaret is and how it is used until I've studied quite a lot of code that uses ParsingCaret. It also took me a while to figure out how SHARED_OBJECT_POOL vs OBJECT_POOL are used.

On the conceptual level, I have no idea how the coroutine stuff works and is supposed to be used. I have also no idea what concurrency model you are using. Is it one-client-per-thread? Your benchmarks seem to suggest that you can't be using that. But I grepped for select/poll/epoll/kqueue/O_NONBLOCK and I could not find anything in your code using any of those mechanisms, so apparently you aren't using non-blocking evented I/O either. It is a mystery.

More tests

Unit tests saved my life so many times. Debugging regressions that are discovered after release, is not fun. The amount of tests you have is fairly small compared to the size of your library.

@Joshua-Ashton
Copy link

But new does not return nullptr, it throws an exception.

This is "standard" correct, but not real world correct -- not everyone uses C++ exceptions (at the compiler level), therefore it could be nullptr. It's better to be safe than sorry.

@lganzzzo
Copy link
Member

@FooBarWidget
Thank you so much for your code review, and your time!
It is very valuable input!

More documentation is really an issue and I am working on it right now.

Here is a brief answer and I will get back to you with the full answer a bit later:

What I do in Passenger is using smart pointers that do not use atomic counters.

I used to have custom smart-pointers in oat++ before but gave-up this idea in advantage for compatibility. So end-user could easier work with the framework.
Also performance impact was really very very small

I do not understand what ChunkedBuffer is supposed to be and how that's different from FifoBuffer

  • ChunkedBuffer is basically a chain of chunks. It provides possibility to grow buffer without copying the memory. Also chunks are pool allocated in order to reduce memory fragmentation.
  • FifoBuffer is for the future feature and not used currently.

On the conceptual level, I have no idea how the coroutine stuff works and is supposed to be used. I have also no idea what concurrency model you are using. Is it one-client-per-thread?

Currently there are two concurrency models implemented:

Coroutines and Async:
Are used in the async model together with non-blocking IO (non-blocking sockets).
Coroutines are served by oatpp::async::Processor.

These are stateless coroutines. On each iteration coroutine returns Action which is basically an action of what to do next. Action provides possibility for additional prioritization like put to "waiting queue" coroutine which waits for IO.

You may treat this approach like app-level implementation of epoll/kqueue. But this gives additional space for experiments and it makes end-user API a bit easier.

Did you manage to run some performance tests on your machine?

Thank you for the code-review once again!

@FooBarWidget
Copy link
Author

I added some edits. In particular, I added feedback about zero-copy.

Also performance impact was really very very small

In my experience with Passenger, such impact was indeed small on its own. But I optimized Passenger by solving a thousand paper cuts — optimizing 1% here, 0.5% there, 1.2% there, etc. They all added up.

@jcelerier
Copy link

jcelerier commented Oct 15, 2018

This is "standard" correct, but not real world correct -- not everyone uses C++ exceptions (at the compiler level), therefore it could be nullptr

I just tried with both GCC and clang : adding -fno-exceptions still makes new throw an exception so I guess the number of cases where new would return nullptr in real-world code is very very close from zero.

#include <cstdio>

int main()
{
  constexpr auto max = 100000000000;
  auto blob = new char[max];

  // the following added to ensure that the compiler does not 
  // optimize the allocation if it's not used
  auto f = fopen("/dev/null", "w");
  fwrite(blob, 1, max, f);
}

->

$ g++ -O3 foo.cpp -fno-exceptions 
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

the generated assembly does not even change : https://gcc.godbolt.org/z/YCrwAX - looking how it's done in libc++ you would have to recompile it with an additional define for instance, and how many people are actually able to recompile their stdlib ?

@Joshua-Ashton
Copy link

I was mainly referring to MSVC (not sure if this is the same behaviour as GCC now, it never used to be) and people who may override new with their own memory stuff in future.

@FooBarWidget
Copy link
Author

My point w.r.t. operator new and exceptions is that the calling code only checks for nullptr, but does not attempt to catch an exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants